Finding Duplicate Addresses Using the Levenshtein Distance Metric in SQL

We’ve all been in the situation where the LIKE operator in SQL isn’t good enough for the needs of the task at hand. The LIKE operator lacks the ability to find strings that are similar but not quite the same. This is when the Levenshtein distance algorithm comes in handy.

I didn’t even know about this algorithm until I ran into this situation. It took me a bit, but I finally figured out how to put it to work.

The Levenshtein distance is a metric that measures the difference between two strings. That is the minimum number of single-character edits that are required to change one string into the other. Single-character edits can be insertions, deletions, and substitutions. For example, the difference distance between “books” and “back” is three.

  • books -> baoks (substitution “o” for “a”)
  • baoks -> backs (substitution “o” for “c”)
  • backs -> back (deletion of “s”)

Phil Factor writes about the edit distance and gives an implementation in MS SQL Server’s T-SQL for this algorithm in his post String Comparisons in SQL: Edit Distance and the Levenshtein algorithm.

Other DBMS already come with an implementation. For example, the LEVENSHTEIN function in PostgreSQL, the EDIT_DISTANCE_SIMILARITY function in Oracle, and the EDITDIST3 in SQLite.

The Levenshtein distance is very useful when trying to identify that a string like 931 Main St is the “same” as 931 Main Street. This is a common issue in systems that work with client information such as CRMs. In this scenario, calculating the Levenshtein distance and then transforming it into a ratio based on the length of the largest string can give you the percentage of similarity of the two strings. It is up to you to decide what percentage is good enough to determine that two addresses are close enough to be considered the same.

In the past, I have used a ratio of less than 50% difference with good results.

Obviously, this technique is not only for addresses. It can be used to find similar strings as needed. Spell-checkers use the Levenshtein distance to find words to suggest when a person makes a typo. Ispell suggest words with an edit distance of 1, for example.

An Example in PostgreSQL

Let’s say we have this table with two text columns. Each column holds an address.

CREATE TABLE Addresses(
  LINE1 text,
  LINE2 text
);

This are the sample records. Most of the addresses in each row are the same one but written in a different way. They are the same for a human, but they’re not equal to the LIKE operator.

INSERT INTO Addresses (LINE1, LINE2)
VALUES ('9128 LEEWARD CIR, INDIANAPOLIS, IN', '9128 Leeward Circle, Indianapolis, IN');
​
INSERT INTO Addresses (LINE1, LINE2)
VALUES ('101 OCEAN LANE DRIVE, KEY BISCAYNE, FL','101 Ocean Lane Drive Unit 1010, Key Biscayne, FL');
​
INSERT INTO Addresses (LINE1, LINE2)
VALUES ('9301 EVERGREEN DRIVE, PARMA, OH', '9301 Evergreen Dr, Parma, OH');
​
INSERT INTO Addresses (LINE1, LINE2)
VALUES ('1817 BERTRAND DR, LAFAYETTE, LA', '2924 Polo Ridge Ct, Charlotte, NC');
​
INSERT INTO Addresses (LINE1, LINE2)
VALUES ('201 E 87TH ST, NEW YORK, NY', '201 E 87th Street 3E, New York, NY');
​
INSERT INTO Addresses (LINE1, LINE2)
VALUES ('799 CARRIGAN AVE, OVIEDO, FL', '799 Carrigan Avenue, Oviedo, FL');
​
INSERT INTO Addresses (LINE1, LINE2)
VALUES ('4014 CADDIE DRIVE, ACWORTH, GA', '10617 WEYBRIDGE DR, TAMPA, FL');

Let’s use the LEVENSHTEIN function to calculate a ratio of difference and determine which addresses are equal.

SELECT
  Line1,
  Line2,
  LEVENSHTEIN(UPPER(line1),UPPER(line2)) as distance,
  LEVENSHTEIN(UPPER(line1),UPPER(line2))::decimal / GREATEST(length(line1), length(line2)) AS ratio
FROM Addresses

</ tr>

Results: 7 rows
# line1 line2 distance ratio
1 9128 LEEWARD CIR, INDIANAPOLIS, IN 9128 Leeward Circle, Indianapolis, IN 3 0.08…
2 101 OCEAN LANE DRIVE, KEY BISCAYNE, FL 101 Ocean Lane Drive Unit 1010, Key Biscayne, FL 10 0.20…
3 9301 EVERGREEN DRIVE, PARMA, OH 9301 Evergreen Dr, Parma, OH 3 0.09…
4 1817 BERTRAND DR, LAFAYETTE, LA 2924 Polo Ridge Ct, Charlotte, NC 23 0.69…
5 201 E 87TH ST, NEW YORK, NY 201 E 87th Street 3E, New York, NY 7 0.20…
6 799 CARRIGAN AVE, OVIEDO, FL 799 Carrigan Avenue, Oviedo, FL 3 0.09…
7 4014 CADDIE DRIVE, ACWORTH, GA 10617 WEYBRIDGE DR, TAMPA, FL 22 0.73…

Finally, lets add a where clause that checks for a ratio of sameness less than 0.5 to identify the addresses that are the same.

SELECT
  Line1,
  Line2,
  LEVENSHTEIN(UPPER(line1),UPPER(line2)) as distance,
  LEVENSHTEIN(UPPER(line1),UPPER(line2))::decimal / GREATEST(length(line1), length(line2)) AS ratio
FROM Addresses
WHERE LEVENSHTEIN(UPPER(line1),UPPER(line2))::decimal / GREATEST(length(line1), length(line2)) &lt; .5

This code filters out the two where columns 1 and 2 are not the same (too far apart, according to the algorithm). Above, these are the ones with a ration higher than 0.5 (i.e., rows 4 and 7).

</ tr>

Results: 5 rows
# line1 line2 distance ratio
1 9128 LEEWARD CIR, INDIANAPOLIS, IN 9128 Leeward Circle, Indianapolis, IN 3 0.08…
2 101 OCEAN LANE DRIVE, KEY BISCAYNE, FL 101 Ocean Lane Drive Unit 1010, Key Biscayne, FL 10 0.20…
3 9301 EVERGREEN DRIVE, PARMA, OH 9301 Evergreen Dr, Parma, OH 3 0.09…
5 201 E 87TH ST, NEW YORK, NY 201 E 87th Street 3E, New York, NY 7 0.20…
6 799 CARRIGAN AVE, OVIEDO, FL 799 Carrigan Avenue, Oviedo, FL 3 0.09…

You can see the example above in this SqlFiddle.

Finally…

If you have any questions about this, please reach out to [email protected].

Focus Mode

Contact Request

Close

We will call you right away. All information is kept private