SQL Server - Fuzzy logic for duplicate identification?

Asked By Yashgt on 12-Jun-08 10:37 PM
Hi,

We have a table of users with about 10000 rows. We have to import a
large text file into this table. While doing so we would like to
ensure that if a row already exists, it should not be imported. Rather
that True/false logic, we need to apply Fuzzy logic here so that if we
find similarities in the name, address, ad a few other columns, we
treat the row as a duplicate.

Such possibly duplicate rows should be identified and logged to some
table.

What transformation should I prefer?

Thanks,
Yash




Todd replied on 12-Jun-08 08:51 AM
Fuzzy Lookup is going to be your best friend in SSIS. The lookup table that
you will use will be the table of 10,000 names you already have on file. Set
the matching columns on Name, Address, City, and State (or whatever
combination applies). Add at least one Column from the lookup side. Then
downstream, have a Conditional Split Transform that checks the value of that
added column. It it has a value, then that Customer already exists. If not,
it is a new Customer. Set your pipeline to deal with the split data
accordingly.

HTH
--
Todd C

[If this response was helpful, please indicate by clicking the appropriate
answer at the bottom]
--CELKO-- replied to Yashgt on 12-Jun-08 10:37 PM
If this is address data for mailing, then get a specialized package
from Group 1, Melissa Data, etc.  They are very good at this.

Otherwise, the Levenshtein distance is pretty standard.  It measures
how many changes have to be made to convert one string to the other.
Plamen Ratchev replied to Yashgt on 12-Jun-08 08:49 AM
SQL Server has the SOUNDEX and DIFFERENCE functions:
http://msdn.microsoft.com/en-us/library/ms187384.aspx

However, the algorithm implemented is very simple and in many cases not good
enough. You may need to write your own soundex logic. Here is one algorithm
that is more advanced:
http://www.planet-source-code.com/vb/scripts/ShowCode.asp?txtCodeId=519&lngWId=5

If you are performing the import via Integration Services, it does have
Fuzzy Lookup functionality:
http://www.sqlteam.com/article/using-fuzzy-lookup-transformations-in-sql-server-integration-services
http://msdn.microsoft.com/en-us/library/ms345128.aspx

HTH,

Plamen Ratchev
http://www.SQLStudio.com