Finding fuzzy string duplicates

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Finding fuzzy string duplicates

This post has NOT been accepted by the mailing list yet.
Hello everyone,

I am working on cleaning large panel data set.  I have done some cleaning and now have a list of unique company names.  However, there are still some inconsistencies in the spelling of company names that lead to separate cases which I would like to match and consolidate under the same name/id.

I initially tried using reclink in order to match the data to itself.  However, when doing this, reclink gives a set of perfect matches (each entry matches perfectly with itself).  What I would really like to do is to look for the second-best/non-perfect matches in order to identify misspelled duplicates.  To my knowledge, reclink doesn't have an option for ignoring the perfect matches.

Is there a way to do this with reclink?  Or is there another method that would work better?