-
-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clustering methods #136
Comments
Thanks for your interest, and for these libraries. I really respect what you're doing. There are two main use cases that I have used clj-fuzzy heavily for:
Re: Language, this was all in English, but with some non-dictionary words (e.g., made up attributes and company names). |
First of all, are you trying to find name matching fuzzily than a human operator will explicitly decide to harmonize here or are your scripts taking the decision on their own according to some arbitrary threshold because precision might not be such a critical issue for later computations/aggregations? Anyhow, one way to speed up the process (but maybe you already used this method somehow), is to aggressively normalize strings using a fingerprint method. The idea is to:
It's the general idea but you can easily adapt the method to better fit your data. Then, you can just join strings in the same clusters if they have the same fingerprint. Just use a multimap or equivalent for that. It works in linear time, is very cheap to compute and usually has good precision. |
One other way, if you need to keep distance computations because of typos etc., is either to perform some kind of clever grouping of the strings before computing pairwise computations to reduce the bounds of the quadratic execution (this is called blocking, or sorted neighborhood) or to use data structures able to perform efficient nearest neighbor queries in metric space (a Vantage Point tree for instance). Exemple: you can group your strings by matching 6-gram and then perform pairwise computations in each group, reducing greatly needed computations since you won't compare obviously different strings. You can see a lot of those methods implemented in the library here. I can explain one of those methods better if you want me to. Also, there is another method that run in linear time that approximates Jaccard coefficient pairwise computations using a technique called minhashing. I can also explain it better if you so desire. |
I have done both. When the output was too big to possibly check each entry by hand I would spot check a few hundred samples and then conservatively set a threshold based on what I saw. I also once had a case where I knew approximately how many matches/clusters I wanted and set the threshold automatically to achieve that. Thank you for the explanation of fingerprinting. I see how if typos (e.g., transposed letters) are not a factor, then fingerprinting can reduce the geometric time complexity. I will keep this technique in mind for future cases. Thank you also for the cogent explanation of 'blocking'... I had definitely considered this, but didn't know the term or any of the methods. I had thought of first blocking by lower case first letter or first-two-letter pairs (assuming that there are no typos there exactly). Blocking cuts the number of pairwise comparisons dramatically. And I also appreciate the pointer to minshash, we use LSH in other contexts, but that particular technique I had not seen. Thank you kindly for all this great information. If I can ever be of service to you, let me know. |
Using 5/6/7-grams in your case should yield better precision. There are some papers out there that demonstrate why but I cannot find the references easily. Blocking by first letter or letters is also a good pick since it was demonstrated that typos very rarely occur on first letters (but this is more true for dictionary words than for names).
Just come back to me if you find some other ways to cut computations or enhance precisions of the matching algorithms etc. Record linkage is kind of my pet peeves and I love finding new things. |
When you say |
For |
There is also a scheme called skip-grams that is becoming quite popular. But more expensive to compute obviously. But a 1-skip -2/3-grams could yield good results with your company names' word tokens. |
Hello @harold. Can you tell me a bit more about the strings you are matching in your use-cases? Length? Language etc.
The text was updated successfully, but these errors were encountered: