-
Notifications
You must be signed in to change notification settings - Fork 339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Am I doing something wrong here #5
Comments
Thanks @ouverz!
Think of the dataframe as a dictionary akin to:
If you use this dictionary to look up a stem For 2. and 3. I'm not 100% sure I follow but I'd be happy to take a look at the code if you can post a gist along with a sample of what the data looks like. Let me know if this helps! |
Thanks for the info @brandomr Here is the implementation of the clustered words in both stem & tokenised as well as only tokenised: Non-stemmed Please find the code and sample data below: Sample data: |
@ouverz sorry for the delay. In looking at your sample and the resultant clusters it looks like you have pretty homogenous documents which will have significant overlap. Your clusters are going to be impacted by the parameters you provide for the You might want to tune the |
Thanks for your response. I apologise for getting back to this only now. What is happening with my model and data is quite odd. While it seems that the data is homogenous somehow unsupervised methods produce a near perfect separation between two classes which is mind boggling at this point. One point that stands out is the sheer drop in features from over 1000 to under 50 when I optimise between min_df 0.1 to 0.2 respectively. Then the accuracy goes from under 70% to over 90%. Thanks a lot! |
That does sound pretty intriguing. As for the number of features dropping when you increase As far as the As far as feature importance--with kmeans the top terms for the cluster are actually the terms nearest the centroid so they are the "most important" or at least most associated with the cluster. Outside this, I'll have to think about how you might get feature importance from unsupervised methods. If you're using a supervised method like SVM of NB you can get feature importance from the classifier coefficients or weights. For an SVM with sklearn it's something like:
If you look around you can find explanations of how to actually interpret the output elsewhere or you could check the math behind sklearn (if you're feeling bold). |
Hi
I just went over your document clustering tutorial and it is really amazing ! great work!
I am trying to conduct a clustering of e-mails, so I have been altering the code a bit to fit my purpose.
When I print the words in each cluster I get the same word reiterated in the same cluster (cluster 0: word1, word2, word1, word 3, word4 etc) or the same word appears in two or more clusters.
The text was updated successfully, but these errors were encountered: