Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get top n words that are nearest to cluster centroid #16

Open
fkolokathi opened this issue Nov 22, 2017 · 4 comments
Open

Get top n words that are nearest to cluster centroid #16

fkolokathi opened this issue Nov 22, 2017 · 4 comments

Comments

@fkolokathi
Copy link

I cannot understand how by taking the indices of the words with max tf-idf per cluster center, you find the top words that are nearest to cluster centroid.Moreover, I want to ask you, cluster centroid is the center of each cluster?

@PabloRR100
Copy link

In this regard I have another question.
If you are clustering synopses (therefore films), the centroid should represent a "fake" film, not a fake word. The points closer to the center should be the closest films, but no the closets words to the film right?

@brandomr
Copy link
Owner

@fkolokathi @PabloRR100 apologies, I haven't had a chance to look back at this in quite some time. In regards to @fkolokathi's question--I'm not sure beyond words what else would comprise the cluster centroid? As @PabloRR100 points out, the centroid is really a "fake film synopsis", not a fake word.

@PabloRR100 I think you're correct if my memory serves. Do you have any suggestions for how things could be improved for clarity?

@PabloRR100
Copy link

Thank you so much for replaying @brandomr.
I am making my head around this since I have a bunch of documents that I want to cluster and then plot a WordCloud of the most relevant words around it. So essentially the same use-case. I was using this "closeness" to the center before to give the importance for the Wordcloud.

What do you think about using the k words with the highest IDF, considered as most important for the list of documents (or some metric using an average(TF) across documents and the IDF) for the words that appear in the documents of the cluster as their importance for the Wordcloud?

@brandomr
Copy link
Owner

@PabloRR100 I think that makes sense. I'd definitely spot check things to ensure that the results you are seeing are actually logical.

You might check out this paper on vennclouds and the associated repo that automatically generates dynamic word clouds comparing documents. That methodology might be useful for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants