-
Notifications
You must be signed in to change notification settings - Fork 339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cosine_similarity(x,y) #11
Comments
@bkieler as far as the specific question you can test this with a basic example:
returns
you are returned a new array:
The difference is that you are returned a new numpy array in the latter case, but functionally the calculation is the same. The OOM errors you are seeing are likely because the tfidf matrix is massive. You might need to reduce the |
Hi! I also have a similar problem. My TF-IDF matrix is huge. So, I tried to use the workaround that suggested by @bkieler, that is, adding cosine_similarity(matrix[len - 1], matrix). However, this yields to a problem in visualizing the data. Specifically, in this line "pos = mds.fit_transform(dist)". Here, the problem is "dist" has to be an array. Because of the workaround that I mentioned above, "dist" returns a value instead of an array. The question is, how should I modify the code (i.e. dist) to adjust with the workaround? |
I attempted to apply the method to clustering tweets. I may be misunderstanding how this works, but running it with cosine_similarity(matrix name) only worked when my data was very small (500 tweets). Once I went to 150,000 tweets, I received memory errors. I used what the documentation said here, http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html, by adding cosine_similarity(matrix[len - 1], matrix) which I found in another example elsewhere since lost.
Is there a reason your code runs it without passing the x and y separately?
The text was updated successfully, but these errors were encountered: