Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cosine_similarity(x,y) #11

Open
bkieler opened this issue Dec 6, 2016 · 2 comments
Open

cosine_similarity(x,y) #11

bkieler opened this issue Dec 6, 2016 · 2 comments

Comments

@bkieler
Copy link

bkieler commented Dec 6, 2016

I attempted to apply the method to clustering tweets. I may be misunderstanding how this works, but running it with cosine_similarity(matrix name) only worked when my data was very small (500 tweets). Once I went to 150,000 tweets, I received memory errors. I used what the documentation said here, http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html, by adding cosine_similarity(matrix[len - 1], matrix) which I found in another example elsewhere since lost.

Is there a reason your code runs it without passing the x and y separately?

@brandomr
Copy link
Owner

brandomr commented Dec 6, 2016

@bkieler as far as the specific question you can test this with a basic example:

from sklearn.metrics.pairwise import cosine_similarity

ar1 = [0,3,4,1,3,5]
ar2 = [1,2,4,3,1,3]

print cosine_similarity(ar1,ar2)

returns [[ 0.87773382]] and if you add

ar3 = [[1,2,4,3,1,3],[1,2,5,2,1,1],[1,2,2,3,1,7],[1,0,1,3,1,2]]
print cosine_similarity(ar3)

you are returned a new array:

array([[ 1.        ,  0.87773382],
       [ 0.87773382,  1.        ]])

The difference is that you are returned a new numpy array in the latter case, but functionally the calculation is the same.

The OOM errors you are seeing are likely because the tfidf matrix is massive. You might need to reduce the max_features allowed in the TfidfVectorizer parameters. Scikit learn is trying to run this calculation in memory and you're just running out of it. If you want to operate on a large dataset you might need to use a computing cluster and something like Spark MLlib

@GlorianY
Copy link

GlorianY commented Feb 26, 2017

Hi!

I also have a similar problem. My TF-IDF matrix is huge. So, I tried to use the workaround that suggested by @bkieler, that is, adding cosine_similarity(matrix[len - 1], matrix).

However, this yields to a problem in visualizing the data. Specifically, in this line "pos = mds.fit_transform(dist)". Here, the problem is "dist" has to be an array. Because of the workaround that I mentioned above, "dist" returns a value instead of an array.

The question is, how should I modify the code (i.e. dist) to adjust with the workaround?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants