cosine_similarity(x,y) #11

bkieler · 2016-12-06T03:39:22Z

I attempted to apply the method to clustering tweets. I may be misunderstanding how this works, but running it with cosine_similarity(matrix name) only worked when my data was very small (500 tweets). Once I went to 150,000 tweets, I received memory errors. I used what the documentation said here, http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html, by adding cosine_similarity(matrix[len - 1], matrix) which I found in another example elsewhere since lost.

Is there a reason your code runs it without passing the x and y separately?

brandomr · 2016-12-06T14:54:41Z

@bkieler as far as the specific question you can test this with a basic example:

from sklearn.metrics.pairwise import cosine_similarity

ar1 = [0,3,4,1,3,5]
ar2 = [1,2,4,3,1,3]

print cosine_similarity(ar1,ar2)

returns [[ 0.87773382]] and if you add

ar3 = [[1,2,4,3,1,3],[1,2,5,2,1,1],[1,2,2,3,1,7],[1,0,1,3,1,2]]
print cosine_similarity(ar3)

you are returned a new array:

array([[ 1.        ,  0.87773382],
       [ 0.87773382,  1.        ]])

The difference is that you are returned a new numpy array in the latter case, but functionally the calculation is the same.

The OOM errors you are seeing are likely because the tfidf matrix is massive. You might need to reduce the max_features allowed in the TfidfVectorizer parameters. Scikit learn is trying to run this calculation in memory and you're just running out of it. If you want to operate on a large dataset you might need to use a computing cluster and something like Spark MLlib

GlorianY · 2017-02-26T20:10:33Z

Hi!

I also have a similar problem. My TF-IDF matrix is huge. So, I tried to use the workaround that suggested by @bkieler, that is, adding cosine_similarity(matrix[len - 1], matrix).

However, this yields to a problem in visualizing the data. Specifically, in this line "pos = mds.fit_transform(dist)". Here, the problem is "dist" has to be an array. Because of the workaround that I mentioned above, "dist" returns a value instead of an array.

The question is, how should I modify the code (i.e. dist) to adjust with the workaround?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cosine_similarity(x,y) #11

cosine_similarity(x,y) #11

bkieler commented Dec 6, 2016

brandomr commented Dec 6, 2016

GlorianY commented Feb 26, 2017 •

edited

Loading

cosine_similarity(x,y) #11

cosine_similarity(x,y) #11

Comments

bkieler commented Dec 6, 2016

brandomr commented Dec 6, 2016

GlorianY commented Feb 26, 2017 • edited Loading

GlorianY commented Feb 26, 2017 •

edited

Loading