Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explanation in README.md is somehow incomprehensive. #24

Open
Fazel94 opened this issue May 28, 2016 · 12 comments
Open

Explanation in README.md is somehow incomprehensive. #24

Fazel94 opened this issue May 28, 2016 · 12 comments

Comments

@Fazel94
Copy link

Fazel94 commented May 28, 2016

How can I actually use this library to detect duplicate document,
The function insert only accept integers so I suppose I should implement minHash myself(Is it true??) and then use your simHash library to detect near-duplicates.
How can I connect this to simhash-db you provided? does that needs something users themselves should implement??
Sorry for being noob in Near duplicate detection bussiness 😄 and Thank you for sharing your work for us 👍

@darentanDZ
Copy link

Up same question!

@rth
Copy link
Contributor

rth commented Jan 25, 2018

See this blog post https://moz.com/devblog/near-duplicate-detection/ for a more general introduction and feel free to make a PR to improve the readme.

Here is an example of an estimator that uses simhash-py for near duplicate detection, it takes as input a sparse csr matrix created by scikit-learn's CountVectorizer or HashingVectorizer. I'm not saying there aren't better ways to use it though..

@dlecocq
Copy link

dlecocq commented Jan 25, 2018

I think the most confusing aspect of how simhashing is talked about is the fact that there are two aspects to it, but the term 'simhash' is often used to refer to both: 1) a function to generate a simhash from input, and 2) methods for identifying near-duplicates from a set of simhashes.

To the first aspect, the input to the simhash hash function is a vector of hashes representative of the entity. There aren't really constraints on the size of this vector of hashes, but how this vector of generators has the most impact on the quality of the near-duplicates found. The Moz blog post that @rth referenced describes a number of ways in which that vector might be generated, but the hash of a rolling window of words is a pretty reasonable way.

To the second aspect, simhash provides a way to identify all the pairs of simhashes that are 'close enough' to be near-duplicates. This was the most fun and interesting part to code up, but it is agnostic to how simhashes were generated, so it doesn't have any impact on the quality of the matches found.

It is also important to note that when running the near-duplicate check, all the input simhashes must have been generated with the same method to have meaningful results. For example, if you have two simhashes, one generated with a bag of words and the other generated with shingles, they are not compatible.

I will try very hard to find some time to expand the README for this repo, since I, too, find it to be less than useful.

@bikashg
Copy link

bikashg commented Apr 23, 2018

@dlecocq : I think the first aspect you described is about generating minhash of the document (not generating simhash, as you mentioned). The second aspect is about taking the minhashes of the documents and predicting whether they are similar or not. The "simhash" is the total process that should do both the first and the second aspect you talked about.

Given that this library name was "simhash", I expected to input it a list of documents and obtain their pairwise similarity as the output. But it seems that this library does only the second aspect and leaves the user to do the first part on their own.

BTW, the blog post link mentioned by @rth is dead. Does anyone have an updated link?

@rth
Copy link
Contributor

rth commented Apr 23, 2018

Yes, the original link tp seomoz blogs now returns ERR_SPDY_PROTOCOL_ERROR for me as well.

Here is the cached url.

@dlecocq
Copy link

dlecocq commented Apr 23, 2018

@bikashg - making an effective simhash for a document depends greatly on the types of documents used and what aspect of the documents is meant to be matched, so making a method that works well in all situations is not practical. With that said, there is a top-level shingle helper that can help quite a bit:

import re
from simhash import compute, shingle, unsigned_hash

# If document is a string with just words, this provides a VERY basic tokenization and subsequent simhash
tokens = re.compile('\W+').split(document)
compute(map(unsigned_hash, shingle(tokens)))

With regard to 'only' performing the last simhash bit, this library also provides the ability to find all the matches once the simhashes are generated.

@bikashg
Copy link

bikashg commented Apr 23, 2018

@dlecocq : Thanks for the explanation. Yes and I also went through the test cases in test.py which further helped to understand the whole pipeline.
My use of world 'only' was probably harsh, sorry for that; didn't mean it that way. Cheers for putting the effort to build this library.

@b4hand
Copy link
Contributor

b4hand commented Apr 23, 2018

The link to https://moz.com/devblog/near-duplicate-detection/ still works for me as of this morning. Maybe it was a transient failure?

@rth
Copy link
Contributor

rth commented Apr 23, 2018

@b4hand For me it also works with wget and Firefox 57 but not Chrome 59 (see error above); probably some SPDY related server misconfiguration..

@b4hand
Copy link
Contributor

b4hand commented Apr 23, 2018

As far as I know, we're not using SPDY. I've reported the issue to the team in charge of that page, it should get fixed. Thanks for letting us know about the issue.

@dlecocq
Copy link

dlecocq commented Apr 23, 2018

@bikashg - no worries - maybe we could have a 'quickstart' section to the README? I'd welcome a PR, as it's sometimes hard to make good onboarding documentation after having worked on all the internals of a library.

@b4hand - https://moz.com/devblog/near-duplicate-detection/ does not work for me, either using Chrome 66.0.3359.117 on Mac OS 10.13.4

@b4hand
Copy link
Contributor

b4hand commented Aug 2, 2018

FYI, the devblog link should be working again for Chrome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants