HyperMinHash Jaccard similarity calculation #17

asteele-quantifind · 2020-10-06T17:49:24Z

The HyperMinHash Jaccard similarity calculation appears to only compare the "mantissa" portion of the register to increase c, rather than the whole register. The algorithm 2.1.4 in the paper seems to compare the full tuple to increase c. Is there a reason why only the mantissa is compared in your implementation?

yunwilliamyu · 2020-10-06T17:59:54Z

Hi there; author of the original paper here.

Honestly, the reason that we compared the whole register in the paper is because it's easier to analyze mathematically. In practice, for a nontrivial mantissa size, comparing only the mantissas should be sufficient, and makes the Jaccard similarity calculation basically the same as using b-bit MinHash. It doesn't actually change much in practice, unless you have super silly parameters.

asteele-quantifind · 2020-10-06T18:11:21Z

Thanks. What would smallest nontrivial mantissa size be? I can imagine if the mantissa was much greater than the log-log part, the increase in false positives would be minimal, but I don't have a good idea about what the tipping point would be.

yunwilliamyu · 2020-10-06T18:48:01Z

Depends on the number of buckets, the error range you want, and the Jaccard similarity you care about.

I think the best way to think about it is to look at the error analysis for b-bit MinHash:
https://arxiv.org/abs/0910.3349
Even 1-bit MinHash even works for some applications, as the authors point out, but that'd be a bit silly coupled with HyperMinHash for certain other regimes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HyperMinHash Jaccard similarity calculation #17

HyperMinHash Jaccard similarity calculation #17

asteele-quantifind commented Oct 6, 2020

yunwilliamyu commented Oct 6, 2020

asteele-quantifind commented Oct 6, 2020

yunwilliamyu commented Oct 6, 2020

HyperMinHash Jaccard similarity calculation #17

HyperMinHash Jaccard similarity calculation #17

Comments

asteele-quantifind commented Oct 6, 2020

yunwilliamyu commented Oct 6, 2020

asteele-quantifind commented Oct 6, 2020

yunwilliamyu commented Oct 6, 2020