Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MinHash output is not mergeable with hash sizes under 64 #169

Closed
andimiller opened this issue Oct 29, 2023 · 3 comments
Closed

MinHash output is not mergeable with hash sizes under 64 #169

andimiller opened this issue Oct 29, 2023 · 3 comments

Comments

@andimiller
Copy link

andimiller commented Oct 29, 2023

Describe the bug
The output of MinHash.compute cannot be merged with another output on hash sizes of under 64

To Reproduce

import com.dynatrace.hash4j.hashing.Hasher64;
import com.dynatrace.hash4j.hashing.Hashing;
import com.dynatrace.hash4j.similarity.ElementHashProvider;
import com.dynatrace.hash4j.similarity.SimilarityHasher;
import com.dynatrace.hash4j.similarity.SimilarityHashing;
import com.dynatrace.hash4j.util.PackedArray;

import java.util.Arrays;
import java.util.stream.IntStream;

public class MinHashBugRepro {

    public static byte[] merge(int components, int bits, byte[] left, byte[] right) {
        PackedArray.PackedArrayHandler pah = PackedArray.getHandler(bits);
        byte[] output = pah.create(components);
        IntStream.range(0, components).forEach(idx ->
            pah.set(output, idx, Math.min(pah.get(left, idx), pah.get(right, idx)))
        );
        return output;
    }

    public static void main(String[] args) {
        SimilarityHasher simHasher32 = SimilarityHashing.minHash(10, 32).createHasher();
        SimilarityHasher simHasher64 = SimilarityHashing.minHash(10, 64).createHasher();
        Hasher64 hasher = Hashing.wyhashFinal4();

        Long hello = hasher.hashCharsToLong("hello");
        Long world = hasher.hashCharsToLong("world");

        // when using a 32-bit hash, this merge function doesn't work
        byte[] one32 = simHasher32.compute(ElementHashProvider.ofValues(hello));
        byte[] two32 = simHasher32.compute(ElementHashProvider.ofValues(world));
        byte[] three32 = simHasher32.compute(ElementHashProvider.ofValues(hello, world));
        byte[] merged32 = merge(10, 32, one32, two32);
        System.out.println(Arrays.equals(merged32, three32));

        // when using a 64-bit hash this merge function does work
        byte[] one64 = simHasher64.compute(ElementHashProvider.ofValues(hello));
        byte[] two64 = simHasher64.compute(ElementHashProvider.ofValues(world));
        byte[] three64 = simHasher64.compute(ElementHashProvider.ofValues(hello, world));
        byte[] merged64 = merge(10, 64, one64, two64);
        System.out.println(Arrays.equals(merged64, three64));
    }
}

Expected behavior
I expect the data stored in the byte[]s to be mergeable, even if the library does not include a merge function

Additional context
I've tried writing different merge functions where I decode it signed, unsigned, swap the byte order, and have been unable to find a version that would work with the 32-bit output.

Possible fix
I am reasonably sure that this is because of this section:

          if (hash < work[i]) {
            work[i] = hash;
          }

and that it would work if it used an unsigned lessthan, rather than the signed one, or if it masked before comparing

Even more context
I ran across this when testing semilattice + distributive laws in https://github.com/andimiller/schrodinger
SuperMinHash merges fine with this merge method on all bit sizes, it does not work on SimHash or FastSimHash

@andimiller andimiller added the bug Something isn't working label Oct 29, 2023
@oertl
Copy link
Member

oertl commented Oct 29, 2023

Merging is not supported and should also not work for SuperMinHash signatures.

@oertl oertl removed the bug Something isn't working label Oct 29, 2023
@andimiller
Copy link
Author

correction, it doesn't work on SuperMinHash, my tests weren't thorough enough

@andimiller
Copy link
Author

closing this to leave #170 since it's not currently supported

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants