Store compressed vectors in dense ByteSequence for PQVectors #370

michaeljmarshall · 2024-11-23T02:39:59Z

The current design stores the compressed vectors using a list of ByteSequence objects where each object represents a single vector. That costs us an overhead of 16 bytes for the ByteSequence wrapper class and 16 more bytes for the array in the ArrayByteSequence. In the case of a 128 dimension vector, we compress PQ down to 64 bytes, so the cost of the wrapper/array is 32 bytes out of the total 96 bytes, which is 33% of the total size. In cases where we have many of these vectors in memory, an extra 33% is quite large.

This PR proposes removing the per-vector wrapper and opts for a more dense representation by

Storing the vectors in an array of ByteSequence references where the ByteSequence has as many vectors as it can possibly hold.
Introducing a ByteSequence#slice method to allow for getting a slice of the whole ByteSequence. This method is intended to be light weight.
Introduce ArraySliceByteSequence to store the offset and length information to simplify element retrieval.
Introducing a ByteSequence#offset method to be used in conjunction with the get method to ensure that the Panama and SimdOps methods do not need to perform an array copy when passing the array to Panama.
Added assertions in the native code when we call get on a MemorySegmentByteSequence since we're technically not passing the offset. FWIW, we could consider updating the c to take a long offset and a long length since those MemorySegments can exceed int length.

This PR is a more invasive alternative to #369, which only removed one of the two wrappers.

Rejected alternatives

Before implementing the PR this way, I tried a few alternatives.

Use MemorySegment in the PQVectors to remove the chunking required because we can have more PQVector bytes than int max, which pushes us to store the compressedVectors field as a ByteSequence[] instead of a constant ByteSequence. Further, a MemorySegment would have allowed for a simpler API because of its support in the Panama API. However, this solution proved challenging due to compilation requirements and concerns about using a preview api (MemorySegment) in java 20.
Attempt to use a ByteBuffer instead of ArraySliceByteSequence. This is plausible, but really just comes down to API design preference. It felt a bit clunky to add it in. The main advantage is that if we add this, we can avoid the offset method on the ByteSequence interface, which might be convenient. One way we could add it is by making PQVectors know how to create a ByteBuffer wrapping a slice of the compressedVectors data.

michaeljmarshall · 2024-11-25T18:12:00Z

@jbellis @jkni @marianotepper - this is ready for review. The other alternative, described above, where we use a ByteBuffer instead of creating an ArraySliceByteSource seems viable, but comes down to preference. Let me know what you think.

jbellis · 2024-11-25T18:32:26Z

jvector-base/src/main/java/io/github/jbellis/jvector/pq/PQVectors.java

-        return Objects.hash(pq, compressedVectors);
+        // We don't use the array structure in the hash code calculation because we allow for different chunking
+        // strategies. Instead, we use the first entry in the first chunk to provide a stable hash code.
+        return Objects.hash(pq, count(), count() > 0 ? get(0).get(0) : 0);


If you're worried about the performance of hashing millions of vectors, I'd prefer adding the first codepoint from each vector, to using all the first vector's codepoints

I ended up just hashing based on every value since that is the norm. If it becomes a performance bottleneck, we can address it then--I doubt we compute the hash code frequently.

jbellis · 2024-11-25T18:34:49Z

jvector-base/src/main/java/io/github/jbellis/jvector/pq/PQVectors.java

+        // TODO how do we want to determine equality? With the current change, we are willing to write one
+        // thing and materialize another. It seems like the real concern should be whether the compressedVectors have
+        // the same data, not whether they are in a MemorySegment or a byte[] and not whether there is one chunk of many
+        // vectors or many chunks of one vector. This technically goes against the implementation of each of the


I'm fine with changing the ByteSequence equals implementations to match. You get the class comparison by default when intellij generates it for you and I think we just ran with that.

I implemented this, but it's not as elegant as I wanted. I could have made the ByteSequence interface an abstract class and overridden equals/hashCode, but that seemed like it went too far, so I added default methods that implementations of the interface can call. In practice, I doubt these objects are compared for strict equality frequently.

jbellis · 2024-11-25T18:55:32Z

I like where this is going, I'm just concerned that it doesn't break compatibility enough. :)

That is, the PQVectors API is hard to use "correctly" now since the only method that gives you One Big Array is load(). (Technically the constructor used by load is public but everyone else just calls it with an array of one vector per chunk.)

I think we should push the changes into VectorCompressor to solve this. Instead of introducing a new class to abstract long[] vs ByteSequence, encodeAll should return CompressedVectors directly. (By far the most common way encodeAll is used is to immediately wrap the result in CV anyway.) I think there's a code path or two where you don't need the full CV but it's not a big deal to construct one for those cases.

CompressedVectors is not generic so VC wouldn't have to be either, which makes me happy.

jbellis · 2024-11-25T18:56:10Z

I'm 80% sure that it's correct to go ahead and change the API and release 3.1 instead of doing unnatural contortions to maintain compatibility.

jbellis · 2024-11-25T18:57:42Z

Also I do not insist that BQ change to the chunking approach. If we can conform to the "right" API without doing that, it's fine to leave optimizing BQ until somebody needs it.

michaeljmarshall · 2024-11-25T19:53:44Z

@jbellis - how would this work in the CompactionGraph? I left the hybrid creation method in place because it seemed like we needed it for that class, but perhaps we just need to add an add method to the PQVectors instead of letting them mutate the ArrayList (the current design).

jbellis · 2024-11-25T20:06:20Z

Yes, I think an add() or a set() method would be straightforward since CG knows up front how many it's going to generate. This is a better design than leaking the underlying List to be mutated, but that was the path of least resistance at one point!

jbellis · 2024-11-26T02:49:47Z

Is it possible to unify encode/set to void encodeTo(VectorFloat, CompressedVectors, int) to avoid the copy?

michaeljmarshall · 2024-11-26T03:31:55Z

I had started to look at that, but stopped because of our logic to for encodedOmittedVector. Turns out that is just setting a zero vector, so as long as we have an on-heap data structure, we're good to go, but we might not want to assume that for long since an off heap MemorySegment looks like it might have undefined values for regions that are not explicitly set.

Getting the change written up now.

michaeljmarshall · 2024-11-26T05:17:33Z

Note: I'll update the PR description tomorrow to align with recent commits.

michaeljmarshall force-pushed the one-big-array branch from 4f1a44b to 76727da Compare November 25, 2024 17:38

Store compressed vectors in dense ByteSequence for PQVectors

650b4d2

michaeljmarshall force-pushed the one-big-array branch from 76727da to 650b4d2 Compare November 25, 2024 17:52

jbellis reviewed Nov 25, 2024

View reviewed changes

Push CompressedVectors construction into VectorCompressor

049d4dd

michaeljmarshall added 2 commits November 25, 2024 22:07

Merge encode and set to avoid array copy

34bce38

Consider ByteSource<T> equal iff each byte is equal at each position

e9092e1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store compressed vectors in dense ByteSequence for PQVectors #370

Store compressed vectors in dense ByteSequence for PQVectors #370

michaeljmarshall commented Nov 23, 2024 •

edited

Loading

michaeljmarshall commented Nov 25, 2024

jbellis Nov 25, 2024

michaeljmarshall Nov 26, 2024

jbellis Nov 25, 2024

michaeljmarshall Nov 26, 2024

jbellis commented Nov 25, 2024

jbellis commented Nov 25, 2024

jbellis commented Nov 25, 2024

michaeljmarshall commented Nov 25, 2024

jbellis commented Nov 25, 2024

jbellis commented Nov 26, 2024 •

edited

Loading

michaeljmarshall commented Nov 26, 2024

michaeljmarshall commented Nov 26, 2024

Store compressed vectors in dense ByteSequence for PQVectors #370

Are you sure you want to change the base?

Store compressed vectors in dense ByteSequence for PQVectors #370

Conversation

michaeljmarshall commented Nov 23, 2024 • edited Loading

Rejected alternatives

michaeljmarshall commented Nov 25, 2024

jbellis Nov 25, 2024

Choose a reason for hiding this comment

michaeljmarshall Nov 26, 2024

Choose a reason for hiding this comment

jbellis Nov 25, 2024

Choose a reason for hiding this comment

michaeljmarshall Nov 26, 2024

Choose a reason for hiding this comment

jbellis commented Nov 25, 2024

jbellis commented Nov 25, 2024

jbellis commented Nov 25, 2024

michaeljmarshall commented Nov 25, 2024

jbellis commented Nov 25, 2024

jbellis commented Nov 26, 2024 • edited Loading

michaeljmarshall commented Nov 26, 2024

michaeljmarshall commented Nov 26, 2024

michaeljmarshall commented Nov 23, 2024 •

edited

Loading

jbellis commented Nov 26, 2024 •

edited

Loading