RandomAccessScoreProvider with MapRandomAccessVectorValues + non-sequential IDs produces wrong centroid #354

vbekiaris · 2024-08-14T09:30:27Z

In 97e523c approximateCentroid() implementation for the BuildScoreProvider returned from BuildScoreProvider.randomAccessScoreProvider() was updated to allow for non-sequential node IDs.

However the iteration only takes into account nodes with ID < ravv.size(). This means that if there are actually "holes" in the ID sequence (e.g. add 100 nodes in a MapRandomAccessVectorValues, then remove 10 starting from 0), then some nodes (those with ID >= 90 in the example) will not be taken into account while calculating the centroid.

A fix would probably require changing RandomAccessVectorValues API to expose an iterator or the highest nodeId that is set (or something similar).

Test that demonstrates the issue:

    @Test
    void testRandomAccessScoreProvider() {
        Map<Integer, VectorFloat<?>> chm = new ConcurrentHashMap<>();
        // nodeId's 0..49 have vector [0, 1], 49..99 are [1, 0], so centroid is [0.5, 0.5]
        for (int i = 0; i < 100; i++) {
            chm.put(i, createVector(i < 50));
        }
        RandomAccessVectorValues ravv = new MapRandomAccessVectorValues(chm, 2);
        var bsp = BuildScoreProvider.randomAccessScoreProvider(ravv, VectorSimilarityFunction.COSINE);
        float[] centroid = (float[]) bsp.approximateCentroid().get();
        Assertions.assertArrayEquals(new float[] {0.5f, 0.5f}, centroid);
        // remove even nodeId's. So we have just 50 nodeIds left now and they are no longer sequential
        // centroid however should stay the same at [0.5, 0.5], since we have 25 [0, 1] and 25 [1, 0] vectors
        for (int i = 0; i < 100; i = i+2) {
            chm.remove(i);
        }
        centroid = (float[]) bsp.approximateCentroid().get();
        // fails - calculated centroid is [0, 0.5] because centroid calculation only took into account node IDs < 50
        Assertions.assertArrayEquals(new float[] {0.5f, 0.5f}, centroid);
    }

    VectorFloat<?> createVector(boolean vertical) {
        VectorFloat<?> vectorFloat = VectorizationProvider.getInstance().getVectorTypeSupport().createFloatVector(2);
        if (vertical) {
            vectorFloat.set(0, 0);
            vectorFloat.set(1, 1);
        } else {
            vectorFloat.set(0, 1);
            vectorFloat.set(1, 0);
        }
        return vectorFloat;
    }

The text was updated successfully, but these errors were encountered:

jbellis · 2024-08-14T14:04:01Z

Thanks for the report and the test!

IMO this is working as designed, the implicit contract of RAVV is that it should give a valid vector for ordinals from 0..size(). In other words JVector is designed to support "holes" in a graph, but not in RAVV.

Apparently there are 31 usages of size() so reviewing those to introduce getIdUpperBound the way we did for GraphIndex to accommodate holes isn't unreasonable, but I'm a bit skeptical that it's necessary. Can you share more about your use case?

(I'm also happy to eliminate the source of confusion by deleting MapRAVV. I thought it was going to be useful for Cassandra but we ended up not using it after all.)

shultseva · 2024-08-28T09:48:22Z

Hi, I'd like to continue this discussion.

In our scenario, we have a collection where vectors can be added and removed, which makes it easy to create gaps in the RAVV.

Could you clarify how the library is intended to be used in this context? Should deletion be avoided, or should deleted node identifiers be reused?

jbellis · 2024-08-28T13:53:37Z

All of the on-heap data structures are designed around the principle that node ids are mostly contiguous (see DenseIntMap in particular), and the on-disk structure assumes they are entirely contiguous. Additionally, removeDeletedNodes cannot be called safely while other modifications are in flight. (See #272.)

I think this would work for you:

Periodically pause mutations while running removeDeletedNodes
Once complete, you can re-use deleted node ids, and overwrite them in your MapRAVV.
Never remove from the MapRAVV, only overwrite when you reuse the id

I would support modifying removeDeletedNodes to return a BitSet of removed IDs so user code doesn't have to track deletions a second time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RandomAccessScoreProvider with MapRandomAccessVectorValues + non-sequential IDs produces wrong centroid #354

RandomAccessScoreProvider with MapRandomAccessVectorValues + non-sequential IDs produces wrong centroid #354

vbekiaris commented Aug 14, 2024

jbellis commented Aug 14, 2024

shultseva commented Aug 28, 2024

jbellis commented Aug 28, 2024

RandomAccessScoreProvider with MapRandomAccessVectorValues + non-sequential IDs produces wrong centroid #354

RandomAccessScoreProvider with MapRandomAccessVectorValues + non-sequential IDs produces wrong centroid #354

Comments

vbekiaris commented Aug 14, 2024

jbellis commented Aug 14, 2024

shultseva commented Aug 28, 2024

jbellis commented Aug 28, 2024