Sparse index support #424

jhamon · 2024-12-13T20:06:29Z

Problem

Need to incorporate change for upcoming sparse index support.

Solution

Incorporating these changes represents a few different types of challenges

Control plane changes

Create index adjusted to accept new optional field, vector_type.

Data plane changes

Codegen for proto changes
- Needed to add codegen scripts to support generation of new grpc message formats from an updated proto file for the first time. This used to be done with some scripts directly invoking protoc CLI from the now-deprecated private repo pinecone-protos. But since we've moved away from that pattern, needed to implement something here for the first time.
- I decided to use the Buf CLI, which gives a more modern way to manage the required dependencies for stub generation via a plugin system.
Dependency changes
- Had to bump our protobuf and googleapis-common-protos deps to match the Buf cli's plugin output. But that's a good thing as we were on fairly old versions and updating protobuf in particular is supposed to unlock some nontrivial performance benefit.
Vector object switcheroo for UX
- The Vector object generated by OpenAPI for the REST pathway was unfortunately insisting that an empty array values=[] be sent when upserting for sparse indexes. This seems annoying and stupid if you are focused on the sparse use case, so I ended up moving away from having users use that generated object in favor of my own custom Vector dataclass. The blast radius of this change was fairly large, requiring updates to the various vector factory classes and hundreds of tests.
- For those keeping score, there are now 5 similar things that can be called Vector inside the SDK:
  - a class generated with OpenAPI
  - a grpc Message object
  - a user-supplied dictionary with specific keys defined by a TypedDict
  - a user-supplied tuple with either 2 or 3 elements representing id, values, and metadata
  - New: a @dataclass I created in this PR
- Out of these "vector" things listed above, the SDK used to provide the OpenAPI vector class to those who write code along the lines of from pinecone import Vector. This was already bad due to the interoperability problems between the GRPC and REST code paths this brought into the vector factory classes (e.g. if you were attempting to pass an OpenAPI Vector object into the GRPC-based upsert implementation extra logic was needed to convert that into the grpc Message object). But with the introduction of sparse, there was no workaround that would give the user the ability to instantiate the OpenAPI vector without also passing values=[] which just seemed like a bridge too far from a UX standpoint. So the decision was made to use this name Vector to refer to the custom dataclass instead and have that be the main vehicle for user input.
- This object switch also caused a bunch of mypy type checks and tests to start failing in subtle ways that required extra time and effort to track down since even though the new and old objects were functionally equivalent to the user they are not the same to the type checker so many type annotations and tests had to be adjusted. But seems worth it for the better UX, and also gets us onto a standard footing for how input should be supplied to both REST and GRPC index clients.
I started writing new integration tests for sparse indexes, but they're disabled in this PR because they currently can't run on prod.

Other changes

I disabled tests on the CI dev release workflow. Seems like I end up doing this every time I am working on a complex change and want to do some hands on testing with google colab even when 100% of tests aren't passing. I think I will leave the tests commented out until I can figiure out how to wire up a config option that controls them.
I created some named type definitions under pinecone/data/types for dictionary objects the user can pass in various places for vector, sparse values, metadata, filter, etc. This is to increase the consistency and readability of types in the code. When they were defined inline, sometimes small inconsistencies creep in. This wasn't really planned in advance, but I had enough challenges wrangling mypy that it seemed simpler to actually centralize some of these definitions into one place than try to iron out minor inconsistencies any other way.
A lot of fixture stuff was moved out of tests/integration/data/conftest and into individual test files. This unified "upsert everything, then run all the tests" approach first began when the product surface was much smaller and freshness was a much bigger problem so it was more advantageous to do all the post-upsert waiting at one time before running any tests. In practice, this has been a big obstacle to development because it means you can't easily run a single test file without the heavy duty global setup and it creates a big opportunity for test pollution if any tests modify state in the index. Breaking this up and putting setup closer to the places where it is tested solves this problem and makes it easier to work with individual test files.

Usage

import random
from pinecone import ServerlessSpec, Vector, SparseVector

pc = Pinecone(api_key='key')

# Create sparse index
index_name = f"sparse-testing-{random.randint(0,10000)}"
pc.create_index(
    name=index_name,
    metric='dotproduct',
    spec=ServerlessSpec(cloud='aws', region='us-east-1'),
    vector_type='sparse'
)

# Get the index client
sparse_index = pc.Index(name=index_name)

# Upsert sparse (random data just for illustration purposes)
def unique_random_integers(n, range_start, range_end):
    if n > (range_end - range_start + 1):
        raise ValueError("Range too small for the requested number of unique integers")
    return random.sample(range(range_start, range_end + 1), n)

# Generate some random sparse vectors
sparse_index.upsert(
    vectors=[
        Vector(
            id=str(i),
            sparse_values=SparseValues(
                indices=unique_random_integers(10, 0, 10000),
                values=[random.random() for j in range(10)]
            )
        ) for i in range(10000)
    ],
    batch_size=500,
)

# Query sparse
sparse_index.query(
    top_k=10,
    sparse_vector={
        "indices": [1,2,3,4,5],
        "values": [random.random()]*5
    }
)

Type of Change

New feature (non-breaking change which adds functionality)

… and tests

jhamon · 2024-12-18T15:12:26Z

.github/workflows/testing-dependency.yaml

@@ -115,7 +114,6 @@ jobs:
          lz4_version: '${{ matrix.lz4_version }}'
          protobuf_version: '${{ matrix.protobuf_version }}'
          googleapis_common_protos_version: '${{ matrix.googleapis_common_protos_version }}'
-          DD_TAGS: 'test.configuration.python:${{ matrix.python_version}},test.configuration.grpcio:${{ matrix.grpcio_version }},test.configuration.lz4:${{ matrix.lz4_version }},test.configuration.protobuf:${{ matrix.protobuf_version }},test.configuration.googleapis_common_protos:${{ matrix.googleapis_common_protos_version }}'


Warnings were showing in the log about these settings for the datadog ddtrace plugin, so I removed them while I was in this file making adjustments.

jhamon · 2024-12-18T15:14:31Z

pinecone/control/pinecone_interface.py

+from pinecone.models import ServerlessSpec, PodSpec, IndexList, CollectionList
+
+
+class PineconeDBControlInterface(ABC):


I made this interface to be a home for docstrings because I was tired of looking at them while trying to edit the Pinecone class. When docs are thorough, you can't see the method signature and the method body at the same time and it annoys me every time.

Great idea, thank you!

austin-denoble

Overall this LGTM. Nice work getting all of this in place while working towards improving the codebase in general. All the changes here make sense to me, and feel like the proper direction to go in compared to where we were at a year ago.

Using buf seems like a good idea.

austin-denoble · 2024-12-19T16:03:56Z

pinecone/control/pinecone_interface.py

+from pinecone.models import ServerlessSpec, PodSpec, IndexList, CollectionList
+
+
+class PineconeDBControlInterface(ABC):


Great idea, thank you!

jhamon force-pushed the jhamon/sparse-indexes branch from 2d00db9 to c58cba0 Compare December 16, 2024 18:45

jhamon added 10 commits December 16, 2024 16:46

Regenerate off 2025-01, update tests

e4e94aa

Control plane docs extraction

1571e16

Introduce Vector and SparseValue dataclasses, adjust object factories…

63cfef3

… and tests

Add codegen script to generate updated stubs from proto file

03d2aa2

Disable tests on dev release build

6db3421

Adjust fetch return type

f5d386c

mypy type fixes

41ddc88

More mypy fixes

cd55949

More more mypy fixes

ebeb5be

Fix failing tests

9d8e5b8

jhamon force-pushed the jhamon/sparse-indexes branch from ab1f9e8 to 9d8e5b8 Compare December 17, 2024 01:06

jhamon added 11 commits December 16, 2024 20:16

Fix unit tests

534df9b

Fix integration tests

da31fe0

Fix more tests

0c0422b

Adjust googleapis-common-protos for protobuf compat

5f5a5d1

Iterate on grpc deps

84248fc

Fix sparse query

df0e8e4

Fix sparse query types

c618fff

Iterate on grpc deps

2770806

Fix integration test

5aa290e

Adjust grpc dep testing config

0d09933

Adjust test configuration

3f134b4

jhamon marked this pull request as ready for review December 18, 2024 15:09

jhamon requested review from austin-denoble, haruska, rohanshah18 and aulorbe December 18, 2024 15:09

jhamon commented Dec 18, 2024

View reviewed changes

austin-denoble approved these changes Dec 19, 2024

View reviewed changes

jhamon merged commit 67704d5 into release-candidate/2025-01 Dec 19, 2024
82 of 83 checks passed

jhamon deleted the jhamon/sparse-indexes branch December 19, 2024 16:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse index support #424

Sparse index support #424

jhamon commented Dec 13, 2024 •

edited

Loading

jhamon Dec 18, 2024

jhamon Dec 18, 2024

austin-denoble Dec 19, 2024

austin-denoble left a comment

austin-denoble Dec 19, 2024

		from pinecone.models import ServerlessSpec, PodSpec, IndexList, CollectionList


		class PineconeDBControlInterface(ABC):

Sparse index support #424

Sparse index support #424

Conversation

jhamon commented Dec 13, 2024 • edited Loading

Problem

Solution

Control plane changes

Data plane changes

Other changes

Usage

Type of Change

jhamon Dec 18, 2024

Choose a reason for hiding this comment

jhamon Dec 18, 2024

Choose a reason for hiding this comment

austin-denoble Dec 19, 2024

Choose a reason for hiding this comment

austin-denoble left a comment

Choose a reason for hiding this comment

austin-denoble Dec 19, 2024

Choose a reason for hiding this comment

jhamon commented Dec 13, 2024 •

edited

Loading