Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse index support #424

Merged
merged 21 commits into from
Dec 19, 2024
Merged

Conversation

jhamon
Copy link
Collaborator

@jhamon jhamon commented Dec 13, 2024

Problem

Need to incorporate change for upcoming sparse index support.

Solution

Incorporating these changes represents a few different types of challenges

Control plane changes

  • Create index adjusted to accept new optional field, vector_type.

Data plane changes

  • Codegen for proto changes

    • Needed to add codegen scripts to support generation of new grpc message formats from an updated proto file for the first time. This used to be done with some scripts directly invoking protoc CLI from the now-deprecated private repo pinecone-protos. But since we've moved away from that pattern, needed to implement something here for the first time.
    • I decided to use the Buf CLI, which gives a more modern way to manage the required dependencies for stub generation via a plugin system.
  • Dependency changes

    • Had to bump our protobuf and googleapis-common-protos deps to match the Buf cli's plugin output. But that's a good thing as we were on fairly old versions and updating protobuf in particular is supposed to unlock some nontrivial performance benefit.
  • Vector object switcheroo for UX

    • The Vector object generated by OpenAPI for the REST pathway was unfortunately insisting that an empty array values=[] be sent when upserting for sparse indexes. This seems annoying and stupid if you are focused on the sparse use case, so I ended up moving away from having users use that generated object in favor of my own custom Vector dataclass. The blast radius of this change was fairly large, requiring updates to the various vector factory classes and hundreds of tests.
    • For those keeping score, there are now 5 similar things that can be called Vector inside the SDK:
      • a class generated with OpenAPI
      • a grpc Message object
      • a user-supplied dictionary with specific keys defined by a TypedDict
      • a user-supplied tuple with either 2 or 3 elements representing id, values, and metadata
      • New: a @dataclass I created in this PR
    • Out of these "vector" things listed above, the SDK used to provide the OpenAPI vector class to those who write code along the lines of from pinecone import Vector. This was already bad due to the interoperability problems between the GRPC and REST code paths this brought into the vector factory classes (e.g. if you were attempting to pass an OpenAPI Vector object into the GRPC-based upsert implementation extra logic was needed to convert that into the grpc Message object). But with the introduction of sparse, there was no workaround that would give the user the ability to instantiate the OpenAPI vector without also passing values=[] which just seemed like a bridge too far from a UX standpoint. So the decision was made to use this name Vector to refer to the custom dataclass instead and have that be the main vehicle for user input.
    • This object switch also caused a bunch of mypy type checks and tests to start failing in subtle ways that required extra time and effort to track down since even though the new and old objects were functionally equivalent to the user they are not the same to the type checker so many type annotations and tests had to be adjusted. But seems worth it for the better UX, and also gets us onto a standard footing for how input should be supplied to both REST and GRPC index clients.
  • I started writing new integration tests for sparse indexes, but they're disabled in this PR because they currently can't run on prod.

Other changes

  • I disabled tests on the CI dev release workflow. Seems like I end up doing this every time I am working on a complex change and want to do some hands on testing with google colab even when 100% of tests aren't passing. I think I will leave the tests commented out until I can figiure out how to wire up a config option that controls them.
  • I created some named type definitions under pinecone/data/types for dictionary objects the user can pass in various places for vector, sparse values, metadata, filter, etc. This is to increase the consistency and readability of types in the code. When they were defined inline, sometimes small inconsistencies creep in. This wasn't really planned in advance, but I had enough challenges wrangling mypy that it seemed simpler to actually centralize some of these definitions into one place than try to iron out minor inconsistencies any other way.
  • A lot of fixture stuff was moved out of tests/integration/data/conftest and into individual test files. This unified "upsert everything, then run all the tests" approach first began when the product surface was much smaller and freshness was a much bigger problem so it was more advantageous to do all the post-upsert waiting at one time before running any tests. In practice, this has been a big obstacle to development because it means you can't easily run a single test file without the heavy duty global setup and it creates a big opportunity for test pollution if any tests modify state in the index. Breaking this up and putting setup closer to the places where it is tested solves this problem and makes it easier to work with individual test files.

Usage

import random
from pinecone import ServerlessSpec, Vector, SparseVector

pc = Pinecone(api_key='key')

# Create sparse index
index_name = f"sparse-testing-{random.randint(0,10000)}"
pc.create_index(
    name=index_name,
    metric='dotproduct',
    spec=ServerlessSpec(cloud='aws', region='us-east-1'),
    vector_type='sparse'
)

# Get the index client
sparse_index = pc.Index(name=index_name)

# Upsert sparse (random data just for illustration purposes)
def unique_random_integers(n, range_start, range_end):
    if n > (range_end - range_start + 1):
        raise ValueError("Range too small for the requested number of unique integers")
    return random.sample(range(range_start, range_end + 1), n)

# Generate some random sparse vectors
sparse_index.upsert(
    vectors=[
        Vector(
            id=str(i),
            sparse_values=SparseValues(
                indices=unique_random_integers(10, 0, 10000),
                values=[random.random() for j in range(10)]
            )
        ) for i in range(10000)
    ],
    batch_size=500,
)

# Query sparse
sparse_index.query(
    top_k=10,
    sparse_vector={
        "indices": [1,2,3,4,5],
        "values": [random.random()]*5
    }
)

Type of Change

  • New feature (non-breaking change which adds functionality)

@jhamon jhamon force-pushed the jhamon/sparse-indexes branch from 2d00db9 to c58cba0 Compare December 16, 2024 18:45
@jhamon jhamon force-pushed the jhamon/sparse-indexes branch from ab1f9e8 to 9d8e5b8 Compare December 17, 2024 01:06
@jhamon jhamon marked this pull request as ready for review December 18, 2024 15:09
@@ -115,7 +114,6 @@ jobs:
lz4_version: '${{ matrix.lz4_version }}'
protobuf_version: '${{ matrix.protobuf_version }}'
googleapis_common_protos_version: '${{ matrix.googleapis_common_protos_version }}'
DD_TAGS: 'test.configuration.python:${{ matrix.python_version}},test.configuration.grpcio:${{ matrix.grpcio_version }},test.configuration.lz4:${{ matrix.lz4_version }},test.configuration.protobuf:${{ matrix.protobuf_version }},test.configuration.googleapis_common_protos:${{ matrix.googleapis_common_protos_version }}'
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warnings were showing in the log about these settings for the datadog ddtrace plugin, so I removed them while I was in this file making adjustments.

from pinecone.models import ServerlessSpec, PodSpec, IndexList, CollectionList


class PineconeDBControlInterface(ABC):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this interface to be a home for docstrings because I was tired of looking at them while trying to edit the Pinecone class. When docs are thorough, you can't see the method signature and the method body at the same time and it annoys me every time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, thank you!

Copy link
Contributor

@austin-denoble austin-denoble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this LGTM. Nice work getting all of this in place while working towards improving the codebase in general. All the changes here make sense to me, and feel like the proper direction to go in compared to where we were at a year ago.

Using buf seems like a good idea.

from pinecone.models import ServerlessSpec, PodSpec, IndexList, CollectionList


class PineconeDBControlInterface(ABC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, thank you!

@jhamon jhamon merged commit 67704d5 into release-candidate/2025-01 Dec 19, 2024
82 of 83 checks passed
@jhamon jhamon deleted the jhamon/sparse-indexes branch December 19, 2024 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants