Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add sqlalchemy vectorizer helper #265

Merged
merged 16 commits into from
Dec 19, 2024
Merged

Conversation

Askir
Copy link
Contributor

@Askir Askir commented Dec 2, 2024

This PR adds a helper descriptor Vectorizer that creates an SQLAlchemy Model which can be used to reference vectorizer managed embedding tables without explicitly configuring it yourself.

I've done another pass at the interface and now have renamed the Vectorizer to embedding_relationship detailed in this discussion:
#265 (comment)

@Askir Askir force-pushed the jascha/add-vectorizer-field branch from 9216152 to 517ce9f Compare December 3, 2024 10:56
@Askir Askir marked this pull request as ready for review December 3, 2024 23:16
@Askir Askir requested a review from a team as a code owner December 3, 2024 23:16
@Askir Askir force-pushed the jascha/add-vectorizer-field branch from 8742af8 to 36cf4d9 Compare December 4, 2024 13:13
@Askir Askir changed the title feat: add sqlalchemy vectorizer field feat: add sqlalchemy vectorizer helper Dec 4, 2024
Copy link
Collaborator

@cevian cevian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Left some comments to address

docs/python-integration.md Outdated Show resolved Hide resolved
docs/python-integration.md Outdated Show resolved Hide resolved
return False
return True

context.configure(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is context coming from here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context is the alembic context used to configure and run migrations programmatically. Alembic autogenerates a fitting env.py for the user. E.g. this is how it looks if you're using async sqlalchemy: https://github.com/sqlalchemy/alembic/blob/main/alembic/templates/async/env.py

I can add the from alembic import context if you'd like?

docs/python-integration.md Outdated Show resolved Hide resolved
docs/python-integration.md Show resolved Hide resolved
self._embedding_class = self.create_embedding_class(objtype)

# Set up relationship if requested
if self.add_relationship:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of configuring whether or not to add a relationship, why don't we always add it but allow the user to define the lazy behavior and default that to select? With select nothing is ever pre-loaded right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other suggestion, what if we simply forward all **kwargs, *args to the relationship function?
SQLAlchemy relationships are extremely configurable so I think just exposing anything the user might need while transparently adding the embedding model might be the cleanest interface.

This would basically make the Vectorizer helper an embedding_relationship function? We could even rename it as such?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you add the relationship, does it mean that related data is prefetched with a join?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's exactly what we are discussing right now. If you set lazy="select" it is only fetched as soon as it is accessed. if lazy is set to "joined" it's eagerly joined. There is also "selectin" for a separate query with an in statement, and a few other options.

Copy link
Member

@JamesGuthrie JamesGuthrie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know enough about sqlalchemy to provide any useful feedback there.

docs/python-integration.md Outdated Show resolved Hide resolved
docs/python-integration.md Outdated Show resolved Hide resolved
docs/python-integration.md Outdated Show resolved Hide resolved
docs/python-integration.md Outdated Show resolved Hide resolved
docs/python-integration.md Outdated Show resolved Hide resolved
@Askir Askir force-pushed the jascha/add-vectorizer-field branch 3 times, most recently from e342f0b to c108a87 Compare December 7, 2024 06:34
@Askir Askir requested a review from cevian December 7, 2024 06:34
);
```

We recommend adding this to a migration script and run it via alembic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add another reference to the Working with alembic section below.

- `dimensions` (int): The size of the embedding vector (required)
- `target_schema` (str, optional): Override the schema for the embeddings table. If not provided, inherits from the parent model's schema
- `target_table` (str, optional): Override the table name for embeddings. Default is `{table_name}_embedding_store`
- `add_relationship` (bool, optional): Whether to automatically create a relationship to the embeddings table (default: False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a reference to the ### 2. Relationship Access section below. At this point I didn't know what are sqlalchemy relationships, and if I needed/wanted one or not.

@Askir Askir force-pushed the jascha/add-vectorizer-field branch 7 times, most recently from 3b47afc to 8fe145e Compare December 12, 2024 13:46
Copy link
Collaborator

@cevian cevian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments but nothing blocking merge. This looks super-useful.


Additional parameters are simply forwarded to the underlying [SQLAlchemy relationship](https://docs.sqlalchemy.org/en/20/orm/relationships.html) so you can configure it as you desire.

Think of the `embedding_relationship` as a normal SQLAlchemy relationship, but with a preconfigured model instance under the hood.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also maybe something similar belongs in the intro?

Suggested change
Think of the `embedding_relationship` as a normal SQLAlchemy relationship, but with a preconfigured model instance under the hood.
Think of the `embedding_relationship` as a normal SQLAlchemy relationship, but with a preconfigured embedding model instance under the hood.

docs/python-integration.md Outdated Show resolved Hide resolved
docs/python-integration.md Outdated Show resolved Hide resolved
@Askir Askir force-pushed the jascha/add-vectorizer-field branch from 8fe145e to 882f91e Compare December 19, 2024 11:40
@Askir Askir force-pushed the jascha/add-vectorizer-field branch from 4dae567 to 22233a0 Compare December 19, 2024 12:18
@Askir Askir merged commit 0230509 into main Dec 19, 2024
5 checks passed
@Askir Askir deleted the jascha/add-vectorizer-field branch December 19, 2024 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants