Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: SQLAlchemy and alembic integration #208

Closed
wants to merge 23 commits into from

Conversation

Askir
Copy link
Contributor

@Askir Askir commented Nov 8, 2024

Adds Python integration for pgai that allows defining and managing vectorizers through SQLAlchemy models and Alembic migrations. The integration provides a declarative interface for vector embeddings and automated migration support.

class BlogPost(DeclarativeBase):
    content_embeddings = VectorizerField(
        source_column="content",
        embedding=EmbeddingConfig(
            model="text-embedding-3-small",
            dimensions=768
        )
    )

# Semantic search query
results = (
    session.query(BlogPost.content_embeddings)
    .order_by(
        BlogPost.content_embeddings.embedding.cosine_distance(
            func.ai.openai_embed('text-embedding-3-small', 'search query')
        )
    )
    .limit(5)
)

Features:

  • Declarative SQLAlchemy field for vectorizer configuration
  • Alembic operations for creating and dropping vectorizer
  • Support for migration generation for vectorizer changes

Known missing features:

  • Downgrading of migrations
  • Custom sqlalchemy functions for ai.xxx functions
  • Comparison of existing vectorizer does not work yet! (Working on it)
  • Chunking configuration is not separated for character and recursive splitter (will fix)
  • Scheduling config only supports default and specific scheduling not no scheduling right now (will fix)

How to review:

  1. Have a look at some tests and the docs to understand how it is supposed to work. If you don't like this or have questions, stop here and leave a comment of what to improve/change
  2. Implementation wise:
  • sqlalchemy/__init__.py: Contains the VectorizerField that defines sqlalchemy embedding models and registers the configuration for alembic.
  • alembic/autogenerate.py: Contains the comparison functions as well as the python code generation.
  • alemic/operations.py: Contains the SQL to be executed (and some alembic boilerplate).
  • configuration.py: Contains dataclasses that map to all the sql parameters for create_vectorizer which act as the input for both the VectorizerField as well as the alembic create_vectorizer and drop_vectorizer operations.

@Askir Askir changed the title feat: add docs about potential python integration feat: SQLAlchemy and alembic integration Nov 20, 2024
@Askir Askir marked this pull request as ready for review November 20, 2024 07:24
@Askir Askir requested a review from a team as a code owner November 20, 2024 07:24
Copy link
Contributor

@alejandrodnm alejandrodnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a very brief pass over the files, left some comments. I'll try to do a more through review later.

Comment on lines 26 to 28
"sqlalchemy>=2.0.36",
"psycopg2>=2.9.10",
"alembic>=1.14.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add these extensions as extras, that way users will do pip install pgai[sqlalquemy]. If we add Django support next, then every users will have sqlalchemy and Django as deps, even if they just want to use the Vectorizer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes for sure.
Question is a bit if the vectorizer itself should also live in its own extra? I feel like you wont need that in your core python e.g. fastapi application.
Almost an argument for a separate package but I think we want to make use of the pgai brand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little on the fence about naming this directory extensions, I don't want people to get confused with the db extension. Maybe I'm overreacting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: what do you think about these class definitions to a non __init__.py file. I think putting them inside the __init__.py makes them less discoverable while browsing the codebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could also just have pgai.sqlalchemy and pgai.alembic? What do you think about that?
pgvector-python doesn't have an extension subpackage either.

Comment on lines +68 to +69
func.ai.openai_embed(
'text-embedding-3-small',
query_text,
text('dimensions => 768')
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reference this from the Vectorizer config? That way we can remove the cognitive load on the user. Maybe that's what @cevian was referring when talking about improving the dev experience.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is just plain sqlalchemy core right now, this works even without my extension code. We can add custom sql alchemy functions. e.g. so you don't have to wrap the dimension paramter in a text, I guess I could also run a subquery maybe in such a function. We can definitely provide a python helper that references the underlying vectorizer or loads it from the python config.

Mats idea was to do this on sql level though and then it's automatically also available in python so we safe some effort there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we can improve this UX. But I also think we can do that in a separate PR


### Model Relationships

You can optionally create SQLAlchemy relationships between your model and its embeddings:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth it to mention that we expose the relationship via the view that's created. I don't know much about sqlalchemy, so maybe this relationship way is better than querying the view directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to look up exactly how relationships work, but generally speaking it's just an eager join that enables the related objects to be loaded as a list on the parent.
Mat mentioned that he's, understandably a bit skeptical about automatic join, because you can run into N+1 problems depending on how it is configured, which is why I made it optional.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can set up a relationship but not have it do the eager join by default?

@Askir Askir force-pushed the jascha/python-integration-docs branch 4 times, most recently from 84a27f4 to 19ed565 Compare November 22, 2024 00:17
Comment on lines 149 to 162
def _build_hnsw_indexing_params(config: HNSWIndexingConfig) -> str:
"""Build HNSW indexing configuration parameters."""
params = []
if config.min_rows is not None:
params.append(f"min_rows=>{config.min_rows}")
if config.opclass is not None:
params.append(f"opclass=>'{config.opclass}'")
if config.m is not None:
params.append(f"m=>{config.m}")
if config.ef_construction is not None:
params.append(f"ef_construction=>{config.ef_construction}")
if config.create_when_queue_empty is not None:
params.append(f"create_when_queue_empty=>{str(config.create_when_queue_empty).lower()}")
return f"ai.indexing_hnsw({', '.join(params)})"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I am considering is to make the dataclass config objects smarter:
Currently there is multiple places where logic for each config is placed:

  • rendering to SQL (here)
  • rendering to python (in autogenerate.py)
  • loading from the db state (in compare_vectorizers)
  • Storing configuration in the metadata info dict on the vectorizer field

I think if I make these config objects "smarter" and add a few functions:

  • from_db_state
  • to_sql()
  • to_python or just __repr__?

For each one. It should make keeping this in sync with our actual underlying db code a lot easier. Since you'll only have to touch one spot in the code and not understand how everything works?

Honestly I am a bit disappointed by alembic, i would have thought that at least the to_python part is covered by the library generically already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this actually kind of overlaps with the pydantic models we have in the vectorizer part of the library. Maybe there is also a way to re-use that config. But e.g. IndexingConfig does not exist there and needs to be loaded separately.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly encourage us to have one set of dataclass objects across all of our libraries. That way we have code reuse and only one place to add things. (and I expect we'll constantly be adding things). My mental model would be one set of dataclass objects (probably pydantic based) that's used (imported) by both vectorizer and these kind of integrations. So that could include things like IndexingConfig that's used by only some parts of the code and not others.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might actually be nice to have access to them in the extension too. I could use the pydantic models to validate the jsonb representations. But I don't know how cumbersome it would be to do this.

else:
# Check for configuration changes
existing_config = existing_vectorizers[table_name]
if _config_has_changed(model_config, existing_config):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change detection part doesn't work yet, no need to review.

Copy link
Collaborator

@cevian cevian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is great. Straightforward and helpful approach. Left a few comments. The most important one is about trying to get common, reusable data classes between all of our components

Comment on lines 149 to 162
def _build_hnsw_indexing_params(config: HNSWIndexingConfig) -> str:
"""Build HNSW indexing configuration parameters."""
params = []
if config.min_rows is not None:
params.append(f"min_rows=>{config.min_rows}")
if config.opclass is not None:
params.append(f"opclass=>'{config.opclass}'")
if config.m is not None:
params.append(f"m=>{config.m}")
if config.ef_construction is not None:
params.append(f"ef_construction=>{config.ef_construction}")
if config.create_when_queue_empty is not None:
params.append(f"create_when_queue_empty=>{str(config.create_when_queue_empty).lower()}")
return f"ai.indexing_hnsw({', '.join(params)})"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly encourage us to have one set of dataclass objects across all of our libraries. That way we have code reuse and only one place to add things. (and I expect we'll constantly be adding things). My mental model would be one set of dataclass objects (probably pydantic based) that's used (imported) by both vectorizer and these kind of integrations. So that could include things like IndexingConfig that's used by only some parts of the code and not others.

title = Column(Text, nullable=False)
content = Column(Text, nullable=False)

content_embeddings = VectorizerField(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great interface. One nit VectorizerField or Vectorizer? In my mind this isn't really a field...

Comment on lines +68 to +69
func.ai.openai_embed(
'text-embedding-3-small',
query_text,
text('dimensions => 768')
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we can improve this UX. But I also think we can do that in a separate PR


### Model Relationships

You can optionally create SQLAlchemy relationships between your model and its embeddings:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can set up a relationship but not have it do the eager join by default?

```python
def upgrade():
# Update vectorizer configuration
op.drop_vectorizer(1, drop_objects=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably make it more clear this is a dangerous and destructive operations which requires re-embedding everything again and paying $$. Perhaps just a code comment explaining this is sufficient.

)


def _config_has_changed(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this can be as easy as rendering the two configs as create_vectorizer calls and doing a string diff.

Copy link
Collaborator

@jgpruitt jgpruitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only got partway through. Will try to look at the rest after lunch.

text('dimensions => 768')
)
)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably ought to have a limit clause here



@dataclass
class EmbeddingConfig:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll need an EmbeddingConfigOpenAI and EmbeddingConfigOllama, right?



@dataclass
class ChunkingConfig:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need a ChunkingConfigCharacterTextSplitter and a ChunkingConfigRecursiveCharacterTextSplitter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm only using one right now, because the recursive character text splitter is essentially the same as the character one if you only provide one splitting point right?
Maybe still good to support both for completeness.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I'm also just wondering what will happen from a naming convention when we add a third chunking strategy



@dataclass
class SchedulingConfig:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be SchedulingConfigTimescaledb or something similar. There's a non-zero chance we'll add support for pgcron back as an alternative.

"""Base type for embedding models with required attributes"""

embedding_uuid: Mapped[str]
id: Mapped[int]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't constrained to an integer id foreign key. How should we handle that? Source tables may have compound primary keys of various types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh you're right. I'll try make this a generic, but I'm a bit scared of breaking strict typing. I don't want the embedding class to have a possibly infinite amount of Any fields.

@Askir Askir force-pushed the jascha/python-integration-docs branch from 274665d to 51c4cb9 Compare November 28, 2024 21:24
@Askir Askir force-pushed the jascha/python-integration-docs branch from 8acabab to e64a08e Compare November 29, 2024 04:18
@Askir
Copy link
Contributor Author

Askir commented Dec 2, 2024

I made the autogen work but have now separated this rather large PR in 3 smaller ones:
#265 #266 and #267

Closing this PR.

@Askir Askir closed this Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants