feat: SQLAlchemy and alembic integration #208

Askir · 2024-11-08T22:02:47Z

Adds Python integration for pgai that allows defining and managing vectorizers through SQLAlchemy models and Alembic migrations. The integration provides a declarative interface for vector embeddings and automated migration support.

class BlogPost(DeclarativeBase):
    content_embeddings = VectorizerField(
        source_column="content",
        embedding=EmbeddingConfig(
            model="text-embedding-3-small",
            dimensions=768
        )
    )

# Semantic search query
results = (
    session.query(BlogPost.content_embeddings)
    .order_by(
        BlogPost.content_embeddings.embedding.cosine_distance(
            func.ai.openai_embed('text-embedding-3-small', 'search query')
        )
    )
    .limit(5)
)

Features:

Declarative SQLAlchemy field for vectorizer configuration
Alembic operations for creating and dropping vectorizer
Support for migration generation for vectorizer changes

Known missing features:

Downgrading of migrations
Custom sqlalchemy functions for ai.xxx functions
Comparison of existing vectorizer does not work yet! (Working on it)
~~Chunking configuration is not separated for character and recursive splitter (will fix)~~
~~Scheduling config only supports default and specific scheduling not no scheduling right now (will fix)~~

How to review:

Have a look at some tests and the docs to understand how it is supposed to work. If you don't like this or have questions, stop here and leave a comment of what to improve/change
Implementation wise:

sqlalchemy/__init__.py: Contains the VectorizerField that defines sqlalchemy embedding models and registers the configuration for alembic.
alembic/autogenerate.py: Contains the comparison functions as well as the python code generation.
alemic/operations.py: Contains the SQL to be executed (and some alembic boilerplate).
configuration.py: Contains dataclasses that map to all the sql parameters for create_vectorizer which act as the input for both the VectorizerField as well as the alembic create_vectorizer and drop_vectorizer operations.

alejandrodnm

Did a very brief pass over the files, left some comments. I'll try to do a more through review later.

alejandrodnm · 2024-11-20T08:47:39Z

projects/pgai/pyproject.toml

+    "sqlalchemy>=2.0.36",
+    "psycopg2>=2.9.10",
+    "alembic>=1.14.0",


Should we add these extensions as extras, that way users will do pip install pgai[sqlalquemy]. If we add Django support next, then every users will have sqlalchemy and Django as deps, even if they just want to use the Vectorizer.

yes for sure.
Question is a bit if the vectorizer itself should also live in its own extra? I feel like you wont need that in your core python e.g. fastapi application.
Almost an argument for a separate package but I think we want to make use of the pgai brand.

alejandrodnm · 2024-11-20T08:55:30Z

projects/pgai/pgai/extensions/__init__.py

I'm a little on the fence about naming this directory extensions, I don't want people to get confused with the db extension. Maybe I'm overreacting.

nit: what do you think about these class definitions to a non __init__.py file. I think putting them inside the __init__.py makes them less discoverable while browsing the codebase.

I could also just have pgai.sqlalchemy and pgai.alembic? What do you think about that?
pgvector-python doesn't have an extension subpackage either.

alejandrodnm · 2024-11-20T09:12:34Z

docs/python-integration.md

+                func.ai.openai_embed(
+                    'text-embedding-3-small',
+                    query_text,
+                    text('dimensions => 768')
+                )


Can we reference this from the Vectorizer config? That way we can remove the cognitive load on the user. Maybe that's what @cevian was referring when talking about improving the dev experience.

This function is just plain sqlalchemy core right now, this works even without my extension code. We can add custom sql alchemy functions. e.g. so you don't have to wrap the dimension paramter in a text, I guess I could also run a subquery maybe in such a function. We can definitely provide a python helper that references the underlying vectorizer or loads it from the python config.

Mats idea was to do this on sql level though and then it's automatically also available in python so we safe some effort there.

I agree we can improve this UX. But I also think we can do that in a separate PR

alejandrodnm · 2024-11-20T09:27:27Z

docs/python-integration.md

+
+### Model Relationships
+
+You can optionally create SQLAlchemy relationships between your model and its embeddings:


Would it be worth it to mention that we expose the relationship via the view that's created. I don't know much about sqlalchemy, so maybe this relationship way is better than querying the view directly.

I need to look up exactly how relationships work, but generally speaking it's just an eager join that enables the related objects to be loaded as a list on the parent.
Mat mentioned that he's, understandably a bit skeptical about automatic join, because you can run into N+1 problems depending on how it is configured, which is why I made it optional.

I wonder if we can set up a relationship but not have it do the eager join by default?

Askir · 2024-11-22T15:00:09Z

projects/pgai/pgai/alembic/operations.py

+def _build_hnsw_indexing_params(config: HNSWIndexingConfig) -> str:
+    """Build HNSW indexing configuration parameters."""
+    params = []
+    if config.min_rows is not None:
+        params.append(f"min_rows=>{config.min_rows}")
+    if config.opclass is not None:
+        params.append(f"opclass=>'{config.opclass}'")
+    if config.m is not None:
+        params.append(f"m=>{config.m}")
+    if config.ef_construction is not None:
+        params.append(f"ef_construction=>{config.ef_construction}")
+    if config.create_when_queue_empty is not None:
+        params.append(f"create_when_queue_empty=>{str(config.create_when_queue_empty).lower()}")
+    return f"ai.indexing_hnsw({', '.join(params)})"


One thing I am considering is to make the dataclass config objects smarter:
Currently there is multiple places where logic for each config is placed:

rendering to SQL (here)

rendering to python (in autogenerate.py)

loading from the db state (in compare_vectorizers)

Storing configuration in the metadata info dict on the vectorizer field

I think if I make these config objects "smarter" and add a few functions:

from_db_state

to_sql()

to_python or just __repr__?

For each one. It should make keeping this in sync with our actual underlying db code a lot easier. Since you'll only have to touch one spot in the code and not understand how everything works?

Honestly I am a bit disappointed by alembic, i would have thought that at least the to_python part is covered by the library generically already.

Note: this actually kind of overlaps with the pydantic models we have in the vectorizer part of the library. Maybe there is also a way to re-use that config. But e.g. IndexingConfig does not exist there and needs to be loaded separately.

I strongly encourage us to have one set of dataclass objects across all of our libraries. That way we have code reuse and only one place to add things. (and I expect we'll constantly be adding things). My mental model would be one set of dataclass objects (probably pydantic based) that's used (imported) by both vectorizer and these kind of integrations. So that could include things like IndexingConfig that's used by only some parts of the code and not others.

It might actually be nice to have access to them in the extension too. I could use the pydantic models to validate the jsonb representations. But I don't know how cumbersome it would be to do this.

Askir · 2024-11-22T15:05:14Z

projects/pgai/pgai/alembic/autogenerate.py

+        else:
+            # Check for configuration changes
+            existing_config = existing_vectorizers[table_name]
+            if _config_has_changed(model_config, existing_config):


This change detection part doesn't work yet, no need to review.

cevian

I think this is great. Straightforward and helpful approach. Left a few comments. The most important one is about trying to get common, reusable data classes between all of our components

cevian · 2024-11-22T17:17:48Z

projects/pgai/pgai/alembic/operations.py

+def _build_hnsw_indexing_params(config: HNSWIndexingConfig) -> str:
+    """Build HNSW indexing configuration parameters."""
+    params = []
+    if config.min_rows is not None:
+        params.append(f"min_rows=>{config.min_rows}")
+    if config.opclass is not None:
+        params.append(f"opclass=>'{config.opclass}'")
+    if config.m is not None:
+        params.append(f"m=>{config.m}")
+    if config.ef_construction is not None:
+        params.append(f"ef_construction=>{config.ef_construction}")
+    if config.create_when_queue_empty is not None:
+        params.append(f"create_when_queue_empty=>{str(config.create_when_queue_empty).lower()}")
+    return f"ai.indexing_hnsw({', '.join(params)})"


I strongly encourage us to have one set of dataclass objects across all of our libraries. That way we have code reuse and only one place to add things. (and I expect we'll constantly be adding things). My mental model would be one set of dataclass objects (probably pydantic based) that's used (imported) by both vectorizer and these kind of integrations. So that could include things like IndexingConfig that's used by only some parts of the code and not others.

cevian · 2024-11-22T17:25:23Z

docs/python-integration.md

+    title = Column(Text, nullable=False)
+    content = Column(Text, nullable=False)
+
+    content_embeddings = VectorizerField(


I think this is a great interface. One nit VectorizerField or Vectorizer? In my mind this isn't really a field...

cevian · 2024-11-22T17:26:41Z

docs/python-integration.md

+                func.ai.openai_embed(
+                    'text-embedding-3-small',
+                    query_text,
+                    text('dimensions => 768')
+                )


I agree we can improve this UX. But I also think we can do that in a separate PR

cevian · 2024-11-22T17:28:34Z

docs/python-integration.md

+
+### Model Relationships
+
+You can optionally create SQLAlchemy relationships between your model and its embeddings:


I wonder if we can set up a relationship but not have it do the eager join by default?

cevian · 2024-11-22T17:55:00Z

docs/python-integration.md

+```python
+def upgrade():
+    # Update vectorizer configuration
+    op.drop_vectorizer(1, drop_objects=True)


We should probably make it more clear this is a dangerous and destructive operations which requires re-embedding everything again and paying $$. Perhaps just a code comment explaining this is sufficient.

cevian · 2024-11-22T17:57:00Z

projects/pgai/pgai/alembic/autogenerate.py

+    )
+
+
+def _config_has_changed(


I wonder if this can be as easy as rendering the two configs as create_vectorizer calls and doing a string diff.

jgpruitt

Only got partway through. Will try to look at the rest after lunch.

jgpruitt · 2024-11-22T17:29:57Z

docs/python-integration.md

+                text('dimensions => 768')
+            )
+        )
+    )


probably ought to have a limit clause here

jgpruitt · 2024-11-22T17:39:23Z

projects/pgai/pgai/configuration.py

+
+
+@dataclass
+class EmbeddingConfig:


I think we'll need an EmbeddingConfigOpenAI and EmbeddingConfigOllama, right?

jgpruitt · 2024-11-22T17:40:59Z

projects/pgai/pgai/configuration.py

+
+
+@dataclass
+class ChunkingConfig:


Don't we need a ChunkingConfigCharacterTextSplitter and a ChunkingConfigRecursiveCharacterTextSplitter?

I'm only using one right now, because the recursive character text splitter is essentially the same as the character one if you only provide one splitting point right?
Maybe still good to support both for completeness.

yeah, I'm also just wondering what will happen from a naming convention when we add a third chunking strategy

jgpruitt · 2024-11-22T17:42:08Z

projects/pgai/pgai/configuration.py

+
+
+@dataclass
+class SchedulingConfig:


I think this should be SchedulingConfigTimescaledb or something similar. There's a non-zero chance we'll add support for pgcron back as an alternative.

jgpruitt · 2024-11-22T18:01:59Z

projects/pgai/pgai/sqlalchemy/__init__.py

+    """Base type for embedding models with required attributes"""
+
+    embedding_uuid: Mapped[str]
+    id: Mapped[int]


We aren't constrained to an integer id foreign key. How should we handle that? Source tables may have compound primary keys of various types

Oh you're right. I'll try make this a generic, but I'm a bit scared of breaking strict typing. I don't want the embedding class to have a possibly infinite amount of Any fields.

…a support

Askir · 2024-12-02T10:14:45Z

I made the autogen work but have now separated this rather large PR in 3 smaller ones:
#265 #266 and #267

Closing this PR.

Askir changed the title ~~feat: add docs about potential python integration~~ feat: SQLAlchemy and alembic integration Nov 20, 2024

Askir marked this pull request as ready for review November 20, 2024 07:24

Askir requested a review from a team as a code owner November 20, 2024 07:24

alejandrodnm reviewed Nov 20, 2024

View reviewed changes

Askir force-pushed the jascha/python-integration-docs branch 4 times, most recently from 84a27f4 to 19ed565 Compare November 22, 2024 00:17

Askir commented Nov 22, 2024

View reviewed changes

cevian reviewed Nov 22, 2024

View reviewed changes

jgpruitt requested changes Nov 22, 2024

View reviewed changes

Askir added 18 commits November 27, 2024 15:47

feat: add docs about potential python integration

ee55c8e

feat: add vectorized annotation

1ce31e1

feat: add VectorizerField implementation

6f3d033

docs: reset docs since inacurate

c09bac0

feat: add basic alembic operations

528522c

feat: make autogenerate work

85ccc75

chore: ditch reversible autogenerate for now

17e4b50

chore: refactor tests

9e94399

chore: test refactoring

fe43e3c

chore: add all paramters to vectorizer field interface

42034d4

docs: add basic python integration docs

91de685

chore: move packages around, add sqlalchemy as optional extra

91bd528

chore: change package structure

e097851

chore: make use of drop_all parameter

fb17a93

chore: align vectorizers on target_table name

d25d5b2

docs: update docs to reflect current interface

074f233

chore: move sql and python generation to dataclasses

37ffbdb

chore: fix comparison logic

5f4abf3

Askir added 3 commits November 27, 2024 15:47

chore: fix tests

63fd2ef

chore: use pydantic parsing for loading from db

128f836

chore: add more extensive tests for vectorizer creation and add ollam…

51c4cb9

…a support

Askir force-pushed the jascha/python-integration-docs branch from 274665d to 51c4cb9 Compare November 28, 2024 21:24

chore: add extensive tests for alembic migrations

e64a08e

Askir force-pushed the jascha/python-integration-docs branch from 8acabab to e64a08e Compare November 29, 2024 04:18

chore: make CI install all extras for tests

9249bc1

Askir closed this Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: SQLAlchemy and alembic integration #208

feat: SQLAlchemy and alembic integration #208

Askir commented Nov 8, 2024 •

edited

Loading

alejandrodnm left a comment

alejandrodnm Nov 20, 2024

Askir Nov 21, 2024

alejandrodnm Nov 20, 2024

alejandrodnm Nov 20, 2024

Askir Nov 21, 2024

alejandrodnm Nov 20, 2024

Askir Nov 21, 2024

cevian Nov 22, 2024

alejandrodnm Nov 20, 2024

Askir Nov 21, 2024

cevian Nov 22, 2024

Askir Nov 22, 2024

Askir Nov 22, 2024

cevian Nov 22, 2024

jgpruitt Nov 22, 2024

Askir Nov 22, 2024

cevian left a comment

cevian Nov 22, 2024

cevian Nov 22, 2024

cevian Nov 22, 2024

cevian Nov 22, 2024

cevian Nov 22, 2024

cevian Nov 22, 2024

jgpruitt left a comment

jgpruitt Nov 22, 2024

jgpruitt Nov 22, 2024

jgpruitt Nov 22, 2024

Askir Nov 22, 2024

jgpruitt Nov 22, 2024

jgpruitt Nov 22, 2024

jgpruitt Nov 22, 2024

Askir Nov 22, 2024

Askir commented Dec 2, 2024


		### Model Relationships

		You can optionally create SQLAlchemy relationships between your model and its embeddings:

		)


		def _config_has_changed(

feat: SQLAlchemy and alembic integration #208

feat: SQLAlchemy and alembic integration #208

Conversation

Askir commented Nov 8, 2024 • edited Loading

alejandrodnm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cevian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgpruitt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Askir commented Dec 2, 2024

Askir commented Nov 8, 2024 •

edited

Loading