Skip to content

Commit

Permalink
feat: add alembic operations for vectorizer (#266)
Browse files Browse the repository at this point in the history
* feat: add alembic operations for vectorizer

* chore: cleanup set up of operations

* chore: add shared base class

* docs: update docs

* chore: unify sql generation

* chore: add more test cases

* chore: simplify code and tests a bit

* chore: use shared base classes, make use of more optional params

* chore: revert dockerfile change

* chore: move configuration to alembic package

* feat: add code generation for migration dataclasses

* chore: downgrade voyageai for tests

* chore: rename table_name to target_table_name

* feat: expose CreateVectorizer directly and add docs for it

* chore: update docs/python-integration.md

Co-authored-by: James Guthrie <[email protected]>
Signed-off-by: Jascha Beste <[email protected]>

* chore: update projects/pgai/pgai/vectorizer/generate/README.md

Co-authored-by: James Guthrie <[email protected]>
Signed-off-by: Jascha Beste <[email protected]>

* chore: fix link to code gen

* chore: remove default_value from code generation

* chore: add some tests for vectorizer creation from python

* chore: upgrade uv to 0.5.20

---------

Signed-off-by: Jascha Beste <[email protected]>
Co-authored-by: James Guthrie <[email protected]>
  • Loading branch information
Askir and JamesGuthrie authored Jan 22, 2025
1 parent 3ce3134 commit b01acfe
Show file tree
Hide file tree
Showing 27 changed files with 2,591 additions and 1,143 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ jobs:
- name: Install uv
uses: astral-sh/setup-uv@v3
with:
version: "0.4.29"
version: "0.5.20"
enable-cache: true
cache-dependency-glob: "./projects/pgai/uv.lock"

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/release-please.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ jobs:
- name: Install uv
uses: astral-sh/setup-uv@v3
with:
version: "0.4.29"
version: "0.5.20"
enable-cache: true
cache-dependency-glob: "./projects/pgai/uv.lock"

Expand Down
5 changes: 5 additions & 0 deletions docs/adding-embedding-integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,11 @@ testing we would like.
[embedders directory]:/projects/pgai/pgai/vectorizer/embedders
[embedders \_\_init\_\_.py]:/projects/pgai/pgai/vectorizer/embedders/__init__.py

## pgai library
The pgai library exposes helpers to create a vectorizer via pure python.
The classes for this are autogenerated via code generation. To update the classes
with a new integration look into the code generator docs in [/projects/pgai/pgai/vectorizer/generate](/projects/pgai/pgai/vectorizer/generate/README.md).

## Documentation

Ensure that the new integration is documented:
Expand Down
83 changes: 82 additions & 1 deletion docs/python-integration.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,38 @@
# Creating vectorizers from python

To create a vectorizer from python you use the `CreateVectorizer` helper class from the `pgai.vectorizer` module.
It accepts all the options listed in the [SQL API](vectorizer-api-reference.md) and exposes the `to_sql`
method to generate a SQL query which you can then run through the SQL library of your choice.

First install the pgai library:
```bash
pip install pgai
```

Then you can create a vectorizer from python:

```python
from pgai.vectorizer import CreateVectorizer
from pgai.vectorizer.configuration import OpenAIConfig, CharacterTextSplitterConfig, PythonTemplateConfig

vectorizer_statement = CreateVectorizer(
source_table="blog",
target_table='blog_embeddings',
embedding=OpenAIConfig(
model='text-embedding-3-small',
dimensions=768
),
chunking=CharacterTextSplitterConfig(
chunk_column='content',
chunk_size=800,
chunk_overlap=400,
separator='.',
is_separator_regex=False
),
formatting=PythonTemplateConfig(template='$title - $chunk')
).to_sql()
```

# SQLAlchemy Integration with pgai Vectorizer

The `vectorizer_relationship` is a SQLAlchemy helper that integrates pgai's vectorization capabilities directly into your SQLAlchemy models.
Expand Down Expand Up @@ -165,7 +200,7 @@ for post, embedding in results:

## Working with alembic


### Excluding managed tables
The `vectorizer_relationship` generates a new SQLAlchemy model, that is available under the attribute that you specify. If you are using alembic's autogenerate functionality to generate migrations, you will need to exclude these models from the autogenerate process.
These are added to a list in your metadata called `pgai_managed_tables` and you can exclude them by adding the following to your `env.py`:

Expand All @@ -183,3 +218,49 @@ context.configure(
```

This should now prevent alembic from generating tables for these models when you run `alembic revision --autogenerate`.


### Creating vectorizers
pgai provides native Alembic operations for managing vectorizers. For them to work you need to run `register_operations` in your env.py file. Which registers the pgai operations under the global op context:

```python
from pgai.alembic import register_operations

register_operations()
```

Then you can use the `create_vectorizer` operation to create a vectorizer for your model. As well as the `drop_vectorizer` operation to remove it.

```python
from alembic import op
from pgai.vectorizer.configuration import (
OpenAIConfig,
CharacterTextSplitterConfig,
PythonTemplateConfig
)


def upgrade() -> None:
op.create_vectorizer(
source_table="blog",
target_table='blog_embeddings',
embedding=OpenAIConfig(
model='text-embedding-3-small',
dimensions=768
),
chunking=CharacterTextSplitterConfig(
chunk_column='content',
chunk_size=800,
chunk_overlap=400,
separator='.',
is_separator_regex=False
),
formatting=PythonTemplateConfig(template='$title - $chunk')
)


def downgrade() -> None:
op.drop_vectorizer(target_table_name="blog_embeddings", drop_all=True)
```

The `create_vectorizer` operation supports all configuration options available in the [SQL API](vectorizer-api-reference.md).
4 changes: 4 additions & 0 deletions projects/extension/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,10 @@ WORKDIR /pgai
COPY . .
RUN just build install

RUN mkdir -p /docker-entrypoint-initdb.d && \
echo "#!/bin/bash" > /docker-entrypoint-initdb.d/configure-timescaledb.sh && \
echo "echo \"shared_preload_libraries = 'timescaledb'\" >> \${PGDATA}/postgresql.conf" >> /docker-entrypoint-initdb.d/configure-timescaledb.sh && \
chmod +x /docker-entrypoint-initdb.d/configure-timescaledb.sh

###############################################################################
# image for use in extension development
Expand Down
2 changes: 1 addition & 1 deletion projects/pgai/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
UV_PROJECT_ENVIRONMENT=/usr/local/
WORKDIR /build
COPY --from=ghcr.io/astral-sh/uv:0.4.29 /uv /uvx /bin/
COPY --from=ghcr.io/astral-sh/uv:0.5.20 /uv /uvx /bin/

COPY pyproject.toml uv.lock ./

Expand Down
7 changes: 7 additions & 0 deletions projects/pgai/pgai/alembic/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from pgai.alembic.operations import (
CreateVectorizerOp,
DropVectorizerOp,
register_operations,
)

__all__ = ["CreateVectorizerOp", "DropVectorizerOp", "register_operations"]
73 changes: 73 additions & 0 deletions projects/pgai/pgai/alembic/operations.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
from typing import Any

from alembic.operations import MigrateOperation, Operations
from sqlalchemy import text

from pgai.vectorizer.create_vectorizer import CreateVectorizer


class CreateVectorizerOp(MigrateOperation):
def __init__(
self,
**kw: dict[str, Any],
):
self.params = CreateVectorizer(
**kw # type: ignore
)

@classmethod
def create_vectorizer(cls, operations: Operations, **kw: Any):
op = CreateVectorizerOp(**kw) # type: ignore
return operations.invoke(op)


class DropVectorizerOp(MigrateOperation):
def __init__(self, target_table: str | None, drop_all: bool):
self.target_table = target_table
self.drop_all = drop_all

@classmethod
def drop_vectorizer(
cls,
operations: Operations,
target_table: str | None,
drop_all: bool = True,
):
op = DropVectorizerOp(target_table, drop_all)
return operations.invoke(op)


def create_vectorizer(operations: Operations, operation: CreateVectorizerOp):
params = operation.params
operations.execute(params.to_sql())


def drop_vectorizer(operations: Operations, operation: DropVectorizerOp):
connection = operations.get_bind()
result = connection.execute(
text("SELECT id FROM ai.vectorizer WHERE target_table = :table_name"),
{"table_name": operation.target_table},
).scalar()

if result is None:
return

# Drop the vectorizer
connection.execute(
text("SELECT ai.drop_vectorizer(:id, drop_all=>:drop_all)"),
{"id": result, "drop_all": operation.drop_all},
)


_operations_registered = False


def register_operations():
global _operations_registered

if not _operations_registered:
Operations.register_operation("create_vectorizer")(CreateVectorizerOp)
Operations.register_operation("drop_vectorizer")(DropVectorizerOp)
Operations.implementation_for(CreateVectorizerOp)(create_vectorizer)
Operations.implementation_for(DropVectorizerOp)(drop_vectorizer)
_operations_registered = True
3 changes: 2 additions & 1 deletion projects/pgai/pgai/vectorizer/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
from .create_vectorizer import CreateVectorizer
from .vectorizer import Vectorizer, Worker

__all__ = ["Vectorizer", "Worker"]
__all__ = ["Vectorizer", "Worker", "CreateVectorizer"]
Loading

0 comments on commit b01acfe

Please sign in to comment.