Skip to content

Commit

Permalink
docs: update docs with review comments
Browse files Browse the repository at this point in the history
  • Loading branch information
Askir committed Dec 7, 2024
1 parent e8307cd commit c108a87
Show file tree
Hide file tree
Showing 2 changed files with 417 additions and 53 deletions.
124 changes: 72 additions & 52 deletions docs/python-integration.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,22 @@
# SQLAlchemy Integration with pgAI Vectorizer
# SQLAlchemy Integration with pgai Vectorizer

The `VectorizerField` is a SQLAlchemy helper type that integrates pgAI's vectorization capabilities directly into your SQLAlchemy models. This allows you to easily query vector embeddings created by pgai using familiar SQLAlchemy patterns.
The `Vectorizer` is a SQLAlchemy helper type that integrates pgai's vectorization capabilities directly into your SQLAlchemy models. This allows you to easily query vector embeddings created by pgai using familiar SQLAlchemy patterns.

## Installation

To use the SQLAlchemy integration, install pgAI with the SQLAlchemy extras:
To use the SQLAlchemy integration, install pgai with the SQLAlchemy extras:

```bash
pip install "pgai[sqlalchemy]"
```

## Basic Usage

Here's a basic example of how to use the `VectorizerField`:
Here's a basic example of how to use the `Vectorizer`:

```python
from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column
from pgai.sqlalchemy import VectorizerField, EmbeddingModel
from pgai.sqlalchemy import Vectorizer, EmbeddingModel

class Base(DeclarativeBase):
pass
Expand All @@ -29,46 +29,66 @@ class BlogPost(Base):
content: Mapped[str]

# Add vector embeddings for the content field
content_embeddings = VectorizerField(
content_embeddings = Vectorizer(
dimensions=768,
add_relationship=True,
)

# Optional: Type hint for the relationship
content_embeddings_relation: Mapped[list[EmbeddingModel["BlogPost"]]]
```
Note if you work with alembics autogenerate functionality for migrations, also check [Working with alembic](#working-with-alembic).

## Configuration
### Semantic Search

The `VectorizerField` accepts the following parameters:
You can then perform semantic similarity search on the field using [pgvector-python's](https://github.com/pgvector/pgvector-python) distance functions:

- `dimensions` (int): The size of the embedding vector (required)
- `target_schema` (str, optional): Override the schema for the embeddings table. If not provided, inherits from the parent model's schema
- `target_table` (str, optional): Override the table name for embeddings. Default is `{table_name}_{field_name}_store`
- `add_relationship` (bool): Whether to automatically create a relationship to the embeddings table (default: False)
```python
from sqlalchemy import func, text

**Note:** The `VectorizerField` generates a new SQLAlchemy model, that is available under the attribute that you specify. If you are using alembics autogenerate functionality to generate migrations, you may need to exclude these models from the autogenerate process.
They are tagged with `pgai_managed=True`so you can simply exclude them by adding the following to your `env.py`:
similar_posts = (
session.query(BlogPost.content_embeddings)
.order_by(
BlogPost.content_embeddings.embedding.cosine_distance(
func.ai.openai_embed(
"text-embedding-3-small",
"search query",
text("dimensions => 768")
)
)
)
.limit(5)
.all()
)
```

```python
def include_object(object, name, type_, reflected, compare_to):
if type_ == "table" and name in target_metadata.info.get("pgai_managed_tables", set()):
return False
return True
Or if you already have the embeddings in your application:

context.configure(
connection=connection,
target_metadata=target_metadata,
include_object=include_object
)
```python
similar_posts = (
session.query(BlogPost.content_embeddings)
.order_by(
BlogPost.content_embeddings.embedding.cosine_distance(
[3, 1, 2]
)
)
.limit(5)
.all()
)
```

The model is only created at runtime, so depending on how your alembic migrations are set up this step could be skipped. Simply see what happens if you run `alembic revision --autogenerate` and if the model is included, add the above code.
## Configuration

The `Vectorizer` accepts the following parameters:

- `dimensions` (int): The size of the embedding vector (required)
- `target_schema` (str, optional): Override the schema for the embeddings table. If not provided, inherits from the parent model's schema
- `target_table` (str, optional): Override the table name for embeddings. Default is `{table_name}_{field_name}_store`
- `add_relationship` (bool, optional): Whether to automatically create a relationship to the embeddings table (default: False)

## Setting up the Vectorizer

After defining your model, you need to create the vectorizer using pgAI's SQL functions:
After defining your model, you need to create the vectorizer using pgai's SQL functions:

```sql
SELECT ai.create_vectorizer(
Expand All @@ -85,9 +105,10 @@ SELECT ai.create_vectorizer(

We recommend adding this to a migration script and run it via alembic.


## Querying Embeddings

The `VectorizerField` provides several ways to work with embeddings:
The `Vectorizer` provides several ways to work with embeddings:

### 1. Direct Access to Embeddings

Expand All @@ -110,35 +131,13 @@ blog_post = session.query(BlogPost).first()
for embedding in blog_post.content_embeddings_relation: # Note: uses _relation suffix
print(embedding.chunk)
```

### 3. Semantic Search

You can perform semantic similarity searches using [pgvector-pythons](https://github.com/pgvector/pgvector-python) distance functions:

Access the original posts through the parent relationship
```python
from sqlalchemy import func, text

similar_posts = (
session.query(BlogPost.content_embeddings)
.order_by(
BlogPost.content_embeddings.embedding.cosine_distance(
func.ai.openai_embed(
"text-embedding-3-small",
"search query",
text("dimensions => 768")
)
)
)
.limit(5)
.all()
)

# Access the original posts through the parent relationship
for embedding in similar_posts:
print(embedding.parent.title)
```

### 4. Join Queries
### 3. Join Queries

You can combine embedding queries with regular SQL queries using the relationship:

Expand All @@ -153,4 +152,25 @@ results = (
for post, embedding in results:
print(f"Title: {post.title}")
print(f"Chunk: {embedding.chunk}")
```
```

## Working with alembic


The `Vectorizer` generates a new SQLAlchemy model, that is available under the attribute that you specify. If you are using alembic's autogenerate functionality to generate migrations, you will need to exclude these models from the autogenerate process.
These are added to a list in your metadata called `pgai_managed_tables` and you can exclude them by adding the following to your `env.py`:

```python
def include_object(object, name, type_, reflected, compare_to):
if type_ == "table" and name in target_metadata.info.get("pgai_managed_tables", set()):
return False
return True

context.configure(
connection=connection,
target_metadata=target_metadata,
include_object=include_object
)
```

This should now prevent alembic from generating tables for these models when you run `alembic revision --autogenerate`.
Loading

0 comments on commit c108a87

Please sign in to comment.