Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add alembic operations for vectorizer #266

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

Askir
Copy link
Contributor

@Askir Askir commented Dec 2, 2024

This PR adds native python operations to alembic so you don't have to write SQL to create vectorizers.

@Askir Askir force-pushed the jascha/add-alembic-migration-ops branch from c899380 to fd9f1bc Compare December 2, 2024 10:08
@Askir Askir force-pushed the jascha/add-alembic-migration-ops branch from fd9f1bc to 6f5ff59 Compare December 3, 2024 13:37
@Askir Askir marked this pull request as ready for review December 3, 2024 23:16
@Askir Askir requested a review from a team as a code owner December 3, 2024 23:16
@Askir Askir force-pushed the jascha/add-vectorizer-field branch from 8742af8 to 36cf4d9 Compare December 4, 2024 13:13
@Askir Askir force-pushed the jascha/add-alembic-migration-ops branch from 6f5ff59 to 525ab5b Compare December 4, 2024 13:20
Copy link
Collaborator

@cevian cevian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gotta say I'm not convinced about the arguments for using a separate model than the models already in pgai/vectorizer or at least having both sets of models extend a base model. I think having 2 sets of models with similar params is really hard to maintain and quite a bit of code duplication. I'd like some more eyes on this tho. Can
James and/or Alejandro chime in here. In particular I'd like us to consider three designs:

  • simply extending the pydantic model we already have with optional fields that are present in either the stored json OR needed for the alembic stuff + having some kind of wrappers to create the config objects in alembic.
  • Factoring common data fields into base classes and using those as mixins. (kinda like the ApiKeyMixin now).
  • Maybe I'm just being stubborn and we should have separate models, like Jascha has them now.
    leaving a few comments in but I think this is the big issue we need to resolve

parts.append(")")
return "\n".join(parts)

def to_python(self, autogen_context: AutogenContext) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The to_python stuff is only needed if we do autogeneration right? So should really be in the 3rd PR?

_operations_registered = False


def register_operations():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this also only needed for the autogen stuff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no this is necessary to make the create and drop vectorizer functions available under alembic op context.
So you can call op.create_vectorizer

All the alembic operations are managed this way. You can read up about the register functions here. They are meant to be used as a decorator but then they only work if the module is imported. So I think asking the user to explicitly call register_operations, is less confusing.

@override
def reverse(self) -> MigrateOperation:
"""Creates the upgrade operation"""
return CreateVectorizerOp(None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the None argument for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reverse function is used to allow reversing migrations (upgrade->downgrade) and it seems to have to create a valid MIgrateOperation Object.
However to reverse a drop operation I'd have to fully load the current config and pass all the parameters back into the create operations (similar to what the autogenerate PR does).

To avoid this I made the CreateVectorizerOps source_table nullable, this doesn't create a valid operation since you can't call ai.create_vectorizer(NULL) but at least the alembic interface is happy.

def to_python_arg(self) -> str: ...


def format_python_arg(config_type: str, instance: Any) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we right a similar function for sql to make all this simpler to write? I mean all parameters can be passed in with {name} => {value} and the way to render the value is type dependent, which we can switch on. We can also skip any None's etc

'content',
50, -- chunk_size
10 -- chunk_overlap
pgai provides native Alembic operations for managing vectorizers. For them to work you need to run `setup_alembic` in your env.py file. Which registers the pgai operations under the global op context:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is setup_alembic? Is it something that people familiar with alembic know about?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed this to register_operations now and explained it here: #266 (comment)
Hope that makes it a bit more clear what this is for.

@Askir Askir force-pushed the jascha/add-vectorizer-field branch 3 times, most recently from e342f0b to c108a87 Compare December 7, 2024 06:34
@Askir Askir force-pushed the jascha/add-vectorizer-field branch 5 times, most recently from 6bb0414 to 4c8e1a0 Compare December 12, 2024 12:55
@Askir Askir force-pushed the jascha/add-vectorizer-field branch from 4c8e1a0 to 3b47afc Compare December 12, 2024 13:33
@Askir Askir force-pushed the jascha/add-vectorizer-field branch from 3b47afc to 8fe145e Compare December 12, 2024 13:46
@Askir Askir force-pushed the jascha/add-alembic-migration-ops branch from 525ab5b to 5e76cf9 Compare December 12, 2024 16:44
@override
def reverse(self) -> MigrateOperation:
"""Creates the upgrade operation"""
return CreateVectorizerOp(None)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reverse function is used to allow reversing migrations (upgrade->downgrade) and it seems to have to create a valid MIgrateOperation Object.
However to reverse a drop operation I'd have to fully load the current config and pass all the parameters back into the create operations (similar to what the autogenerate PR does).

To avoid this I made the CreateVectorizerOps source_table nullable, this doesn't create a valid operation since you can't call ai.create_vectorizer(NULL) but at least the alembic interface is happy.

_operations_registered = False


def register_operations():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no this is necessary to make the create and drop vectorizer functions available under alembic op context.
So you can call op.create_vectorizer

All the alembic operations are managed this way. You can read up about the register functions here. They are meant to be used as a decorator but then they only work if the module is imported. So I think asking the user to explicitly call register_operations, is less confusing.

Comment on lines 29 to 35
chunking=CharacterTextSplitterConfig(
chunk_column='content',
chunk_size=800,
chunk_overlap=400,
separator='.',
is_separator_regex=False
),
Copy link
Contributor Author

@Askir Askir Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cevian This is a bit of an ugly side effect of having merged the parent classes now.
The pydantic models that map the db config have these fields as non optional because the extension code provides the defaults.

But the input function should have these as optional for the same reason (if they are null the extension will provide e.g. chunk_size=800.

So now in the python code you have to unnecessarily always provide all the arguments (true for some of the other configs as well).

Alternatively I could give these the same defaults as the extension code provides, but then we have to again maintain the defaults in 2 places.

This was previously not an issue as I could just make the input arguments nullable while leaving the load from db models required. I can override this of course in the child classes, but then again this breaks the purpose of sharing a parent class in the first place.

@Askir Askir force-pushed the jascha/add-vectorizer-field branch from 8fe145e to 882f91e Compare December 19, 2024 11:40
Base automatically changed from jascha/add-vectorizer-field to main December 19, 2024 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants