-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add alembic operations for vectorizer #266
base: main
Are you sure you want to change the base?
Conversation
c899380
to
fd9f1bc
Compare
fd9f1bc
to
6f5ff59
Compare
8742af8
to
36cf4d9
Compare
6f5ff59
to
525ab5b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gotta say I'm not convinced about the arguments for using a separate model than the models already in pgai/vectorizer or at least having both sets of models extend a base model. I think having 2 sets of models with similar params is really hard to maintain and quite a bit of code duplication. I'd like some more eyes on this tho. Can
James and/or Alejandro chime in here. In particular I'd like us to consider three designs:
- simply extending the pydantic model we already have with optional fields that are present in either the stored json OR needed for the alembic stuff + having some kind of wrappers to create the config objects in alembic.
- Factoring common data fields into base classes and using those as mixins. (kinda like the ApiKeyMixin now).
- Maybe I'm just being stubborn and we should have separate models, like Jascha has them now.
leaving a few comments in but I think this is the big issue we need to resolve
projects/pgai/pgai/configuration.py
Outdated
parts.append(")") | ||
return "\n".join(parts) | ||
|
||
def to_python(self, autogen_context: AutogenContext) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The to_python stuff is only needed if we do autogeneration right? So should really be in the 3rd PR?
_operations_registered = False | ||
|
||
|
||
def register_operations(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this also only needed for the autogen stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no this is necessary to make the create and drop vectorizer functions available under alembic op
context.
So you can call op.create_vectorizer
All the alembic operations are managed this way. You can read up about the register functions here. They are meant to be used as a decorator but then they only work if the module is imported. So I think asking the user to explicitly call register_operations
, is less confusing.
@override | ||
def reverse(self) -> MigrateOperation: | ||
"""Creates the upgrade operation""" | ||
return CreateVectorizerOp(None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the None argument for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reverse function is used to allow reversing migrations (upgrade->downgrade) and it seems to have to create a valid MIgrateOperation Object.
However to reverse a drop operation I'd have to fully load the current config and pass all the parameters back into the create operations (similar to what the autogenerate PR does).
To avoid this I made the CreateVectorizerOps source_table nullable, this doesn't create a valid operation since you can't call ai.create_vectorizer(NULL)
but at least the alembic interface is happy.
projects/pgai/pgai/configuration.py
Outdated
def to_python_arg(self) -> str: ... | ||
|
||
|
||
def format_python_arg(config_type: str, instance: Any) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we right a similar function for sql to make all this simpler to write? I mean all parameters can be passed in with {name} => {value} and the way to render the value is type dependent, which we can switch on. We can also skip any None's etc
docs/python-integration.md
Outdated
'content', | ||
50, -- chunk_size | ||
10 -- chunk_overlap | ||
pgai provides native Alembic operations for managing vectorizers. For them to work you need to run `setup_alembic` in your env.py file. Which registers the pgai operations under the global op context: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is setup_alembic
? Is it something that people familiar with alembic know about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed this to register_operations
now and explained it here: #266 (comment)
Hope that makes it a bit more clear what this is for.
e342f0b
to
c108a87
Compare
6bb0414
to
4c8e1a0
Compare
4c8e1a0
to
3b47afc
Compare
3b47afc
to
8fe145e
Compare
525ab5b
to
5e76cf9
Compare
@override | ||
def reverse(self) -> MigrateOperation: | ||
"""Creates the upgrade operation""" | ||
return CreateVectorizerOp(None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reverse function is used to allow reversing migrations (upgrade->downgrade) and it seems to have to create a valid MIgrateOperation Object.
However to reverse a drop operation I'd have to fully load the current config and pass all the parameters back into the create operations (similar to what the autogenerate PR does).
To avoid this I made the CreateVectorizerOps source_table nullable, this doesn't create a valid operation since you can't call ai.create_vectorizer(NULL)
but at least the alembic interface is happy.
_operations_registered = False | ||
|
||
|
||
def register_operations(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no this is necessary to make the create and drop vectorizer functions available under alembic op
context.
So you can call op.create_vectorizer
All the alembic operations are managed this way. You can read up about the register functions here. They are meant to be used as a decorator but then they only work if the module is imported. So I think asking the user to explicitly call register_operations
, is less confusing.
chunking=CharacterTextSplitterConfig( | ||
chunk_column='content', | ||
chunk_size=800, | ||
chunk_overlap=400, | ||
separator='.', | ||
is_separator_regex=False | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cevian This is a bit of an ugly side effect of having merged the parent classes now.
The pydantic models that map the db config have these fields as non optional because the extension code provides the defaults.
But the input function should have these as optional for the same reason (if they are null the extension will provide e.g. chunk_size=800
.
So now in the python code you have to unnecessarily always provide all the arguments (true for some of the other configs as well).
Alternatively I could give these the same defaults as the extension code provides, but then we have to again maintain the defaults in 2 places.
This was previously not an issue as I could just make the input arguments nullable while leaving the load from db models required. I can override this of course in the child classes, but then again this breaks the purpose of sharing a parent class in the first place.
8fe145e
to
882f91e
Compare
This PR adds native python operations to alembic so you don't have to write SQL to create vectorizers.