Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add alembic operations for vectorizer #266

Merged
merged 20 commits into from
Jan 22, 2025

Conversation

Askir
Copy link
Contributor

@Askir Askir commented Dec 2, 2024

This PR adds native python operations to alembic so you don't have to write SQL to create vectorizers.

@Askir Askir force-pushed the jascha/add-alembic-migration-ops branch from c899380 to fd9f1bc Compare December 2, 2024 10:08
@Askir Askir force-pushed the jascha/add-alembic-migration-ops branch from fd9f1bc to 6f5ff59 Compare December 3, 2024 13:37
@Askir Askir marked this pull request as ready for review December 3, 2024 23:16
@Askir Askir requested a review from a team as a code owner December 3, 2024 23:16
@Askir Askir force-pushed the jascha/add-vectorizer-field branch from 8742af8 to 36cf4d9 Compare December 4, 2024 13:13
@Askir Askir force-pushed the jascha/add-alembic-migration-ops branch from 6f5ff59 to 525ab5b Compare December 4, 2024 13:20
Copy link
Collaborator

@cevian cevian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gotta say I'm not convinced about the arguments for using a separate model than the models already in pgai/vectorizer or at least having both sets of models extend a base model. I think having 2 sets of models with similar params is really hard to maintain and quite a bit of code duplication. I'd like some more eyes on this tho. Can
James and/or Alejandro chime in here. In particular I'd like us to consider three designs:

  • simply extending the pydantic model we already have with optional fields that are present in either the stored json OR needed for the alembic stuff + having some kind of wrappers to create the config objects in alembic.
  • Factoring common data fields into base classes and using those as mixins. (kinda like the ApiKeyMixin now).
  • Maybe I'm just being stubborn and we should have separate models, like Jascha has them now.
    leaving a few comments in but I think this is the big issue we need to resolve

projects/pgai/pgai/configuration.py Outdated Show resolved Hide resolved
projects/pgai/pgai/alembic/operations.py Show resolved Hide resolved
projects/pgai/pgai/alembic/operations.py Outdated Show resolved Hide resolved
projects/pgai/pgai/configuration.py Outdated Show resolved Hide resolved
@Askir Askir force-pushed the jascha/add-vectorizer-field branch 10 times, most recently from 3b47afc to 8fe145e Compare December 12, 2024 13:46
@Askir Askir force-pushed the jascha/add-alembic-migration-ops branch from 525ab5b to 5e76cf9 Compare December 12, 2024 16:44
@Askir Askir force-pushed the jascha/add-vectorizer-field branch from 8fe145e to 882f91e Compare December 19, 2024 11:40
Base automatically changed from jascha/add-vectorizer-field to main December 19, 2024 12:32
@Askir Askir force-pushed the jascha/add-alembic-migration-ops branch 7 times, most recently from c90ae69 to 7b90575 Compare January 7, 2025 13:57
@Askir Askir force-pushed the jascha/add-alembic-migration-ops branch 2 times, most recently from 2ea1180 to 786ecfc Compare January 16, 2025 13:39
@Askir Askir requested review from JamesGuthrie and cevian January 16, 2025 13:53
@Askir Askir force-pushed the jascha/add-alembic-migration-ops branch from 786ecfc to 9e98564 Compare January 17, 2025 11:59
Copy link
Collaborator

@cevian cevian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Two comments that need addressing. Would like for @JamesGuthrie to review next and then we can merge

docs/adding-embedding-integration.md Outdated Show resolved Hide resolved
@@ -1,3 +1,38 @@
# Creating vectorizers from python
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to add this to the docs, I'd like at least some tests of using this functionality outside of Alembic

Copy link
Member

@JamesGuthrie JamesGuthrie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a few minor bits and pieces.

docs/adding-embedding-integration.md Outdated Show resolved Hide resolved
docs/adding-embedding-integration.md Outdated Show resolved Hide resolved
docs/python-integration.md Outdated Show resolved Hide resolved
projects/pgai/pgai/vectorizer/generate/README.md Outdated Show resolved Hide resolved
n.nspname,
p.proargnames,
p.pronargdefaults,
string_to_array(array_to_string(p.proargtypes, ' '), ' ') as argtypes,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this amusing. I think that you could replace it with p.proargtypes::oid[]::text[], but I'm not 100% sure on that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually didn't understand this code in detail either, it's something that claude came up with but works so I didn't worry too much about it.

type_name = type_info[1] # type: ignore
is_array = type_info[2] # type: ignore

default = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a bit misleading thanks for questioning it. I removed the default value from the code base now.

The generated classes don't actually have the correct default values the to_sql function just laves any None value out so that the ai.xyz() call uses the sqls default value. So the default is always None in this script (otherwise I'd have to correctly parse the sql default values into its python representation, which I am saving on this way.

@Askir Askir merged commit b01acfe into main Jan 22, 2025
5 checks passed
@Askir Askir deleted the jascha/add-alembic-migration-ops branch January 22, 2025 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants