-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Migration from existing embeddings #196
Comments
Hi,
The hardest part will be to make sure that the chunking mechanism you configure is the same one as the one you used before, as well as any embedding parameters. I'm not sure if we can simplify this somehow. Let me know if you have any ideas. |
Thanks for the response. Why does the chunking need to be the same? Isn't that only relevant during indexing? Simplifying the customization - I think allowing code to customize it would help here. |
Chunking is used to split content that is too large for the embedding model into smaller "chunks". The configuration for this decides things such as chunk_size, character separators that are used as a "cut-point", etc. So if the chunking method you used before doesn't match with the new one, you simply get inconsistent embeddings. |
Does it index the whole document at once? Or does it compare the chunks and only reindex changed chunks? From my understanding, the retrieval performance should be fine as long as the same embedding model is used? |
It recreates all embeddings if you change the document. This is maybe an optimization we could do though, to avoid unnecessary api costs. And yes I think it should be fine it's just not 100% the same result as before. |
I think not having the exact same results as before during retrieval is fine for my use case. So that should make the migration a lot easier? |
Hey @kolaente did you manage to migrate your embeddings or is there anything else we can help you with? |
@alejandrodnm I haven't migrated yet since there were other things with higher priority (as always). If all goes well, I'll do this in the next 1-2 months. I also need a solution for #23 (comment) before I can use this in production. |
What problem does the new feature solve?
pgai looks very promising to me, as I have built quite a lot of code to handle synching content and embeddings. The system I'm building runs in production and has a few million document rows, and even more embeddings. I'd love to migrate to pgai, but I can't throw these embeddings away in the process because it would take a few weeks to recompute them and cost a lot of money.
What does the feature do?
Provide a migration strategy for how to migrate from a home-made
data_embeddings
table to pgai. This could be tooling, or documentation, or both.Implementation challenges
No response
Are you going to work on this feature?
None
The text was updated successfully, but these errors were encountered: