Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Migration from existing embeddings #196

Open
kolaente opened this issue Nov 4, 2024 · 8 comments
Open

[Feature]: Migration from existing embeddings #196

kolaente opened this issue Nov 4, 2024 · 8 comments

Comments

@kolaente
Copy link
Contributor

kolaente commented Nov 4, 2024

What problem does the new feature solve?

pgai looks very promising to me, as I have built quite a lot of code to handle synching content and embeddings. The system I'm building runs in production and has a few million document rows, and even more embeddings. I'd love to migrate to pgai, but I can't throw these embeddings away in the process because it would take a few weeks to recompute them and cost a lot of money.

What does the feature do?

Provide a migration strategy for how to migrate from a home-made data_embeddings table to pgai. This could be tooling, or documentation, or both.

Implementation challenges

No response

Are you going to work on this feature?

None

@Askir
Copy link
Contributor

Askir commented Nov 5, 2024

Hi,
that's a valid request. In general I think a migration with existing embeddings would work something like this:

  1. Create vectorizer with ai.create_vectorizer
  2. Stop the worker or disable the scheduling check here for details
  3. Transform existing embeddings into the new table
  4. Purge vectorizer queue, this is a simple table it the ai schema
  5. Reenable the vectorizer

The hardest part will be to make sure that the chunking mechanism you configure is the same one as the one you used before, as well as any embedding parameters.

I'm not sure if we can simplify this somehow. Let me know if you have any ideas.

@kolaente
Copy link
Contributor Author

kolaente commented Nov 7, 2024

Thanks for the response.

Why does the chunking need to be the same? Isn't that only relevant during indexing?

Simplifying the customization - I think allowing code to customize it would help here.

@Askir
Copy link
Contributor

Askir commented Nov 7, 2024

Chunking is used to split content that is too large for the embedding model into smaller "chunks". The configuration for this decides things such as chunk_size, character separators that are used as a "cut-point", etc.

So if the chunking method you used before doesn't match with the new one, you simply get inconsistent embeddings.
That might not be a problem depending on how sensitive the embedding model is, but it will mean that the chunks produced by the pgai vectorizer and therefore the embeddings differ from what you had before. So you will end up with a change in retrieval performance.

@kolaente
Copy link
Contributor Author

kolaente commented Nov 7, 2024

Does it index the whole document at once? Or does it compare the chunks and only reindex changed chunks?

From my understanding, the retrieval performance should be fine as long as the same embedding model is used?

@Askir
Copy link
Contributor

Askir commented Nov 8, 2024

It recreates all embeddings if you change the document. This is maybe an optimization we could do though, to avoid unnecessary api costs.

And yes I think it should be fine it's just not 100% the same result as before.

@kolaente
Copy link
Contributor Author

kolaente commented Nov 8, 2024

I think not having the exact same results as before during retrieval is fine for my use case.

So that should make the migration a lot easier?

@alejandrodnm
Copy link
Contributor

Hey @kolaente did you manage to migrate your embeddings or is there anything else we can help you with?

@kolaente
Copy link
Contributor Author

@alejandrodnm I haven't migrated yet since there were other things with higher priority (as always). If all goes well, I'll do this in the next 1-2 months.

I also need a solution for #23 (comment) before I can use this in production.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants