[Feature]: Migration from existing embeddings #196

kolaente · 2024-11-04T21:48:58Z

What problem does the new feature solve?

pgai looks very promising to me, as I have built quite a lot of code to handle synching content and embeddings. The system I'm building runs in production and has a few million document rows, and even more embeddings. I'd love to migrate to pgai, but I can't throw these embeddings away in the process because it would take a few weeks to recompute them and cost a lot of money.

What does the feature do?

Provide a migration strategy for how to migrate from a home-made data_embeddings table to pgai. This could be tooling, or documentation, or both.

Implementation challenges

No response

Are you going to work on this feature?

None

The text was updated successfully, but these errors were encountered:

Askir · 2024-11-05T18:55:36Z

Hi,
that's a valid request. In general I think a migration with existing embeddings would work something like this:

Create vectorizer with ai.create_vectorizer
Stop the worker or disable the scheduling check here for details
Transform existing embeddings into the new table
Purge vectorizer queue, this is a simple table it the ai schema
Reenable the vectorizer

The hardest part will be to make sure that the chunking mechanism you configure is the same one as the one you used before, as well as any embedding parameters.

I'm not sure if we can simplify this somehow. Let me know if you have any ideas.

kolaente · 2024-11-07T08:43:27Z

Thanks for the response.

Why does the chunking need to be the same? Isn't that only relevant during indexing?

Simplifying the customization - I think allowing code to customize it would help here.

Askir · 2024-11-07T18:53:43Z

Chunking is used to split content that is too large for the embedding model into smaller "chunks". The configuration for this decides things such as chunk_size, character separators that are used as a "cut-point", etc.

So if the chunking method you used before doesn't match with the new one, you simply get inconsistent embeddings.
That might not be a problem depending on how sensitive the embedding model is, but it will mean that the chunks produced by the pgai vectorizer and therefore the embeddings differ from what you had before. So you will end up with a change in retrieval performance.

kolaente · 2024-11-07T21:57:35Z

Does it index the whole document at once? Or does it compare the chunks and only reindex changed chunks?

From my understanding, the retrieval performance should be fine as long as the same embedding model is used?

Askir · 2024-11-08T00:04:57Z

It recreates all embeddings if you change the document. This is maybe an optimization we could do though, to avoid unnecessary api costs.

And yes I think it should be fine it's just not 100% the same result as before.

kolaente · 2024-11-08T13:55:02Z

I think not having the exact same results as before during retrieval is fine for my use case.

So that should make the migration a lot easier?

alejandrodnm · 2024-11-13T12:09:43Z

Hey @kolaente did you manage to migrate your embeddings or is there anything else we can help you with?

kolaente · 2024-11-15T10:09:32Z

@alejandrodnm I haven't migrated yet since there were other things with higher priority (as always). If all goes well, I'll do this in the next 1-2 months.

I also need a solution for #23 (comment) before I can use this in production.

kolaente added community pgai labels Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Migration from existing embeddings #196

[Feature]: Migration from existing embeddings #196

kolaente commented Nov 4, 2024 •

edited

Loading

Askir commented Nov 5, 2024

kolaente commented Nov 7, 2024

Askir commented Nov 7, 2024

kolaente commented Nov 7, 2024

Askir commented Nov 8, 2024

kolaente commented Nov 8, 2024

alejandrodnm commented Nov 13, 2024

kolaente commented Nov 15, 2024

[Feature]: Migration from existing embeddings #196

[Feature]: Migration from existing embeddings #196

Comments

kolaente commented Nov 4, 2024 • edited Loading

What problem does the new feature solve?

What does the feature do?

Implementation challenges

Are you going to work on this feature?

Askir commented Nov 5, 2024

kolaente commented Nov 7, 2024

Askir commented Nov 7, 2024

kolaente commented Nov 7, 2024

Askir commented Nov 8, 2024

kolaente commented Nov 8, 2024

alejandrodnm commented Nov 13, 2024

kolaente commented Nov 15, 2024

kolaente commented Nov 4, 2024 •

edited

Loading