Support bulk operations #23

jgpruitt · 2024-06-10T14:01:58Z

What is the most performant/efficient way to embed lots of rows? Can we build functions or procedures to make this easy? If not, can we document guidance and provide example code?

kolaente · 2024-11-04T21:44:57Z

I have hit this problem with an application I'm building (not yet with pgai). We were ingesting so much data into the system that we'd run into openai's rate limits. The solution here was to build a batch processing job which creates openai embedding batches and, on a schedule, checks if openai has processed the batch, then saving the returned embeddings into the databse.

I wonder if pgai could do something like this as well?

theodufort · 2024-11-15T01:53:14Z

I met the same problem when I wanted to bulk generate embeddings for 25 million + rows, I can not do it without reaching OSError: [Errno 24] Too many open files

theodufort · 2024-11-15T02:11:06Z

Right now only way that is working for me without having any error is doing it by small batches this way:
UPDATE public.subjects SET subject_embedding = ai.ollama_embed('nomic-embed-text:v1.5', name) WHERE id IN ( SELECT id FROM public.subjects WHERE subject_embedding IS NULL LIMIT 100 );

alejandrodnm · 2024-11-15T12:18:59Z

Have you tried setting up a vectorizer? When you run the worker, you can specify the number of batches, concurrency, and poll interval.

Batches are sent in a single request to openAI.

Of course, if you tried to do too much in a small period, you'll bound to be rate limited. We currently don't support the openAI's batch API. The alternative will be setting the vectorizer's config to stay below rate limit's threshold. This will work if you get spikes of ingest, and you're fine with some delay between inserts and generating the embeddings.

kolaente · 2024-11-15T14:24:33Z

I haven't set one up yet, still exploring options.

Is it possible to either extend the vectorizer to support adding embeddings through openai's batch api or manually, via a service which I run? (The latter would then do all the checking and batch creation etc, but would require marking a chunk as "this will be created asynchronously, please do not do anything")

alejandrodnm · 2024-11-24T00:50:23Z

@kolaente Supporting the batch API in the vectorizer worker docker image could take some effort, and it's not something in our current roadmap. But you can extend the vectorizer, and make your own worker. Once you have something running, we can discuss integrating that into the pgai repo.

When you create a vectorizer in your DB with the ai.create_vectorizer function, an embeddings store table, and a queue table will be created. The queue table will be populated whenever something changes in your source table.

These is the query we use to fetch items from the queue:

pgai/projects/pgai/pgai/vectorizer/vectorizer.py

Lines 183 to 184 in 99d62f3

    
           @cached_property 
        
           def fetch_work_query(self) -> sql.Composed:

From a very simplistic point of view, I think this is more or less what you need to do:

Create another table (or tables) to keep track of what queue items have been sent and the job id.
Update the fetch work query to skip those queue items that have been sent to a batch. Or delete them from the queue once you create a batch job with them, and keep track of the batches in a different table.
Have a separate process that polls for the batch jobs, inserts the embeddings, and cleans up the queues.

There are many more pieces, so that's why it's non-trivial to add it right now to the project.

If you implement something like this, we'd be very interested to learn about your experience.

Hope this helps to get you started. Feel free to reach out if you have more questions, and you can always reach out in the PGAI discord https://discord.com/channels/1246241636019605616/1246243698111676447

kolaente · 2024-11-26T11:10:12Z

@alejandrodnm Thanks! I'll look into implementing this and report back with my findings. (might take a while until I get to it though)

Would I need to fork and build everything from scratch to extend the vectorizer or is there a clear path to extending it?

kolaente · 2024-12-05T19:39:32Z

I've just opened a PR for this: #280

jgpruitt added the enhancement New feature or request label Jun 10, 2024

kolaente mentioned this issue Nov 15, 2024

[Feature]: Migration from existing embeddings #196

Open

kolaente linked a pull request Dec 5, 2024 that will close this issue

feat: use openai's batch processing to create large volumes of embeddings #280

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support bulk operations #23

Support bulk operations #23

jgpruitt commented Jun 10, 2024

kolaente commented Nov 4, 2024

theodufort commented Nov 15, 2024

theodufort commented Nov 15, 2024

alejandrodnm commented Nov 15, 2024

kolaente commented Nov 15, 2024

alejandrodnm commented Nov 24, 2024

kolaente commented Nov 26, 2024

kolaente commented Dec 5, 2024

Support bulk operations #23

Support bulk operations #23

Comments

jgpruitt commented Jun 10, 2024

kolaente commented Nov 4, 2024

theodufort commented Nov 15, 2024

theodufort commented Nov 15, 2024

alejandrodnm commented Nov 15, 2024

kolaente commented Nov 15, 2024

alejandrodnm commented Nov 24, 2024

kolaente commented Nov 26, 2024

kolaente commented Dec 5, 2024