Performance expectations #5

Sopamo · 2024-05-31T12:55:10Z

Thanks for working on this!

I've been testing running embeddings in a runpod serverless environment, but the performance isn't what I would have expected. For running bge-m3, we're seeing an end to end latency of ~600ms. Runpod itself reports around 100ms delay time and around 110ms processing time.

I tried running bge-m3 locally on my machine (on a Geforce 4080, directly via python using BGEM3FlagModel) and for the first embedding I see a very high latency as well (~180ms), but for embeddings afterwards the latency is very low, as expected. Around 4-5ms for simple text.

I don't see obvious reasons why requests after the first one on a running worker would still take 100+ms. Is this something that can be improved somehow? I would be willing to contribute, but would like to ask first if this performance is to be expected or if there is potential to improve it.

I would also like to ask about the 100ms delay time. What could be the reasons for it being so high, even though the worker is already running?

We are using European Data centers. Could it be that the requests are somehow routed through the US?

This is the python script I used for testing locally:

import time

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=False)

sentences = ["What is BGE M3?"]
sentences2 = ["More text"]
sentences3 = ["<<< More text"]


def get_embeddings(inputs):
    start_time = time.time()
    model.encode(inputs)['dense_vecs']
    end_time = time.time()

    execution_time = end_time - start_time
    print(f"Execution time: {execution_time} seconds")


get_embeddings(sentences)
get_embeddings(sentences2)
get_embeddings(sentences3)

Output:

Execution time: 0.214857816696167 seconds
Execution time: 0.004781007766723633 seconds
Execution time: 0.00433349609375 seconds

The text was updated successfully, but these errors were encountered:

michaelfeil · 2024-07-01T16:17:53Z

For the first request, the cuda-graph might be intialized on the GPU, which leads to the cold-start delay. The application is designed for high throughput.

Sopamo · 2024-07-01T16:36:37Z

Yes, I wasn't confused about the cold start delay, but about the delay that are sent to warm workers. Sorry, I think I wasn't explicit enough about this. The numbers we are seeing when running via runpod were all measured with warm workers.

michaelfeil · 2024-07-01T17:19:02Z

Interesting, thanks for flagging that - maybe you hit a "cold replica"?

Sopamo · 2024-07-01T17:47:57Z

I just tried it again. For testing we only have a single worker deployed. The warm requests consistently have ~100ms delay and ~100ms execution time:

Here are the container logs, maybe that helps:
logs.txt

I would be open to debugging this further myself, but before I do that I wanted to ask if there are any obvious reasons for these delays.

TimPietrusky · 2024-08-15T10:18:24Z

@Sopamo as far as I understand, there is no obvious reason for the delay.

When you tested this locally, you didn't use the worker-infinity-embedding right? So maybe that would be something to try out if you have the time and energy.

Sopamo · 2024-08-15T10:20:29Z

@TimPietrusky thanks for getting back to me! That is true, will try!

TimPietrusky · 2024-08-17T10:54:49Z

@Sopamo thank you very much! And please let us know if you find anything!

Sopamo · 2024-09-06T12:00:35Z

@TimPietrusky I did some more testing. The delay seems to depend on the data center I'm using. I chose 4 different data centers that are far apart from each other, configured it to use a single worker, made a request to warm up the worker and then did 5 requests directly after each other. The following are the lowest values for delay and execution that I got. The execution times seem ok, but the delay times are all very high (from the perspective of trying to build real time applications with it):

EU-SE-1
- 0.1s Delay Time
- 0.08s Execution Time
US-TX-3
- 0.15s Delay Time
- 0.05s Execution Time
CA-MTL-1
- 0.16s Delay Time
- 0.01s Execution Time
EUR-IS-1
- 0.1s Delay Time
- 0.05s Execution Time

I also tried running the worker locally:

I checked out https://github.com/runpod-workers/worker-infinity-embedding
Created a new venv
Installed requirements.txt
Run MODEL_NAMES=BAAI/bge-reranker-v2-m3 python src/handler.py --rp_serve_api
Create a file post_data.json with this string: { "input": { "model": "BAAI/bge-reranker-v2-m3", "query": "my question", "docs": [ "blue", "red" ] }}
Benchmark performance with ab: ab -n 300 -c 1 -p post_data.json -T 'application/json' http://localhost:8000/runsync
Results show that when running it locally, there is virtually no delay (p90 of 8ms):

Percentage of the requests served within a certain time (ms)
  50%      6
  66%      6
  75%      7
  80%      7
  90%      8
  95%      9
  98%     10
  99%     12
 100%     27 (longest request)

As comparison I did the same against the runsync endpoint of my worker that's running in runpod:

ab -n 300 -c 1 -p post_data.json -T 'application/json' -H 'Authorization: xxx' https://api.runpod.ai/v2/xxx/runsync
The response times that ab reports don't have any real value, as they depend on my local network connection to the server of course, but the graph within runpod shows that those 300 calls had a p90 delay time of 104ms:

and also a p90 execution time of 94ms:

So in total that gives us p90 of ~200ms instead of 7ms once the worker is running on runpod infrastructure (in this case EU-RO-1). I feel like this might either depend on some inefficiencies in the queueing code that runs in runpod prod, but not in the local rp_serve_api version, or it's just that there are a few http requests that the worker does to communicate with some kind of central runpod infrastructure and that adds up to add all of the additional time.

I'd love to get some feedback if this (using runpod serverless for real time applications) is something that you plan to improve, or if you currently focusing on providing good service for applications that don't rely on as little latency as possible.

I'd also be happy to help if I can, so let me know if I can do anything else :)

Sopamo · 2024-10-23T11:34:22Z

@TimPietrusky @michaelfeil I'm sure you have loads of to-dos building amazing GPU cloud features, but I wanted to send a single follow up comment, because I think if there were a way to improve the latency, runpod could be much better at solving real time AI use cases.
Having an easily scalable endpoint that provides single-digit to low double-digit millisecond latency would make runpod very appealing for real time AI.

Feel free to let me know if I can be of any help. Thanks for your work!

michaelfeil · 2024-10-23T16:50:19Z

@Sopamo Low double digit would be complicated, e.g. you would need to be actually having a cluster in Germany, not Sweden. Ping to my hometown in Frankfurt is ~170ms, while us west coast is <10ms. Iceland may be >80ms away for you too.
https://www.meter.net/tools/world-ping-test/

Sopamo · 2024-10-23T19:26:11Z

Yeah, that would be the idea, choosing (for example) Frankfurt as a datacenter for the main application and then choosing a runpod server in Frankfurt for ai workloads.
I think the delay times can't be explained by physical distance, otherwise for example Sweden would have to be much faster than Texas (the ping test you linked shows 30ms to sweden for me), but my tests showed it isn't.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance expectations #5

Performance expectations #5

Sopamo commented May 31, 2024

michaelfeil commented Jul 1, 2024

Sopamo commented Jul 1, 2024

michaelfeil commented Jul 1, 2024

Sopamo commented Jul 1, 2024

TimPietrusky commented Aug 15, 2024

Sopamo commented Aug 15, 2024

TimPietrusky commented Aug 17, 2024

Sopamo commented Sep 6, 2024

Sopamo commented Oct 23, 2024

michaelfeil commented Oct 23, 2024

Sopamo commented Oct 23, 2024

Performance expectations #5

Performance expectations #5

Comments

Sopamo commented May 31, 2024

michaelfeil commented Jul 1, 2024

Sopamo commented Jul 1, 2024

michaelfeil commented Jul 1, 2024

Sopamo commented Jul 1, 2024

TimPietrusky commented Aug 15, 2024

Sopamo commented Aug 15, 2024

TimPietrusky commented Aug 17, 2024

Sopamo commented Sep 6, 2024

Sopamo commented Oct 23, 2024

michaelfeil commented Oct 23, 2024

Sopamo commented Oct 23, 2024