-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance expectations #5
Comments
For the first request, the cuda-graph might be intialized on the GPU, which leads to the cold-start delay. The application is designed for high throughput. |
Yes, I wasn't confused about the cold start delay, but about the delay that are sent to warm workers. Sorry, I think I wasn't explicit enough about this. The numbers we are seeing when running via runpod were all measured with warm workers. |
Interesting, thanks for flagging that - maybe you hit a "cold replica"? |
I just tried it again. For testing we only have a single worker deployed. The warm requests consistently have ~100ms delay and ~100ms execution time: Here are the container logs, maybe that helps: I would be open to debugging this further myself, but before I do that I wanted to ask if there are any obvious reasons for these delays. |
@Sopamo as far as I understand, there is no obvious reason for the delay. When you tested this locally, you didn't use the worker-infinity-embedding right? So maybe that would be something to try out if you have the time and energy. |
@TimPietrusky thanks for getting back to me! That is true, will try! |
@Sopamo thank you very much! And please let us know if you find anything! |
@TimPietrusky I did some more testing. The delay seems to depend on the data center I'm using. I chose 4 different data centers that are far apart from each other, configured it to use a single worker, made a request to warm up the worker and then did 5 requests directly after each other. The following are the lowest values for delay and execution that I got. The execution times seem ok, but the delay times are all very high (from the perspective of trying to build real time applications with it):
I also tried running the worker locally:
As comparison I did the same against the runsync endpoint of my worker that's running in runpod:
So in total that gives us p90 of ~200ms instead of 7ms once the worker is running on runpod infrastructure (in this case EU-RO-1). I feel like this might either depend on some inefficiencies in the queueing code that runs in runpod prod, but not in the local rp_serve_api version, or it's just that there are a few http requests that the worker does to communicate with some kind of central runpod infrastructure and that adds up to add all of the additional time. I'd love to get some feedback if this (using runpod serverless for real time applications) is something that you plan to improve, or if you currently focusing on providing good service for applications that don't rely on as little latency as possible. I'd also be happy to help if I can, so let me know if I can do anything else :) |
@TimPietrusky @michaelfeil I'm sure you have loads of to-dos building amazing GPU cloud features, but I wanted to send a single follow up comment, because I think if there were a way to improve the latency, runpod could be much better at solving real time AI use cases. Feel free to let me know if I can be of any help. Thanks for your work! |
@Sopamo Low double digit would be complicated, e.g. you would need to be actually having a cluster in Germany, not Sweden. Ping to my hometown in Frankfurt is ~170ms, while us west coast is <10ms. Iceland may be >80ms away for you too. |
Yeah, that would be the idea, choosing (for example) Frankfurt as a datacenter for the main application and then choosing a runpod server in Frankfurt for ai workloads. |
Thanks for working on this!
I've been testing running embeddings in a runpod serverless environment, but the performance isn't what I would have expected. For running bge-m3, we're seeing an end to end latency of ~600ms. Runpod itself reports around 100ms delay time and around 110ms processing time.
I tried running bge-m3 locally on my machine (on a Geforce 4080, directly via python using BGEM3FlagModel) and for the first embedding I see a very high latency as well (~180ms), but for embeddings afterwards the latency is very low, as expected. Around 4-5ms for simple text.
I don't see obvious reasons why requests after the first one on a running worker would still take 100+ms. Is this something that can be improved somehow? I would be willing to contribute, but would like to ask first if this performance is to be expected or if there is potential to improve it.
I would also like to ask about the 100ms delay time. What could be the reasons for it being so high, even though the worker is already running?
We are using European Data centers. Could it be that the requests are somehow routed through the US?
This is the python script I used for testing locally:
Output:
The text was updated successfully, but these errors were encountered: