Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance expectations #5

Open
Sopamo opened this issue May 31, 2024 · 11 comments
Open

Performance expectations #5

Sopamo opened this issue May 31, 2024 · 11 comments

Comments

@Sopamo
Copy link

Sopamo commented May 31, 2024

Thanks for working on this!

I've been testing running embeddings in a runpod serverless environment, but the performance isn't what I would have expected. For running bge-m3, we're seeing an end to end latency of ~600ms. Runpod itself reports around 100ms delay time and around 110ms processing time.

I tried running bge-m3 locally on my machine (on a Geforce 4080, directly via python using BGEM3FlagModel) and for the first embedding I see a very high latency as well (~180ms), but for embeddings afterwards the latency is very low, as expected. Around 4-5ms for simple text.

I don't see obvious reasons why requests after the first one on a running worker would still take 100+ms. Is this something that can be improved somehow? I would be willing to contribute, but would like to ask first if this performance is to be expected or if there is potential to improve it.

I would also like to ask about the 100ms delay time. What could be the reasons for it being so high, even though the worker is already running?

We are using European Data centers. Could it be that the requests are somehow routed through the US?

This is the python script I used for testing locally:

import time

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=False)

sentences = ["What is BGE M3?"]
sentences2 = ["More text"]
sentences3 = ["<<< More text"]


def get_embeddings(inputs):
    start_time = time.time()
    model.encode(inputs)['dense_vecs']
    end_time = time.time()

    execution_time = end_time - start_time
    print(f"Execution time: {execution_time} seconds")


get_embeddings(sentences)
get_embeddings(sentences2)
get_embeddings(sentences3)

Output:

Execution time: 0.214857816696167 seconds
Execution time: 0.004781007766723633 seconds
Execution time: 0.00433349609375 seconds
@michaelfeil
Copy link
Contributor

For the first request, the cuda-graph might be intialized on the GPU, which leads to the cold-start delay. The application is designed for high throughput.

@Sopamo
Copy link
Author

Sopamo commented Jul 1, 2024

Yes, I wasn't confused about the cold start delay, but about the delay that are sent to warm workers. Sorry, I think I wasn't explicit enough about this. The numbers we are seeing when running via runpod were all measured with warm workers.

@michaelfeil
Copy link
Contributor

Interesting, thanks for flagging that - maybe you hit a "cold replica"?

@Sopamo
Copy link
Author

Sopamo commented Jul 1, 2024

I just tried it again. For testing we only have a single worker deployed. The warm requests consistently have ~100ms delay and ~100ms execution time:

www runpod io_console_serverless_user_endpoint_wy2rllic173ghb

Here are the container logs, maybe that helps:
logs.txt

I would be open to debugging this further myself, but before I do that I wanted to ask if there are any obvious reasons for these delays.

@TimPietrusky
Copy link

@Sopamo as far as I understand, there is no obvious reason for the delay.

When you tested this locally, you didn't use the worker-infinity-embedding right? So maybe that would be something to try out if you have the time and energy.

@Sopamo
Copy link
Author

Sopamo commented Aug 15, 2024

@TimPietrusky thanks for getting back to me! That is true, will try!

@TimPietrusky
Copy link

@Sopamo thank you very much! And please let us know if you find anything!

@Sopamo
Copy link
Author

Sopamo commented Sep 6, 2024

@TimPietrusky I did some more testing. The delay seems to depend on the data center I'm using. I chose 4 different data centers that are far apart from each other, configured it to use a single worker, made a request to warm up the worker and then did 5 requests directly after each other. The following are the lowest values for delay and execution that I got. The execution times seem ok, but the delay times are all very high (from the perspective of trying to build real time applications with it):

  • EU-SE-1
    • 0.1s Delay Time
    • 0.08s Execution Time
  • US-TX-3
    • 0.15s Delay Time
    • 0.05s Execution Time
  • CA-MTL-1
    • 0.16s Delay Time
    • 0.01s Execution Time
  • EUR-IS-1
    • 0.1s Delay Time
    • 0.05s Execution Time

I also tried running the worker locally:

  • I checked out https://github.com/runpod-workers/worker-infinity-embedding
  • Created a new venv
  • Installed requirements.txt
  • Run MODEL_NAMES=BAAI/bge-reranker-v2-m3 python src/handler.py --rp_serve_api
  • Create a file post_data.json with this string: { "input": { "model": "BAAI/bge-reranker-v2-m3", "query": "my question", "docs": [ "blue", "red" ] }}
  • Benchmark performance with ab: ab -n 300 -c 1 -p post_data.json -T 'application/json' http://localhost:8000/runsync
  • Results show that when running it locally, there is virtually no delay (p90 of 8ms):
Percentage of the requests served within a certain time (ms)
  50%      6
  66%      6
  75%      7
  80%      7
  90%      8
  95%      9
  98%     10
  99%     12
 100%     27 (longest request)

As comparison I did the same against the runsync endpoint of my worker that's running in runpod:

  • ab -n 300 -c 1 -p post_data.json -T 'application/json' -H 'Authorization: xxx' https://api.runpod.ai/v2/xxx/runsync
  • The response times that ab reports don't have any real value, as they depend on my local network connection to the server of course, but the graph within runpod shows that those 300 calls had a p90 delay time of 104ms:
    image
    and also a p90 execution time of 94ms:
    image

So in total that gives us p90 of ~200ms instead of 7ms once the worker is running on runpod infrastructure (in this case EU-RO-1). I feel like this might either depend on some inefficiencies in the queueing code that runs in runpod prod, but not in the local rp_serve_api version, or it's just that there are a few http requests that the worker does to communicate with some kind of central runpod infrastructure and that adds up to add all of the additional time.

I'd love to get some feedback if this (using runpod serverless for real time applications) is something that you plan to improve, or if you currently focusing on providing good service for applications that don't rely on as little latency as possible.

I'd also be happy to help if I can, so let me know if I can do anything else :)

@Sopamo
Copy link
Author

Sopamo commented Oct 23, 2024

@TimPietrusky @michaelfeil I'm sure you have loads of to-dos building amazing GPU cloud features, but I wanted to send a single follow up comment, because I think if there were a way to improve the latency, runpod could be much better at solving real time AI use cases.
Having an easily scalable endpoint that provides single-digit to low double-digit millisecond latency would make runpod very appealing for real time AI.

Feel free to let me know if I can be of any help. Thanks for your work!

@michaelfeil
Copy link
Contributor

@Sopamo Low double digit would be complicated, e.g. you would need to be actually having a cluster in Germany, not Sweden. Ping to my hometown in Frankfurt is ~170ms, while us west coast is <10ms. Iceland may be >80ms away for you too.
https://www.meter.net/tools/world-ping-test/

@Sopamo
Copy link
Author

Sopamo commented Oct 23, 2024

Yeah, that would be the idea, choosing (for example) Frankfurt as a datacenter for the main application and then choosing a runpod server in Frankfurt for ai workloads.
I think the delay times can't be explained by physical distance, otherwise for example Sweden would have to be much faster than Texas (the ping test you linked shows 30ms to sweden for me), but my tests showed it isn't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants