performance issue investigation #9

supa-thibaud · 2024-08-23T20:55:48Z

I've been investigating a performance issue with SGLang on RunPod's serverless platform. Here are my key findings:

I identified that SGLang performs significantly worse on the serverless setup compared to a dedicated pod.

The main issue I discovered is that on serverless, requests are being processed sequentially rather than in batches, which is inefficient for SGLang.

In the pod setup, I observed SGLang handling multiple requests simultaneously, while in serverless, it's always "running-req: 1".

I updated the Dockerfile with more recent versions of Python and other dependencies, but this didn't resolve the issue.

I attempted to add concurrency settings similar to those used in the vLLM implementation, but this didn't solve the problem.

I believe the issue is related to RunPod's serverless architecture or possibly async/await implementation, rather than SGLang itself.
When running tests, I observed that the serverless setup would queue multiple jobs (e.g., "queued 9") but only process one at a time ("Inprogress 1").

I've shared my findings and updated code in a GitHub repository (https://github.com/supa-thibaud/worker-sglang), but I'm at a point where I don't think I can do much more on my side to resolve this issue.

More details on Runpod discord: https://discord.com/channels/912829806415085598/1275353840073441362

pandyamarut · 2024-08-23T21:22:49Z

Thanks for detailed description.
I think implementing custom batching at worker level and forwarding to the Sglang should be fix to this, but let me try. I shall update you soon.

supa-thibaud · 2024-08-23T21:39:36Z

i hope you'll have more success than me for that.
i think i ve done my best and i can't go further myself (except testing your solutions)

i try without great results:

https://github.com/supa-thibaud/worker-sglang/tree/async
https://github.com/supa-thibaud/worker-sglang/tree/async_v2

coding with async seems a little helpful (i think i got 2 "inprogress" at the same time but never more even flooding with dozens of lite requests (openai) on 2 x H100s with 2GPU/worker with a 8B model)

supa-thibaud · 2024-09-09T17:46:17Z

@pandyamarut any updates?

pandyamarut · 2024-09-09T18:24:28Z

Yeah, This seem to be an issue with an interface. There are Fixes being pushed for runpod sdk. I will test with those and I will update you at the earliest. Thanks for being patient.

supa-thibaud · 2024-09-20T12:57:39Z

Yeah, This seem to be an issue with an interface. There are Fixes being pushed for runpod sdk. I will test with those and I will update you at the earliest. Thanks for being patient.

Hi @pandyamarut any updates?

pandyamarut · 2024-09-20T17:51:36Z

Sorry for the inconvenience @supa-thibaud as soon as there's update, I shall send in a update here. Thank you.

nerdylive123 · 2024-11-08T04:51:38Z

Im interested in this topic, any updates yet?, and also for the model download i saw on vllm-worker there is commented out tensorizer code but it seems to be missing here, isn't that good for speeding up model loading or inference?

kldzj · 2024-11-09T16:10:34Z

Any update on this issue? The worker is still processing one request at a time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance issue investigation #9

performance issue investigation #9

supa-thibaud commented Aug 23, 2024

pandyamarut commented Aug 23, 2024

supa-thibaud commented Aug 23, 2024

supa-thibaud commented Sep 9, 2024

pandyamarut commented Sep 9, 2024

supa-thibaud commented Sep 20, 2024

pandyamarut commented Sep 20, 2024

nerdylive123 commented Nov 8, 2024

kldzj commented Nov 9, 2024

performance issue investigation #9

performance issue investigation #9

Comments

supa-thibaud commented Aug 23, 2024

pandyamarut commented Aug 23, 2024

supa-thibaud commented Aug 23, 2024

supa-thibaud commented Sep 9, 2024

pandyamarut commented Sep 9, 2024

supa-thibaud commented Sep 20, 2024

pandyamarut commented Sep 20, 2024

nerdylive123 commented Nov 8, 2024

kldzj commented Nov 9, 2024