Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance issue investigation #9

Open
supa-thibaud opened this issue Aug 23, 2024 · 8 comments
Open

performance issue investigation #9

supa-thibaud opened this issue Aug 23, 2024 · 8 comments

Comments

@supa-thibaud
Copy link

I've been investigating a performance issue with SGLang on RunPod's serverless platform. Here are my key findings:

I identified that SGLang performs significantly worse on the serverless setup compared to a dedicated pod.

The main issue I discovered is that on serverless, requests are being processed sequentially rather than in batches, which is inefficient for SGLang.

In the pod setup, I observed SGLang handling multiple requests simultaneously, while in serverless, it's always "running-req: 1".

I updated the Dockerfile with more recent versions of Python and other dependencies, but this didn't resolve the issue.

I attempted to add concurrency settings similar to those used in the vLLM implementation, but this didn't solve the problem.

I believe the issue is related to RunPod's serverless architecture or possibly async/await implementation, rather than SGLang itself.
When running tests, I observed that the serverless setup would queue multiple jobs (e.g., "queued 9") but only process one at a time ("Inprogress 1").

I've shared my findings and updated code in a GitHub repository (https://github.com/supa-thibaud/worker-sglang), but I'm at a point where I don't think I can do much more on my side to resolve this issue.

More details on Runpod discord: https://discord.com/channels/912829806415085598/1275353840073441362

@pandyamarut
Copy link
Collaborator

Thanks for detailed description.
I think implementing custom batching at worker level and forwarding to the Sglang should be fix to this, but let me try. I shall update you soon.

@supa-thibaud
Copy link
Author

i hope you'll have more success than me for that.
i think i ve done my best and i can't go further myself (except testing your solutions)


i try without great results:

https://github.com/supa-thibaud/worker-sglang/tree/async
https://github.com/supa-thibaud/worker-sglang/tree/async_v2

coding with async seems a little helpful (i think i got 2 "inprogress" at the same time but never more even flooding with dozens of lite requests (openai) on 2 x H100s with 2GPU/worker with a 8B model)

@supa-thibaud
Copy link
Author

@pandyamarut any updates?

@pandyamarut
Copy link
Collaborator

Yeah, This seem to be an issue with an interface. There are Fixes being pushed for runpod sdk. I will test with those and I will update you at the earliest. Thanks for being patient.

@supa-thibaud
Copy link
Author

Yeah, This seem to be an issue with an interface. There are Fixes being pushed for runpod sdk. I will test with those and I will update you at the earliest. Thanks for being patient.

Hi @pandyamarut any updates?

@pandyamarut
Copy link
Collaborator

Sorry for the inconvenience @supa-thibaud as soon as there's update, I shall send in a update here. Thank you.

@nerdylive123
Copy link

Im interested in this topic, any updates yet?, and also for the model download i saw on vllm-worker there is commented out tensorizer code but it seems to be missing here, isn't that good for speeding up model loading or inference?

@kldzj
Copy link

kldzj commented Nov 9, 2024

Any update on this issue? The worker is still processing one request at a time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants