-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance issue investigation #9
Comments
Thanks for detailed description. |
i hope you'll have more success than me for that. i try without great results: https://github.com/supa-thibaud/worker-sglang/tree/async coding with async seems a little helpful (i think i got 2 "inprogress" at the same time but never more even flooding with dozens of lite requests (openai) on 2 x H100s with 2GPU/worker with a 8B model) |
@pandyamarut any updates? |
Yeah, This seem to be an issue with an interface. There are Fixes being pushed for runpod sdk. I will test with those and I will update you at the earliest. Thanks for being patient. |
Hi @pandyamarut any updates? |
Sorry for the inconvenience @supa-thibaud as soon as there's update, I shall send in a update here. Thank you. |
Im interested in this topic, any updates yet?, and also for the model download i saw on vllm-worker there is commented out tensorizer code but it seems to be missing here, isn't that good for speeding up model loading or inference? |
Any update on this issue? The worker is still processing one request at a time. |
I've been investigating a performance issue with SGLang on RunPod's serverless platform. Here are my key findings:
I identified that SGLang performs significantly worse on the serverless setup compared to a dedicated pod.
The main issue I discovered is that on serverless, requests are being processed sequentially rather than in batches, which is inefficient for SGLang.
In the pod setup, I observed SGLang handling multiple requests simultaneously, while in serverless, it's always "running-req: 1".
I updated the Dockerfile with more recent versions of Python and other dependencies, but this didn't resolve the issue.
I attempted to add concurrency settings similar to those used in the vLLM implementation, but this didn't solve the problem.
I believe the issue is related to RunPod's serverless architecture or possibly async/await implementation, rather than SGLang itself.
When running tests, I observed that the serverless setup would queue multiple jobs (e.g., "queued 9") but only process one at a time ("Inprogress 1").
I've shared my findings and updated code in a GitHub repository (https://github.com/supa-thibaud/worker-sglang), but I'm at a point where I don't think I can do much more on my side to resolve this issue.
More details on Runpod discord: https://discord.com/channels/912829806415085598/1275353840073441362
The text was updated successfully, but these errors were encountered: