-
Notifications
You must be signed in to change notification settings - Fork 794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] deepseek v3 60 tokens/sec on deepseek API vs. 13 tokens/sec on sglang #3196
Comments
Hello. I'm running on two H100s and achieving a speed of 33 tokens per second. While this is likely not the maximum possible speed, it's still significantly higher than 13 tokens per second. Make sure you're fully utilizing Infiniband, as this can greatly impact performance. If you're using the standard Docker image, that might be the source of the issue. You can find a potential solution on this GitHub page: https://github.com/sgl-project/sglang/issues/2817 . |
Also, wait for next version's flashinfer. It could be much quicker. |
Thanks. 33 tokens/sec is definitely closer to what deepseek say, but still factor of 2 away. In the setups I tried, listed above, I didn't try 2*H100. Are you able to get full 128k context in that mode? Can you share your startup command? Thanks! |
On a single 8*H200 system (so infiniband not issue) with this script:
I get:
i.e. quite poor for even just 8k context, super poor for 120k context. Here's how I launch:
logs:
nvidia-smi
Followed default instructions for 8*H200: https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3 |
If I use a proper streaming code, here is what I get:
|
So then question is how do I speed things up for long context queries? |
Thank you for sharing your code. I obtained similar results. When I mentioned a speed of 30 tokens per second in my initial message, I was using a relatively short prompt. So the issue remains relevant. I tested DeepSeek-R1 on 2 H100 nodes and achieved the following results:
To run the application, I used Apptainer and executed the following commands (as the base image, I used the Docker image nvcr.io/nvidia/tritonserver:24.04-py3-min.): apptainer exec --nv -B /models/:/models sglang_v0.4.1_post4.sif python3 -m sglang.launch_server --model-path /models/DeepSeek-V3 --tp 16 --trust-remote-code --host 0.0.0.0 --port 8100 --max-prefill-tokens 126000 --nccl-init-addr hgx224:20000 --nnodes 2 --node-rank 0 apptainer exec --nv -B /models/:/models sglang_v0.4.1_post4.sif python3 -m sglang.launch_server --model-path /models/DeepSeek-V3 --tp 16 --trust-remote-code --max-prefill-tokens 126000 --nccl-init-addr hgx224:20000 --nnodes 2 --node-rank 1 |
I'll try vLLM and see how goes. |
@pseudotensor SGLang should be the fastest among open-source options at present. Of course, there is still much room for improvement, and we are already optimizing it. Please stay tuned. |
Checklist
Motivation
The PR for AMD + sglang and NVIDIA + sglang was that it was "fully" supported, but it seems something is off by the speed. A single sequence runs at only order 13 tokens/sec for long generation with TTFT order 2 seconds. This is consistent with vLLM as well. True for either 8MI300X or 8H200 or 28H200.
For only 37B parameters + 14B MOE parameters, this seems way too slow. Also, deepseek API (before it started to break down) was order 60 tokens/sec early on and they advertise 60 tokens/sec. This is more aligned with the parameters active.
What is missing from truly fully suppporting deepseek V3 and R1? Can these features be enumerated and added in a roadmap?
Related resources
No response
The text was updated successfully, but these errors were encountered: