[Feature] deepseek v3 60 tokens/sec on deepseek API vs. 13 tokens/sec on sglang #3196

pseudotensor · 2025-01-28T18:40:18Z

Checklist

1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
2. Please use English, otherwise it will be closed.

Motivation

The PR for AMD + sglang and NVIDIA + sglang was that it was "fully" supported, but it seems something is off by the speed. A single sequence runs at only order 13 tokens/sec for long generation with TTFT order 2 seconds. This is consistent with vLLM as well. True for either 8MI300X or 8H200 or 28H200.

For only 37B parameters + 14B MOE parameters, this seems way too slow. Also, deepseek API (before it started to break down) was order 60 tokens/sec early on and they advertise 60 tokens/sec. This is more aligned with the parameters active.

What is missing from truly fully suppporting deepseek V3 and R1? Can these features be enumerated and added in a roadmap?

Related resources

No response

EgorovMike219 · 2025-01-29T11:30:08Z

Hello. I'm running on two H100s and achieving a speed of 33 tokens per second. While this is likely not the maximum possible speed, it's still significantly higher than 13 tokens per second. Make sure you're fully utilizing Infiniband, as this can greatly impact performance. If you're using the standard Docker image, that might be the source of the issue. You can find a potential solution on this GitHub page: https://github.com/sgl-project/sglang/issues/2817 .

zhaochenyang20 · 2025-01-29T17:20:43Z

Also, wait for next version's flashinfer. It could be much quicker.

pseudotensor · 2025-01-29T21:59:00Z

@EgorovMike219

Hello. I'm running on two H100s and achieving a speed of 33 tokens per second. While this is likely not the maximum possible speed, it's still significantly higher than 13 tokens per second. Make sure you're fully utilizing Infiniband, as this can greatly impact performance. If you're using the standard Docker image, that might be the source of the issue. You can find a potential solution on this GitHub page: https://github.com/sgl-project/sglang/issues/2817 .

Thanks. 33 tokens/sec is definitely closer to what deepseek say, but still factor of 2 away.

In the setups I tried, listed above, I didn't try 2*H100. Are you able to get full 128k context in that mode? Can you share your startup command? Thanks!

pseudotensor · 2025-01-30T05:00:39Z

On a single 8*H200 system (so infiniband not issue) with this script:

import time
import openai
import tiktoken
from datetime import datetime

def count_tokens(prompt):
    encoder = tiktoken.get_encoding("cl100k_base")
    return len(encoder.encode(prompt))

def measure_performance(prompt, model="deepseek-ai/DeepSeek-V3", max_tokens=64):
    client = openai.Client(base_url="URL:5000/v1", api_key="KEY")

    token_count = count_tokens(prompt)
    start_time = time.time()
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "user", "content": prompt},
        ],
        temperature=0,
        max_tokens=max_tokens,
    )

    first_token_time = time.time() - start_time
    total_tokens = len(response.choices[0].message.content.split())
    total_time = time.time() - start_time
    tps = total_tokens / total_time if total_time > 0 else 0

    return {
        "prompt_length": token_count,
        "max_tokens": max_tokens,
        "time_to_first_token": first_token_time,
        "tokens_per_second": tps,
        "total_time": total_time,
        "total_tokens": total_tokens,
    }

def generate_markdown(results):
    md_report = """# Token Performance Analysis

## Summary of Response Time and Throughput

| Prompt Length | Max Tokens | Time to First Token (s) | Tokens per Second | Total Time (s) | Total Tokens |
|--------------|------------|-------------------------|-------------------|---------------|-------------|
"""

    for res in results:
        md_report += f"| {res['prompt_length']} | {res['max_tokens']} | {res['time_to_first_token']:.4f} | {res['tokens_per_second']:.4f} | {res['total_time']:.4f} | {res['total_tokens']} |\n"

    return md_report

def main():
    test_cases = [
        ("List 3 countries and their capitals.", 64),
        ("word " * 8000 + "Long context test.", 256),
        ("word " * 118000 + "Extreme long context test.", 512)
    ]

    results = []
    for prompt, max_tokens in test_cases:
        res = measure_performance(prompt, max_tokens=max_tokens)
        results.append(res)

    markdown_report = generate_markdown(results)

    with open("performance_report.md", "w") as f:
        f.write(markdown_report)

    print("Performance report generated: performance_report.md")

if __name__ == "__main__":
    main()

I get:

# Token Performance Analysis

## Summary of Response Time and Throughput

| Prompt Length | Max Tokens | Time to First Token (s) | Tokens per Second | Total Time (s) | Total Tokens |
|--------------|------------|-------------------------|-------------------|---------------|-------------|
| 8 | 64 | 1.1590 | 16.3928 | 1.1590 | 19 |
| 8004 | 256 | 5.8025 | 6.5489 | 5.8025 | 38 |
| 118005 | 512 | 198.0117 | 0.2222 | 198.0117 | 44 |

i.e. quite poor for even just 8k context, super poor for 120k context.

Here's how I launch:

docker run -d --gpus all --shm-size 32g -p 5000:5000 -v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --port 5000 --host 0.0.0.0  --api-key KEY  --random-seed 1234

logs:

loc("/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/sgl-workspace/sglang/python/sglang/srt/layers/attention/triloc(t"on/_sogpls-/wdoerckosdpea_caet/tsegnltainogn/.ppyyt"h:o310n:/16sg)l: aerror: noperation scheduled before its operands
g/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands

...

[2025-01-29 19:50:52 TP0] max_total_num_tokens=480406, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-01-29 19:50:52 TP6] max_total_num_tokens=480406, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-01-29 19:50:52 TP1] max_total_num_tokens=480406, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-01-29 19:50:52 TP2] max_total_num_tokens=480406, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-01-29 19:50:52 TP5] max_total_num_tokens=480406, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-01-29 19:50:52 TP3] max_total_num_tokens=480406, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-01-29 19:50:52 TP4] max_total_num_tokens=480406, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-01-29 19:50:52 TP7] max_total_num_tokens=480406, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-01-29 19:50:53] INFO:     Started server process [1]
[2025-01-29 19:50:53] INFO:     Waiting for application startup.
[2025-01-29 19:50:53] INFO:     Application startup complete.
[2025-01-29 19:50:53] INFO:     Uvicorn running on http://0.0.0.0:5000 (Press CTRL+C to quit)
[2025-01-29 19:50:54] INFO:     127.0.0.1:40454 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-01-29 19:50:54 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-29 19:50:56 TP5] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-01-29 19:50:56 TP3] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-01-29 19:50:56 TP6] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-01-29 19:50:56 TP4] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-01-29 19:50:56 TP1] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-01-29 19:50:56 TP7] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-01-29 19:50:56 TP2] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-01-29 19:50:56 TP0] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-01-29 19:51:01] INFO:     127.0.0.1:40462 - "POST /generate HTTP/1.1" 200 OK


...

[2025-01-29 20:53:04 TP0] Prefill batch. #new-seq: 1, #new-token: 8192, #cached-token: 0, cache hit rate: 6.54%, token usage: 0.22, #running-req: 0, #queue-req: 1
[2025-01-29 20:53:27 TP0] Prefill batch. #new-seq: 1, #new-token: 3510, #cached-token: 0, cache hit rate: 6.36%, token usage: 0.24, #running-req: 0, #queue-req: 1
[2025-01-29 20:53:44 TP0] Decode batch. #running-req: 1, #token: 118041, token usage: 0.25, gen throughput (token/s): 0.20, #queue-req: 0

nvidia-smi

(sglang) shadeform@shadeform-system11:~$ nvidia-smi
Thu Jan 30 05:00:03 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H200                    On  |   00000000:03:00.0 Off |                    0 |
| N/A   33C    P0            115W /  700W |  124585MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H200                    On  |   00000000:23:00.0 Off |                    0 |
| N/A   30C    P0            115W /  700W |  124729MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H200                    On  |   00000000:43:00.0 Off |                    0 |
| N/A   32C    P0            114W /  700W |  124729MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H200                    On  |   00000000:63:00.0 Off |                    0 |
| N/A   29C    P0            110W /  700W |  124729MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H200                    On  |   00000000:83:00.0 Off |                    0 |
| N/A   33C    P0            112W /  700W |  124729MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H200                    On  |   00000000:A3:00.0 Off |                    0 |
| N/A   30C    P0            116W /  700W |  124729MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H200                    On  |   00000000:C3:00.0 Off |                    0 |
| N/A   33C    P0            116W /  700W |  124729MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H200                    On  |   00000000:E3:00.0 Off |                    0 |
| N/A   30C    P0            113W /  700W |  124009MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     38119      C   sglang::scheduler                           12457... |
|    1   N/A  N/A     38120      C   sglang::scheduler                           12471... |
|    2   N/A  N/A     38121      C   sglang::scheduler                           12471... |
|    3   N/A  N/A     38122      C   sglang::scheduler                           12471... |
|    4   N/A  N/A     38123      C   sglang::scheduler                           12471... |
|    5   N/A  N/A     38124      C   sglang::scheduler                           12471... |
|    6   N/A  N/A     38125      C   sglang::scheduler                           12471... |
|    7   N/A  N/A     38126      C   sglang::scheduler                           12399... |
+-----------------------------------------------------------------------------------------+
(sglang) shadeform@shadeform-system11:~$

Followed default instructions for 8*H200:

https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3

pseudotensor · 2025-01-30T05:17:44Z

If I use a proper streaming code, here is what I get:

import time
import openai
from transformers import AutoTokenizer
from datetime import datetime

def count_tokens(prompt, model_name="deepseek-ai/DeepSeek-V3"):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return len(tokenizer.encode(prompt))

def measure_performance(prompt, model="deepseek-ai/DeepSeek-V3", max_tokens=64):
    client = openai.Client(base_url="URL:5000/v1", api_key="KEY")

    token_count = count_tokens(prompt, model)
    start_time = time.time()

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=max_tokens,
        stream=True  # Enable streaming mode
    )

    first_token_time = None
    total_tokens = 0
    first_token_received = False

    for chunk in response:
        if not first_token_received:
            first_token_time = time.time() - start_time
            first_token_received = True
        total_tokens += len(chunk.choices[0].delta.content.split())

    total_time = time.time() - start_time
    tps = total_tokens / total_time if total_time > 0 else 0

    return {
        "prompt_length": token_count,
        "max_tokens": max_tokens,
        "time_to_first_token": first_token_time,
        "tokens_per_second": tps,
        "total_time": total_time,
        "total_tokens": total_tokens,
    }

def generate_markdown(results):
    md_report = """# Token Performance Analysis

## Summary of Response Time and Throughput

| Prompt Length | Max Tokens | Time to First Token (s) | Tokens per Second | Total Time (s) | Total Tokens |
|--------------|------------|-------------------------|-------------------|---------------|-------------|
"""

    for res in results:
        md_report += f"| {res['prompt_length']} | {res['max_tokens']} | {res['time_to_first_token']:.4f} | {res['tokens_per_second']:.4f} | {res['total_time']:.4f} | {res['total_tokens']} |\n"

    return md_report

def main():
    test_cases = [
        ("Write an extremely long story.", 8192),
        ("word " * 8000 + "Write an extremely long story.", 8192),
        ("word " * 118000 + "Write an extremely long story.", 8192)
    ]

    results = []
    for prompt, max_tokens in test_cases:
        res = measure_performance(prompt, max_tokens=max_tokens)
        results.append(res)

    markdown_report = generate_markdown(results)

    with open("performance_report.md", "w") as f:
        f.write(markdown_report)

    print("Performance report generated: performance_report.md")

if __name__ == "__main__":
    main()

pseudotensor · 2025-01-30T05:18:19Z

So then question is how do I speed things up for long context queries?

EgorovMike219 · 2025-01-30T08:04:03Z

If I use a proper streaming code, here is what I get:

Thank you for sharing your code. I obtained similar results. When I mentioned a speed of 30 tokens per second in my initial message, I was using a relatively short prompt. So the issue remains relevant.

I tested DeepSeek-R1 on 2 H100 nodes and achieved the following results:

Prompt Length	Max Tokens	Time to First Token (s)	Tokens per Second	Total Time (s)	Total Tokens
7	8192	0.2580	32.1598	66.3561	2134
8007	8192	0.9969	26.0523	51.3967	1339
118007	8192	96.6305	5.3192	362.8391	1930

To run the application, I used Apptainer and executed the following commands (as the base image, I used the Docker image nvcr.io/nvidia/tritonserver:24.04-py3-min.):

apptainer exec --nv -B /models/:/models sglang_v0.4.1_post4.sif python3 -m sglang.launch_server --model-path /models/DeepSeek-V3 --tp 16 --trust-remote-code --host 0.0.0.0 --port 8100 --max-prefill-tokens 126000 --nccl-init-addr hgx224:20000 --nnodes 2 --node-rank 0

apptainer exec --nv -B /models/:/models sglang_v0.4.1_post4.sif python3 -m sglang.launch_server --model-path /models/DeepSeek-V3 --tp 16 --trust-remote-code --max-prefill-tokens 126000 --nccl-init-addr hgx224:20000 --nnodes 2 --node-rank 1

pseudotensor · 2025-01-30T08:23:28Z

I'll try vLLM and see how goes.

zhyncs · 2025-01-30T08:26:28Z

@pseudotensor SGLang should be the fastest among open-source options at present. Of course, there is still much room for improvement, and we are already optimizing it. Please stay tuned.

zhaochenyang20 self-assigned this Jan 29, 2025

zhaochenyang20 added the help wanted Extra attention is needed label Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] deepseek v3 60 tokens/sec on deepseek API vs. 13 tokens/sec on sglang #3196

[Feature] deepseek v3 60 tokens/sec on deepseek API vs. 13 tokens/sec on sglang #3196

pseudotensor commented Jan 28, 2025

EgorovMike219 commented Jan 29, 2025

zhaochenyang20 commented Jan 29, 2025

pseudotensor commented Jan 29, 2025

pseudotensor commented Jan 30, 2025 •

edited

Loading

pseudotensor commented Jan 30, 2025

pseudotensor commented Jan 30, 2025

EgorovMike219 commented Jan 30, 2025

pseudotensor commented Jan 30, 2025

zhyncs commented Jan 30, 2025

[Feature] deepseek v3 60 tokens/sec on deepseek API vs. 13 tokens/sec on sglang #3196

[Feature] deepseek v3 60 tokens/sec on deepseek API vs. 13 tokens/sec on sglang #3196

Comments

pseudotensor commented Jan 28, 2025

Checklist

Motivation

Related resources

EgorovMike219 commented Jan 29, 2025

zhaochenyang20 commented Jan 29, 2025

pseudotensor commented Jan 29, 2025

pseudotensor commented Jan 30, 2025 • edited Loading

pseudotensor commented Jan 30, 2025

pseudotensor commented Jan 30, 2025

EgorovMike219 commented Jan 30, 2025

pseudotensor commented Jan 30, 2025

zhyncs commented Jan 30, 2025

pseudotensor commented Jan 30, 2025 •

edited

Loading