Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] deepseek v3 60 tokens/sec on deepseek API vs. 13 tokens/sec on sglang #3196

Open
2 tasks done
pseudotensor opened this issue Jan 28, 2025 · 9 comments
Open
2 tasks done
Assignees
Labels
help wanted Extra attention is needed

Comments

@pseudotensor
Copy link

Checklist

Motivation

The PR for AMD + sglang and NVIDIA + sglang was that it was "fully" supported, but it seems something is off by the speed. A single sequence runs at only order 13 tokens/sec for long generation with TTFT order 2 seconds. This is consistent with vLLM as well. True for either 8MI300X or 8H200 or 28H200.

For only 37B parameters + 14B MOE parameters, this seems way too slow. Also, deepseek API (before it started to break down) was order 60 tokens/sec early on and they advertise 60 tokens/sec. This is more aligned with the parameters active.

What is missing from truly fully suppporting deepseek V3 and R1? Can these features be enumerated and added in a roadmap?

Related resources

No response

@EgorovMike219
Copy link

Hello. I'm running on two H100s and achieving a speed of 33 tokens per second. While this is likely not the maximum possible speed, it's still significantly higher than 13 tokens per second. Make sure you're fully utilizing Infiniband, as this can greatly impact performance. If you're using the standard Docker image, that might be the source of the issue. You can find a potential solution on this GitHub page: https://github.com/sgl-project/sglang/issues/2817 .

@zhaochenyang20
Copy link
Collaborator

Also, wait for next version's flashinfer. It could be much quicker.

@zhaochenyang20 zhaochenyang20 self-assigned this Jan 29, 2025
@zhaochenyang20 zhaochenyang20 added the help wanted Extra attention is needed label Jan 29, 2025
@pseudotensor
Copy link
Author

@EgorovMike219

Hello. I'm running on two H100s and achieving a speed of 33 tokens per second. While this is likely not the maximum possible speed, it's still significantly higher than 13 tokens per second. Make sure you're fully utilizing Infiniband, as this can greatly impact performance. If you're using the standard Docker image, that might be the source of the issue. You can find a potential solution on this GitHub page: https://github.com/sgl-project/sglang/issues/2817 .

Thanks. 33 tokens/sec is definitely closer to what deepseek say, but still factor of 2 away.

In the setups I tried, listed above, I didn't try 2*H100. Are you able to get full 128k context in that mode? Can you share your startup command? Thanks!

@pseudotensor
Copy link
Author

pseudotensor commented Jan 30, 2025

On a single 8*H200 system (so infiniband not issue) with this script:

import time
import openai
import tiktoken
from datetime import datetime

def count_tokens(prompt):
    encoder = tiktoken.get_encoding("cl100k_base")
    return len(encoder.encode(prompt))

def measure_performance(prompt, model="deepseek-ai/DeepSeek-V3", max_tokens=64):
    client = openai.Client(base_url="URL:5000/v1", api_key="KEY")

    token_count = count_tokens(prompt)
    start_time = time.time()
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "user", "content": prompt},
        ],
        temperature=0,
        max_tokens=max_tokens,
    )

    first_token_time = time.time() - start_time
    total_tokens = len(response.choices[0].message.content.split())
    total_time = time.time() - start_time
    tps = total_tokens / total_time if total_time > 0 else 0

    return {
        "prompt_length": token_count,
        "max_tokens": max_tokens,
        "time_to_first_token": first_token_time,
        "tokens_per_second": tps,
        "total_time": total_time,
        "total_tokens": total_tokens,
    }

def generate_markdown(results):
    md_report = """# Token Performance Analysis

## Summary of Response Time and Throughput

| Prompt Length | Max Tokens | Time to First Token (s) | Tokens per Second | Total Time (s) | Total Tokens |
|--------------|------------|-------------------------|-------------------|---------------|-------------|
"""

    for res in results:
        md_report += f"| {res['prompt_length']} | {res['max_tokens']} | {res['time_to_first_token']:.4f} | {res['tokens_per_second']:.4f} | {res['total_time']:.4f} | {res['total_tokens']} |\n"

    return md_report

def main():
    test_cases = [
        ("List 3 countries and their capitals.", 64),
        ("word " * 8000 + "Long context test.", 256),
        ("word " * 118000 + "Extreme long context test.", 512)
    ]

    results = []
    for prompt, max_tokens in test_cases:
        res = measure_performance(prompt, max_tokens=max_tokens)
        results.append(res)

    markdown_report = generate_markdown(results)

    with open("performance_report.md", "w") as f:
        f.write(markdown_report)

    print("Performance report generated: performance_report.md")

if __name__ == "__main__":
    main()

I get:

# Token Performance Analysis

## Summary of Response Time and Throughput

| Prompt Length | Max Tokens | Time to First Token (s) | Tokens per Second | Total Time (s) | Total Tokens |
|--------------|------------|-------------------------|-------------------|---------------|-------------|
| 8 | 64 | 1.1590 | 16.3928 | 1.1590 | 19 |
| 8004 | 256 | 5.8025 | 6.5489 | 5.8025 | 38 |
| 118005 | 512 | 198.0117 | 0.2222 | 198.0117 | 44 |

Image

i.e. quite poor for even just 8k context, super poor for 120k context.

Here's how I launch:

docker run -d --gpus all --shm-size 32g -p 5000:5000 -v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --port 5000 --host 0.0.0.0  --api-key KEY  --random-seed 1234

logs:

loc("/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/sgl-workspace/sglang/python/sglang/srt/layers/attention/triloc(t"on/_sogpls-/wdoerckosdpea_caet/tsegnltainogn/.ppyyt"h:o310n:/16sg)l: aerror: noperation scheduled before its operands
g/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands
loc("/sgl-workspace/sglang/python/sglang/srt/layers/attention/triton_ops/decode_attention.py":310:16): error: operation scheduled before its operands

...

[2025-01-29 19:50:52 TP0] max_total_num_tokens=480406, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-01-29 19:50:52 TP6] max_total_num_tokens=480406, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-01-29 19:50:52 TP1] max_total_num_tokens=480406, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-01-29 19:50:52 TP2] max_total_num_tokens=480406, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-01-29 19:50:52 TP5] max_total_num_tokens=480406, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-01-29 19:50:52 TP3] max_total_num_tokens=480406, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-01-29 19:50:52 TP4] max_total_num_tokens=480406, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-01-29 19:50:52 TP7] max_total_num_tokens=480406, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=163840
[2025-01-29 19:50:53] INFO:     Started server process [1]
[2025-01-29 19:50:53] INFO:     Waiting for application startup.
[2025-01-29 19:50:53] INFO:     Application startup complete.
[2025-01-29 19:50:53] INFO:     Uvicorn running on http://0.0.0.0:5000 (Press CTRL+C to quit)
[2025-01-29 19:50:54] INFO:     127.0.0.1:40454 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-01-29 19:50:54 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-29 19:50:56 TP5] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-01-29 19:50:56 TP3] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-01-29 19:50:56 TP6] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-01-29 19:50:56 TP4] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-01-29 19:50:56 TP1] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-01-29 19:50:56 TP7] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-01-29 19:50:56 TP2] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-01-29 19:50:56 TP0] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-01-29 19:51:01] INFO:     127.0.0.1:40462 - "POST /generate HTTP/1.1" 200 OK


...

[2025-01-29 20:53:04 TP0] Prefill batch. #new-seq: 1, #new-token: 8192, #cached-token: 0, cache hit rate: 6.54%, token usage: 0.22, #running-req: 0, #queue-req: 1
[2025-01-29 20:53:27 TP0] Prefill batch. #new-seq: 1, #new-token: 3510, #cached-token: 0, cache hit rate: 6.36%, token usage: 0.24, #running-req: 0, #queue-req: 1
[2025-01-29 20:53:44 TP0] Decode batch. #running-req: 1, #token: 118041, token usage: 0.25, gen throughput (token/s): 0.20, #queue-req: 0

nvidia-smi

(sglang) shadeform@shadeform-system11:~$ nvidia-smi
Thu Jan 30 05:00:03 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H200                    On  |   00000000:03:00.0 Off |                    0 |
| N/A   33C    P0            115W /  700W |  124585MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H200                    On  |   00000000:23:00.0 Off |                    0 |
| N/A   30C    P0            115W /  700W |  124729MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H200                    On  |   00000000:43:00.0 Off |                    0 |
| N/A   32C    P0            114W /  700W |  124729MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H200                    On  |   00000000:63:00.0 Off |                    0 |
| N/A   29C    P0            110W /  700W |  124729MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H200                    On  |   00000000:83:00.0 Off |                    0 |
| N/A   33C    P0            112W /  700W |  124729MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H200                    On  |   00000000:A3:00.0 Off |                    0 |
| N/A   30C    P0            116W /  700W |  124729MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H200                    On  |   00000000:C3:00.0 Off |                    0 |
| N/A   33C    P0            116W /  700W |  124729MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H200                    On  |   00000000:E3:00.0 Off |                    0 |
| N/A   30C    P0            113W /  700W |  124009MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     38119      C   sglang::scheduler                           12457... |
|    1   N/A  N/A     38120      C   sglang::scheduler                           12471... |
|    2   N/A  N/A     38121      C   sglang::scheduler                           12471... |
|    3   N/A  N/A     38122      C   sglang::scheduler                           12471... |
|    4   N/A  N/A     38123      C   sglang::scheduler                           12471... |
|    5   N/A  N/A     38124      C   sglang::scheduler                           12471... |
|    6   N/A  N/A     38125      C   sglang::scheduler                           12471... |
|    7   N/A  N/A     38126      C   sglang::scheduler                           12399... |
+-----------------------------------------------------------------------------------------+
(sglang) shadeform@shadeform-system11:~$ 

Followed default instructions for 8*H200:

https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3

@pseudotensor
Copy link
Author

If I use a proper streaming code, here is what I get:

Image

import time
import openai
from transformers import AutoTokenizer
from datetime import datetime

def count_tokens(prompt, model_name="deepseek-ai/DeepSeek-V3"):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return len(tokenizer.encode(prompt))

def measure_performance(prompt, model="deepseek-ai/DeepSeek-V3", max_tokens=64):
    client = openai.Client(base_url="URL:5000/v1", api_key="KEY")

    token_count = count_tokens(prompt, model)
    start_time = time.time()

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=max_tokens,
        stream=True  # Enable streaming mode
    )

    first_token_time = None
    total_tokens = 0
    first_token_received = False

    for chunk in response:
        if not first_token_received:
            first_token_time = time.time() - start_time
            first_token_received = True
        total_tokens += len(chunk.choices[0].delta.content.split())

    total_time = time.time() - start_time
    tps = total_tokens / total_time if total_time > 0 else 0

    return {
        "prompt_length": token_count,
        "max_tokens": max_tokens,
        "time_to_first_token": first_token_time,
        "tokens_per_second": tps,
        "total_time": total_time,
        "total_tokens": total_tokens,
    }

def generate_markdown(results):
    md_report = """# Token Performance Analysis

## Summary of Response Time and Throughput

| Prompt Length | Max Tokens | Time to First Token (s) | Tokens per Second | Total Time (s) | Total Tokens |
|--------------|------------|-------------------------|-------------------|---------------|-------------|
"""

    for res in results:
        md_report += f"| {res['prompt_length']} | {res['max_tokens']} | {res['time_to_first_token']:.4f} | {res['tokens_per_second']:.4f} | {res['total_time']:.4f} | {res['total_tokens']} |\n"

    return md_report

def main():
    test_cases = [
        ("Write an extremely long story.", 8192),
        ("word " * 8000 + "Write an extremely long story.", 8192),
        ("word " * 118000 + "Write an extremely long story.", 8192)
    ]

    results = []
    for prompt, max_tokens in test_cases:
        res = measure_performance(prompt, max_tokens=max_tokens)
        results.append(res)

    markdown_report = generate_markdown(results)

    with open("performance_report.md", "w") as f:
        f.write(markdown_report)

    print("Performance report generated: performance_report.md")

if __name__ == "__main__":
    main()

@pseudotensor
Copy link
Author

So then question is how do I speed things up for long context queries?

@EgorovMike219
Copy link

If I use a proper streaming code, here is what I get:

Thank you for sharing your code. I obtained similar results. When I mentioned a speed of 30 tokens per second in my initial message, I was using a relatively short prompt. So the issue remains relevant.

I tested DeepSeek-R1 on 2 H100 nodes and achieved the following results:

Prompt Length Max Tokens Time to First Token (s) Tokens per Second Total Time (s) Total Tokens
7 8192 0.2580 32.1598 66.3561 2134
8007 8192 0.9969 26.0523 51.3967 1339
118007 8192 96.6305 5.3192 362.8391 1930

To run the application, I used Apptainer and executed the following commands (as the base image, I used the Docker image nvcr.io/nvidia/tritonserver:24.04-py3-min.):

apptainer exec --nv -B /models/:/models sglang_v0.4.1_post4.sif python3 -m sglang.launch_server --model-path /models/DeepSeek-V3 --tp 16 --trust-remote-code --host 0.0.0.0 --port 8100 --max-prefill-tokens 126000 --nccl-init-addr hgx224:20000 --nnodes 2 --node-rank 0

apptainer exec --nv -B /models/:/models sglang_v0.4.1_post4.sif python3 -m sglang.launch_server --model-path /models/DeepSeek-V3 --tp 16 --trust-remote-code --max-prefill-tokens 126000 --nccl-init-addr hgx224:20000 --nnodes 2 --node-rank 1

@pseudotensor
Copy link
Author

I'll try vLLM and see how goes.

@zhyncs
Copy link
Member

zhyncs commented Jan 30, 2025

@pseudotensor SGLang should be the fastest among open-source options at present. Of course, there is still much room for improvement, and we are already optimizing it. Please stay tuned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants