~50% GPU utilization of openai API server #459

imoneoi · 2023-07-11T10:03:50Z

imoneoi
Jul 11, 2023

I run an openai API server with a LLaMA-based model and 128 parallel requests, but only about 50% GPU utilization (nvidia-smi). Is it normal? Or because of some overhead, such as tokenizer?

WoosukKwon · 2023-07-13T23:53:55Z

WoosukKwon
Jul 13, 2023
Maintainer

Hi @imoneoi, thanks for the question. Yes, vLLM has some CPU-side overheads that can potentially decrease its GPU utilization. For example, the tokenizer can be a performance bottleneck especially when the slow tokenizer is used. For another example, the sampler part can be slow, especially when the requests have different sampling parameters (e.g., some requests use nucleus sampling and others use beam search). Also, FastAPI may incur some overhead when the request rate is high. For now, it's difficult to tell which one is causing the slowdown.

Thanks again for reporting it. We will continue to identify and optimize the performance issue.

1 reply

saareliad Aug 3, 2023

@WoosukKwon @imoneoi Note that nvidia smi only measures the time when a kernel is running, not the true utilization. One can write a "wait kernel" which would get 100% utilization in nvidia smi ;) So better measure FLOPS/THEORETIC_FLOPS or alike. ref https://stackoverflow.com/questions/40937894/nvidia-smi-volatile-gpu-utilization-explanation.,

@WoosukKwon do you think that these overheads you mentioned are really major? @imoneoi Can you recall some CPU usage statistics during the run? I guess I'll try viewing these statistics myself soon on vLLM (I'm intentionally looking for bottlenecks to solve)

supersteves · 2024-10-18T09:33:50Z

supersteves
Oct 18, 2024

I have a similar issue.

I'm using VLLM with Qwen2.5 32b GGUF q4 via the OpenAI endpoints to do chat completion. I know this is experimental, but am VRAM poor and need this kind of model. (OK, I'm testing in the cloud with short-lived machines, but am looking for something that can be used beyond testing without ridiculous cost).

I am playing with increasing the number of GPU expecting to halve the token generation time each time I double the GPUs, approximately.

I'm finding that the GPUs hit around 80-95% usage (which is good) but after a certain number of GPUs (depending on the GPU model), it hits a wall in terms of performance gain.

top shows CPU is almost exactly 100% (i.e. a single core).

And the worst thing about it is that it doesn't matter if I have more than 2 vCPUs, it's always hitting the 1 (real) core 100% usage bottleneck. I've tried with 24 (12 real) cores, same.

This is a single chat request, not parallel calls. I've played with all the options I can find related to concurrency, parallel, threads.

I guess it's a slow tokeniser, but also one that isn't able to use multithreading?

1 reply

tempcollab Oct 22, 2024

also facing the same issue with llama 3.1 models on 2 x H100 SXM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

~50% GPU utilization of openai API server #459

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

~50% GPU utilization of openai API server #459

imoneoi Jul 11, 2023

Replies: 2 comments · 2 replies

WoosukKwon Jul 13, 2023 Maintainer

saareliad Aug 3, 2023

supersteves Oct 18, 2024

tempcollab Oct 22, 2024

imoneoi
Jul 11, 2023

Replies: 2 comments 2 replies

WoosukKwon
Jul 13, 2023
Maintainer

supersteves
Oct 18, 2024