performance of using tensor parallelism vs using one GPU #702

butyuhao · 2023-08-08T08:49:19Z

butyuhao
Aug 8, 2023

Using tensor parallelism vs using one GPU, which is faster? Should I use tp when the model can fit into one GPU?

phucdoitoan · 2023-08-08T12:00:47Z

phucdoitoan
Aug 8, 2023

I observed that TP is slightly faster (?). I test with llama-2-13b-chat-hf.
For offline inference, only 1 prompt: on 1 GPU, vllm has about 40 tokens/s. on 2 GPU it has 45 toknes/s.
For offline inference, 5 prompts, it is 165 token/s for 1 GPU, and 270 tokens/s for 2 GPU.

But, when TP on 2 GPUs, the model takes up an absurd amount of memory, 36.5 GB each, even though dtype is bfloat16 ...

0 replies

ywen666 · 2023-08-11T17:21:07Z

ywen666
Aug 11, 2023

When my model can fit into one GPU, it seems # GPUs * throughput is always higher than using TP=# GPUs. So if # requests is always greater than # GPUs, I think using separable process on each GPU gives me the best performance. Please correct me if I am wrong.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance of using tensor parallelism vs using one GPU #702

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

performance of using tensor parallelism vs using one GPU #702

butyuhao Aug 8, 2023

Replies: 2 comments

phucdoitoan Aug 8, 2023

ywen666 Aug 11, 2023

butyuhao
Aug 8, 2023

phucdoitoan
Aug 8, 2023

ywen666
Aug 11, 2023