Replies: 2 comments
-
I observed that TP is slightly faster (?). I test with llama-2-13b-chat-hf. But, when TP on 2 GPUs, the model takes up an absurd amount of memory, 36.5 GB each, even though dtype is bfloat16 ... |
Beta Was this translation helpful? Give feedback.
-
When my model can fit into one GPU, it seems # GPUs * throughput is always higher than using TP=# GPUs. So if # requests is always greater than # GPUs, I think using separable process on each GPU gives me the best performance. Please correct me if I am wrong. |
Beta Was this translation helpful? Give feedback.
-
Using tensor parallelism vs using one GPU, which is faster? Should I use tp when the model can fit into one GPU?
Beta Was this translation helpful? Give feedback.
All reactions