Deploying microsoft/Phi-3-medium-128k-instruct on more than 2 GPUs #5500

Ala-Na · 2024-06-13T10:47:40Z

Ala-Na
Jun 13, 2024

Hello there :)

I'm trying to deploy microsoft/Phi-3-medium-128k-instruct on NVIDIA L4 GPU with the latest version of VLLM (0.5.0).

I tried with 4 GPUs using the cli command:
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m vllm.entrypoints.openai.api_server --model microsoft/Phi-3-medium-128k-instruct --trust-remote-code --port 8000 --tensor-parallel-size 4

But this throw an error as the number of KV heads (10) is not a multiple of 4.
Error details :

assert self.total_num_kv_heads % tp_size == 0, f"tp_size {tp_size} total_num_kv_heads {self.total_num_kv_heads} total_num_heads {self.total_num_heads}"
AssertionError

From my understanding, Phi-3-medium use the Phi3ForCasualLM architecture which is treated as a Llama model by VLLM.
The attention layer of this model throw this error as it tries to distribute the KV heads across multiple tensor parallel GPUs.

I can't use 2 GPUs only as there too low on memory and only have access to L4 GPUs for the moment.

If anyone has an idea on how to make this works, I'm all ears :)
Thanks

ccruttjr · 2024-07-11T01:50:18Z

ccruttjr
Jul 11, 2024

Bumping. vllm/model_executor/models/llama.py relies on these assert statements

assert self.total_num_heads % tp_size == 0
# ...
assert self.total_num_kv_heads % tp_size == 0
# or
assert tp_size % self.total_num_kv_heads == 0

And I also believe serving requires vocab size to be divisible by tp? And hidden size? And hidden layers? The issue is that

total_num_heads = 40
total_num_kv_heads = 10
vocab_size = 32064
hidden_size = 5120
num_hidden_layers = 40

Again, not sure about the vocab size, but, if that is the case, only rigs with 2 GPUs would work. If vocab size doesn't matter, we'd have to have either 2, 10, 20, or 40 GPUs

1 reply

ccruttjr Jul 11, 2024

Referenced in #5951

hao-fang · 2024-08-07T17:52:44Z

hao-fang
Aug 7, 2024

According to https://docs.vllm.ai/en/stable/serving/distributed_serving.html#multi-node-inference-and-serving

There is one edge case: if the model fits in a single node with multiple GPUs, but the number of GPUs cannot divide the model size evenly, you can use pipeline parallelism, which splits the model along layers and supports uneven splits. In this case, the tensor parallel size should be 1 and the pipeline parallel size should be the number of GPUs.

You can also additionally specify --pipeline-parallel-size to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism:

vllm serve gpt2 \
  --tensor-parallel-size 4 \
  --pipeline-parallel-size 2

So I think you would need to use --pipeline-parallel-size in this case.

0 replies

jgen1 · 2024-08-13T18:49:26Z

jgen1
Aug 13, 2024

@ccruttjr did you ever find a solution to this issue? I am running into the same issue with Phi-3-medium-128k-instruct. I have 4 Nvidia Tesla T4s, which have 16 GB RAM each. Using dtype float16, the model should only be 26.0 GB, so it should fit on these GPUs easily. tensor-parallel-size 4 gives me this error
assert self.total_num_kv_heads % tp_size == 0
mentioned at the top of the post. I have also tried --tensor-parallel-size 2, with and without --pipeline-parallel-size 2, but both times I got CUDA out of memory errors because the model can't fit on just 2 of these GPUs. So is there any solution to using 4 GPUs for this model?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deploying microsoft/Phi-3-medium-128k-instruct on more than 2 GPUs #5500

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Deploying microsoft/Phi-3-medium-128k-instruct on more than 2 GPUs #5500

Ala-Na Jun 13, 2024

Replies: 3 comments · 1 reply

ccruttjr Jul 11, 2024

ccruttjr Jul 11, 2024

hao-fang Aug 7, 2024

jgen1 Aug 13, 2024

Ala-Na
Jun 13, 2024

Replies: 3 comments 1 reply

ccruttjr
Jul 11, 2024

hao-fang
Aug 7, 2024

jgen1
Aug 13, 2024