Multi-GPU inference and Specify which GPUs to be used during inference #278

EthanC111 · 2023-06-26T04:03:34Z

EthanC111
Jun 26, 2023

I have two questions:

I attempted multi-GPU inference (8 GPU inference on A100) on Llama-13B. I followed the steps described in [https://github.com/CUDA error: out of memory #188], first running $ ray start --head and then llm = LLM(model=<your model>, tensor_parallel_size=8).
However, I got the following error:
(Worker pid=1027546) AssertionError: 32001 is not divisible by 8 [repeated 7x across cluster]
Is there any way to resolve this issue?
Additionally, is there a way to specify which GPUs are used during inference? I tried using os.environ["CUDA_VISIBLE_DEVICES"]="2" but it doesn't seem to work - it continues to use the first GPU.

Thanks!

xcxhy · 2023-06-26T06:46:47Z

xcxhy
Jun 26, 2023

@EthanC111 the second error, you can use CUDA_VISIBLE_DEVICES=2 python test.py like this code on terminal to instead of setting in code.

2 replies

bigmover Aug 1, 2023

I use "CUDA_VISIABLE_DEVICES=1 python -m vllm.entrypoints.api_server " to place server on device 1. But it doesn't work! What should I do to figure it out?

JackYangzg Sep 7, 2023

I found this cannot special using which GPU

zhuohan123 · 2023-06-26T18:37:56Z

zhuohan123
Jun 26, 2023
Maintainer

For the first error, what exactly was the model you use? Did you change the vocabulary size? If your vocabulary size cannot be divided by the number of GPUs, you will see this error.

1 reply

alex-ji-repo Jul 7, 2023

how do we set vocabulary size? is it in the sampling_params?

renfeier · 2023-07-04T09:03:57Z

renfeier
Jul 4, 2023

I attempted multi-GPU inference (8 GPU inference on A100) on Llama-13B. I first running $ ray start --head and then llm = LLM(model="./models/open_llama_13b", tensor_parallel_size=4)
and then i got the flowing output:

2023-07-04 16:57:32,881 INFO worker.py:1452 -- Connecting to existing Ray cluster at address: 30.152.83.176:6379...
[2023-07-04 16:57:32,888 I 186137 186137] global_state_accessor.cc:356: This node has an IP address of 30.152.83.253, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
2023-07-04 16:57:32,901 INFO worker.py:1636 -- Connected to Ray cluster.

it blocked , no more log output
@zhuohan123 what's wrong with me?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU inference and Specify which GPUs to be used during inference #278

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Multi-GPU inference and Specify which GPUs to be used during inference #278

EthanC111 Jun 26, 2023

Replies: 3 comments · 3 replies

xcxhy Jun 26, 2023

bigmover Aug 1, 2023

JackYangzg Sep 7, 2023

zhuohan123 Jun 26, 2023 Maintainer

alex-ji-repo Jul 7, 2023

renfeier Jul 4, 2023

I attempted multi-GPU inference (8 GPU inference on A100) on Llama-13B. I first running $ ray start --head and then llm = LLM(model="./models/open_llama_13b", tensor_parallel_size=4) and then i got the flowing output:

EthanC111
Jun 26, 2023

Replies: 3 comments 3 replies

xcxhy
Jun 26, 2023

zhuohan123
Jun 26, 2023
Maintainer

renfeier
Jul 4, 2023

I attempted multi-GPU inference (8 GPU inference on A100) on Llama-13B. I first running $ ray start --head and then llm = LLM(model="./models/open_llama_13b", tensor_parallel_size=4)
and then i got the flowing output: