Replies: 3 comments 3 replies
-
@EthanC111 the second error, you can use |
Beta Was this translation helpful? Give feedback.
2 replies
-
For the first error, what exactly was the model you use? Did you change the vocabulary size? If your vocabulary size cannot be divided by the number of GPUs, you will see this error. |
Beta Was this translation helpful? Give feedback.
1 reply
-
I attempted multi-GPU inference (8 GPU inference on A100) on Llama-13B. I first running $ ray start --head and then llm = LLM(model="./models/open_llama_13b", tensor_parallel_size=4)
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have two questions:
I attempted multi-GPU inference (8 GPU inference on A100) on Llama-13B. I followed the steps described in [https://github.com/CUDA error: out of memory #188], first running
$ ray start --head
and thenllm = LLM(model=<your model>, tensor_parallel_size=8)
.However, I got the following error:
(Worker pid=1027546) AssertionError: 32001 is not divisible by 8 [repeated 7x across cluster]
Is there any way to resolve this issue?
Additionally, is there a way to specify which GPUs are used during inference? I tried using
os.environ["CUDA_VISIBLE_DEVICES"]="2"
but it doesn't seem to work - it continues to use the first GPU.Thanks!
Beta Was this translation helpful? Give feedback.
All reactions