How to specify which gpu to use? #691
-
If I have multiple GPUs, how can I specify which GPU to use individually? Previously, I used exp. if i want to specify the model_kwargs = {"torch_dtype": torch.float16, 'device_map': 'sequential'} Then, I manually set the available memory of
|
Beta Was this translation helpful? Give feedback.
Replies: 10 comments 26 replies
-
Use |
Beta Was this translation helpful? Give feedback.
-
Thank you for the suggestion. |
Beta Was this translation helpful? Give feedback.
-
In addition to @gesanqiu's method, you can also use Ray's placement group feature for this. Feel free to follow up on discussion if you meet any issues or have any other suggestions! |
Beta Was this translation helpful? Give feedback.
-
A simple solution is adding this to your code: Then, only GPU-0,1 are visible. |
Beta Was this translation helpful? Give feedback.
-
export CUDA_VISIBLE_DEVICES=1 |
Beta Was this translation helpful? Give feedback.
-
Hi |
Beta Was this translation helpful? Give feedback.
-
Any solution for this |
Beta Was this translation helpful? Give feedback.
-
hoping to solve it too :lol: |
Beta Was this translation helpful? Give feedback.
-
Did anyone find a solution ? @zhuohan123 |
Beta Was this translation helpful? Give feedback.
-
How does one load multiple models on multiple gpus with vLLM with a single script using the llm object?I'm trying to load the same model onto different GPUs within a single Python script using vLLM, but I'm encountering an error when initializing the second model. What I'm Trying to Do:
My Script: def main():
import os
import sys
import socket
print(sys.executable)
if socket.gethostname() == 'skampere1':
print('Hardcoding the path since we are in skampere')
sys.path = ['', '/path/to/env/lib/python311.zip',
'/path/to/env/lib/python3.11',
'/path/to/env/lib/python3.11/lib-dynload',
'/path/to/env/lib/python3.11/site-packages',
'/path/to/py_src', '/path/to/ultimate-utils/py_src']
print(f'{sys.path=}')
# Clear GPU cache
import torch
import gc
torch.cuda.empty_cache()
gc.collect()
from vllm import LLM
model = 'gpt2'
print('Allocating model 1 on GPU 0')
llm1 = LLM(model=model, device='cuda:0')
print('Allocating model 2 on GPU 1')
llm2 = LLM(model=model, device='cuda:1')
print('About to generate with both...')
while True:
prompt = "Hello from GPU 0"
output = llm1.generate([prompt])
print(f"Output from llm1: {output[0].outputs[0].text}")
prompt = "Hello from GPU 1"
output = llm2.generate([prompt])
print(f"Output from llm2: {output[0].outputs[0].text}")
if __name__ == '__main__':
import fire
import time
start = time.time()
fire.Fire(main)
print(f"Done! Time: {time.time()-start:.2f} sec") How I'm Running the Script: I run the script with the following command to set the CUDA_VISIBLE_DEVICES environment variable: CUDA_VISIBLE_DEVICES=2,5 python script.py Thought for a few seconds markdown Question: I'm trying to load the same model onto different GPUs within a single Python script using vLLM, but I'm encountering an error when initializing the second model. What I'm Trying to Do:
My Script: def main():
import os
import sys
import socket
print(sys.executable)
if socket.gethostname() == 'skampere1':
print('Hardcoding the path since we are in skampere')
sys.path = ['', '/path/to/env/lib/python311.zip',
'/path/to/env/lib/python3.11',
'/path/to/env/lib/python3.11/lib-dynload',
'/path/to/env/lib/python3.11/site-packages',
'/path/to/py_src', '/path/to/ultimate-utils/py_src']
print(f'{sys.path=}')
# Clear GPU cache
import torch
import gc
torch.cuda.empty_cache()
gc.collect()
from vllm import LLM
model = 'gpt2'
print('Allocating model 1 on GPU 0')
llm1 = LLM(model=model, device='cuda:0')
print('Allocating model 2 on GPU 1')
llm2 = LLM(model=model, device='cuda:1')
print('About to generate with both...')
while True:
prompt = "Hello from GPU 0"
output = llm1.generate([prompt])
print(f"Output from llm1: {output[0].outputs[0].text}")
prompt = "Hello from GPU 1"
output = llm2.generate([prompt])
print(f"Output from llm2: {output[0].outputs[0].text}")
if __name__ == '__main__':
import fire
import time
start = time.time()
fire.Fire(main)
print(f"Done! Time: {time.time()-start:.2f} sec")
How I'm Running the Script:
I run the script with the following command to set the CUDA_VISIBLE_DEVICES environment variable:
bash
Copy code
CUDA_VISIBLE_DEVICES=2,5 python script.py
The Issue:
When I run the script, the first model initializes correctly on cuda:0 (which should correspond to physical GPU 2). However, when initializing the second LLM instance on cuda:1, I encounter the following error:
```bash
AssertionError: Error in memory profiling. This happens when the GPU memory was not properly cleaned up before initializing the vLLM instance. Full Error Traceback: (beyond_scale_2) brando9@skampere1~/snap-cluster-setup $ CUDA_VISIBLE_DEVICES=2,5 python ~/beyond-scale-2-alignment-coeff/py_src/alignment/synth_data_gen/af/gen_synth_data.py
/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/bin/python
Hardcoding the path since we are in skampere
sys.path=['', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python311.zip', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/lib-dynload', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages', '/afs/cs.stanford.edu/u/brando9/beyond-scale-2-alignment-coeff/py_src', '/afs/cs.stanford.edu/u/brando9/ultimate-utils/py_src']
allocating model 1 gpu1
INFO 09-23 12:38:36 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='gpt2', speculative_config=None, tokenizer='gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
INFO 09-23 12:38:37 utils.py:608] Found nccl from library /lfs/skampere1/0/brando9/.config/vllm/nccl/cu12/libnccl.so.2.18.1
INFO 09-23 12:38:38 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
INFO 09-23 12:38:38 selector.py:33] Using XFormers backend.
INFO 09-23 12:38:39 weight_utils.py:193] Using model weights format ['*.safetensors']
INFO 09-23 12:38:40 model_runner.py:173] Loading model weights took 0.2378 GB
INFO 09-23 12:38:40 gpu_executor.py:119] # GPU blocks: 127654, # CPU blocks: 7281
INFO 09-23 12:38:42 model_runner.py:976] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 09-23 12:38:42 model_runner.py:980] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 09-23 12:38:48 model_runner.py:1057] Graph capturing finished in 6 secs.
allocating model 2 gpu2
INFO 09-23 12:38:48 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='gpt2', speculative_config=None, tokenizer='gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda:1, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
INFO 09-23 12:38:48 weight_utils.py:193] Using model weights format ['*.safetensors']
INFO 09-23 12:38:49 model_runner.py:173] Loading model weights took 0.0000 GB
Traceback (most recent call last):
File "/lfs/skampere1/0/brando9/beyond-scale-2-alignment-coeff/py_src/alignment/synth_data_gen/af/gen_synth_data.py", line 99, in <module>
fire.Fire(main)
File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/beyond-scale-2-alignment-coeff/py_src/alignment/synth_data_gen/af/gen_synth_data.py", line 81, in main
llm2 = LLM(model=model, device=f'cuda:1')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 118, in __init__
self.llm_engine = LLMEngine.from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 277, in from_engine_args
engine = cls(
^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 160, in __init__
self._initialize_kv_caches()
File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 236, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 111, in determine_num_available_blocks
return self.driver_worker.determine_num_available_blocks()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/worker/worker.py", line 147, in determine_num_available_blocks
assert peak_memory > 0, (
^^^^^^^^^^^^^^^
AssertionError: Error in memory profiling. This happens when the GPU memory was not properly cleaned up before initializing the vLLM instance. What I've Tried:
Despite these attempts, the error persists when initializing the second Questions:
Additional Information:
Any insights or suggestions on how to resolve this issue would be greatly appreciated! |
Beta Was this translation helpful? Give feedback.
In addition to @gesanqiu's method, you can also use Ray's placement group feature for this. Feel free to follow up on discussion if you meet any issues or have any other suggestions!