How to specify which gpu to use? #691

Answered by zhuohan123

Zhuqln asked this question in Q&A

How to specify which gpu to use? #691

Jul 24, 2023

· 10 comments · 26 replies

Answered by zhuohan123 Return to top

Zhuqln
Jul 24, 2023

If I have multiple GPUs, how can I specify which GPU to use individually?

Previously, I used 'device_map': 'sequential' with accelerate to control this. Now, with vllm_engine, is there a similar functionality available?"

exp. if i want to specify the gpu1 instead of gpu0,because they are not same type of card.i will write like this:

model_kwargs = {"torch_dtype": torch.float16, 'device_map': 'sequential'}

Then, I manually set the available memory of device0 to 0, with the purpose of using device1:

GPU_max_memory: {0: '0GiB', 1: '40GiB'}

Answered by zhuohan123

In addition to @gesanqiu's method, you can also use Ray's placement group feature for this. Feel free to follow up on discussion if you meet any issues or have any other suggestions!

View full answer

Replies: 10 comments 26 replies

gesanqiu
Jul 24, 2023

Use CUDA_VISIBLE_DEVICES=0,1,2,3 python -m vllm.x.x.x. [parameters] to specify the gpus you want to use in a process level.

2 replies

brando90 Sep 23, 2024

is doing

def main():
    import os
    import sys
    import socket
    print(sys.executable)
    if socket.gethostname() == 'skampere1':
        print('Hardcoding the path since we are in skampere')
        sys.path = ['', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python311.zip', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/lib-dynload', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages', '/afs/cs.stanford.edu/u/brando9/beyond-scale-2-alignment-coeff/py_src', '/afs/cs.stanford.edu/u/brando9/ultimate-utils/py_src']
        print(f'{sys.path=}')

    # This example demonstrates running separate scripts for each GPU.

    # llm_gpu0.py
    import os
    from vllm import LLM
    # model: str = 'deepseek-ai/deepseek-math-7b-instruct'
    model: str = 'gpt2'

    print('allocating model 1 gpu1') 
    os.environ["CUDA_VISIBLE_DEVICES"] = "2"
    llm1 = LLM(model=model)
    # prompt = "Hello from GPU 0"
    # output = llm.generate([prompt])
    # print(output[0].outputs[0].text)

    print('allocating model 2 gpu2') 
    os.environ["CUDA_VISIBLE_DEVICES"] = "5"
    llm2 = LLM(model=model)
    # prompt = "Hello from GPU 1"
    # output = llm.generate([prompt])
    # print(output[0].outputs[0].text)
    print('about to generate with both...')
    while True:
        prompt = "Hello from GPU 1"
        output = llm1.generate([prompt])
        print(output[0].outputs[0].text)
        prompt = "Hello from GPU 2"
        output = llm2.generate([prompt])
        print(output[0].outputs[0].text)


if __name__ == '__main__':
    import fire
    import time
    start = time.time()
    fire.Fire(main)
    print(f"Done!\a Time: {time.time()-start:.2f} sec, {(time.time()-start)/60:.2f} min, {(time.time()-start)/3600:.2f} hr\a")
    ```
    
    
then sufficient for it work with llm objects? how does it know which gpu to use?

brando90 Sep 23, 2024

fyi that didn't work for me:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.88 GiB. GPU 0 has a total capacity of 79.15 GiB of which 1.37 GiB is free. Including non-PyTorch memory, this process has 77.76 GiB memory in use. Of the allocated memory 76.50 GiB is allocated by PyTorch, and 33.42 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Zhuqln
Jul 24, 2023
Author

Thank you for the suggestion.
I am aware of this method, but due to my project's architecture, I need to be able to start my worker through a triggered request.
Using the "python" command directly in the terminal is not suitable for me. Nevertheless, I am delighted to receive your response. Thanks again.

1 reply

brando90 Sep 23, 2024

for a soln with only 1 script multiple llm objs multiple gpus see: https://stackoverflow.com/questions/79016077/how-does-one-load-multiple-models-on-multiple-gpus-with-vllm-with-a-single-scrip

zhuohan123
Aug 7, 2023
Maintainer

In addition to @gesanqiu's method, you can also use Ray's placement group feature for this. Feel free to follow up on discussion if you meet any issues or have any other suggestions!

2 replies

brando90 Sep 23, 2024

@zhuohan123 do you have full script that works? Happy to help build it.

brando90 Sep 23, 2024

ref: https://stackoverflow.com/questions/79016077/how-does-one-load-multiple-models-on-multiple-gpus-with-vllm-with-a-single-scrip

ok this seems to work:

    def main2():
        import os
        import sys
        import socket
        import ray
        from ray.util.placement_group import placement_group
        from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy
        from vllm import LLM

        print(sys.executable)
        if socket.gethostname() == 'skampere1':
            print('Hardcoding the path since we are in skampere')
            sys.path = ['', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python311.zip',
                        '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11',
                        '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/lib-dynload',
                        '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages',
                        '/afs/cs.stanford.edu/u/brando9/beyond-scale-2-alignment-coeff/py_src',
                        '/afs/cs.stanford.edu/u/brando9/ultimate-utils/py_src']
            print(f'{sys.path=}')

        # Initialize Ray
        ray.init()

        # Define the number of models (and GPUs) you want to use
        num_models = 2  # Adjust this based on your available GPUs

        # Create a placement group with one GPU and one CPU per bundle
        pg = placement_group(
            name="llm_pg",
            bundles=[{"GPU": 1, "CPU": 1} for _ in range(num_models)],
            strategy="STRICT_PACK"  # or "PACK" or "SPREAD" depending on your needs
        )
        # Wait until the placement group is ready
        ray.get(pg.ready())

        # Define the LLMActor class that will load the LLM model on the assigned GPU
        @ray.remote(num_gpus=1, num_cpus=1)
        class LLMActor:
            def __init__(self, model_name):
                import os
                import torch

                # Get the GPU IDs assigned to this actor by Ray
                gpu_ids = ray.get_gpu_ids()
                # Set CUDA_VISIBLE_DEVICES to limit the GPUs visible to this process
                os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(str(int(gpu_id)) for gpu_id in gpu_ids)
                # Set the default CUDA device
                torch.cuda.set_device(0)  # Since only one GPU is visible, it's cuda:0
                # Initialize the LLM model
                self.llm = LLM(model=model_name, device="cuda:0")  # Use cuda:0 since only one GPU is visible

            def generate(self, prompt):
                # Generate text using the LLM instance
                outputs = self.llm.generate([prompt])
                return outputs[0].outputs[0].text

        # Main function
        model_name = "gpt2"  # Replace with your model
        prompts = ["Hello from model 1", "Greetings from model 2"]

        # Create LLMActor instances assigned to different bundles in the placement group
        actors = []
        for i in range(num_models):
            # Assign the actor to a specific bundle in the placement group
            actor = LLMActor.options(
                scheduling_strategy=PlacementGroupSchedulingStrategy(
                    placement_group=pg,
                    placement_group_bundle_index=i
                )
            ).remote(model_name)
            actors.append(actor)

        # Generate outputs using the actors
        futures = []
        for actor, prompt in zip(actors, prompts):
            future = actor.generate.remote(prompt)
            futures.append(future)

        # Retrieve and print the outputs
        outputs = ray.get(futures)
        for i, output in enumerate(outputs):
            print(f"Output from model {i+1}: {output}")

    main2()

output:

(beyond_scale_2) brando9@skampere1~/snap-cluster-setup $ CUDA_VISIBLE_DEVICES=2,5 python ~/beyond-scale-2-alignment-coeff/py_src/alignment/synth_data_gen/af/gen_synth_data.py

/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/bin/python
Hardcoding the path since we are in skampere
sys.path=['', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python311.zip', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/lib-dynload', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages', '/afs/cs.stanford.edu/u/brando9/beyond-scale-2-alignment-coeff/py_src', '/afs/cs.stanford.edu/u/brando9/ultimate-utils/py_src']
/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/bin/python
Hardcoding the path since we are in skampere
sys.path=['', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python311.zip', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/lib-dynload', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages', '/afs/cs.stanford.edu/u/brando9/beyond-scale-2-alignment-coeff/py_src', '/afs/cs.stanford.edu/u/brando9/ultimate-utils/py_src']
2024-09-23 12:52:58,838 INFO worker.py:1786 -- Started a local Ray instance.
(LLMActor pid=442031) INFO 09-23 12:53:02 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='gpt2', speculative_config=None, tokenizer='gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
(LLMActor pid=442031) /lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
(LLMActor pid=442031)   warnings.warn(
(LLMActor pid=442031) INFO 09-23 12:53:02 utils.py:608] Found nccl from library /lfs/skampere1/0/brando9/.config/vllm/nccl/cu12/libnccl.so.2.18.1
(LLMActor pid=442031) INFO 09-23 12:53:02 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
(LLMActor pid=442031) INFO 09-23 12:53:02 selector.py:33] Using XFormers backend.
(LLMActor pid=442031) INFO 09-23 12:53:04 weight_utils.py:193] Using model weights format ['*.safetensors']
(LLMActor pid=442031) INFO 09-23 12:53:05 model_runner.py:173] Loading model weights took 0.2378 GB
(LLMActor pid=442031) INFO 09-23 12:53:05 gpu_executor.py:119] # GPU blocks: 127654, # CPU blocks: 7281
(LLMActor pid=442031) INFO 09-23 12:53:08 model_runner.py:976] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(LLMActor pid=442031) INFO 09-23 12:53:08 model_runner.py:980] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(LLMActor pid=442030) INFO 09-23 12:53:02 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='gpt2', speculative_config=None, tokenizer='gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
(LLMActor pid=442030) INFO 09-23 12:53:02 utils.py:608] Found nccl from library /lfs/skampere1/0/brando9/.config/vllm/nccl/cu12/libnccl.so.2.18.1
(LLMActor pid=442030) INFO 09-23 12:53:03 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
(LLMActor pid=442030) INFO 09-23 12:53:03 selector.py:33] Using XFormers backend.
(LLMActor pid=442031) INFO 09-23 12:53:11 model_runner.py:1057] Graph capturing finished in 3 secs.
(LLMActor pid=442030) INFO 09-23 12:53:04 weight_utils.py:193] Using model weights format ['*.safetensors']
(LLMActor pid=442030) INFO 09-23 12:53:05 model_runner.py:173] Loading model weights took 0.2378 GB
(LLMActor pid=442030) INFO 09-23 12:53:05 gpu_executor.py:119] # GPU blocks: 127654, # CPU blocks: 7281
Processed prompts: 100%|██████████| 1/1 [00:00<00:00, 24.18it/s]
(LLMActor pid=442030) /lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
(LLMActor pid=442030)   warnings.warn(
Output from model 1: .103102 ...olla at 5:59 pm tomorrow with zcarc from
Output from model 2: .103. ...olla... What the heck is building with zcar? B
Done! Time: 15.06 sec, 0.25 min, 0.00 hr
(LLMActor pid=442030) INFO 09-23 12:53:08 model_runner.py:976] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(LLMActor pid=442030) INFO 09-23 12:53:08 model_runner.py:980] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(LLMActor pid=442030) INFO 09-23 12:53:11 model_runner.py:1057] Graph capturing finished in 3 secs.
Processed prompts: 100%|██████████| 1/1 [00:00<00:00, 25.67it/s]
(beyond_scale_2) brando9@skampere1~/snap-cluster-setup $

Answer selected by Zhuqln

zhiqiangzhongddu
Oct 10, 2023

A simple solution is adding this to your code:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

Then, only GPU-0,1 are visible.

6 replies

humza-sami Feb 23, 2024

@zhiqiangzhongddu I have 4 RTX 4090 GPUs available, and I want to run a LLM with a size of 42GB on 2 RTX 4090 GPUs (~48GB) and a separate model with a size of 22GB on 1 RTX 4090 GPU(`24GB).

from vllm import LLM
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "1,2"
llm_1 = LLM(llm_1_name,max_model_len=50,gpu_memory_utilization=0.9, tensor_parallel_size=2)

os.environ["CUDA_VISIBLE_DEVICES"] = "3"
llm_2 = LLM(llm_2_name,max_model_len=50,gpu_memory_utilization=0.9, tensor_parallel_size=1)

this still loads 2nd llm on 1 and 2 gu and gives memory error

lizhipengpeng Jun 1, 2024

I have the same request as @humza-sami , I want to load different models on different gpu, any solvtions?

dirtycomputer Jul 18, 2024

I have the same request as @humza-sami , I want to load different models on different gpu, any solvtions?

brando90 Sep 23, 2024

A simple solution is adding this to your code: import os os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

Then, only GPU-0,1 are visible.

This doesn't work for me. Is there specific evidence this suggestion actually works?

CUDA_VISIBLE_DEVICES=2,5 python ~/beyond-scale-2-alignment-coeff/py_src/alignment/synth_data_gen/af/gen_synth_data.py

and then

def main():
    import os
    import sys
    import socket
    print(sys.executable)
    if socket.gethostname() == 'skampere1':
        print('Hardcoding the path since we are in skampere')
        sys.path = ['', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python311.zip', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/lib-dynload', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages', '/afs/cs.stanford.edu/u/brando9/beyond-scale-2-alignment-coeff/py_src', '/afs/cs.stanford.edu/u/brando9/ultimate-utils/py_src']
        print(f'{sys.path=}')

    # This example demonstrates running separate scripts for each GPU.

    # llm_gpu0.py
    import os
    from vllm import LLM
    # model: str = 'deepseek-ai/deepseek-math-7b-instruct'
    model: str = 'gpt2'

    print('allocating model 1 gpu1') 
    os.environ["CUDA_VISIBLE_DEVICES"] = "2"
    llm1 = LLM(model=model)
    # prompt = "Hello from GPU 0"
    # output = llm.generate([prompt])
    # print(output[0].outputs[0].text)

    print('allocating model 2 gpu2') 
    os.environ["CUDA_VISIBLE_DEVICES"] = "5"
    llm2 = LLM(model=model)
    # prompt = "Hello from GPU 1"
    # output = llm.generate([prompt])
    # print(output[0].outputs[0].text)
    print('about to generate with both...')
    while True:
        prompt = "Hello from GPU 1"
        output = llm1.generate([prompt])
        print(output[0].outputs[0].text)
        prompt = "Hello from GPU 2"
        output = llm2.generate([prompt])
        print(output[0].outputs[0].text)


if __name__ == '__main__':
    import fire
    import time
    start = time.time()
    fire.Fire(main)
    print(f"Done!\a Time: {time.time()-start:.2f} sec, {(time.time()-start)/60:.2f} min, {(time.time()-start)/3600:.2f} hr\a")

after creating the file the example should be self contained

brando90 Sep 23, 2024

for a soln with only 1 script multiple llm objs multiple gpus see: https://stackoverflow.com/questions/79016077/how-does-one-load-multiple-models-on-multiple-gpus-with-vllm-with-a-single-scrip

timeconnection
Mar 25, 2024

export CUDA_VISIBLE_DEVICES=1

1 reply

brando90 Sep 23, 2024

then what?

sAviOr287
May 1, 2024

Hi
is there any chance there is an example script on how to place LLMs on different GPUs?
That would be amazing and really helpful
Thanks a lot in advance.
@zhuohan123

2 replies

brando90 Sep 23, 2024

I'm also looking for one. Help me fix mine:

A simple solution is adding this to your code: import os os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

Then, only GPU-0,1 are visible.

This doesn't work for me. Is there specific evidence this suggestion actually works?

CUDA_VISIBLE_DEVICES=2,5 python ~/beyond-scale-2-alignment-coeff/py_src/alignment/synth_data_gen/af/gen_synth_data.py

and then

def main():
    import os
    import sys
    import socket
    print(sys.executable)
    if socket.gethostname() == 'skampere1':
        print('Hardcoding the path since we are in skampere')
        sys.path = ['', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python311.zip', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/lib-dynload', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages', '/afs/cs.stanford.edu/u/brando9/beyond-scale-2-alignment-coeff/py_src', '/afs/cs.stanford.edu/u/brando9/ultimate-utils/py_src']
        print(f'{sys.path=}')

    # This example demonstrates running separate scripts for each GPU.

    # llm_gpu0.py
    import os
    from vllm import LLM
    # model: str = 'deepseek-ai/deepseek-math-7b-instruct'
    model: str = 'gpt2'

    print('allocating model 1 gpu1') 
    os.environ["CUDA_VISIBLE_DEVICES"] = "2"
    llm1 = LLM(model=model)
    # prompt = "Hello from GPU 0"
    # output = llm.generate([prompt])
    # print(output[0].outputs[0].text)

    print('allocating model 2 gpu2') 
    os.environ["CUDA_VISIBLE_DEVICES"] = "5"
    llm2 = LLM(model=model)
    # prompt = "Hello from GPU 1"
    # output = llm.generate([prompt])
    # print(output[0].outputs[0].text)
    print('about to generate with both...')
    while True:
        prompt = "Hello from GPU 1"
        output = llm1.generate([prompt])
        print(output[0].outputs[0].text)
        prompt = "Hello from GPU 2"
        output = llm2.generate([prompt])
        print(output[0].outputs[0].text)


if __name__ == '__main__':
    import fire
    import time
    start = time.time()
    fire.Fire(main)
    print(f"Done!\a Time: {time.time()-start:.2f} sec, {(time.time()-start)/60:.2f} min, {(time.time()-start)/3600:.2f} hr\a")

after creating the file the example should be self contained

brando90 Sep 23, 2024

for a soln with only 1 script multiple llm objs multiple gpus see: https://stackoverflow.com/questions/79016077/how-does-one-load-multiple-models-on-multiple-gpus-with-vllm-with-a-single-scrip

sparsh35
May 21, 2024

Any solution for this

1 reply

brando90 Sep 23, 2024

for a soln with only 1 script multiple llm objs multiple gpus see: https://stackoverflow.com/questions/79016077/how-does-one-load-multiple-models-on-multiple-gpus-with-vllm-with-a-single-scrip

UbeCc
Jul 17, 2024

hoping to solve it too :lol:

1 reply

brando90 Sep 23, 2024

for a soln with only 1 script multiple llm objs multiple gpus see: https://stackoverflow.com/questions/79016077/how-does-one-load-multiple-models-on-multiple-gpus-with-vllm-with-a-single-scrip

ChristopheYe
Sep 12, 2024

Did anyone find a solution ? @zhuohan123

8 replies

brando90 Sep 23, 2024

@ChristopheYe do you still need to map the cuda visible device in the python script? e.g.,

    print('allocating model 1 gpu1') 
    os.environ["CUDA_VISIBLE_DEVICES"] = "2"
    llm1 = LLM(model=model, device=f'cuda:0')

brando90 Sep 23, 2024

when I try this solution I get this issue:

Traceback (most recent call last):
  File "/lfs/skampere1/0/brando9/beyond-scale-2-alignment-coeff/py_src/alignment/synth_data_gen/af/gen_synth_data.py", line 99, in <module>
    fire.Fire(main)
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/beyond-scale-2-alignment-coeff/py_src/alignment/synth_data_gen/af/gen_synth_data.py", line 81, in main
    llm2 = LLM(model=model, device=f'cuda:1')
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 118, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 277, in from_engine_args
    engine = cls(
             ^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 160, in __init__
    self._initialize_kv_caches()
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 236, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 111, in determine_num_available_blocks
    return self.driver_worker.determine_num_available_blocks()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/worker/worker.py", line 147, in determine_num_available_blocks
    assert peak_memory > 0, (
           ^^^^^^^^^^^^^^^
AssertionError: Error in memory profiling. This happens when the GPU memory was not properly cleaned up before initializing the vLLM instance.

Any full script that actually works?

brando90 Sep 23, 2024

This is the script I ran:

def main():
    import os
    import sys
    import socket
    print(sys.executable)
    if socket.gethostname() == 'skampere1':
        print('Hardcoding the path since we are in skampere')
        sys.path = ['', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python311.zip', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/lib-dynload', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages', '/afs/cs.stanford.edu/u/brando9/beyond-scale-2-alignment-coeff/py_src', '/afs/cs.stanford.edu/u/brando9/ultimate-utils/py_src']
        print(f'{sys.path=}')
    # This example demonstrates running separate scripts for each GPU.
    import torch
    import gc
    torch.cuda.empty_cache()
    gc.collect()


    # llm_gpu0.py
    import os
    from vllm import LLM
    # model: str = 'deepseek-ai/deepseek-math-7b-instruct'
    model: str = 'gpt2'

    print('allocating model 1 gpu1') 
    # os.environ["CUDA_VISIBLE_DEVICES"] = "2"
    llm1 = LLM(model=model, device=f'cuda:0')
    print('allocating model 2 gpu2') 
    # os.environ["CUDA_VISIBLE_DEVICES"] = "5"
    llm2 = LLM(model=model, device=f'cuda:1')
    print('about to generate with both...')
    while True:
        prompt = "Hello from GPU 1"
        output = llm1.generate([prompt])
        print(output[0].outputs[0].text)
        prompt = "Hello from GPU 2"
        output = llm2.generate([prompt])
        print(output[0].outputs[0].text)


if __name__ == '__main__':
    import fire
    import time
    start = time.time()
    fire.Fire(main)
    print(f"Done!\a Time: {time.time()-start:.2f} sec, {(time.time()-start)/60:.2f} min, {(time.time()-start)/3600:.2f} hr\a")

with this CMD in bash:

conda activate beyond_scale_2
CUDA_VISIBLE_DEVICES=2,5 python ~/beyond-scale-2-alignment-coeff/py_src/alignment/synth_data_gen/af/gen_synth_data.py

brando90 Sep 23, 2024

for a soln with only 1 script multiple llm objs multiple gpus see: https://stackoverflow.com/questions/79016077/how-does-one-load-multiple-models-on-multiple-gpus-with-vllm-with-a-single-scrip

ChristopheYe Sep 23, 2024

@ChristopheYe do you still need to map the cuda visible device in the python script? e.g.,
    print('allocating model 1 gpu1') 
    os.environ["CUDA_VISIBLE_DEVICES"] = "2"
    llm1 = LLM(model=model, device=f'cuda:0')

No it was not needed. This worked directly for me :

device_1 = torch.device("cuda:0")
device_2 = torch.device("cuda:1")

llm = LLM(
model=model1,
tensor_parallel_size=1,
device=device_1)

llm2 = LLM(
model=model2,
tensor_parallel_size=1,
device=device_2)

brando90
Sep 23, 2024

https://stackoverflow.com/questions/79016077/how-does-one-load-multiple-models-on-multiple-gpus-with-vllm-with-a-single-scrip

How does one load multiple models on multiple gpus with vLLM with a single script using the llm object?

I'm trying to load the same model onto different GPUs within a single Python script using vLLM, but I'm encountering an error when initializing the second model.

What I'm Trying to Do:

Environment:
- vLLM version: 0.4.1
- Python version: 3.11
- CUDA version: (please specify if known)
- PyTorch version: (please specify if known)
- GPUs: GPU 2 and GPU 5 (want to use these GPUs)
Objective:
- Load two instances of the same model (gpt2) onto two different GPUs within the same script.
- Use CUDA_VISIBLE_DEVICES=2,5 to map physical GPUs 2 and 5 to logical devices cuda:0 and cuda:1.
- Initialize two LLM instances from vLLM, each assigned to a different GPU.

My Script:

def main():
    import os
    import sys
    import socket
    print(sys.executable)
    if socket.gethostname() == 'skampere1':
        print('Hardcoding the path since we are in skampere')
        sys.path = ['', '/path/to/env/lib/python311.zip',
                    '/path/to/env/lib/python3.11',
                    '/path/to/env/lib/python3.11/lib-dynload',
                    '/path/to/env/lib/python3.11/site-packages',
                    '/path/to/py_src', '/path/to/ultimate-utils/py_src']
        print(f'{sys.path=}')

    # Clear GPU cache
    import torch
    import gc
    torch.cuda.empty_cache()
    gc.collect()

    from vllm import LLM
    model = 'gpt2'

    print('Allocating model 1 on GPU 0')
    llm1 = LLM(model=model, device='cuda:0')

    print('Allocating model 2 on GPU 1')
    llm2 = LLM(model=model, device='cuda:1')

    print('About to generate with both...')
    while True:
        prompt = "Hello from GPU 0"
        output = llm1.generate([prompt])
        print(f"Output from llm1: {output[0].outputs[0].text}")

        prompt = "Hello from GPU 1"
        output = llm2.generate([prompt])
        print(f"Output from llm2: {output[0].outputs[0].text}")

if __name__ == '__main__':
    import fire
    import time
    start = time.time()
    fire.Fire(main)
    print(f"Done! Time: {time.time()-start:.2f} sec")

How I'm Running the Script:

I run the script with the following command to set the CUDA_VISIBLE_DEVICES environment variable:

CUDA_VISIBLE_DEVICES=2,5 python script.py

Thought for a few seconds

markdown
Copy code
Title:
AssertionError in vLLM when loading multiple models on different GPUs within the same script

Question:

I'm trying to load the same model onto different GPUs within a single Python script using vLLM, but I'm encountering an error when initializing the second model.

What I'm Trying to Do:

Environment:
- vLLM version: 0.4.1
- Python version: 3.11
- CUDA version: (please specify if known)
- PyTorch version: (please specify if known)
- GPUs: GPU 2 and GPU 5 (want to use these GPUs)
Objective:
- Load two instances of the same model (gpt2) onto two different GPUs within the same script.
- Use CUDA_VISIBLE_DEVICES=2,5 to map physical GPUs 2 and 5 to logical devices cuda:0 and cuda:1.
- Initialize two LLM instances from vLLM, each assigned to a different GPU.

My Script:

def main():
    import os
    import sys
    import socket
    print(sys.executable)
    if socket.gethostname() == 'skampere1':
        print('Hardcoding the path since we are in skampere')
        sys.path = ['', '/path/to/env/lib/python311.zip',
                    '/path/to/env/lib/python3.11',
                    '/path/to/env/lib/python3.11/lib-dynload',
                    '/path/to/env/lib/python3.11/site-packages',
                    '/path/to/py_src', '/path/to/ultimate-utils/py_src']
        print(f'{sys.path=}')

    # Clear GPU cache
    import torch
    import gc
    torch.cuda.empty_cache()
    gc.collect()

    from vllm import LLM
    model = 'gpt2'

    print('Allocating model 1 on GPU 0')
    llm1 = LLM(model=model, device='cuda:0')

    print('Allocating model 2 on GPU 1')
    llm2 = LLM(model=model, device='cuda:1')

    print('About to generate with both...')
    while True:
        prompt = "Hello from GPU 0"
        output = llm1.generate([prompt])
        print(f"Output from llm1: {output[0].outputs[0].text}")

        prompt = "Hello from GPU 1"
        output = llm2.generate([prompt])
        print(f"Output from llm2: {output[0].outputs[0].text}")

if __name__ == '__main__':
    import fire
    import time
    start = time.time()
    fire.Fire(main)
    print(f"Done! Time: {time.time()-start:.2f} sec")
How I'm Running the Script:

I run the script with the following command to set the CUDA_VISIBLE_DEVICES environment variable:

bash
Copy code
CUDA_VISIBLE_DEVICES=2,5 python script.py
The Issue:

When I run the script, the first model initializes correctly on cuda:0 (which should correspond to physical GPU 2). However, when initializing the second LLM instance on cuda:1, I encounter the following error:

```bash
AssertionError: Error in memory profiling. This happens when the GPU memory was not properly cleaned up before initializing the vLLM instance.

Full Error Traceback:

(beyond_scale_2) brando9@skampere1~/snap-cluster-setup $ CUDA_VISIBLE_DEVICES=2,5 python ~/beyond-scale-2-alignment-coeff/py_src/alignment/synth_data_gen/af/gen_synth_data.py

/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/bin/python
Hardcoding the path since we are in skampere
sys.path=['', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python311.zip', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/lib-dynload', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages', '/afs/cs.stanford.edu/u/brando9/beyond-scale-2-alignment-coeff/py_src', '/afs/cs.stanford.edu/u/brando9/ultimate-utils/py_src']
allocating model 1 gpu1
INFO 09-23 12:38:36 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='gpt2', speculative_config=None, tokenizer='gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
INFO 09-23 12:38:37 utils.py:608] Found nccl from library /lfs/skampere1/0/brando9/.config/vllm/nccl/cu12/libnccl.so.2.18.1
INFO 09-23 12:38:38 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
INFO 09-23 12:38:38 selector.py:33] Using XFormers backend.
INFO 09-23 12:38:39 weight_utils.py:193] Using model weights format ['*.safetensors']
INFO 09-23 12:38:40 model_runner.py:173] Loading model weights took 0.2378 GB
INFO 09-23 12:38:40 gpu_executor.py:119] # GPU blocks: 127654, # CPU blocks: 7281
INFO 09-23 12:38:42 model_runner.py:976] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 09-23 12:38:42 model_runner.py:980] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 09-23 12:38:48 model_runner.py:1057] Graph capturing finished in 6 secs.
allocating model 2 gpu2
INFO 09-23 12:38:48 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='gpt2', speculative_config=None, tokenizer='gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda:1, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
INFO 09-23 12:38:48 weight_utils.py:193] Using model weights format ['*.safetensors']
INFO 09-23 12:38:49 model_runner.py:173] Loading model weights took 0.0000 GB
Traceback (most recent call last):
  File "/lfs/skampere1/0/brando9/beyond-scale-2-alignment-coeff/py_src/alignment/synth_data_gen/af/gen_synth_data.py", line 99, in <module>
    fire.Fire(main)
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/beyond-scale-2-alignment-coeff/py_src/alignment/synth_data_gen/af/gen_synth_data.py", line 81, in main
    llm2 = LLM(model=model, device=f'cuda:1')
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 118, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 277, in from_engine_args
    engine = cls(
             ^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 160, in __init__
    self._initialize_kv_caches()
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 236, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 111, in determine_num_available_blocks
    return self.driver_worker.determine_num_available_blocks()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/vllm/worker/worker.py", line 147, in determine_num_available_blocks
    assert peak_memory > 0, (
           ^^^^^^^^^^^^^^^
AssertionError: Error in memory profiling. This happens when the GPU memory was not properly cleaned up before initializing the vLLM instance.

What I've Tried:

Clearing GPU Memory:
- Added torch.cuda.empty_cache() and gc.collect() before initializing the second LLM instance.
Device Specification:
- Ensured that I'm specifying the correct device IDs (cuda:0 and cuda:1) after setting CUDA_VISIBLE_DEVICES=2,5.
Environment Variables:
- Confirmed that CUDA_VISIBLE_DEVICES is set correctly at the beginning of the script and before importing any CUDA-dependent libraries.

Despite these attempts, the error persists when initializing the second LLM instance.

Questions:

Is it possible to load multiple LLM instances onto different GPUs within the same script using vLLM?
Am I correctly specifying the devices after setting CUDA_VISIBLE_DEVICES?
Could this issue be related to how vLLM or underlying libraries handle GPU devices in a single process?

Additional Information:

GPU Availability: Both GPUs are available and not being used by other processes.
Monitoring: Checked nvidia-smi to ensure that GPU memory is available before running the script.
Single Model Initialization: When I run the script with only one LLM instance (either on cuda:0 or cuda:1), it works without any issues.

Any insights or suggestions on how to resolve this issue would be greatly appreciated!

2 replies

brando90 Sep 23, 2024

ok this seems to work:

    def main2():
        import os
        import sys
        import socket
        import ray
        from ray.util.placement_group import placement_group
        from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy
        from vllm import LLM

        print(sys.executable)
        if socket.gethostname() == 'skampere1':
            print('Hardcoding the path since we are in skampere')
            sys.path = ['', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python311.zip',
                        '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11',
                        '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/lib-dynload',
                        '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages',
                        '/afs/cs.stanford.edu/u/brando9/beyond-scale-2-alignment-coeff/py_src',
                        '/afs/cs.stanford.edu/u/brando9/ultimate-utils/py_src']
            print(f'{sys.path=}')

        # Initialize Ray
        ray.init()

        # Define the number of models (and GPUs) you want to use
        num_models = 2  # Adjust this based on your available GPUs

        # Create a placement group with one GPU and one CPU per bundle
        pg = placement_group(
            name="llm_pg",
            bundles=[{"GPU": 1, "CPU": 1} for _ in range(num_models)],
            strategy="STRICT_PACK"  # or "PACK" or "SPREAD" depending on your needs
        )
        # Wait until the placement group is ready
        ray.get(pg.ready())

        # Define the LLMActor class that will load the LLM model on the assigned GPU
        @ray.remote(num_gpus=1, num_cpus=1)
        class LLMActor:
            def __init__(self, model_name):
                import os
                import torch

                # Get the GPU IDs assigned to this actor by Ray
                gpu_ids = ray.get_gpu_ids()
                # Set CUDA_VISIBLE_DEVICES to limit the GPUs visible to this process
                os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(str(int(gpu_id)) for gpu_id in gpu_ids)
                # Set the default CUDA device
                torch.cuda.set_device(0)  # Since only one GPU is visible, it's cuda:0
                # Initialize the LLM model
                self.llm = LLM(model=model_name, device="cuda:0")  # Use cuda:0 since only one GPU is visible

            def generate(self, prompt):
                # Generate text using the LLM instance
                outputs = self.llm.generate([prompt])
                return outputs[0].outputs[0].text

        # Main function
        model_name = "gpt2"  # Replace with your model
        prompts = ["Hello from model 1", "Greetings from model 2"]

        # Create LLMActor instances assigned to different bundles in the placement group
        actors = []
        for i in range(num_models):
            # Assign the actor to a specific bundle in the placement group
            actor = LLMActor.options(
                scheduling_strategy=PlacementGroupSchedulingStrategy(
                    placement_group=pg,
                    placement_group_bundle_index=i
                )
            ).remote(model_name)
            actors.append(actor)

        # Generate outputs using the actors
        futures = []
        for actor, prompt in zip(actors, prompts):
            future = actor.generate.remote(prompt)
            futures.append(future)

        # Retrieve and print the outputs
        outputs = ray.get(futures)
        for i, output in enumerate(outputs):
            print(f"Output from model {i+1}: {output}")

    main2()

output:

(beyond_scale_2) brando9@skampere1~/snap-cluster-setup $ CUDA_VISIBLE_DEVICES=2,5 python ~/beyond-scale-2-alignment-coeff/py_src/alignment/synth_data_gen/af/gen_synth_data.py

/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/bin/python
Hardcoding the path since we are in skampere
sys.path=['', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python311.zip', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/lib-dynload', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages', '/afs/cs.stanford.edu/u/brando9/beyond-scale-2-alignment-coeff/py_src', '/afs/cs.stanford.edu/u/brando9/ultimate-utils/py_src']
/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/bin/python
Hardcoding the path since we are in skampere
sys.path=['', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python311.zip', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/lib-dynload', '/lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages', '/afs/cs.stanford.edu/u/brando9/beyond-scale-2-alignment-coeff/py_src', '/afs/cs.stanford.edu/u/brando9/ultimate-utils/py_src']
2024-09-23 12:52:58,838 INFO worker.py:1786 -- Started a local Ray instance.
(LLMActor pid=442031) INFO 09-23 12:53:02 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='gpt2', speculative_config=None, tokenizer='gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
(LLMActor pid=442031) /lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
(LLMActor pid=442031)   warnings.warn(
(LLMActor pid=442031) INFO 09-23 12:53:02 utils.py:608] Found nccl from library /lfs/skampere1/0/brando9/.config/vllm/nccl/cu12/libnccl.so.2.18.1
(LLMActor pid=442031) INFO 09-23 12:53:02 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
(LLMActor pid=442031) INFO 09-23 12:53:02 selector.py:33] Using XFormers backend.
(LLMActor pid=442031) INFO 09-23 12:53:04 weight_utils.py:193] Using model weights format ['*.safetensors']
(LLMActor pid=442031) INFO 09-23 12:53:05 model_runner.py:173] Loading model weights took 0.2378 GB
(LLMActor pid=442031) INFO 09-23 12:53:05 gpu_executor.py:119] # GPU blocks: 127654, # CPU blocks: 7281
(LLMActor pid=442031) INFO 09-23 12:53:08 model_runner.py:976] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(LLMActor pid=442031) INFO 09-23 12:53:08 model_runner.py:980] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(LLMActor pid=442030) INFO 09-23 12:53:02 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='gpt2', speculative_config=None, tokenizer='gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
(LLMActor pid=442030) INFO 09-23 12:53:02 utils.py:608] Found nccl from library /lfs/skampere1/0/brando9/.config/vllm/nccl/cu12/libnccl.so.2.18.1
(LLMActor pid=442030) INFO 09-23 12:53:03 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
(LLMActor pid=442030) INFO 09-23 12:53:03 selector.py:33] Using XFormers backend.
(LLMActor pid=442031) INFO 09-23 12:53:11 model_runner.py:1057] Graph capturing finished in 3 secs.
(LLMActor pid=442030) INFO 09-23 12:53:04 weight_utils.py:193] Using model weights format ['*.safetensors']
(LLMActor pid=442030) INFO 09-23 12:53:05 model_runner.py:173] Loading model weights took 0.2378 GB
(LLMActor pid=442030) INFO 09-23 12:53:05 gpu_executor.py:119] # GPU blocks: 127654, # CPU blocks: 7281
Processed prompts: 100%|██████████| 1/1 [00:00<00:00, 24.18it/s]
(LLMActor pid=442030) /lfs/skampere1/0/brando9/miniconda/envs/beyond_scale_2/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
(LLMActor pid=442030)   warnings.warn(
Output from model 1: .103102 ...olla at 5:59 pm tomorrow with zcarc from
Output from model 2: .103. ...olla... What the heck is building with zcar? B
Done! Time: 15.06 sec, 0.25 min, 0.00 hr
(LLMActor pid=442030) INFO 09-23 12:53:08 model_runner.py:976] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(LLMActor pid=442030) INFO 09-23 12:53:08 model_runner.py:980] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(LLMActor pid=442030) INFO 09-23 12:53:11 model_runner.py:1057] Graph capturing finished in 3 secs.
Processed prompts: 100%|██████████| 1/1 [00:00<00:00, 25.67it/s]
(beyond_scale_2) brando9@skampere1~/snap-cluster-setup $

brando90 Sep 23, 2024

for a soln with only 1 script multiple llm objs multiple gpus see: https://stackoverflow.com/questions/79016077/how-does-one-load-multiple-models-on-multiple-gpus-with-vllm-with-a-single-scrip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment