You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I can train stack llama2 with 8gpus with ddp. Which I have to use {"device_map": {"": Accelerator().local_process_index}} , detail info can be found here.
I want to use deepspeed stage 3 to train it because I will train 70b later. For large model, I can't train with ddp.
Traceback (most recent call last):
File "/nas/lili/codes/pt/ft/trl/examples/research_projects/stack_llama_2/scripts/dpo_llama2.py", line 140, in <module>
model = AutoModelForCausalLM.from_pretrained(
File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
return model_class.from_pretrained(
File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2992, in from_pretrained
raise ValueError(
ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`.
As in [this issue]((#1348), I passed device_map to AutoModelForCausalLM.from_pretrained. But it seems DeepSpeed Zero-3 is not compatible with passing device_map.
So I remove this parameter. It ran oom this time:
Traceback (most recent call last):
File "/nas/lili/codes/pt/ft/trl/examples/research_projects/stack_llama_2/scripts/dpo_llama2.py", line 214, in <module>
dpo_trainer = DPOTrainer(
File "/nas/lili/codes/pt/ft/trl/trl/trainer/dpo_trainer.py", line 234, in __init__
model = prepare_model_for_kbit_training(model, **prepare_model_kwargs)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/peft/utils/other.py", line 105, in prepare_model_for_kbit_training
param.data = param.data.to(torch.float32)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 500.00 MiB. GPU 0 has a total capacity of 39.39 GiB of which 469.00 MiB is free. Process 70022 has 4.62 GiB memory in use. Including non-PyTorch memory, this process has 4.61 GiB memory in use. Process 70021 has 4.62 GiB memory in use. Process 70018 has 4.62 GiB memory in use. Process 70019 has 4.62 GiB memory in use. Process 70015 has 5.59 GiB memory in use. Process 70017 has 4.62 GiB memory in use. Process 70016 has 5.59 GiB memory in use. Of the allocated memory 3.99 GiB is allocated by PyTorch, and 145.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
File "/nas/lili/codes/pt/ft/trl/examples/research_projects/stack_llama_2/scripts/dpo_llama2.py", line 214, in <module>
dpo_trainer = DPOTrainer(
File "/nas/lili/codes/pt/ft/trl/trl/trainer/dpo_trainer.py", line 234, in __init__
model = prepare_model_for_kbit_training(model, **prepare_model_kwargs)
File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/peft/utils/other.py", line 105, in prepare_model_for_kbit_training
param.data = param.data.to(torch.float32)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 500.00 MiB. GPU 0 has a total capacity of 39.39 GiB of which 477.00 MiB is free. Process 70022 has 4.62 GiB memory in use. Process 70020 has 4.61 GiB memory in use. Process 70021 has 4.62 GiB memory in use. Including non-PyTorch memory, this process has 4.61 GiB memory in use. Process 70019 has 4.62 GiB memory in use. Process 70015 has 5.59 GiB memory in use. Process 70017 has 4.62 GiB memory in use. Process 70016 has 5.59 GiB memory in use. Of the allocated memory 3.99 GiB is allocated by PyTorch, and 145.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
I have 8 a100 40GB gpus. I think for llama2-7b, it's enough. And I checked the yaml config, it has "zero3_init_flag: true". So I think it will not load the whole model to a single gpu/device but load each shard its own parameters.
But in peft/utils/other.py
ifnotis_gptq_quantized:
# cast all non INT8 parameters to fp32forparaminmodel.parameters():
if (param.dtype==torch.float16) or (param.dtype==torch.bfloat16):
param.data=param.data.to(torch.float32)
it seems peft need cast bf16 parameter to fp32. When it ran, I saw there are 8 processes ran in gpu0 and the gpu0 memory is used up and it failed.
So I guess the peft don't support shard parameters to 8 gpus but load all to a single gpu.
The text was updated successfully, but these errors were encountered:
I can train stack llama2 with 8gpus with ddp. Which I have to use {"device_map": {"": Accelerator().local_process_index}} , detail info can be found here.
I want to use deepspeed stage 3 to train it because I will train 70b later. For large model, I can't train with ddp.
So I ran it with the deepspeed_zero3.yaml:
it failed with:
As in [this issue]((#1348), I passed device_map to AutoModelForCausalLM.from_pretrained. But it seems DeepSpeed Zero-3 is not compatible with passing device_map.
So I remove this parameter. It ran oom this time:
I have 8 a100 40GB gpus. I think for llama2-7b, it's enough. And I checked the yaml config, it has "zero3_init_flag: true". So I think it will not load the whole model to a single gpu/device but load each shard its own parameters.
But in peft/utils/other.py
it seems peft need cast bf16 parameter to fp32. When it ran, I saw there are 8 processes ran in gpu0 and the gpu0 memory is used up and it failed.
So I guess the peft don't support shard parameters to 8 gpus but load all to a single gpu.
The text was updated successfully, but these errors were encountered: