StackLlaMa 2 dpo train failed: 8-bit model can't train with multiple gpus #1348

fancyerii · 2024-02-22T04:22:41Z

I am following https://github.com/huggingface/trl/tree/main/examples/research_projects/stack_llama_2/scripts.

I ran with:

accelerate launch --config_file 7b.yaml examples/research_projects/stack_llama_2/scripts/dpo_llama2.py     --model_name_or_path="sft/final_checkpoint"     --output_dir="dpo"

7b.yaml file:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 0,1,2,3
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

error message:

Traceback (most recent call last):
  File "/nas/lili/codes/pt/ft/trl/examples/research_projects/stack_llama_2/scripts/dpo_llama2.py", line 213, in <module>
    dpo_trainer.train()
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 1687, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/accelerate/accelerator.py", line 1227, in prepare
    result = tuple(
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/accelerate/accelerator.py", line 1228, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/accelerate/accelerator.py", line 1104, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/accelerate/accelerator.py", line 1330, in prepare_model
    raise ValueError(
ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}

I searched this issue and I am not sure it use Naive PP.

my environment:

transformers             4.37.2
accelerate               0.26.1
peft                     0.8.2
bitsandbytes             0.43.0.dev0 # latest built from source
trl                      0.7.11.dev0 # latest built from source
torch                    2.2.0
python 3.9.18

The text was updated successfully, but these errors were encountered:

younesbelkada · 2024-02-22T09:24:12Z

Hi @fancyerii
Thanks for the issue !
you need to first install the latest version of accelerate from pypi pip install -U accelerate and load the model with device_map={"":Accelerator().process_index} similarly as in

trl/examples/research_projects/stack_llama_2/scripts/sft_llama2.py

Line 145 in 4f97fb4

device_map={"": Accelerator().local_process_index},

If that works, would you be happy to submit a fix through a PR on that DPO script?

fancyerii · 2024-02-22T11:20:22Z

Hi @fancyerii Thanks for the issue ! you need to first install the latest version of accelerate from pypi pip install -U accelerate and load the model with device_map={"":Accelerator().process_index} similarly as in

trl/examples/research_projects/stack_llama_2/scripts/sft_llama2.py

Line 145 in 4f97fb4

device_map={"": Accelerator().local_process_index},

If that works, would you be happy to submit a fix through a PR on that DPO script?

I upgraded accelerate to 0.27.2. But it failed with new error.

Traceback (most recent call last):
  File "/nas/lili/codes/pt/ft/trl/examples/research_projects/stack_llama_2/scripts/dpo_llama2.py", line 215, in <module>
Traceback (most recent call last):
  File "/nas/lili/codes/pt/ft/trl/examples/research_projects/stack_llama_2/scripts/dpo_llama2.py", line 215, in <module>
Traceback (most recent call last):
  File "/nas/lili/codes/pt/ft/trl/examples/research_projects/stack_llama_2/scripts/dpo_llama2.py", line 215, in <module>
    dpo_trainer.train()
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 1869, in _inner
_training_loop
    dpo_trainer.train()
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 1539, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 2781, in traini
ng_step
    return inner_training_loop(
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 1869, in _inner
_training_loop
    dpo_trainer.train()
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 1539, in train
    self.accelerator.backward(loss)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/accelerate/accelerator.py", line 1966, in back
ward
    tr_loss_step = self.training_step(model, inputs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 2781, in traini
ng_step
    return inner_training_loop(
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 1869, in _inner_training_loop
    loss.backward(**kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/torch/_tensor.py", line 522, in backward
    tr_loss_step = self.training_step(model, inputs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/transformers/trainer.py", line 2781, in training_step
    torch.autograd.backward(
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/torch/autograd/__init__.py", line 266, in backward
    self.accelerator.backward(loss)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/accelerate/accelerator.py", line 1966, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/torch/autograd/function.py", line 289, in apply
    return user_fn(self, *args)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 319, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 191 with name base_model.model.model.layers.31.self_attn.v_proj.lora_B.default.weight has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.
    loss.backward(**kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/torch/_tensor.py", line 522, in backward
    self.accelerator.backward(loss)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/accelerate/accelerator.py", line 1966, in backward
    torch.autograd.backward(
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/torch/autograd/function.py", line 289, in apply
    return user_fn(self, *args)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 319, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/torch/autograd/__init__.py", line 266, in backward
    loss.backward(**kwargs)
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/ft-zSqjAXBp-py3.9/lib/python3.9/site-packages/torch/_tensor.py", line 522, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.

fancyerii · 2024-02-22T11:56:07Z

I searched this issue. And After I added the following line, it worked.

training_args.gradient_checkpointing_kwargs = dict(use_reentrant=False)

younesbelkada · 2024-02-22T12:27:18Z

indeed ! adding that line should solve it !
Would you like to sublit a PR wiyh all the fixes?

fix "ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on." see huggingface#1348

fancyerii · 2024-02-22T12:32:49Z

indeed ! adding that line should solve it ! Would you like to sublit a PR wiyh all the fixes?

I have submitted a pr.

younesbelkada · 2024-02-22T12:33:51Z

Thanks so much @fancyerii !

* fix 8-bit multi-gpu training bug see #1348 * Update dpo_llama2.py make gradient_checkpointing_kwargs configurable. * Update dpo_llama2.py remote unnecessary config of device_map * format with make precommit --------- Co-authored-by: ubuntu <[email protected]>

* fix 8-bit multi-gpu training bug see huggingface#1348 * Update dpo_llama2.py make gradient_checkpointing_kwargs configurable. * Update dpo_llama2.py remote unnecessary config of device_map * format with make precommit --------- Co-authored-by: ubuntu <[email protected]>

fancyerii added a commit to fancyerii/trl that referenced this issue Feb 22, 2024

Update dpo_llama2.py

c91a228

fix "ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on." see huggingface#1348

fancyerii mentioned this issue Feb 22, 2024

Update dpo_llama2.py to fix 8-bit multi-gpu training bug #1352

Closed

fancyerii pushed a commit to fancyerii/trl that referenced this issue Feb 22, 2024

fix 8-bit multi-gpu training bug see huggingface#1348

a2098c7

fancyerii mentioned this issue Feb 22, 2024

fix 8-bit multi-gpu training bug #1353

Merged

kashif closed this as completed Feb 23, 2024

fancyerii mentioned this issue Feb 23, 2024

StackLlaMa 2 dpo train with deepspeed oom #1358

Closed

younesbelkada mentioned this issue Feb 27, 2024

8-bit precision error with fine tuning of gemma #1355

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StackLlaMa 2 dpo train failed: 8-bit model can't train with multiple gpus #1348

StackLlaMa 2 dpo train failed: 8-bit model can't train with multiple gpus #1348

fancyerii commented Feb 22, 2024 •

edited

Loading

younesbelkada commented Feb 22, 2024

fancyerii commented Feb 22, 2024 •

edited

Loading

fancyerii commented Feb 22, 2024

younesbelkada commented Feb 22, 2024

fancyerii commented Feb 22, 2024

younesbelkada commented Feb 22, 2024

StackLlaMa 2 dpo train failed: 8-bit model can't train with multiple gpus #1348

StackLlaMa 2 dpo train failed: 8-bit model can't train with multiple gpus #1348

Comments

fancyerii commented Feb 22, 2024 • edited Loading

younesbelkada commented Feb 22, 2024

fancyerii commented Feb 22, 2024 • edited Loading

fancyerii commented Feb 22, 2024

younesbelkada commented Feb 22, 2024

fancyerii commented Feb 22, 2024

younesbelkada commented Feb 22, 2024

fancyerii commented Feb 22, 2024 •

edited

Loading

fancyerii commented Feb 22, 2024 •

edited

Loading