Error when using DPOTrainer with Multiple GPUs and 8-bit Precision Models #659

munhouiani · 2023-08-18T02:30:56Z

Description:

Hi there,

I've encountered an issue while attempting to utilize the DPO training process on an AWS g5 instance equipped with 4 A10 GPUs. My training setup closely follows the procedure outlined in the dpo_llama2.py script. However, I deviated from the script by employing the Llama-2-7B-Chat model rather than the SFT model with PEFT.

The models are loaded using the following code snippet:

model = AutoModelForCausalLM.from_pretrained(
    local_model_path,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map={"": 0},
    trust_remote_code=True,
)

ref_model = AutoModelForCausalLM.from_pretrained(
    local_model_path,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map={"": 1},
    trust_remote_code=True,
)

I explicitly loaded the models onto two separate GPUs, as they are too large to fit within a single A10 GPU.

However, upon attempting to create a DPOTrainer instance, I encountered the following error message:

You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example
`device_map={'': torch.cuda.current_device()}` or `device_map={'': torch.xpu.current_device()}`

Query:
What would be the correct approach to implementing DPO with multiple GPUs?

Additionally, I've included a list of the installed packages, with

accelerate==0.21.0
bitsandbytes==0.41.1
einops==0.6.1
transformers==4.31.0
trl==0.5.0

Thank you for your assistance.

The text was updated successfully, but these errors were encountered:

lvwerra · 2023-08-18T08:12:36Z

Having the active and ref model on different GPUs is not supported as far as I know. This would lead to all sorts of additional issues as we would need to move tensors around. However with #640 you should be able to just load one model and activate/deactivate the adapters to switch between active/reference model. Hope this helps!

munhouiani · 2023-08-18T20:19:35Z

Thank you for your prompt response! After installing trl from the main branch, and set ref_model to None, a new error has surfaced.

I have now adjusted the model to load in 4-bit, and I've successfully managed to create a DPOTrainer instance:

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    local_model_path,
    quantization_config=bnb_config,
    trust_remote_code=True,
)

peft_config = LoraConfig(
    r=lora_r,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    target_modules=[
        "q_proj",
        "v_proj",
        "k_proj",
        "out_proj",
        "fc_in",
        "fc_out",
        "wte",
    ],
    bias="none",
    task_type="CAUSAL_LM",
)

dpo_trainer = DPOTrainer(
    model,
    args=training_args,
    beta=beta,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
    peft_config=peft_config,
    max_prompt_length=max_prompt_length,
    max_length=max_length,
)

However, upon executing dpo_train.train(), the following error message is displayed:

Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 191 has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further debugging.

With setting the environment variable TORCH_DISTRIBUTED_DEBUG=DETAIL, it shows more details:

Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 191 with name base_model.model.model.layers.31.self_attn.v_proj.lora_B.default.weight has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.

lvwerra · 2023-08-21T11:05:46Z

Could be related to #480 @younesbelkada ?

lewtun · 2023-09-01T12:15:18Z

@munhouiani do you have gradient checkpointing activated? If yes, you can try disabling it to bypass the above error (at the expense of needing ~2x more vRAM)

munhouiani closed this as completed Sep 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when using DPOTrainer with Multiple GPUs and 8-bit Precision Models #659

Error when using DPOTrainer with Multiple GPUs and 8-bit Precision Models #659

munhouiani commented Aug 18, 2023 •

edited

Loading

lvwerra commented Aug 18, 2023

munhouiani commented Aug 18, 2023 •

edited

Loading

lvwerra commented Aug 21, 2023

lewtun commented Sep 1, 2023

Error when using DPOTrainer with Multiple GPUs and 8-bit Precision Models #659

Error when using DPOTrainer with Multiple GPUs and 8-bit Precision Models #659

Comments

munhouiani commented Aug 18, 2023 • edited Loading

lvwerra commented Aug 18, 2023

munhouiani commented Aug 18, 2023 • edited Loading

lvwerra commented Aug 21, 2023

lewtun commented Sep 1, 2023

munhouiani commented Aug 18, 2023 •

edited

Loading

munhouiani commented Aug 18, 2023 •

edited

Loading