Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when using DPOTrainer with Multiple GPUs and 8-bit Precision Models #659

Closed
munhouiani opened this issue Aug 18, 2023 · 4 comments
Closed

Comments

@munhouiani
Copy link

munhouiani commented Aug 18, 2023

Description:

Hi there,

I've encountered an issue while attempting to utilize the DPO training process on an AWS g5 instance equipped with 4 A10 GPUs. My training setup closely follows the procedure outlined in the dpo_llama2.py script. However, I deviated from the script by employing the Llama-2-7B-Chat model rather than the SFT model with PEFT.

The models are loaded using the following code snippet:

model = AutoModelForCausalLM.from_pretrained(
    local_model_path,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map={"": 0},
    trust_remote_code=True,
)

ref_model = AutoModelForCausalLM.from_pretrained(
    local_model_path,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map={"": 1},
    trust_remote_code=True,
)

I explicitly loaded the models onto two separate GPUs, as they are too large to fit within a single A10 GPU.

However, upon attempting to create a DPOTrainer instance, I encountered the following error message:

You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example
`device_map={'': torch.cuda.current_device()}` or `device_map={'': torch.xpu.current_device()}`

Query:
What would be the correct approach to implementing DPO with multiple GPUs?

Additionally, I've included a list of the installed packages, with

accelerate==0.21.0
bitsandbytes==0.41.1
einops==0.6.1
transformers==4.31.0
trl==0.5.0

Thank you for your assistance.

@lvwerra
Copy link
Member

lvwerra commented Aug 18, 2023

Having the active and ref model on different GPUs is not supported as far as I know. This would lead to all sorts of additional issues as we would need to move tensors around. However with #640 you should be able to just load one model and activate/deactivate the adapters to switch between active/reference model. Hope this helps!

@munhouiani
Copy link
Author

munhouiani commented Aug 18, 2023

Thank you for your prompt response! After installing trl from the main branch, and set ref_model to None, a new error has surfaced.

I have now adjusted the model to load in 4-bit, and I've successfully managed to create a DPOTrainer instance:

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    local_model_path,
    quantization_config=bnb_config,
    trust_remote_code=True,
)

peft_config = LoraConfig(
    r=lora_r,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    target_modules=[
        "q_proj",
        "v_proj",
        "k_proj",
        "out_proj",
        "fc_in",
        "fc_out",
        "wte",
    ],
    bias="none",
    task_type="CAUSAL_LM",
)

dpo_trainer = DPOTrainer(
    model,
    args=training_args,
    beta=beta,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
    peft_config=peft_config,
    max_prompt_length=max_prompt_length,
    max_length=max_length,
)

However, upon executing dpo_train.train(), the following error message is displayed:

Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 191 has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further debugging.

With setting the environment variable TORCH_DISTRIBUTED_DEBUG=DETAIL, it shows more details:

Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 191 with name base_model.model.model.layers.31.self_attn.v_proj.lora_B.default.weight has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration.

@lvwerra
Copy link
Member

lvwerra commented Aug 21, 2023

Could be related to #480 @younesbelkada ?

@lewtun
Copy link
Member

lewtun commented Sep 1, 2023

@munhouiani do you have gradient checkpointing activated? If yes, you can try disabling it to bypass the above error (at the expense of needing ~2x more vRAM)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants