How to Properly Resume Multi-GPU Training with accelerate launch Without OOM or Loss Issues? #3260

tqxg2018 · 2024-11-25T17:19:06Z

I encountered an issue while running multi-GPU training using accelerate launch. I am using 4 GPUs for training, and during the process, I save my model state using:

accelerator.save_state(state_path)

Later, I attempt to resume training by loading the model parameters with:

accelerator.load_state(state_path)

However, when I start training again, I observe multiple strange processes on the first GPU, which causes an OOM (out of memory) error, as shown in the attached figure.

To address this, I tried adding the following line before:

accelerator.load_state(state_path)

The updated code looks like this:

if self.accelerator.is_main_process:
    self.accelerator.load_state(state_path)

I then used:

accelerator.wait_for_everyone()

afterward to synchronize the model state across all four GPUs. While this resolved the issue of multiple processes on the first GPU, the model's loss increases significantly. It seems that the trained weights are not being properly synchronized across all GPUs.

Could anyone please suggest how to correctly resume training in a multi-GPU setup with accelerate launch, ensuring the model weights are properly loaded and synchronized across all devices? Thank you!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Properly Resume Multi-GPU Training with accelerate launch Without OOM or Loss Issues? #3260

How to Properly Resume Multi-GPU Training with accelerate launch Without OOM or Loss Issues? #3260

tqxg2018 commented Nov 25, 2024

How to Properly Resume Multi-GPU Training with accelerate launch Without OOM or Loss Issues? #3260

How to Properly Resume Multi-GPU Training with accelerate launch Without OOM or Loss Issues? #3260

Comments

tqxg2018 commented Nov 25, 2024