Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Properly Resume Multi-GPU Training with accelerate launch Without OOM or Loss Issues? #3260

Open
tqxg2018 opened this issue Nov 25, 2024 · 0 comments

Comments

@tqxg2018
Copy link

I encountered an issue while running multi-GPU training using accelerate launch. I am using 4 GPUs for training, and during the process, I save my model state using:

accelerator.save_state(state_path)

Later, I attempt to resume training by loading the model parameters with:

accelerator.load_state(state_path)

However, when I start training again, I observe multiple strange processes on the first GPU, which causes an OOM (out of memory) error, as shown in the attached figure.

To address this, I tried adding the following line before:

accelerator.load_state(state_path)

The updated code looks like this:

if self.accelerator.is_main_process:
    self.accelerator.load_state(state_path)

I then used:

accelerator.wait_for_everyone()

afterward to synchronize the model state across all four GPUs. While this resolved the issue of multiple processes on the first GPU, the model's loss increases significantly. It seems that the trained weights are not being properly synchronized across all GPUs.

Could anyone please suggest how to correctly resume training in a multi-GPU setup with accelerate launch, ensuring the model weights are properly loaded and synchronized across all devices? Thank you!

微信图片_20241124170918
微信图片_20241124170833

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant