You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I encountered an issue while running multi-GPU training using accelerate launch. I am using 4 GPUs for training, and during the process, I save my model state using:
accelerator.save_state(state_path)
Later, I attempt to resume training by loading the model parameters with:
accelerator.load_state(state_path)
However, when I start training again, I observe multiple strange processes on the first GPU, which causes an OOM (out of memory) error, as shown in the attached figure.
To address this, I tried adding the following line before:
afterward to synchronize the model state across all four GPUs. While this resolved the issue of multiple processes on the first GPU, the model's loss increases significantly. It seems that the trained weights are not being properly synchronized across all GPUs.
Could anyone please suggest how to correctly resume training in a multi-GPU setup with accelerate launch, ensuring the model weights are properly loaded and synchronized across all devices? Thank you!
The text was updated successfully, but these errors were encountered:
I encountered an issue while running multi-GPU training using
accelerate launch
. I am using 4 GPUs for training, and during the process, I save my model state using:Later, I attempt to resume training by loading the model parameters with:
However, when I start training again, I observe multiple strange processes on the first GPU, which causes an OOM (out of memory) error, as shown in the attached figure.
To address this, I tried adding the following line before:
The updated code looks like this:
I then used:
afterward to synchronize the model state across all four GPUs. While this resolved the issue of multiple processes on the first GPU, the model's loss increases significantly. It seems that the trained weights are not being properly synchronized across all GPUs.
Could anyone please suggest how to correctly resume training in a multi-GPU setup with
accelerate launch
, ensuring the model weights are properly loaded and synchronized across all devices? Thank you!The text was updated successfully, but these errors were encountered: