You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fine-tuning Mistral instruct model using trl v0.7.2 is causing a CUDA OOM error on A100 GPU on Colab regardless of batch size, gradient accumulation steps, or precision level (fp16 or bf16). Leaving all hyperparameters unchanged, but reverting trl back to v0.7.1, I'm able to run the fine-tuning job without any errors.
The text was updated successfully, but these errors were encountered:
Hi @mallaham
I believe #725 caused the issue, before that PR gradient checkpointing was used by default, now it is not the case anymore, you need to explicitly pass gradient_checkpointing=True in the TrainingArguments. Can you double check this and report back here? 🙏
Hi. I had the same issue. gradient_checkpointing=True does solve the memory issue but afterwards I am unable to continue a previous run. Even with explicit checkpoints, it restarts completely.
Fine-tuning Mistral instruct model using trl v0.7.2 is causing a CUDA OOM error on A100 GPU on Colab regardless of batch size, gradient accumulation steps, or precision level (fp16 or bf16). Leaving all hyperparameters unchanged, but reverting trl back to v0.7.1, I'm able to run the fine-tuning job without any errors.
The text was updated successfully, but these errors were encountered: