Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trl v0.7.2 is causing CUDA out of memory for Mistral Instruct on A100 GPU (on Colab) #876

Closed
mallaham opened this issue Oct 15, 2023 · 4 comments

Comments

@mallaham
Copy link

Fine-tuning Mistral instruct model using trl v0.7.2 is causing a CUDA OOM error on A100 GPU on Colab regardless of batch size, gradient accumulation steps, or precision level (fp16 or bf16). Leaving all hyperparameters unchanged, but reverting trl back to v0.7.1, I'm able to run the fine-tuning job without any errors.

@younesbelkada
Copy link
Contributor

Hi @mallaham
I believe #725 caused the issue, before that PR gradient checkpointing was used by default, now it is not the case anymore, you need to explicitly pass gradient_checkpointing=True in the TrainingArguments. Can you double check this and report back here? 🙏

@Pclanglais
Copy link

Hi. I had the same issue. gradient_checkpointing=True does solve the memory issue but afterwards I am unable to continue a previous run. Even with explicit checkpoints, it restarts completely.

@mallaham
Copy link
Author

@younesbelkada sorry for the late response. Indeed, with gradient_checkpointing set to True the issue has been resolved.

@lvwerra
Copy link
Member

lvwerra commented Oct 26, 2023

Ok closing the issue then :)

@lvwerra lvwerra closed this as completed Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants