trl v0.7.2 is causing CUDA out of memory for Mistral Instruct on A100 GPU (on Colab) #876

mallaham · 2023-10-15T21:28:53Z

Fine-tuning Mistral instruct model using trl v0.7.2 is causing a CUDA OOM error on A100 GPU on Colab regardless of batch size, gradient accumulation steps, or precision level (fp16 or bf16). Leaving all hyperparameters unchanged, but reverting trl back to v0.7.1, I'm able to run the fine-tuning job without any errors.

younesbelkada · 2023-10-16T02:20:52Z

Hi @mallaham
I believe #725 caused the issue, before that PR gradient checkpointing was used by default, now it is not the case anymore, you need to explicitly pass gradient_checkpointing=True in the TrainingArguments. Can you double check this and report back here? 🙏

Pclanglais · 2023-10-22T22:40:44Z

Hi. I had the same issue. gradient_checkpointing=True does solve the memory issue but afterwards I am unable to continue a previous run. Even with explicit checkpoints, it restarts completely.

mallaham · 2023-10-25T17:14:08Z

@younesbelkada sorry for the late response. Indeed, with gradient_checkpointing set to True the issue has been resolved.

lvwerra · 2023-10-26T08:58:25Z

Ok closing the issue then :)

lvwerra closed this as completed Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trl v0.7.2 is causing CUDA out of memory for Mistral Instruct on A100 GPU (on Colab) #876

trl v0.7.2 is causing CUDA out of memory for Mistral Instruct on A100 GPU (on Colab) #876

mallaham commented Oct 15, 2023

younesbelkada commented Oct 16, 2023

Pclanglais commented Oct 22, 2023

mallaham commented Oct 25, 2023

lvwerra commented Oct 26, 2023

trl v0.7.2 is causing CUDA out of memory for Mistral Instruct on A100 GPU (on Colab) #876

trl v0.7.2 is causing CUDA out of memory for Mistral Instruct on A100 GPU (on Colab) #876

Comments

mallaham commented Oct 15, 2023

younesbelkada commented Oct 16, 2023

Pclanglais commented Oct 22, 2023

mallaham commented Oct 25, 2023

lvwerra commented Oct 26, 2023