update to `prepare_model_for_kbit_training` #728

mnoukhov · 2023-09-02T19:29:34Z

since peft has deprecated prepare_model_for_int8_training

also add use_gradient_checkpointing=args.gradient_checkpointing to automatically follow the gradient checkpointing choice in training args

For RewardTrainer, this is the workaround to #480 proposed by #694.

Concurrently @lewtun is working on #726 which adds the use_gradient_checkpointing for RewardTrainer. I'm happy to wait until it is merged to merge this.

HuggingFaceDocBuilderDev · 2023-09-02T19:36:06Z

The documentation is not available anymore as the PR was closed or merged.

from deprecated `prepare_model_for_int8_training` and add `use_gradient_checkpointing=args.gradient_checkpointing` to automatically follow the gradient checkpointing choice is also the workaround for huggingface#694

calling model.gradient_checkpointing_enable() twice causes issues this workaround calls it in prepare_model_for_kbit_training and then changes the arg to false to make sure it isn't called again in huggingface trainer inner loop also changes stack_llama_2 sft trainer to use correct device map for ddp training so that you can test this issue

mnoukhov · 2023-09-03T00:24:46Z

I've realized this fix actually causes an issue. Calling model.gradient_checkpointing_enable() twice leads to the same error as #480. And it is called twice:

in peft's prepare_model_for_kbit_training here
in huggingface Trainer's inner training loop here

I've made a workaround here in sft_trainer but it is ugly and maybe there's a better way.

To demonstrate the issue, I fixed stack_llama_2/sft_llama2.py by making the device map use Accelerator so it actually runs on multi-gpu on a single device. Run accelerator launch --multi_gpu sft_llama2.py without fix in sft_trainer and it will error

younesbelkada

This looks good to me! Thanks for deepdiving and explaining, I left one question, let me know what do you think

younesbelkada · 2023-09-06T08:22:46Z

trl/trainer/sft_trainer.py

+                        model, use_gradient_checkpointing=args.gradient_checkpointing
+                    )
+
+                    args = dataclasses.replace(args, gradient_checkpointing=False)


why this change here and not above?

we do want to call gradient_checkpointing_enable once, we just don't want to call it twice. We will call it in 'prepare_for_kbit_trainingbut this change makes sure we don't call it inTrainer`

Perfect makes sense!

lewtun · 2023-09-06T12:09:03Z

FYI @mnoukhov my PR #726 has now been merged so feel free to wrap this one up - thank you 🚀 !

younesbelkada · 2023-09-11T08:25:06Z

@mnoukhov thanks again for your work on this ! Would you be happy to fix the merge conflicts? After that we should be good to merge!

mnoukhov · 2023-09-11T19:40:02Z

Pulled and should be ready to merge!

younesbelkada

Thanks a lot for this great effort!

* update to `prepare_model_for_kbit_training` from deprecated `prepare_model_for_int8_training` and add `use_gradient_checkpointing=args.gradient_checkpointing` to automatically follow the gradient checkpointing choice is also the workaround for huggingface#694 * workaround for gradient checkpointing issue calling model.gradient_checkpointing_enable() twice causes issues this workaround calls it in prepare_model_for_kbit_training and then changes the arg to false to make sure it isn't called again in huggingface trainer inner loop also changes stack_llama_2 sft trainer to use correct device map for ddp training so that you can test this issue

update to prepare_model_for_kbit_training

16ef09b

from deprecated `prepare_model_for_int8_training` and add `use_gradient_checkpointing=args.gradient_checkpointing` to automatically follow the gradient checkpointing choice is also the workaround for huggingface#694

mnoukhov force-pushed the gradcheckpoint branch from 973429d to 16ef09b Compare September 2, 2023 23:53

lvwerra requested a review from younesbelkada September 4, 2023 10:06

younesbelkada reviewed Sep 6, 2023

View reviewed changes

Merge branch 'main' of github.com:huggingface/trl into gradcheckpoint

fc6b414

younesbelkada approved these changes Sep 12, 2023

View reviewed changes

younesbelkada merged commit b87ec2d into huggingface:main Sep 12, 2023

mnoukhov mentioned this pull request Sep 12, 2023

Error with Multi-GPU peft Reward Training #480

Closed

younesbelkada mentioned this pull request Sep 26, 2023

Nightly trl QLoRA Training OOM #818

Closed

younesbelkada mentioned this pull request Oct 10, 2023

OOM error on 7B model despite 4-bit quantization with 24GB VRAM #846

Closed

appleparan mentioned this pull request Jan 22, 2024

Update Fine-tune Llama 2 libraries mlabonne/llm-course#32

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update to `prepare_model_for_kbit_training` #728

update to `prepare_model_for_kbit_training` #728

mnoukhov commented Sep 2, 2023

HuggingFaceDocBuilderDev commented Sep 2, 2023 •

edited

Loading

mnoukhov commented Sep 3, 2023

younesbelkada left a comment

younesbelkada Sep 6, 2023

mnoukhov Sep 6, 2023

younesbelkada Sep 11, 2023

lewtun commented Sep 6, 2023

younesbelkada commented Sep 11, 2023

mnoukhov commented Sep 11, 2023

younesbelkada left a comment

update to prepare_model_for_kbit_training #728

update to prepare_model_for_kbit_training #728

Conversation

mnoukhov commented Sep 2, 2023

HuggingFaceDocBuilderDev commented Sep 2, 2023 • edited Loading

mnoukhov commented Sep 3, 2023

younesbelkada left a comment

Choose a reason for hiding this comment

younesbelkada Sep 6, 2023

Choose a reason for hiding this comment

mnoukhov Sep 6, 2023

Choose a reason for hiding this comment

younesbelkada Sep 11, 2023

Choose a reason for hiding this comment

lewtun commented Sep 6, 2023

younesbelkada commented Sep 11, 2023

mnoukhov commented Sep 11, 2023

younesbelkada left a comment

Choose a reason for hiding this comment

update to `prepare_model_for_kbit_training` #728

update to `prepare_model_for_kbit_training` #728

HuggingFaceDocBuilderDev commented Sep 2, 2023 •

edited

Loading