-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
auto_wrap_policy for PEFT with FSDP #2253
Conversation
I also added a test called |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for improving the interop of accelerate with PEFT in an FSDP setting. I have a few small comments, please check.
Regarding the change in general, I have a question. I don't have experience with FSDP, so maybe it's obvious, but what is the special thing about PEFT that requires this extra policy? Is it just that PEFT results in a mixture of params with and without requires_grad
or is there something else that's special for PEFT?
Yes the reason is a mixture of params with and without grad. And PEFT does not make sense with use_orig_params=True configuration. That's why this is necessary. |
…d import statements
Thanks for explaining. I wonder if this could be detached from PEFT, WDYT? This would also allow to avoid the |
What we could do is check if there exist parameters with requires_grad=False in the model and if so run the wrapping. I don't know if there exist cases where you do not want the extra wrapping (using use_orig_params=True). @pacman100 what do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, use_orig_params=True
is required if one wants to prepare model and optimizer together in a single accelerator.prepare
call else without it one needs to prepare/wrap model in FSDp before creating the optimizer. See the PR #2177 for more information.
With respect to auto wrap policy for PEFT when using FSDP, it is already present in PEFT here: https://github.com/huggingface/peft/blob/482a2a6d9aaa01d534b1240e8c1ab6d346eb278f/src/peft/utils/other.py#L354 and used in the example here: https://github.com/huggingface/peft/blob/main/examples/conditional_generation/peft_lora_seq2seq_accelerate_fsdp.py
Can you please provide a minimal reproducer for us to deep dive into the memory increase?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
What does this PR do?
While doing finetuning with a PEFT model and FSDP I noticed excessive memory usage caused by the parameter use_orig_params which leads to additional memory usage for frozen parameters. When trying to disable use_orig_params the following error was thrown by FSDP
The reason for this is documented in the torch library.
I implemented a check in the auto_wrap_policy function of the FullyShardedDataParallelPlugin and wrapped the unfrozen parameters individually if necessary.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
I wrote an additional test called
test_auto_wrap_policy_peft
which tests if FSDP fails when using PEFT and use_orig_params=False.Who can review?
@pacman100
@muellerzr