You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)
Reproduction
I am running a slightly modified version of Flux ControlNet training script in diffusers. The script is attached below. Basically, I have 1 trainable model, FluxControlNetModel, and 2 frozen models, AutoencoderKL and FluxTransformer2DModel. I have a machine with 8xA100 and I am using FSDP with FULL_SHARD. The accelerate config is attached below.
I want to shard model weights for all of the models, not just the trainable one. When I use the script as is, I get following error when accelerator.save_state() is called:
[rank7]: File "/opt/conda/envs/flux_cn/lib/python3.10/site-packages/torch/distributed/utils.py", line 166, in _p_assert
[rank7]: raise AssertionError(s)
[rank7]: AssertionError: Non-root FSDP instance's `_is_root` should not have been set yet or should have been set to `False
However, the issue is gone when I apply the following changes in the provided script. Basically, it isn't allowed to call accelerator.prepare() on frozen models:
I want to be able to shard both frozen and trainable models. Calling accelerator.prepare() creates issues while calling accelerator.save_state(). This forces skipping preparing frozen models with accelerator, hence, these models are replicated on each GPU. I would expect being able to call accelerator.prepare() on all models, and save the state without any issues with accelerator.save_state().
The text was updated successfully, but these errors were encountered:
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
I am running a slightly modified version of Flux ControlNet training script in diffusers. The script is attached below. Basically, I have 1 trainable model,
FluxControlNetModel
, and 2 frozen models,AutoencoderKL
andFluxTransformer2DModel
. I have a machine with 8xA100 and I am using FSDP withFULL_SHARD
. The accelerate config is attached below.I want to shard model weights for all of the models, not just the trainable one. When I use the script as is, I get following error when
accelerator.save_state()
is called:However, the issue is gone when I apply the following changes in the provided script. Basically, it isn't allowed to call
accelerator.prepare()
on frozen models:Command used to run the script:
Link to the script: https://pastebin.com/SdQZcQR8
Expected behavior
I want to be able to shard both frozen and trainable models. Calling
accelerator.prepare()
creates issues while callingaccelerator.save_state()
. This forces skipping preparing frozen models with accelerator, hence, these models are replicated on each GPU. I would expect being able to callaccelerator.prepare()
on all models, and save the state without any issues withaccelerator.save_state()
.The text was updated successfully, but these errors were encountered: