Non-root FSDP instance's `_is_root` should not have been set yet or should have been set to `False` while saving state with FSDP #3258

enesmsahin · 2024-11-25T15:46:24Z

System Info

- `Accelerate` version: 1.2.0.dev0
- Platform: Linux-5.10.0-33-cloud-amd64-x86_64-with-glibc2.31
- `accelerate` bash location: /opt/conda/envs/flux_cn/bin/accelerate
- Python version: 3.10.10
- Numpy version: 2.1.3
- PyTorch version (GPU?): 2.4.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 1338.63 GB
- GPU type: NVIDIA A100-SXM4-80GB
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: NO
        - mixed_precision: bf16
        - use_cpu: False
        - debug: False
        - num_processes: 1
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: 0
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

I am running a slightly modified version of Flux ControlNet training script in diffusers. The script is attached below. Basically, I have 1 trainable model, FluxControlNetModel, and 2 frozen models, AutoencoderKL and FluxTransformer2DModel. I have a machine with 8xA100 and I am using FSDP with FULL_SHARD. The accelerate config is attached below.

I want to shard model weights for all of the models, not just the trainable one. When I use the script as is, I get following error when accelerator.save_state() is called:

[rank7]:   File "/opt/conda/envs/flux_cn/lib/python3.10/site-packages/torch/distributed/utils.py", line 166, in _p_assert
[rank7]:     raise AssertionError(s)
[rank7]: AssertionError: Non-root FSDP instance's `_is_root` should not have been set yet or should have been set to `False

However, the issue is gone when I apply the following changes in the provided script. Basically, it isn't allowed to call accelerator.prepare() on frozen models:

diff --git a/train_controlnet_flux_minimum_working_example.py b/train_controlnet_flux_minimum_working_example_diff.py
index 4587f77c2..503e77791 100644
--- a/train_controlnet_flux_minimum_working_example.py
+++ b/train_controlnet_flux_minimum_working_example_diff.py
@@ -1164,9 +1164,6 @@ def main(args):
     )
     flux_controlnet.to(accelerator.device, dtype=weight_dtype)
 
-    vae = accelerator.prepare(vae)
-    flux_transformer = accelerator.prepare(flux_transformer)
-
     vae.to(accelerator.device, dtype=weight_dtype)
     flux_transformer.to(accelerator.device, dtype=weight_dtype)

Command used to run the script:

accelerate launch --config_file "./default_config_fsdp.yaml" train_controlnet_flux_minimum_working_example.py     --pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev"     --dataset_name=fusing/fill50k     --conditioning_image_column=conditioning_image     --image_column=image     --caption_column=text     --output_dir="./training_output/"     --mixed_precision="bf16"     --resolution=512     --learning_rate=1e-5     --max_train_steps=15000     --validation_steps=100     --checkpointing_steps=1     --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png"     --validation_prompt "red circle with blue background" "cyan circle with brown floral background"     --train_batch_size=1     --gradient_accumulation_steps=1     --report_to="wandb"     --num_double_layers=4     --num_single_layers=0     --seed=42

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
fsdp_config:
  fsdp_auto_wrap_policy: SIZE_BASED_WRAP
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_forward_prefetch: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
  fsdp_min_num_params: 1000
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Link to the script: https://pastebin.com/SdQZcQR8

Expected behavior

I want to be able to shard both frozen and trainable models. Calling accelerator.prepare() creates issues while calling accelerator.save_state(). This forces skipping preparing frozen models with accelerator, hence, these models are replicated on each GPU. I would expect being able to call accelerator.prepare() on all models, and save the state without any issues with accelerator.save_state().

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-root FSDP instance's `_is_root` should not have been set yet or should have been set to `False` while saving state with FSDP #3258

Non-root FSDP instance's `_is_root` should not have been set yet or should have been set to `False` while saving state with FSDP #3258

enesmsahin commented Nov 25, 2024

Non-root FSDP instance's _is_root should not have been set yet or should have been set to False while saving state with FSDP #3258

Non-root FSDP instance's _is_root should not have been set yet or should have been set to False while saving state with FSDP #3258

Comments

enesmsahin commented Nov 25, 2024

System Info

Information

Tasks

Reproduction

Expected behavior

Non-root FSDP instance's `_is_root` should not have been set yet or should have been set to `False` while saving state with FSDP #3258

Non-root FSDP instance's `_is_root` should not have been set yet or should have been set to `False` while saving state with FSDP #3258