Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-root FSDP instance's _is_root should not have been set yet or should have been set to False while saving state with FSDP #3258

Open
2 of 4 tasks
enesmsahin opened this issue Nov 25, 2024 · 0 comments

Comments

@enesmsahin
Copy link

System Info

- `Accelerate` version: 1.2.0.dev0
- Platform: Linux-5.10.0-33-cloud-amd64-x86_64-with-glibc2.31
- `accelerate` bash location: /opt/conda/envs/flux_cn/bin/accelerate
- Python version: 3.10.10
- Numpy version: 2.1.3
- PyTorch version (GPU?): 2.4.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 1338.63 GB
- GPU type: NVIDIA A100-SXM4-80GB
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: NO
        - mixed_precision: bf16
        - use_cpu: False
        - debug: False
        - num_processes: 1
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: 0
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

I am running a slightly modified version of Flux ControlNet training script in diffusers. The script is attached below. Basically, I have 1 trainable model, FluxControlNetModel, and 2 frozen models, AutoencoderKL and FluxTransformer2DModel. I have a machine with 8xA100 and I am using FSDP with FULL_SHARD. The accelerate config is attached below.

I want to shard model weights for all of the models, not just the trainable one. When I use the script as is, I get following error when accelerator.save_state() is called:

[rank7]:   File "/opt/conda/envs/flux_cn/lib/python3.10/site-packages/torch/distributed/utils.py", line 166, in _p_assert
[rank7]:     raise AssertionError(s)
[rank7]: AssertionError: Non-root FSDP instance's `_is_root` should not have been set yet or should have been set to `False

However, the issue is gone when I apply the following changes in the provided script. Basically, it isn't allowed to call accelerator.prepare() on frozen models:

diff --git a/train_controlnet_flux_minimum_working_example.py b/train_controlnet_flux_minimum_working_example_diff.py
index 4587f77c2..503e77791 100644
--- a/train_controlnet_flux_minimum_working_example.py
+++ b/train_controlnet_flux_minimum_working_example_diff.py
@@ -1164,9 +1164,6 @@ def main(args):
     )
     flux_controlnet.to(accelerator.device, dtype=weight_dtype)
 
-    vae = accelerator.prepare(vae)
-    flux_transformer = accelerator.prepare(flux_transformer)
-
     vae.to(accelerator.device, dtype=weight_dtype)
     flux_transformer.to(accelerator.device, dtype=weight_dtype)

Command used to run the script:

accelerate launch --config_file "./default_config_fsdp.yaml" train_controlnet_flux_minimum_working_example.py     --pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev"     --dataset_name=fusing/fill50k     --conditioning_image_column=conditioning_image     --image_column=image     --caption_column=text     --output_dir="./training_output/"     --mixed_precision="bf16"     --resolution=512     --learning_rate=1e-5     --max_train_steps=15000     --validation_steps=100     --checkpointing_steps=1     --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png"     --validation_prompt "red circle with blue background" "cyan circle with brown floral background"     --train_batch_size=1     --gradient_accumulation_steps=1     --report_to="wandb"     --num_double_layers=4     --num_single_layers=0     --seed=42
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
fsdp_config:
  fsdp_auto_wrap_policy: SIZE_BASED_WRAP
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_forward_prefetch: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
  fsdp_min_num_params: 1000
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Link to the script: https://pastebin.com/SdQZcQR8

Expected behavior

I want to be able to shard both frozen and trainable models. Calling accelerator.prepare() creates issues while calling accelerator.save_state(). This forces skipping preparing frozen models with accelerator, hence, these models are replicated on each GPU. I would expect being able to call accelerator.prepare() on all models, and save the state without any issues with accelerator.save_state().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant