Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flux ControlNet Training Multi-GPU DeepSpeed Stage-3 doesn't reduce memory compared to Single GPU #10026

Closed
enesmsahin opened this issue Nov 26, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@enesmsahin
Copy link

Describe the bug

I am running a slightly modified version of Flux ControlNet training script in diffusers. The script is attached below. I am using DeepSpeed Stage-3 with the accelerate config below.

When I use only 1 GPU (configured via accelerate config file below), it takes around 42GB during training. When I use all 8 GPUs in a single node, it still takes around 42GB per GPU.

I don't know about the parallelization details of DeepSpeed but I would expect DeepSpeed Stage-3 to shard the model weights further and reduce the memory usage per GPU for 8 GPUs compared to single-GPU case.

PS: I am not sure if this issue is related to the CN training script in diffusers or accelerate. I have opened the same issue in accelerate.

Reproduction

Link to the script: https://pastebin.com/SdQZcQR8

Command used to run the script:

accelerate launch --config_file "./default_config_fsdp.yaml" train_controlnet_flux_minimum_working_example.py     --pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev"     --dataset_name=fusing/fill50k     --conditioning_image_column=conditioning_image     --image_column=image     --caption_column=text     --output_dir="./training_output/"     --mixed_precision="bf16"     --resolution=512     --learning_rate=1e-5     --max_train_steps=15000     --validation_steps=100     --checkpointing_steps=1     --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png"     --validation_prompt "red circle with blue background" "cyan circle with brown floral background"     --train_batch_size=1     --gradient_accumulation_steps=1     --report_to="wandb"     --num_double_layers=4     --num_single_layers=0     --seed=42

Accelerate Config File

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
gpu_ids: all # "0"
num_machines: 1
num_processes: 8 # 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Logs

No response

System Info

  • 🤗 Diffusers version: 0.32.0.dev0
  • Platform: Linux-5.10.0-33-cloud-amd64-x86_64-with-glibc2.31
  • Running on Google Colab?: No
  • Python version: 3.10.10
  • PyTorch version (GPU?): 2.4.0+cu121 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Huggingface_hub version: 0.26.2
  • Transformers version: 4.46.3
  • Accelerate version: 1.2.0.dev0
  • PEFT version: not installed
  • Bitsandbytes version: not installed
  • Safetensors version: 0.4.5
  • xFormers version: not installed
  • Accelerator: NVIDIA A100-SXM4-80GB, 81920 MiB
    NVIDIA A100-SXM4-80GB, 81920 MiB
    NVIDIA A100-SXM4-80GB, 81920 MiB
    NVIDIA A100-SXM4-80GB, 81920 MiB
    NVIDIA A100-SXM4-80GB, 81920 MiB
    NVIDIA A100-SXM4-80GB, 81920 MiB
    NVIDIA A100-SXM4-80GB, 81920 MiB
    NVIDIA A100-SXM4-80GB, 81920 MiB
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

@PromeAIpro @sayakpaul

@enesmsahin enesmsahin added the bug Something isn't working label Nov 26, 2024
@enesmsahin enesmsahin changed the title Flux ControlNet Multi-GPU DeepSpeed Stage-3 doesn't reduce memory compared to Single GPU Flux ControlNet Training Multi-GPU DeepSpeed Stage-3 doesn't reduce memory compared to Single GPU Nov 26, 2024
@sayakpaul
Copy link
Member

This doesn't seem like a diffusers specific issue or more of a discussion. Hence converting.

@huggingface huggingface locked and limited conversation to collaborators Nov 26, 2024
@sayakpaul sayakpaul converted this issue into discussion #10027 Nov 26, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants