Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU DeepSpeed Stage-3 doesn't reduce memory compared to Single GPU #3264

Open
2 of 4 tasks
enesmsahin opened this issue Nov 26, 2024 · 0 comments
Open
2 of 4 tasks

Comments

@enesmsahin
Copy link

enesmsahin commented Nov 26, 2024

System Info

- `Accelerate` version: 1.2.0.dev0
- Platform: Linux-5.10.0-33-cloud-amd64-x86_64-with-glibc2.31
- `accelerate` bash location: /opt/conda/envs/flux_cn/bin/accelerate
- Python version: 3.10.10
- Numpy version: 2.1.3
- PyTorch version (GPU?): 2.4.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 1338.63 GB
- GPU type: NVIDIA A100-SXM4-80GB
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: NO
        - mixed_precision: bf16
        - use_cpu: False
        - debug: False
        - num_processes: 1
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: 0
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

I am running a slightly modified version of Flux ControlNet training script in diffusers. The script is attached below. I am using DeepSpeed Stage-3 with the accelerate config below.

When I use only 1 GPU (configured via accelerate config file below), it takes around 42GB during training. When I use all 8 GPUs in a single node, it still takes around 42GB per GPU.

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
gpu_ids: all # "0"
num_machines: 1
num_processes: 8 # 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Command used to run the script:

accelerate launch --config_file "./default_config_fsdp.yaml" train_controlnet_flux_minimum_working_example.py     --pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev"     --dataset_name=fusing/fill50k     --conditioning_image_column=conditioning_image     --image_column=image     --caption_column=text     --output_dir="./training_output/"     --mixed_precision="bf16"     --resolution=512     --learning_rate=1e-5     --max_train_steps=15000     --validation_steps=100     --checkpointing_steps=1     --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png"     --validation_prompt "red circle with blue background" "cyan circle with brown floral background"     --train_batch_size=1     --gradient_accumulation_steps=1     --report_to="wandb"     --num_double_layers=4     --num_single_layers=0     --seed=42

Link to the script: https://pastebin.com/SdQZcQR8

Expected behavior

I would expect DeepSpeed Stage-3 to shard the model weights further and reduce the memory usage per GPU for 8 GPUs compared to single-GPU case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant