-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flux_example.py get stuck #363
Comments
find out why it get stuck, nccl get time out: |
same error |
i use a100 instead of l20,and success now |
I suggest you check your nccl environments related to PCIe. |
i use NCCL_P2P_DISABLE=0 but still stuck in same place. i check the code acording to the stack infomation, looks like it stuck in this operation:
maybe something wrong with nccl,l20 doesn't support? |
Thank you for the assistance in debugging. I believe the issue is still related to the implementation of P2P in PipeFusion, as it uses asynchronous P2P, which may have bugs leading to deadlocks. Particularly at L20, cross-NUMA communication might be routed through the CPU's QPI. @Lay2000 will help on debuging this issue. |
thanks for help. If you have any progress, pls let me know. I want to know the reason too. |
i use this command:
torchrun --nproc_per_node=2 examples/flux_example.py --model /models/FLUX.1-dev --height 1024 --width 1024 --pipefusion_parallel_degree 2 --ulysses_degree 1 --ring_degree 1 --num_inference_steps 20 --warmup_steps 0 --prompt "A small dog"
but it get stuck in _async_pipeline this is what it looks like:
W1125 17:34:30.620696 71014 site-packages/torch/distributed/run.py:793]
W1125 17:34:30.620696 71014 site-packages/torch/distributed/run.py:793] *****************************************
W1125 17:34:30.620696 71014 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1125 17:34:30.620696 71014 site-packages/torch/distributed/run.py:793] *****************************************
WARNING 11-25 17:34:35 [args.py:320] Distributed environment is not initialized. Initializing...
DEBUG 11-25 17:34:35 [parallel_state.py:179] world_size=-1 rank=-1 local_rank=-1 distributed_init_method=env:// backend=nccl
WARNING 11-25 17:34:35 [args.py:320] Distributed environment is not initialized. Initializing...
DEBUG 11-25 17:34:35 [parallel_state.py:179] world_size=-1 rank=-1 local_rank=-1 distributed_init_method=env:// backend=nccl
INFO 11-25 17:34:35 [config.py:163] Pipeline patch number not set, using default value 2
INFO 11-25 17:34:35 [config.py:163] Pipeline patch number not set, using default value 2
Loading pipeline components...: 0%| | 0/7 [00:00<?, ?it/s]You set
add_prefix_space
. The tokenizer needs to be converted from the slow tokenizersLoading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 11.46it/s]
You set
add_prefix_space
. The tokenizer needs to be converted from the slow tokenizers████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 11.47it/s]Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 11.71it/s]
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00, 6.98it/s]
WARNING 11-25 17:34:36 [runtime_state.py:63] Model parallel is not initialized, initializing...
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00, 6.85it/s]
WARNING 11-25 17:34:36 [runtime_state.py:63] Model parallel is not initialized, initializing...
INFO 11-25 17:34:36 [base_pipeline.py:290] Transformer backbone found, paralleling transformer...
INFO 11-25 17:34:36 [base_pipeline.py:290] Transformer backbone found, paralleling transformer...
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.0.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.1.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.2.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.3.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.0.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.4.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.5.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.1.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.6.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.2.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.7.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.3.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.4.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.8.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.5.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.6.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.9.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.7.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.10.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.8.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.9.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.11.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.10.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.12.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.11.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.12.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.13.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.13.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.14.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.14.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.15.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.15.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.16.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.16.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.17.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.18.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.17.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.19.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.18.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.20.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.21.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.0.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.22.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.1.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.23.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.2.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.24.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.3.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.25.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.4.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.26.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.27.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.5.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.6.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.7.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.8.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.9.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_pipeline.py:335] Scheduler found, paralleling scheduler...
INFO 11-25 17:34:36 [base_pipeline.py:335] Scheduler found, paralleling scheduler...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.35it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.41s/it]
5%|███████▋
it always stuck in 5%,and i dump the stack it looks like:
Thread 71079 (active): "MainThread"
synchronize (torch/cuda/init.py:954)
_communicate_shapes (xfuser/core/distributed/group_coordinator.py:865)
_check_shape_and_buffer (xfuser/core/distributed/group_coordinator.py:796)
recv_next (xfuser/core/distributed/group_coordinator.py:938)
_async_pipeline (xfuser/model_executor/pipelines/pipeline_flux.py:572)
call (xfuser/model_executor/pipelines/pipeline_flux.py:319)
check_naive_forward_fn (xfuser/model_executor/pipelines/base_pipeline.py:186)
data_parallel_fn (xfuser/model_executor/pipelines/base_pipeline.py:166)
wrapper (xfuser/model_executor/pipelines/base_pipeline.py:218)
decorate_context (torch/utils/_contextlib.py:116)
main (flux_example.py:42)
(flux_example.py:81)
Thread 71098 (idle): "Thread-1"
wait (threading.py:324)
wait (threading.py:600)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1009)
_bootstrap (threading.py:966)
anthor process was:
Thread 71080 (idle): "MainThread"
isend (torch/distributed/distributed_c10d.py:2062)
_pipeline_isend (xfuser/core/distributed/group_coordinator.py:968)
pipeline_isend (xfuser/core/distributed/group_coordinator.py:921)
_async_pipeline (xfuser/model_executor/pipelines/pipeline_flux.py:634)
call (xfuser/model_executor/pipelines/pipeline_flux.py:319)
check_naive_forward_fn (xfuser/model_executor/pipelines/base_pipeline.py:186)
data_parallel_fn (xfuser/model_executor/pipelines/base_pipeline.py:166)
wrapper (xfuser/model_executor/pipelines/base_pipeline.py:218)
decorate_context (torch/utils/_contextlib.py:116)
main (flux_example.py:42)
(flux_example.py:81)
Thread 71099 (idle): "Thread-1"
wait (threading.py:324)
wait (threading.py:600)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1009)
_bootstrap (threading.py:966)
maybe something goes wrong, pls give me a little help
The text was updated successfully, but these errors were encountered: