flux_example.py get stuck #363

fy1214 · 2024-11-25T09:45:50Z

i use this command:

torchrun --nproc_per_node=2 examples/flux_example.py --model /models/FLUX.1-dev --height 1024 --width 1024 --pipefusion_parallel_degree 2 --ulysses_degree 1 --ring_degree 1 --num_inference_steps 20 --warmup_steps 0 --prompt "A small dog"

but it get stuck in _async_pipeline this is what it looks like:

W1125 17:34:30.620696 71014 site-packages/torch/distributed/run.py:793]
W1125 17:34:30.620696 71014 site-packages/torch/distributed/run.py:793] *****************************************
W1125 17:34:30.620696 71014 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1125 17:34:30.620696 71014 site-packages/torch/distributed/run.py:793] *****************************************
WARNING 11-25 17:34:35 [args.py:320] Distributed environment is not initialized. Initializing...
DEBUG 11-25 17:34:35 [parallel_state.py:179] world_size=-1 rank=-1 local_rank=-1 distributed_init_method=env:// backend=nccl
WARNING 11-25 17:34:35 [args.py:320] Distributed environment is not initialized. Initializing...
DEBUG 11-25 17:34:35 [parallel_state.py:179] world_size=-1 rank=-1 local_rank=-1 distributed_init_method=env:// backend=nccl
INFO 11-25 17:34:35 [config.py:163] Pipeline patch number not set, using default value 2
INFO 11-25 17:34:35 [config.py:163] Pipeline patch number not set, using default value 2
Loading pipeline components...: 0%| | 0/7 [00:00<?, ?it/s]You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 11.46it/s]
You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 11.47it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 11.71it/s]
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00, 6.98it/s]
WARNING 11-25 17:34:36 [runtime_state.py:63] Model parallel is not initialized, initializing...
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00, 6.85it/s]
WARNING 11-25 17:34:36 [runtime_state.py:63] Model parallel is not initialized, initializing...
INFO 11-25 17:34:36 [base_pipeline.py:290] Transformer backbone found, paralleling transformer...
INFO 11-25 17:34:36 [base_pipeline.py:290] Transformer backbone found, paralleling transformer...
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.0.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.1.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.2.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.3.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.0.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.4.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.5.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.1.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.6.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.2.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.7.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.3.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.4.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.8.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.5.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.6.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.9.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.7.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.10.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.8.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.9.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.11.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.10.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.12.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.11.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.12.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.13.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.13.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.14.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.14.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.15.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.15.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.16.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.16.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.17.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.18.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.17.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.19.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.18.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.20.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.21.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.0.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.22.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.1.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.23.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.2.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.24.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.3.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.25.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.4.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.26.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.27.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.5.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.6.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.7.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.8.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.9.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_pipeline.py:335] Scheduler found, paralleling scheduler...
INFO 11-25 17:34:36 [base_pipeline.py:335] Scheduler found, paralleling scheduler...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.35it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.41s/it]
5%|███████▋

it always stuck in 5%，and i dump the stack it looks like:

Thread 71079 (active): "MainThread"
synchronize (torch/cuda/init.py:954)
_communicate_shapes (xfuser/core/distributed/group_coordinator.py:865)
_check_shape_and_buffer (xfuser/core/distributed/group_coordinator.py:796)
recv_next (xfuser/core/distributed/group_coordinator.py:938)
_async_pipeline (xfuser/model_executor/pipelines/pipeline_flux.py:572)
call (xfuser/model_executor/pipelines/pipeline_flux.py:319)
check_naive_forward_fn (xfuser/model_executor/pipelines/base_pipeline.py:186)
data_parallel_fn (xfuser/model_executor/pipelines/base_pipeline.py:166)
wrapper (xfuser/model_executor/pipelines/base_pipeline.py:218)
decorate_context (torch/utils/_contextlib.py:116)
main (flux_example.py:42)
(flux_example.py:81)
Thread 71098 (idle): "Thread-1"
wait (threading.py:324)
wait (threading.py:600)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1009)
_bootstrap (threading.py:966)

anthor process was:
Thread 71080 (idle): "MainThread"
isend (torch/distributed/distributed_c10d.py:2062)
_pipeline_isend (xfuser/core/distributed/group_coordinator.py:968)
pipeline_isend (xfuser/core/distributed/group_coordinator.py:921)
_async_pipeline (xfuser/model_executor/pipelines/pipeline_flux.py:634)
call (xfuser/model_executor/pipelines/pipeline_flux.py:319)
check_naive_forward_fn (xfuser/model_executor/pipelines/base_pipeline.py:186)
data_parallel_fn (xfuser/model_executor/pipelines/base_pipeline.py:166)
wrapper (xfuser/model_executor/pipelines/base_pipeline.py:218)
decorate_context (torch/utils/_contextlib.py:116)
main (flux_example.py:42)
(flux_example.py:81)
Thread 71099 (idle): "Thread-1"
wait (threading.py:324)
wait (threading.py:600)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1009)
_bootstrap (threading.py:966)

maybe something goes wrong, pls give me a little help

The text was updated successfully, but these errors were encountered:

fy1214 · 2024-11-25T09:53:45Z

find out why it get stuck, nccl get time out:
rank0]:[E1125 17:44:44.587302788 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10, OpType=SEND, NumelIn=6291456, NumelOut=6291456, Timeout(ms)=600000) ran for 600024 milliseconds before timing out.
[rank0]:[E1125 17:44:44.588313115 ProcessGroupNCCL.cpp:1785] [PG ID 4 PG GUID 11 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 10, last enqueued NCCL work: 11, last completed NCCL work: 11.
[rank1]:[W1125 17:44:44.646952859 socket.cpp:462] [c10d] waitForInput: poll for socket SocketImpl(fd=21, addr=[::ffff:127.0.0.1]:39518, remote=[::ffff:127.0.0.1]:29500) returned 0, likely a timeout
[rank1]:[W1125 17:44:44.647828424 socket.cpp:487] [c10d] waitForInput: socket SocketImpl(fd=21, addr=[::ffff:127.0.0.1]:39518, remote=[::ffff:127.0.0.1]:29500) timed out after 600000ms
0%| | 0/20 [10:00<?, ?it/s]
[rank1]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: wait timeout after 600000ms, keys: //worker/attempt_0/default_pg/0//12//cuda//0:1

yinfan98 · 2024-11-25T12:42:27Z

same error

fy1214 · 2024-11-25T12:43:53Z

same error

i use a100 instead of l20，and success now

feifeibear · 2024-11-26T03:17:09Z

I suggest you check your nccl environments related to PCIe.
for example, NCCL_P2P_DISABLE=0

fy1214 · 2024-11-26T08:42:52Z

I suggest you check your nccl environments related to PCIe. for example, NCCL_P2P_DISABLE=0

i use NCCL_P2P_DISABLE=0 but still stuck in same place. i check the code acording to the stack infomation, looks like it stuck in this operation:

      if recv_prev:
           recv_prev_dim_tensor = torch.empty(
               (1), device=self.device, dtype=torch.int64
           )
           recv_prev_dim_op = torch.distributed.P2POp(
               torch.distributed.irecv,
               recv_prev_dim_tensor,
               self.prev_rank,
               self.device_group,
           )
           ops.append(recv_prev_dim_op)

       if tensor_send_to_next is not None:
           send_next_dim_tensor = torch.tensor(
               tensor_send_to_next.dim(), device=self.device, dtype=torch.int64
           )
           send_next_dim_op = torch.distributed.P2POp(
               torch.distributed.isend,
               send_next_dim_tensor,
               self.next_rank,
               self.device_group,
           )
           ops.append(send_next_dim_op)

       if len(ops) > 0:
           reqs = torch.distributed.batch_isend_irecv(ops)
           for req in reqs:
               req.wait()

       # To protect against race condition when using batch_isend_irecv().
       # should take this out once the bug with batch_isend_irecv is resolved.
       torch.cuda.synchronize()

maybe something wrong with nccl，l20 doesn't support?

feifeibear · 2024-11-26T09:05:41Z

To protect against race condition when using batch_isend_irecv().
   # should take this out once the bug with batch_isend_irecv is resolved.
   torch.cuda.synchronize()

Thank you for the assistance in debugging. I believe the issue is still related to the implementation of P2P in PipeFusion, as it uses asynchronous P2P, which may have bugs leading to deadlocks. Particularly at L20, cross-NUMA communication might be routed through the CPU's QPI.

@Lay2000 will help on debuging this issue.

fy1214 · 2024-11-26T11:06:29Z

To protect against race condition when using batch_isend_irecv().
   # should take this out once the bug with batch_isend_irecv is resolved.
   torch.cuda.synchronize()
Thank you for the assistance in debugging. I believe the issue is still related to the implementation of P2P in PipeFusion, as it uses asynchronous P2P, which may have bugs leading to deadlocks. Particularly at L20, cross-NUMA communication might be routed through the CPU's QPI.

@Lay2000 will help on debuging this issue.

thanks for help. If you have any progress, pls let me know. I want to know the reason too.

feifeibear assigned Lay2000 Nov 26, 2024

feifeibear added the bug Something isn't working label Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flux_example.py get stuck #363

flux_example.py get stuck #363

fy1214 commented Nov 25, 2024 •

edited

Loading

fy1214 commented Nov 25, 2024

yinfan98 commented Nov 25, 2024

fy1214 commented Nov 25, 2024

feifeibear commented Nov 26, 2024

fy1214 commented Nov 26, 2024

feifeibear commented Nov 26, 2024 •

edited

Loading

To protect against race condition when using batch_isend_irecv().

fy1214 commented Nov 26, 2024 •

edited

Loading

To protect against race condition when using batch_isend_irecv().

flux_example.py get stuck #363

flux_example.py get stuck #363

Comments

fy1214 commented Nov 25, 2024 • edited Loading

i use this command:

but it get stuck in _async_pipeline this is what it looks like:

it always stuck in 5%，and i dump the stack it looks like:

fy1214 commented Nov 25, 2024

yinfan98 commented Nov 25, 2024

fy1214 commented Nov 25, 2024

feifeibear commented Nov 26, 2024

fy1214 commented Nov 26, 2024

feifeibear commented Nov 26, 2024 • edited Loading

To protect against race condition when using batch_isend_irecv().

fy1214 commented Nov 26, 2024 • edited Loading

To protect against race condition when using batch_isend_irecv().

fy1214 commented Nov 25, 2024 •

edited

Loading

feifeibear commented Nov 26, 2024 •

edited

Loading

fy1214 commented Nov 26, 2024 •

edited

Loading