Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux_example.py get stuck #363

Open
fy1214 opened this issue Nov 25, 2024 · 7 comments
Open

flux_example.py get stuck #363

fy1214 opened this issue Nov 25, 2024 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@fy1214
Copy link

fy1214 commented Nov 25, 2024

i use this command:

torchrun --nproc_per_node=2 examples/flux_example.py --model /models/FLUX.1-dev --height 1024 --width 1024 --pipefusion_parallel_degree 2 --ulysses_degree 1 --ring_degree 1 --num_inference_steps 20 --warmup_steps 0 --prompt "A small dog"

but it get stuck in _async_pipeline this is what it looks like:

W1125 17:34:30.620696 71014 site-packages/torch/distributed/run.py:793]
W1125 17:34:30.620696 71014 site-packages/torch/distributed/run.py:793] *****************************************
W1125 17:34:30.620696 71014 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1125 17:34:30.620696 71014 site-packages/torch/distributed/run.py:793] *****************************************
WARNING 11-25 17:34:35 [args.py:320] Distributed environment is not initialized. Initializing...
DEBUG 11-25 17:34:35 [parallel_state.py:179] world_size=-1 rank=-1 local_rank=-1 distributed_init_method=env:// backend=nccl
WARNING 11-25 17:34:35 [args.py:320] Distributed environment is not initialized. Initializing...
DEBUG 11-25 17:34:35 [parallel_state.py:179] world_size=-1 rank=-1 local_rank=-1 distributed_init_method=env:// backend=nccl
INFO 11-25 17:34:35 [config.py:163] Pipeline patch number not set, using default value 2
INFO 11-25 17:34:35 [config.py:163] Pipeline patch number not set, using default value 2
Loading pipeline components...: 0%| | 0/7 [00:00<?, ?it/s]You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 11.46it/s]
You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 11.47it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 11.71it/s]
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00, 6.98it/s]
WARNING 11-25 17:34:36 [runtime_state.py:63] Model parallel is not initialized, initializing...
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00, 6.85it/s]
WARNING 11-25 17:34:36 [runtime_state.py:63] Model parallel is not initialized, initializing...
INFO 11-25 17:34:36 [base_pipeline.py:290] Transformer backbone found, paralleling transformer...
INFO 11-25 17:34:36 [base_pipeline.py:290] Transformer backbone found, paralleling transformer...
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.0.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.1.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.2.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.3.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.0.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.4.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.5.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.1.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.6.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.2.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.7.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.3.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.4.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.8.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.5.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.6.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.9.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.7.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.10.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.8.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.9.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.11.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.10.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.12.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.11.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.12.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.13.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.13.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.14.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.14.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.15.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.15.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.16.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.16.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.17.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.18.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.17.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.19.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping transformer_blocks.18.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.20.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.21.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.0.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.22.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.1.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.23.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.2.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.24.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.3.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.25.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.4.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.26.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 1] Wrapping single_transformer_blocks.27.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.5.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.6.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.7.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.8.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_model.py:83] [RANK 0] Wrapping single_transformer_blocks.9.attn in model class FluxTransformer2DModel with xFuserAttentionWrapper
INFO 11-25 17:34:36 [base_pipeline.py:335] Scheduler found, paralleling scheduler...
INFO 11-25 17:34:36 [base_pipeline.py:335] Scheduler found, paralleling scheduler...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.35it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.41s/it]
5%|███████▋

it always stuck in 5%,and i dump the stack it looks like:

Thread 71079 (active): "MainThread"
synchronize (torch/cuda/init.py:954)
_communicate_shapes (xfuser/core/distributed/group_coordinator.py:865)
_check_shape_and_buffer (xfuser/core/distributed/group_coordinator.py:796)
recv_next (xfuser/core/distributed/group_coordinator.py:938)
_async_pipeline (xfuser/model_executor/pipelines/pipeline_flux.py:572)
call (xfuser/model_executor/pipelines/pipeline_flux.py:319)
check_naive_forward_fn (xfuser/model_executor/pipelines/base_pipeline.py:186)
data_parallel_fn (xfuser/model_executor/pipelines/base_pipeline.py:166)
wrapper (xfuser/model_executor/pipelines/base_pipeline.py:218)
decorate_context (torch/utils/_contextlib.py:116)
main (flux_example.py:42)
(flux_example.py:81)
Thread 71098 (idle): "Thread-1"
wait (threading.py:324)
wait (threading.py:600)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1009)
_bootstrap (threading.py:966)

anthor process was:
Thread 71080 (idle): "MainThread"
isend (torch/distributed/distributed_c10d.py:2062)
_pipeline_isend (xfuser/core/distributed/group_coordinator.py:968)
pipeline_isend (xfuser/core/distributed/group_coordinator.py:921)
_async_pipeline (xfuser/model_executor/pipelines/pipeline_flux.py:634)
call (xfuser/model_executor/pipelines/pipeline_flux.py:319)
check_naive_forward_fn (xfuser/model_executor/pipelines/base_pipeline.py:186)
data_parallel_fn (xfuser/model_executor/pipelines/base_pipeline.py:166)
wrapper (xfuser/model_executor/pipelines/base_pipeline.py:218)
decorate_context (torch/utils/_contextlib.py:116)
main (flux_example.py:42)
(flux_example.py:81)
Thread 71099 (idle): "Thread-1"
wait (threading.py:324)
wait (threading.py:600)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1009)
_bootstrap (threading.py:966)

maybe something goes wrong, pls give me a little help

@fy1214
Copy link
Author

fy1214 commented Nov 25, 2024

find out why it get stuck, nccl get time out:
rank0]:[E1125 17:44:44.587302788 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10, OpType=SEND, NumelIn=6291456, NumelOut=6291456, Timeout(ms)=600000) ran for 600024 milliseconds before timing out.
[rank0]:[E1125 17:44:44.588313115 ProcessGroupNCCL.cpp:1785] [PG ID 4 PG GUID 11 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 10, last enqueued NCCL work: 11, last completed NCCL work: 11.
[rank1]:[W1125 17:44:44.646952859 socket.cpp:462] [c10d] waitForInput: poll for socket SocketImpl(fd=21, addr=[::ffff:127.0.0.1]:39518, remote=[::ffff:127.0.0.1]:29500) returned 0, likely a timeout
[rank1]:[W1125 17:44:44.647828424 socket.cpp:487] [c10d] waitForInput: socket SocketImpl(fd=21, addr=[::ffff:127.0.0.1]:39518, remote=[::ffff:127.0.0.1]:29500) timed out after 600000ms
0%| | 0/20 [10:00<?, ?it/s]
[rank1]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: wait timeout after 600000ms, keys: //worker/attempt_0/default_pg/0//12//cuda//0:1

@yinfan98
Copy link

same error

@fy1214
Copy link
Author

fy1214 commented Nov 25, 2024

same error

i use a100 instead of l20,and success now

@feifeibear
Copy link
Collaborator

I suggest you check your nccl environments related to PCIe.
for example, NCCL_P2P_DISABLE=0

@fy1214
Copy link
Author

fy1214 commented Nov 26, 2024

I suggest you check your nccl environments related to PCIe. for example, NCCL_P2P_DISABLE=0

i use NCCL_P2P_DISABLE=0 but still stuck in same place. i check the code acording to the stack infomation, looks like it stuck in this operation:

      if recv_prev:
           recv_prev_dim_tensor = torch.empty(
               (1), device=self.device, dtype=torch.int64
           )
           recv_prev_dim_op = torch.distributed.P2POp(
               torch.distributed.irecv,
               recv_prev_dim_tensor,
               self.prev_rank,
               self.device_group,
           )
           ops.append(recv_prev_dim_op)

       if tensor_send_to_next is not None:
           send_next_dim_tensor = torch.tensor(
               tensor_send_to_next.dim(), device=self.device, dtype=torch.int64
           )
           send_next_dim_op = torch.distributed.P2POp(
               torch.distributed.isend,
               send_next_dim_tensor,
               self.next_rank,
               self.device_group,
           )
           ops.append(send_next_dim_op)

       if len(ops) > 0:
           reqs = torch.distributed.batch_isend_irecv(ops)
           for req in reqs:
               req.wait()

       # To protect against race condition when using batch_isend_irecv().
       # should take this out once the bug with batch_isend_irecv is resolved.
       torch.cuda.synchronize()

maybe something wrong with nccl,l20 doesn't support?

@feifeibear
Copy link
Collaborator

feifeibear commented Nov 26, 2024

To protect against race condition when using batch_isend_irecv().

   # should take this out once the bug with batch_isend_irecv is resolved.
   torch.cuda.synchronize()

Thank you for the assistance in debugging. I believe the issue is still related to the implementation of P2P in PipeFusion, as it uses asynchronous P2P, which may have bugs leading to deadlocks. Particularly at L20, cross-NUMA communication might be routed through the CPU's QPI.

@Lay2000 will help on debuging this issue.

@feifeibear feifeibear added the bug Something isn't working label Nov 26, 2024
@fy1214
Copy link
Author

fy1214 commented Nov 26, 2024

To protect against race condition when using batch_isend_irecv().

   # should take this out once the bug with batch_isend_irecv is resolved.
   torch.cuda.synchronize()

Thank you for the assistance in debugging. I believe the issue is still related to the implementation of P2P in PipeFusion, as it uses asynchronous P2P, which may have bugs leading to deadlocks. Particularly at L20, cross-NUMA communication might be routed through the CPU's QPI.

@Lay2000 will help on debuging this issue.

thanks for help. If you have any progress, pls let me know. I want to know the reason too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants