You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm try to use torch.distributed.launch to launch multiple node training with oneccl.
On each node, I install oneccl, and source $oneccl_bindings_for_pytorch_path/env/setvars.sh
The command on 1st node is:
CCL_WORKER_COUNT=1 python -m torch.distributed.launch --master_addr=172.168.0.201 --nproc_per_node=1 --nnodes=2 --node_rank=0 demo.py
The command on 2nd node is:
CCL_WORKER_COUNT=1 python -m torch.distributed.launch --master_addr=172.168.0.201 --nproc_per_node=1 --nnodes=2 --node_rank=1 demo.py
But on both nodes, it hung after these messages:
2023-06-23 03:36:46,458 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 0
2023-06-23 03:36:46,520 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
point 0
point 1
point 2
point 2.1
point 2.2
2023:06:23-03:36:46:(3742406) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi
I'm wondering how to use torch.distributed.launch to run multiple node training with oneccl? Is there any specific setting needs to do?
The text was updated successfully, but these errors were encountered:
I'm try to use torch.distributed.launch to launch multiple node training with oneccl.
On each node, I install oneccl, and source $oneccl_bindings_for_pytorch_path/env/setvars.sh
The command on 1st node is:
CCL_WORKER_COUNT=1 python -m torch.distributed.launch --master_addr=172.168.0.201 --nproc_per_node=1 --nnodes=2 --node_rank=0 demo.py
The command on 2nd node is:
CCL_WORKER_COUNT=1 python -m torch.distributed.launch --master_addr=172.168.0.201 --nproc_per_node=1 --nnodes=2 --node_rank=1 demo.py
But on both nodes, it hung after these messages:
2023-06-23 03:36:46,458 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 0
2023-06-23 03:36:46,520 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
point 0
point 1
point 2
point 2.1
point 2.2
2023:06:23-03:36:46:(3742406) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi
I'm wondering how to use torch.distributed.launch to run multiple node training with oneccl? Is there any specific setting needs to do?
The text was updated successfully, but these errors were encountered: