How to use torch.distributed.launch to run multiple node training with oneccl #48

jenniew · 2023-06-23T00:16:19Z

I'm try to use torch.distributed.launch to launch multiple node training with oneccl.
On each node, I install oneccl, and source $oneccl_bindings_for_pytorch_path/env/setvars.sh
The command on 1st node is:
CCL_WORKER_COUNT=1 python -m torch.distributed.launch --master_addr=172.168.0.201 --nproc_per_node=1 --nnodes=2 --node_rank=0 demo.py
The command on 2nd node is:
CCL_WORKER_COUNT=1 python -m torch.distributed.launch --master_addr=172.168.0.201 --nproc_per_node=1 --nnodes=2 --node_rank=1 demo.py

But on both nodes, it hung after these messages:
2023-06-23 03:36:46,458 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 0
2023-06-23 03:36:46,520 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
point 0
point 1
point 2
point 2.1
point 2.2
2023:06:23-03:36:46:(3742406) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi

I'm wondering how to use torch.distributed.launch to run multiple node training with oneccl? Is there any specific setting needs to do?

zhuhong61 · 2023-06-25T03:27:59Z

Hi @jenniew Does your workload runs on CPU? Could you please try the "gloo" backend first and see whether it works? Thanks!

jenniew · 2023-06-27T02:16:57Z

Hi @jenniew Does your workload runs on CPU? Could you please try the "gloo" backend first and see whether it works? Thanks!

Yes, my workload runs on CPU. I tried "gloo" backend, and it works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use torch.distributed.launch to run multiple node training with oneccl #48

How to use torch.distributed.launch to run multiple node training with oneccl #48

jenniew commented Jun 23, 2023

zhuhong61 commented Jun 25, 2023

jenniew commented Jun 27, 2023

How to use torch.distributed.launch to run multiple node training with oneccl #48

How to use torch.distributed.launch to run multiple node training with oneccl #48

Comments

jenniew commented Jun 23, 2023

zhuhong61 commented Jun 25, 2023

jenniew commented Jun 27, 2023