Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use torch.distributed.launch to run multiple node training with oneccl #48

Open
jenniew opened this issue Jun 23, 2023 · 2 comments

Comments

@jenniew
Copy link

jenniew commented Jun 23, 2023

I'm try to use torch.distributed.launch to launch multiple node training with oneccl.
On each node, I install oneccl, and source $oneccl_bindings_for_pytorch_path/env/setvars.sh
The command on 1st node is:
CCL_WORKER_COUNT=1 python -m torch.distributed.launch --master_addr=172.168.0.201 --nproc_per_node=1 --nnodes=2 --node_rank=0 demo.py
The command on 2nd node is:
CCL_WORKER_COUNT=1 python -m torch.distributed.launch --master_addr=172.168.0.201 --nproc_per_node=1 --nnodes=2 --node_rank=1 demo.py

But on both nodes, it hung after these messages:
2023-06-23 03:36:46,458 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 0
2023-06-23 03:36:46,520 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
point 0
point 1
point 2
point 2.1
point 2.2
2023:06:23-03:36:46:(3742406) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi

I'm wondering how to use torch.distributed.launch to run multiple node training with oneccl? Is there any specific setting needs to do?

@zhuhong61
Copy link
Contributor

Hi @jenniew Does your workload runs on CPU? Could you please try the "gloo" backend first and see whether it works? Thanks!

@jenniew
Copy link
Author

jenniew commented Jun 27, 2023

Hi @jenniew Does your workload runs on CPU? Could you please try the "gloo" backend first and see whether it works? Thanks!

Yes, my workload runs on CPU. I tried "gloo" backend, and it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants