Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP(model) gets stocked in a cluster When run Demo.py manually #46

Open
leonardozcm opened this issue May 5, 2023 · 2 comments
Open

Comments

@leonardozcm
Copy link

leonardozcm commented May 5, 2023

Torch/torch-ccl/ipex version 1.13.0
cluster node: 2
World_size: 2
All nodes have password-less connections set, and mpirun works well as the readme says:

mpirun -f ./hosts -n 2 -ppn 1 -genv OMP_NUM_THREADS=24 python demo.py 

And I try to run it manually by start training in both of the nodes:

# in node 0
RANK=0 WORLD_SIZE=2 python demo.py
# in node 1
RANK=1 WORLD_SIZE=2 python demo.py

This will stock at DDP(model):

/home/cpx/anaconda3/envs/bigdl_test/lib/python3.7/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libc10_cuda.so: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
1 2
2023-05-05 16:13:22,973 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 1
2023-05-05 16:13:22,984 - torch.distributed.distributed_c10d - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2023:05:05-16:13:30:(3185397) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi

This will not happen if I set dist.init_process_group(backend='gloo')

@leonardozcm
Copy link
Author

@abhilash1910
Copy link
Contributor

@leonardozcm What backend are you using while initializing torch.init_distributed() ? The recommended backend is "ccl" and judging by the error , you might have set backend="nccl" which is why it is querying for libc10_cuda.so
Could you share a snippet /reproducer is possible?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants