Deadlock attempting to do concurrent send, receive #72

pspillai · 2024-09-24T19:39:55Z

I am trying to implement a concurrent asynchronous send and receive between multiple processes. This results in deadlock. Minimum code to reproduce this is as follows:

import torch.nn.parallel
import torch.distributed as dist
import intel_extension_for_pytorch as ipex
import oneccl_bindings_for_pytorch
import os

os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
os.environ['RANK'] = str(os.environ.get('PMI_RANK', 0))
os.environ['WORLD_SIZE'] = str(os.environ.get('PMI_SIZE', 1))

print (os.environ['RANK'], os.environ['WORLD_SIZE'])
backend = 'ccl'
dist.init_process_group(backend)
my_rank = dist.get_rank()
my_size = dist.get_world_size()
print("my rank = %d  my size = %d" % (my_rank, my_size))

dev = f"xpu:{my_rank}"
torch.xpu.set_device(my_rank)
A = torch.ones(1,2, dtype=torch.float32).to(dev)
_ = A[0,0].item()
B = torch.zeros(1,2, dtype=torch.float32).to(dev)
_ = B[0,0].item()

dist.barrier()

dist.all_reduce(A)

print ("START")
o1 = dist.isend(A,1-my_rank)
o2 = dist.irecv(B,1-my_rank)
o1.wait()
o2.wait()

print ("DONE")

Run with

mpirun -n 2 python -u test.py

This sounds like the isend and irecv on each process is serialized. This particular example can complete if one process does send first and the other recv first, but I think they are still being serialized, so the two transfers are not concurrent.

I tried to use batch_isend_irecv to define a list of transfers, but this resulted in the same deadlock.
Without concurrent transfers, it is almost impossible to implement efficient distributed compute and shift algorithms or Cannon's algorithms, etc.

The text was updated successfully, but these errors were encountered:

gaopengff · 2024-10-09T06:05:21Z

Now torch-ccl only support one rank do sending and another do receiving at the same time, if you change the code to

if my_rank == 0:
    o1 = dist.isend(A,1-my_rank)
    o1.wait()
else:
    o2 = dist.irecv(B,1-my_rank)
    o2.wait()

It will works. Did you run your test with cuda's nccl. If it works with cuda, I think this is design issue of torch-ccl.

pspillai · 2024-10-09T15:28:49Z

Yes, if the send and receive ordering is matched, it will work, but this causes the transmissions to be serialized, wasting half of the available bandwidth. (There should be no reason why the two transfers cannot be done concurrently).

I have not tested on nccl, however looking at the sample code for torch.distributed.batch_isend_irecv: https://pytorch.org/docs/stable/distributed.html#torch.distributed.batch_isend_irecv
and the source code at:
https://pytorch.org/docs/stable/_modules/torch/distributed/distributed_c10d.html#batch_isend_irecv
it looks like the batch_isend_irecv is just calling the isend/irecv operations in the order provided, which in this example is the same for each rank. So I expect this to work fine on nccl.

Not surprisingly, batch_isend_irecv locks up with this example using ccl.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock attempting to do concurrent send, receive #72

Deadlock attempting to do concurrent send, receive #72

pspillai commented Sep 24, 2024

gaopengff commented Oct 9, 2024

pspillai commented Oct 9, 2024

Deadlock attempting to do concurrent send, receive #72

Deadlock attempting to do concurrent send, receive #72

Comments

pspillai commented Sep 24, 2024

gaopengff commented Oct 9, 2024

pspillai commented Oct 9, 2024