You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! I use the dist.all_to_all_single with torch.distributed and torch.ccl. I found that when the size of send buffer and recv buffer is large (several Gigabytes), the problem of segment fault will occur.
Here is the test code:
importtorchimporttorch.distributedasdistimportnumpyasnpimportosimporttorch_ccldefinit_dist_group():
world_size=int(os.environ.get("PMI_SIZE", -1))
rank=int(os.environ.get("PMI_RANK", -1))
dist_url="env://"dist.init_process_group(backend="ccl", init_method="env://",
world_size=world_size, rank=rank)
asserttorch.distributed.is_initialized()
print(f"dist_info RANK: {dist.get_rank()}, SIZE: {dist.get_world_size()}")
# number of process in this MPI groupworld_size=dist.get_world_size()
# mpi rank in this MPI grouprank=dist.get_rank()
return (rank, world_size)
# main functionif__name__=="__main__":
rank, world_size=init_dist_group()
# allocate memory for send_buf and recv_bufdata_size=250000send_buf=torch.zeros((data_size* (world_size-1), 172), dtype=torch.float32)
recv_buf=torch.zeros((data_size* (world_size-1), 172), dtype=torch.float32)
send_buf_shape=send_buf.shaperecv_buf_shape=recv_buf.shapeprint("send_buf.shape = {}, recv_buf.shape = {}".format(send_buf.shape, recv_buf.shape), flush=True)
send_splits= [data_sizeforiinrange(world_size)]
recv_splits= [data_sizeforiinrange(world_size)]
send_splits[rank] =0recv_splits[rank] =0print("rank = {}, send_splits = {}, recv_splits = {}".format(rank, send_splits, recv_splits), flush=True)
assert(sum(send_splits) ==send_buf_shape[0])
assert(sum(recv_splits) ==recv_buf_shape[0])
assert(len(send_splits) ==world_size)
assert(len(recv_splits) ==world_size)
# all_to_alldist.all_to_all_single(recv_buf, send_buf, recv_splits, send_splits)
print("finish!")
When the data_size = 25000, it works well. But when I set data_size = 500000, there is an error output:
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 10 PID 3564580 RUNNING AT g0118
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 11 PID 3564581 RUNNING AT g0118
= KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================
The version of MPI: intel-mpi/2021.8
The version of Pytorch: 1.10.0 (CPU version)
The version of torch_ccl: 1.10.0
The number of MPI processes: 16, each MPI process is mapped to 1 socket.
Each compute node owns 2 socket CPUs, and the total memory size in one compute node is 387 Gigabytes. So this benchmark is ran on 8 compute nodes, 16 socket CPUs.
The command I use to launch MPI is: mpiexec.hydra -n 16 -ppn 2 python test_alltoall.py
The text was updated successfully, but these errors were encountered:
Hi! I use the
dist.all_to_all_single
with torch.distributed and torch.ccl. I found that when the size of send buffer and recv buffer is large (several Gigabytes), the problem ofsegment fault
will occur.Here is the test code:
When the
data_size = 25000
, it works well. But when I setdata_size = 500000
, there is an error output:The version of MPI: intel-mpi/2021.8
The version of Pytorch: 1.10.0 (CPU version)
The version of torch_ccl: 1.10.0
The number of MPI processes: 16, each MPI process is mapped to 1 socket.
Each compute node owns 2 socket CPUs, and the total memory size in one compute node is 387 Gigabytes. So this benchmark is ran on 8 compute nodes, 16 socket CPUs.
The command I use to launch MPI is:
mpiexec.hydra -n 16 -ppn 2 python test_alltoall.py
The text was updated successfully, but these errors were encountered: