Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segement fault when the size of send buffer and recv buffer is large #49

Open
zhuangbility111 opened this issue Jul 6, 2023 · 0 comments

Comments

@zhuangbility111
Copy link

zhuangbility111 commented Jul 6, 2023

Hi! I use the dist.all_to_all_single with torch.distributed and torch.ccl. I found that when the size of send buffer and recv buffer is large (several Gigabytes), the problem of segment fault will occur.
Here is the test code:

import torch 
import torch.distributed as dist
import numpy as np
import os
import torch_ccl

def init_dist_group():
    world_size = int(os.environ.get("PMI_SIZE", -1))
    rank = int(os.environ.get("PMI_RANK", -1))
    dist_url = "env://"
    dist.init_process_group(backend="ccl", init_method="env://", 
                            world_size=world_size, rank=rank)
    assert torch.distributed.is_initialized()
    print(f"dist_info RANK: {dist.get_rank()}, SIZE: {dist.get_world_size()}")
    # number of process in this MPI group
    world_size = dist.get_world_size() 
    # mpi rank in this MPI group
    rank = dist.get_rank()
    return (rank, world_size)

# main function
if __name__ == "__main__":
    rank, world_size = init_dist_group()

    # allocate memory for send_buf and recv_buf
    data_size = 250000
    send_buf = torch.zeros((data_size * (world_size-1), 172), dtype=torch.float32)
    recv_buf = torch.zeros((data_size * (world_size-1), 172), dtype=torch.float32)
    send_buf_shape = send_buf.shape
    recv_buf_shape = recv_buf.shape

    print("send_buf.shape = {}, recv_buf.shape = {}".format(send_buf.shape, recv_buf.shape), flush=True)

    send_splits = [data_size for i in range(world_size)]
    recv_splits = [data_size for i in range(world_size)]
    send_splits[rank] = 0
    recv_splits[rank] = 0

    print("rank = {}, send_splits = {}, recv_splits = {}".format(rank, send_splits, recv_splits), flush=True)

    assert(sum(send_splits) == send_buf_shape[0])
    assert(sum(recv_splits) == recv_buf_shape[0])
    assert(len(send_splits) == world_size)
    assert(len(recv_splits) == world_size)

    # all_to_all
    dist.all_to_all_single(recv_buf, send_buf, recv_splits, send_splits)

    print("finish!")

When the data_size = 25000, it works well. But when I set data_size = 500000, there is an error output:

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 10 PID 3564580 RUNNING AT g0118
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 11 PID 3564581 RUNNING AT g0118
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

The version of MPI: intel-mpi/2021.8
The version of Pytorch: 1.10.0 (CPU version)
The version of torch_ccl: 1.10.0
The number of MPI processes: 16, each MPI process is mapped to 1 socket.
Each compute node owns 2 socket CPUs, and the total memory size in one compute node is 387 Gigabytes. So this benchmark is ran on 8 compute nodes, 16 socket CPUs.
The command I use to launch MPI is: mpiexec.hydra -n 16 -ppn 2 python test_alltoall.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant