Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allgather causes SEGFAULT #56

Open
Iain-S opened this issue Feb 9, 2024 · 6 comments
Open

allgather causes SEGFAULT #56

Iain-S opened this issue Feb 9, 2024 · 6 comments

Comments

@Iain-S
Copy link

Iain-S commented Feb 9, 2024

Summary

Calling torch.distributed.all_gather() when using the 'ccl' backend results in a SEGFAULT if the tensors being gathered are larger than a few megabytes.

This problem also seems to occur with gather().

Steps to Reproduce

See my minimal reproducible example repo here: https://github.com/Iain-S/torch-ccl-segfault/tree/main

Using a tensor of around 11MiB is enough to cause a segfault.

Expected Behaviour

I would not expect a SEGFAULT to be raised.

Actual Behaviour

I get the following output

Caught signal 11 (Segmentation fault: address not mapped to object at address 0xff00000002600000)

LIBXSMM_VERSION: main_stable-1.17-3651 (25693763)==== backtrace (tid:  43554) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x00000000000cee73 __memmove_avx_unaligned_erms()  :0
 2 0x00000000004b80c3 zeKernelSetIndirectAccessTracing()  ???:0
 3 0x0000000000106cb8 zetGetMetricGroupExpProcAddrTable()  ???:0
 4 0x00000000000fc2da zetGetMetricGroupExpProcAddrTable()  ???:0
 5 0x00000000001967a4 zetGetMetricGroupExpProcAddrTable()  ???:0
 6 0x00000000001a5aa8 zetGetMetricGroupExpProcAddrTable()  ???:0
 7 0x00000000000f1f34 ???()  /lib64/libze_intel_gpu.so.1:0
 8 0x0000000000046103 std::vector<void*, std::allocator<void*> >::resize()  ???:0
 9 0x0000000000012975 zeGetFabricVertexExpProcAddrTable()  ???:0
10 0x000000000052cce9 ze_cmd_memory_copy::ze_call()  :0
11 0x000000000052d933 ze_copy_entry::init_ze_hook()  :0
12 0x000000000050d14d ze_base_entry::init()  :0
13 0x000000000050df3a ze_base_entry::init_entries()  :0
14 0x000000000050e25f ze_base_entry::start()  :0
15 0x0000000000470578 sched_entry::do_progress()  :0
16 0x0000000000482985 ccl_sched::do_progress()  :0
17 0x00000000003fe969 ccl_worker::process_sched_bin()  :0
18 0x00000000003fe540 ccl_worker::process_sched_queue()  :0
19 0x00000000003fd31b ccl_worker::do_work()  :0
20 0x00000000003f84bd ccl_executor::wait()  :0
21 0x00000000003061fb ccl_coll_create()  coll-f53f59.cpp:0
22 0x00000000003056a6 ccl_allgatherv_impl()  :0
23 0x000000000035aa1e ccl_comm::allgatherv_impl()  :0
24 0x000000000036d5cb ccl_comm::allgatherv()  :0
25 0x00000000004c4be1 ccl::v1::allgatherv()  ???:0
26 0x0000000000035fdd oneccl_bindings_for_pytorch::CollectiveAsyncWorkCCL<oneccl_bindings_for_pytorch::XPUCCLStubs::allgather_(std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >&, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllgatherOptions const&, c10d::ProcessGroupCCL&)::{lambda(at::Tensor, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, ccl::v1::allgatherv_attr, c
31 0x0000000000030aee c10d::ProcessGroupCCL::allgather()  ???:0
32 0x000000000002a3fa c10d::ops::allgather_xpu_()  ???:0
33 0x000000000003cecf c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long), std::tuple<std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long> >, false>::call()  ???:0
34 0x0000000004c8d308 c10::impl::BoxedKernelWrapper<std::tuple<std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long), void>::call()  :0
35 0x0000000004c98b45 c10d::ProcessGroup::allgather()  :0
36 0x0000000000b7e8b0 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >&, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >&, std::vector<at::Tensor, std::alloca
::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::T
ensor> > > >&, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >&, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllgatherOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(c10d::ProcessGroup*, std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >&, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  :0
37 0x00000000003975c7 pybind11::cpp_function::dispatcher()  :0
38 0x00000000001639fb PyCFunction_Call()  ???:0
39 0x000000000019046b _PyObject_MakeTpCall()  ???:0
40 0x00000000001c8c17 PyEval_EvalCodeEx()  ???:0
41 0x000000000020bcb1 _PyEval_EvalFrameDefault()  ???:0
42 0x00000000001cb465 _PyFunction_Vectorcall()  ???:0
43 0x00000000001638bb PyObject_Call()  ???:0
44 0x0000000000209030 _PyEval_EvalFrameDefault()  ???:0
45 0x00000000001cb465 _PyFunction_Vectorcall()  ???:0
46 0x000000000020bcb1 _PyEval_EvalFrameDefault()  ???:0
47 0x00000000001cb465 _PyFunction_Vectorcall()  ???:0
48 0x0000000000206f4d _PyEval_EvalFrameDefault()  ???:0
49 0x00000000001c7713 PyList_SetSlice()  ???:0
50 0x00000000001c870f _PyEval_EvalCodeWithName()  ???:0
51 0x00000000001c8743 PyEval_EvalCode()  ???:0
52 0x0000000000279dad _PyImport_FixupBuiltin()  ???:0
53 0x000000000028db0a PyAST_CompileObject()  ???:0
54 0x000000000011d2f6 PyRun_String()  ???:0
55 0x000000000028e325 PyRun_SimpleFileExFlags()  ???:0
56 0x000000000028e7d2 Py_RunMain()  ???:0
57 0x000000000028e919 Py_BytesMain()  ???:0

Versions

  • Python 3.9
  • For Python package versions, see README in my repo.
  • CCL v2021.11.2
  • Running on an Intel(R) Data Center GPU Max 1550
@akashdhamasia12
Copy link

Hello, i am not able to reproduce the segmentation fault, i activated Intel® oneAPI env (https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html?operatingsystem=linux&distributions=offline )

Installed dependencies on conda env: (conda create -n ipex21 python=3.9)

python -m pip install torch==2.1.0a0 torchvision==0.16.0a0 torchaudio==2.1.0a0 intel-extension-for-pytorch==2.1.10+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

python -m pip install oneccl_bind_pt==2.1.100+xpu -f https://pytorch-extension.intel.com/release-whl/stable/xpu/us/oneccl-bind-pt/

This is my output:

(ipex21) adhamasi@sdp125072:~/torch-ccl-segfault$ mpirun -n 2 -l python -u allgather.py ccl 3_000_000 xpu
[1] /nfs/site/home/adhamasi/miniconda3/envs/ipex21/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
[1] warn(
[0] /nfs/site/home/adhamasi/miniconda3/envs/ipex21/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
[0] warn(
[0] My guessed rank = 0
[1] My guessed rank = 1
[1] Initialising process group with backend ccl
[0] Initialising process group with backend ccl
[1] TSHAPE=[3000000]
[0] TSHAPE=[3000000]
[0] moving to xpu:0
[1] moving to xpu:1
[0] tensor size 2: 2.861MB
[0] tensor size 1: 11.444MB
[0] making space
[1] tensor size 2: 2.861MB
[1] tensor size 1: 11.444MB
[1] making space
[0] gathering
[1] gathering
[0] gathered
[1] gathered

Could you please let me know, which version of Intel® oneAPI Base Toolkit are you using ?

@Iain-S
Copy link
Author

Iain-S commented Mar 15, 2024

Hi @akashdhamasia12, thanks for running on your system.

Could you check with progressively larger tensors (4_000_000, 5_000_000 and up) to see whether there is some point after which it segfaults?

Could you please let me know, which version of Intel® oneAPI Base Toolkit are you using?

I believe it is 2024.0.0.

edit

And the specific versions of the oneapi components are:

intel-oneapi-ccl/2021.11.1
intel-oneapi-compilers/2024.0.0
intel-oneapi-dpl/2022.3.0
intel-oneapi-inspector/2024.0
intel-oneapi-mkl/2024.0.0
intel-oneapi-mpi/2021.11.0
intel-oneapi-tbb/2021.11.0

@akashdhamasia12
Copy link

Hi, i tried upto 1000_000_000, still cant reproduce, below you can check the logs:

(ipex21) adhamasi@sdp716089:/torch-ccl-segfault$ mpirun -n 2 -l python -u allgather.py ccl 50_000_000 xpu
[1] /nfs/site/home/adhamasi/miniconda3/envs/ipex21/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
[1] warn(
[0] /nfs/site/home/adhamasi/miniconda3/envs/ipex21/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
[0] warn(
[0] My guessed rank = 0
[1] My guessed rank = 1
[0] Initialising process group with backend ccl
[1] Initialising process group with backend ccl
[1] TSHAPE=[50000000]
[0] TSHAPE=[50000000]
[1] moving to xpu:1
[0] moving to xpu:0
[0] tensor size 2: 47.684MB
[0] tensor size 1: 190.735MB
[0] making space
[1] tensor size 2: 47.684MB
[1] tensor size 1: 190.735MB
[1] making space
[0] gathering
[1] gathering
[1] gathered
[0] gathered
(ipex21) adhamasi@sdp716089:
/torch-ccl-segfault$
(ipex21) adhamasi@sdp716089:/torch-ccl-segfault$ mpirun -n 2 -l python -u allgather.py ccl 100_000_000 xpu
[0] /nfs/site/home/adhamasi/miniconda3/envs/ipex21/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
[0] warn(
[1] /nfs/site/home/adhamasi/miniconda3/envs/ipex21/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
[1] warn(
[0] My guessed rank = 0
[1] My guessed rank = 1
[0] Initialising process group with backend ccl
[1] Initialising process group with backend ccl
[0] TSHAPE=[100000000]
[1] TSHAPE=[100000000]
[0] moving to xpu:0
[1] moving to xpu:1
[1] tensor size 2: 95.367MB
[1] tensor size 1: 381.470MB
[1] making space
[0] tensor size 2: 95.367MB
[0] tensor size 1: 381.470MB
[0] making space
[1] gathering
[0] gathering
[0] gathered
[1] gathered
(ipex21) adhamasi@sdp716089:
/torch-ccl-segfault$ mpirun -n 2 -l python -u allgather.py ccl 1000_000_000 xpu
[0] /nfs/site/home/adhamasi/miniconda3/envs/ipex21/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
[0] warn(
[1] /nfs/site/home/adhamasi/miniconda3/envs/ipex21/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
[1] warn(
[0] My guessed rank = 0
[1] My guessed rank = 1
[0] Initialising process group with backend ccl
[1] Initialising process group with backend ccl
[1] TSHAPE=[1000000000]
[0] TSHAPE=[1000000000]
[0] moving to xpu:0
[1] moving to xpu:1
[0] tensor size 2: 953.674MB
[0] tensor size 1: 3814.697MB
[0] making space
[1] tensor size 2: 953.674MB
[1] tensor size 1: 3814.697MB
[1] making space
[0] gathering
[1] gathering
[0] gathered
[1] gathered

@akashdhamasia12
Copy link

Hi, Intel Max Series 1550 XPU contains 2 tiles per device, both are capable to do processing individually. If your node contains n XPUs, you can spawn nx2 processes to utilize all the tiles.

Can you also please try setting Affinity flag depending on number of tiles you are using before you run your application, like for example for using 2 tiles (1GPU):

export ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
export ZE_AFFINITY_MASK=0.0,0.1

For 4 tiles (2 GPUs):

export ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
export ZE_AFFINITY_MASK=0.0,0.1,1.0,1.1

& so on.

This is my log for 1 GPU (2 tiles):

export ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
export ZE_AFFINITY_MASK=0.0,0.1

(ipex21) [hpcdham1@pvc-s-191 torch-ccl-segfault]$ mpirun -n 2 -l python -u allgather.py ccl 100_000_000 xpu
[0] /home/hpcdham1/miniconda3/envs/ipex21/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
[0] warn(
[1] /home/hpcdham1/miniconda3/envs/ipex21/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
[1] warn(
[1] My guessed rank = 1
[0] My guessed rank = 0
[0] Initialising process group with backend ccl
[1] Initialising process group with backend ccl
[1] TSHAPE=[100000000]
[0] TSHAPE=[100000000]
[0] moving to xpu:0
[1] moving to xpu:1
[1] tensor size 2: 95.367MB
[1] tensor size 1: 381.470MB
[1] making space
[0] tensor size 2: 95.367MB
[0] tensor size 1: 381.470MB
[0] making space
[1] gathering
[0] gathering
[1] MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
[0] MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
[0] gathered
[1] gathered

@Iain-S
Copy link
Author

Iain-S commented Apr 2, 2024

Can you also please try setting Affinity flag depending on number of tiles you are using before you run your application

Thanks for the tips. I shall give it a go this week.

@Iain-S
Copy link
Author

Iain-S commented Apr 3, 2024

@akashdhamasia12

Thanks, I can confirm that the segfault does not occur with ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE. I have checked with a range of tensor sizes and a range process numbers.

Do you have any idea why ZE_FLAT_DEVICE_HIERARCHY=FLAT or ZE_FLAT_DEVICE_HIERARCHY=COMBINED causes an error that only occurs over a certain tensor size? From a reading of the docs, I can't see why the following (or similar) should't allow PyTorch to gather across two tiles:

sbatch script

#SBATCH --nodes=1
#SBATCH --gpus-per-node=4

export ZE_FLAT_DEVICE_HIERARCHY=FLAT
export ZE_AFFINITY_MASK=0,1,2,3,4,5,6,7

mpirun -n 2 python allgather.py

Is it intended behaviour that I have misunderstood from the docs or is it a bug, do you think? In particular, I'm curious why it would work up to a certain size but not above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants