-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allgather causes SEGFAULT #56
Comments
Hello, i am not able to reproduce the segmentation fault, i activated Intel® oneAPI env (https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html?operatingsystem=linux&distributions=offline ) Installed dependencies on conda env: (conda create -n ipex21 python=3.9) python -m pip install torch==2.1.0a0 torchvision==0.16.0a0 torchaudio==2.1.0a0 intel-extension-for-pytorch==2.1.10+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ python -m pip install oneccl_bind_pt==2.1.100+xpu -f https://pytorch-extension.intel.com/release-whl/stable/xpu/us/oneccl-bind-pt/ This is my output: (ipex21) adhamasi@sdp125072:~/torch-ccl-segfault$ mpirun -n 2 -l python -u allgather.py ccl 3_000_000 xpu Could you please let me know, which version of Intel® oneAPI Base Toolkit are you using ? |
Hi @akashdhamasia12, thanks for running on your system. Could you check with progressively larger tensors (
I believe it is edit And the specific versions of the oneapi components are:
|
Hi, i tried upto 1000_000_000, still cant reproduce, below you can check the logs: (ipex21) adhamasi@sdp716089: |
Hi, Intel Max Series 1550 XPU contains 2 tiles per device, both are capable to do processing individually. If your node contains n XPUs, you can spawn nx2 processes to utilize all the tiles. Can you also please try setting Affinity flag depending on number of tiles you are using before you run your application, like for example for using 2 tiles (1GPU): export ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE For 4 tiles (2 GPUs): export ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE & so on. This is my log for 1 GPU (2 tiles): export ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE (ipex21) [hpcdham1@pvc-s-191 torch-ccl-segfault]$ mpirun -n 2 -l python -u allgather.py ccl 100_000_000 xpu |
Thanks for the tips. I shall give it a go this week. |
Thanks, I can confirm that the segfault does not occur with Do you have any idea why sbatch script #SBATCH --nodes=1
#SBATCH --gpus-per-node=4
export ZE_FLAT_DEVICE_HIERARCHY=FLAT
export ZE_AFFINITY_MASK=0,1,2,3,4,5,6,7
mpirun -n 2 python allgather.py Is it intended behaviour that I have misunderstood from the docs or is it a bug, do you think? In particular, I'm curious why it would work up to a certain size but not above. |
Summary
Calling
torch.distributed.all_gather()
when using the'ccl'
backend results in a SEGFAULT if the tensors being gathered are larger than a few megabytes.This problem also seems to occur with
gather()
.Steps to Reproduce
See my minimal reproducible example repo here: https://github.com/Iain-S/torch-ccl-segfault/tree/main
Using a tensor of around 11MiB is enough to cause a segfault.
Expected Behaviour
I would not expect a SEGFAULT to be raised.
Actual Behaviour
I get the following output
Versions
The text was updated successfully, but these errors were encountered: