Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing oneCCL libs in 1.13.100+gpu #43

Open
robogast opened this issue Feb 27, 2023 · 1 comment
Open

Missing oneCCL libs in 1.13.100+gpu #43

robogast opened this issue Feb 27, 2023 · 1 comment

Comments

@robogast
Copy link

robogast commented Feb 27, 2023

Hi! I've installed oneccl_bindings_for_pt==1.13.100+gpu from https://developer.intel.com/ipex-whl-stable-xpu, but after installing I get a "libccl.so.1 not found" error:

$ python                                                                                                                                                                                                                                                  
Python 3.10.4 (main, Oct 26 2022, 02:21:10) [GCC 11.3.0] on linux                                                                                                                                                                                                                                                                                                                   
Type "help", "copyright", "credits" or "license" for more information.                                                                                                                                                                                                                                                                                                              
>>> import oneccl_bindings_for_pytorch                                                                                                                                                                                                                                                                                                                                              
Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                                  
  File "<stdin>", line 1, in <module>                                                                                                                                                                                                                                                                                                                                               
  File "/gpfs/home5/robertsc/2D-VQ-AE-2/.venv/py310-XPU/lib/python3.10/site-packages/oneccl_bindings_for_pytorch/__init__.py", line 26, in <module>                                                                                                                                                                                                                                 
    from . import _C as ccl_lib                                                                                                                                                                                                                                                                                                                                                     
ImportError: libccl.so.1: cannot open shared object file: No such file or directory 

It seems like including oneCCL was forgotten in the latest build, because when I check a previous version (1.13.0+cpu) libccl.so.1 is included in oneccl_bindings_for_pytorch:

$ grep -r libccl.so.1
Binary file lib/python3.10/site-packages/oneccl_bindings_for_pytorch/lib/libccl.so.1.0 matches
Binary file lib/python3.10/site-packages/oneccl_bindings_for_pytorch/lib/libccl.so.1 matches
Binary file lib/python3.10/site-packages/oneccl_bindings_for_pytorch/lib/libccl.so matches
Binary file lib/python3.10/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch.so matches
lib/python3.10/site-packages/oneccl_bind_pt-1.13.0+cpu.dist-info/RECORD:oneccl_bindings_for_pytorch/lib/libccl.so.1,sha256=QsFq3umZ-WRQHD69SAZ9ilXdYcEwwZfBVS4b8P48KjQ,4544872
lib/python3.10/site-packages/oneccl_bind_pt-1.13.0+cpu.dist-info/RECORD:oneccl_bindings_for_pytorch/lib/libccl.so.1.0,sha256=QsFq3umZ-WRQHD69SAZ9ilXdYcEwwZfBVS4b8P48KjQ,4544872
[robertsc@int4 py310-AMX]$ 

But in the 1.13.100+gpu version it's missing:

$ grep -r libccl.so.1
Binary file lib/python3.10/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch_xpu.so matches
Binary file lib/python3.10/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch.so matches

As a temporary fix I can install oneccl-devel==2021.8.0 from pypi, which still bundles it:

$ grep -r libccl.so.1
Binary file lib/cpu_gpu_dpcpp/libccl.so.1.0 matches
Binary file lib/cpu_gpu_dpcpp/libccl.so.1 matches
Binary file lib/cpu_gpu_dpcpp/libccl.so matches
Binary file lib/python3.10/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch_xpu.so matches
Binary file lib/python3.10/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch.so matches
lib/python3.10/site-packages/oneccl_devel-2021.8.0.dist-info/RECORD:../../cpu/libccl.so.1,sha256=Mb1k7Cr0EMbtwcPLheTP5ipnzpMYizaUkqVlKC7SJ-s,4847184
lib/python3.10/site-packages/oneccl_devel-2021.8.0.dist-info/RECORD:../../cpu/libccl.so.1.0,sha256=Mb1k7Cr0EMbtwcPLheTP5ipnzpMYizaUkqVlKC7SJ-s,4847184
lib/python3.10/site-packages/oneccl_devel-2021.8.0.dist-info/RECORD:../../cpu_gpu_dpcpp/libccl.so.1,sha256=bYQ16wi5o1aOEmM-x3n2G1-3GVXjVzDsL15XpNRu5u0,7543928
lib/python3.10/site-packages/oneccl_devel-2021.8.0.dist-info/RECORD:../../cpu_gpu_dpcpp/libccl.so.1.0,sha256=bYQ16wi5o1aOEmM-x3n2G1-3GVXjVzDsL15XpNRu5u0,7543928
Binary file lib/cpu/libccl.so.1.0 matches
Binary file lib/cpu/libccl.so.1 matches
Binary file lib/cpu/libccl.so matches

The default build option is to ship with oneCCL, perhaps this flag was accidentally wrongly set while building the latest version?
Could you please re-build with the latest oneCCL version? :)

Edit: same for Intel-MPI, libs and bins are also missing

@robogast robogast changed the title Misscing oneCCL include in 1.13.100+gpu? Misscing oneCCL libs in 1.13.100+gpu? Feb 27, 2023
@robogast robogast changed the title Misscing oneCCL libs in 1.13.100+gpu? Misscing oneCCL libs in 1.13.100+gpu Feb 27, 2023
@robogast robogast changed the title Misscing oneCCL libs in 1.13.100+gpu Missing oneCCL libs in 1.13.100+gpu Feb 27, 2023
@zhuhong61
Copy link
Contributor

zhuhong61 commented Feb 28, 2023

Hi @robogast thanks for the feedback.
For torch-ccl 1.13.100+gpu release, we didn't bundle the oneCCL libraries into torch-ccl package. we recommend users to oneCCL/mpi libraries from oneAPI basekit directly.
To use the oneCCL in basekit, the usage would be:
source $basekit_root/ccl/latest/env/vars.sh;
to use the mpi in basekit, the usage would be:
source $basekit_root/mpi/latest/env/vars.sh

More details see https://github.com/intel/torch-ccl/blob/master/README.md. Section: Install from Source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants