You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I built torch-ccl using the pip command shown in README: python -m pip install oneccl_bind_pt==2.0.100 -f https://developer.intel.com/ipex-whl-stable-xpu
However, when trying to import it, I get an error:
$ python3.11
Python 3.11.5 (main, Sep 06 2023, 11:21:05) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
File "/etc/pythonstart", line 7, in <module>
import readline
ModuleNotFoundError: No module named 'readline'
>>> import oneccl_bindings_for_pytorch
terminate called after throwing an instance of 'c10::Error'
what():
Mismatch in kernel C++ signatures
operator: c10d::allreduce_(Tensor[] tensors, __torch__.torch.classes.c10d.ProcessGroup process_group, __torch__.torch.classes.c10d.ReduceOp reduce_op, Tensor? sparse_indices, int timeout) -> (Tensor[], __torch__.torch.classes.c10d.Work)
registered at /build/pytorch/torch/csrc/distributed/c10d/Ops.cpp:10
kernel 1: std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, c10::optional<at::Tensor> const&, long)
dispatch key: CPU
registered at /build/pytorch/torch/csrc/distributed/c10d/Ops.cpp:501
kernel 2: std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, long)
dispatch key: HIP
registered at /build/frameworks.ai.pytorch.torch-ccl/src/ProcessGroupCCL.cpp:89
Exception raised from registerKernel at /build/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:120 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x99 (0x7f9e77527a89 in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7f9e774e11d4 in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::impl::OperatorEntry::registerKernel(c10::Dispatcher const&, c10::optional<c10::DispatchKey>, c10::KernelFunction, c10::optional<c10::impl::CppSignature>, std::unique_ptr<c10::FunctionSchema, std::default_delete<c10::FunctionSchema> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x222 (0x7f9e78b35352 in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10::Dispatcher::registerImpl(c10::OperatorName, c10::optional<c10::DispatchKey>, c10::KernelFunction, c10::optional<c10::impl::CppSignature>, std::unique_ptr<c10::FunctionSchema, std::default_delete<c10::FunctionSchema> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x171 (0x7f9e78b2a191 in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: torch::Library::_impl(char const*, torch::CppFunction&&, torch::_RegisterOrVerify) & + 0x38e (0x7f9e78b6465e in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x31be5 (0x7f9dd1d06be5 in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch.so)
frame #6: torch::detail::TorchLibraryInit::TorchLibraryInit(torch::Library::Kind, void (*)(torch::Library&), char const*, c10::optional<c10::DispatchKey>, char const*, unsigned int) + 0xf1 (0x7f9dd1d09f71 in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch.so)
frame #7: <unknown function> + 0x29842 (0x7f9dd1cfe842 in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch.so)
frame #8: <unknown function> + 0x111da (0x7f9e8f4061da in /lib64/ld-linux-x86-64.so.2)
frame #9: <unknown function> + 0x112f6 (0x7f9e8f4062f6 in /lib64/ld-linux-x86-64.so.2)
frame #10: _dl_catch_exception + 0x50 (0x7f9e8e95a11e in /lib64/libc.so.6)
frame #11: <unknown function> + 0x155d6 (0x7f9e8f40a5d6 in /lib64/ld-linux-x86-64.so.2)
frame #12: _dl_catch_exception + 0xbf (0x7f9e8e95a18d in /lib64/libc.so.6)
frame #13: <unknown function> + 0x14e0b (0x7f9e8f409e0b in /lib64/ld-linux-x86-64.so.2)
frame #14: <unknown function> + 0x13b6 (0x7f9e8e6013b6 in /lib64/libdl.so.2)
frame #15: _dl_catch_exception + 0xbf (0x7f9e8e95a18d in /lib64/libc.so.6)
frame #16: _dl_catch_error + 0x31 (0x7f9e8e95a21f in /lib64/libc.so.6)
frame #17: <unknown function> + 0x1ba5 (0x7f9e8e601ba5 in /lib64/libdl.so.2)
frame #18: dlopen + 0x73 (0x7f9e8e601481 in /lib64/libdl.so.2)
<omitting python frames>
frame #56: __libc_start_main + 0xef (0x7f9e8e8392bd in /lib64/libc.so.6)
frame #57: _start + 0x2c (0x560c8259e7aa in python3.11)
Aborted (core dumped)
Software Details:
Python3.11
OneCCL 2021.11.1
torch 2.1.0a0+cxx11.abi
intel-extension-for-pytorch 2.1.10+xpu
oneccl-bind-pt 2.0.100+gpu
oneapi/release/2023.12.15.001
intel_compute_runtime/release/stable-736.25
I suspect I'm messing up having compatible versions of the packages. Any suggestions would be helpful. Thanks!
The text was updated successfully, but these errors were encountered:
I built torch-ccl using the pip command shown in README:
python -m pip install oneccl_bind_pt==2.0.100 -f https://developer.intel.com/ipex-whl-stable-xpu
However, when trying to import it, I get an error:
Software Details:
I suspect I'm messing up having compatible versions of the packages. Any suggestions would be helpful. Thanks!
The text was updated successfully, but these errors were encountered: