Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import error after building with pip #59

Open
suyashbakshi opened this issue Mar 20, 2024 · 0 comments
Open

Import error after building with pip #59

suyashbakshi opened this issue Mar 20, 2024 · 0 comments

Comments

@suyashbakshi
Copy link

I built torch-ccl using the pip command shown in README: python -m pip install oneccl_bind_pt==2.0.100 -f https://developer.intel.com/ipex-whl-stable-xpu

However, when trying to import it, I get an error:

$ python3.11
Python 3.11.5 (main, Sep 06 2023, 11:21:05) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
  File "/etc/pythonstart", line 7, in <module>
    import readline
ModuleNotFoundError: No module named 'readline'

>>> import oneccl_bindings_for_pytorch
terminate called after throwing an instance of 'c10::Error'
  what():
Mismatch in kernel C++ signatures
  operator: c10d::allreduce_(Tensor[] tensors, __torch__.torch.classes.c10d.ProcessGroup process_group, __torch__.torch.classes.c10d.ReduceOp reduce_op, Tensor? sparse_indices, int timeout) -> (Tensor[], __torch__.torch.classes.c10d.Work)
    registered at /build/pytorch/torch/csrc/distributed/c10d/Ops.cpp:10
  kernel 1: std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, c10::optional<at::Tensor> const&, long)
    dispatch key: CPU
    registered at /build/pytorch/torch/csrc/distributed/c10d/Ops.cpp:501
  kernel 2: std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, c10::intrusive_ptr<c10d::ReduceOp, c10::detail::intrusive_target_default_null_type<c10d::ReduceOp> > const&, long)
    dispatch key: HIP
    registered at /build/frameworks.ai.pytorch.torch-ccl/src/ProcessGroupCCL.cpp:89

Exception raised from registerKernel at /build/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:120 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x99 (0x7f9e77527a89 in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7f9e774e11d4 in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::impl::OperatorEntry::registerKernel(c10::Dispatcher const&, c10::optional<c10::DispatchKey>, c10::KernelFunction, c10::optional<c10::impl::CppSignature>, std::unique_ptr<c10::FunctionSchema, std::default_delete<c10::FunctionSchema> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x222 (0x7f9e78b35352 in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10::Dispatcher::registerImpl(c10::OperatorName, c10::optional<c10::DispatchKey>, c10::KernelFunction, c10::optional<c10::impl::CppSignature>, std::unique_ptr<c10::FunctionSchema, std::default_delete<c10::FunctionSchema> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x171 (0x7f9e78b2a191 in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: torch::Library::_impl(char const*, torch::CppFunction&&, torch::_RegisterOrVerify) & + 0x38e (0x7f9e78b6465e in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x31be5 (0x7f9dd1d06be5 in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch.so)
frame #6: torch::detail::TorchLibraryInit::TorchLibraryInit(torch::Library::Kind, void (*)(torch::Library&), char const*, c10::optional<c10::DispatchKey>, char const*, unsigned int) + 0xf1 (0x7f9dd1d09f71 in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch.so)
frame #7: <unknown function> + 0x29842 (0x7f9dd1cfe842 in /nfs/home/pytorch_ipex/lib64/python3.11/site-packages/oneccl_bindings_for_pytorch/lib/liboneccl_bindings_for_pytorch.so)
frame #8: <unknown function> + 0x111da (0x7f9e8f4061da in /lib64/ld-linux-x86-64.so.2)
frame #9: <unknown function> + 0x112f6 (0x7f9e8f4062f6 in /lib64/ld-linux-x86-64.so.2)
frame #10: _dl_catch_exception + 0x50 (0x7f9e8e95a11e in /lib64/libc.so.6)
frame #11: <unknown function> + 0x155d6 (0x7f9e8f40a5d6 in /lib64/ld-linux-x86-64.so.2)
frame #12: _dl_catch_exception + 0xbf (0x7f9e8e95a18d in /lib64/libc.so.6)
frame #13: <unknown function> + 0x14e0b (0x7f9e8f409e0b in /lib64/ld-linux-x86-64.so.2)
frame #14: <unknown function> + 0x13b6 (0x7f9e8e6013b6 in /lib64/libdl.so.2)
frame #15: _dl_catch_exception + 0xbf (0x7f9e8e95a18d in /lib64/libc.so.6)
frame #16: _dl_catch_error + 0x31 (0x7f9e8e95a21f in /lib64/libc.so.6)
frame #17: <unknown function> + 0x1ba5 (0x7f9e8e601ba5 in /lib64/libdl.so.2)
frame #18: dlopen + 0x73 (0x7f9e8e601481 in /lib64/libdl.so.2)
<omitting python frames>
frame #56: __libc_start_main + 0xef (0x7f9e8e8392bd in /lib64/libc.so.6)
frame #57: _start + 0x2c (0x560c8259e7aa in python3.11)

Aborted (core dumped)

Software Details:

  • Python3.11
  • OneCCL 2021.11.1
  • torch 2.1.0a0+cxx11.abi
  • intel-extension-for-pytorch 2.1.10+xpu
  • oneccl-bind-pt 2.0.100+gpu
  • oneapi/release/2023.12.15.001
  • intel_compute_runtime/release/stable-736.25

I suspect I'm messing up having compatible versions of the packages. Any suggestions would be helpful. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant