You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using oneccl_bindings_for_pytorch with intel_extension_for_pytorch including Intel GPU support, the ordering of the import statements is important for functionality and does not seem to be documented in the repository or anywhere else I have found.
intel_extension_for_pytorch needs to be imported first beforeoneccl_bindings_for_pytorch, otherwise the collectives for GPU will not be recognized:
Minimum example to reproduce
Below is a minimum working example that demonstrate the error: oneccl_bindings_for_pytorch is imported before IPEX, and throws an error saying that allgather is not implemented on [xpu]. The below is called using mpirun -n 4 -genvall -bootstrap ssh python ccl_test.py.
importoneccl_bindings_for_pytorchimportintel_extension_for_pytorchasipexrank=int(os.environ["PMI_RANK"])
world_size=int(os.environ["PMI_SIZE"])
torch.manual_seed(rank)
os.environ["RANK"] =str(rank)
os.environ["WORLD_SIZE"] =str(world_size)
os.environ["MASTER_ADDR"] ="127.0.0.1"os.environ["MASTER_PORT"] ="21616"group=dist.init_process_group(backend="ccl")
# generate random data on XPUdata=torch.rand(16, 8, device=f"xpu:{rank}")
ifdist.get_rank() ==0:
print(f"Initializing XPU data for rank {rank}")
print(data)
print(f"Performing all reduce for {world_size} ranks")
dist.all_reduce(data)
dist.barrier()
ifdist.get_rank() ==0:
print(f"All reduce done")
print(data)
This will also trigger for other collectives (e.g. allgather). The code will run successfully if you import IPEX first, followed by oneCCL.
Proposed solution
Please add documentation regarding this behavior: it is actually expected since IPEX and oneCCL act on torch dynamically, but this is not documented and may confuse users.
The text was updated successfully, but these errors were encountered:
Problem
When using
oneccl_bindings_for_pytorch
withintel_extension_for_pytorch
including Intel GPU support, the ordering of the import statements is important for functionality and does not seem to be documented in the repository or anywhere else I have found.intel_extension_for_pytorch
needs to be imported first beforeoneccl_bindings_for_pytorch
, otherwise the collectives for GPU will not be recognized:Minimum example to reproduce
Below is a minimum working example that demonstrate the error:
oneccl_bindings_for_pytorch
is imported before IPEX, and throws an error saying thatallgather
is not implemented on[xpu]
. The below is called usingmpirun -n 4 -genvall -bootstrap ssh python ccl_test.py
.The error:
This will also trigger for other collectives (e.g.
allgather
). The code will run successfully if you import IPEX first, followed by oneCCL.Proposed solution
Please add documentation regarding this behavior: it is actually expected since IPEX and oneCCL act on
torch
dynamically, but this is not documented and may confuse users.The text was updated successfully, but these errors were encountered: