Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ordering of Intel extension imports not documented #44

Open
laserkelvin opened this issue Mar 2, 2023 · 2 comments
Open

Ordering of Intel extension imports not documented #44

laserkelvin opened this issue Mar 2, 2023 · 2 comments
Assignees

Comments

@laserkelvin
Copy link

Problem

When using oneccl_bindings_for_pytorch with intel_extension_for_pytorch including Intel GPU support, the ordering of the import statements is important for functionality and does not seem to be documented in the repository or anywhere else I have found.

intel_extension_for_pytorch needs to be imported first before oneccl_bindings_for_pytorch, otherwise the collectives for GPU will not be recognized:

Minimum example to reproduce

Below is a minimum working example that demonstrate the error: oneccl_bindings_for_pytorch is imported before IPEX, and throws an error saying that allgather is not implemented on [xpu]. The below is called using mpirun -n 4 -genvall -bootstrap ssh python ccl_test.py.

import oneccl_bindings_for_pytorch
import intel_extension_for_pytorch as ipex

rank = int(os.environ["PMI_RANK"])
world_size = int(os.environ["PMI_SIZE"])

torch.manual_seed(rank)

os.environ["RANK"] = str(rank)
os.environ["WORLD_SIZE"] = str(world_size)
os.environ["MASTER_ADDR"] = "127.0.0.1"
os.environ["MASTER_PORT"] = "21616"

group = dist.init_process_group(backend="ccl")

# generate random data on XPU
data = torch.rand(16, 8, device=f"xpu:{rank}")
if dist.get_rank() == 0:
    print(f"Initializing XPU data for rank {rank}")
    print(data)
    print(f"Performing all reduce for {world_size} ranks")

dist.all_reduce(data)
dist.barrier()
if dist.get_rank() == 0:
    print(f"All reduce done")
    print(data)

The error:

Performing all reduce for 4 ranks
Traceback (most recent call last):
  File "ccl_test.py", line 42, in <module>
    dist.all_reduce(data)
  File ".../lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1534, in all_reduce
    work = default_pg.allreduce([tensor], opts)
RuntimeError: oneccl_bindings_for_pytorch: allreduce isn't implementd on backend [xpu].

This will also trigger for other collectives (e.g. allgather). The code will run successfully if you import IPEX first, followed by oneCCL.

Proposed solution

Please add documentation regarding this behavior: it is actually expected since IPEX and oneCCL act on torch dynamically, but this is not documented and may confuse users.

@gujinghui
Copy link
Contributor

@jingxu10 @tye1 pls help.

@tye1
Copy link
Contributor

tye1 commented Aug 21, 2023

Thanks. @laserkelvin It has been documented in IPEX side, see https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/getting_started.html, Note: Please import intel_extension_for_pytorch right after import torch, prior to importing other packages.

We will update torch-ccl README to emphasize this requirement too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants