Skip to content
This repository has been archived by the owner on Oct 19, 2024. It is now read-only.

[BUG] Collective group's rank is incorrect #790

Open
2 tasks
ZYHowell opened this issue Dec 1, 2022 · 4 comments
Open
2 tasks

[BUG] Collective group's rank is incorrect #790

ZYHowell opened this issue Dec 1, 2022 · 4 comments
Labels
good first issue Good for newcomers known bug Something isn't working

Comments

@ZYHowell
Copy link
Collaborator

ZYHowell commented Dec 1, 2022

Background

Alpa initializes collective groups for each cross-mesh communication pair. The call stack to initialize a collective group is:
create_collective_group or init_collective_group from collective.py calls:
create_collective_group of GroupManager class in collective.py calls:
NCCLGroup.__init__ with two different implementations. One is based on cupy, while the other is based on xla.

A NCCLGroup creates and manages nccl communicators for each GPU in this node. When we need to call a nccl function, we finally goes into the NCCLGroup to call it. However, in our current implementation, we use node_rank * num_devices_per_node + local_offset to compute the rank of a local GPU w.r.t. the communication group. An example is here. This is correct in most cases, but when the send mesh has a different number of devices per node with the receive mesh, it is incorrect.

TODO

  • Fix the bug above by adding a start_gpu_rank at the initialization of NCCLGroup.
  • Add tests for collective communications among meshes. For unit test on cross-mesh communication, please refer to this file.
@ZYHowell ZYHowell added known bug Something isn't working good first issue Good for newcomers labels Dec 1, 2022
@ZYHowell
Copy link
Collaborator Author

ZYHowell commented Dec 1, 2022

cc @jiaodong

@AhmedMAlbreiki
Copy link

Goodday,

I am currently working on this bug

@zhisbug
Copy link
Member

zhisbug commented Feb 7, 2023

@AhmedMAlbreiki Please submit a PR so we can help review, thanks!

@AhmedRAlmansoori
Copy link

Hello, Im helping out with this issue, i have some questions for this

Currently the rank is computed in _get_nccl_collective_communicator where in it sets is like so actual_rank = self.rank * len(device_list) + i. Is this the issue in question where it needs to be replaced by start_gpu_rank = something magical?

Still trying to fully understand the issue, thanks

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
good first issue Good for newcomers known bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants