[BUG] Collective group's rank is incorrect #790

ZYHowell · 2022-12-01T02:12:35Z

Background

Alpa initializes collective groups for each cross-mesh communication pair. The call stack to initialize a collective group is:
create_collective_group or init_collective_group from collective.py calls:
create_collective_group of GroupManager class in collective.py calls:
NCCLGroup.__init__ with two different implementations. One is based on cupy, while the other is based on xla.

A NCCLGroup creates and manages nccl communicators for each GPU in this node. When we need to call a nccl function, we finally goes into the NCCLGroup to call it. However, in our current implementation, we use node_rank * num_devices_per_node + local_offset to compute the rank of a local GPU w.r.t. the communication group. An example is here. This is correct in most cases, but when the send mesh has a different number of devices per node with the receive mesh, it is incorrect.

TODO

Fix the bug above by adding a start_gpu_rank at the initialization of NCCLGroup.
Add tests for collective communications among meshes. For unit test on cross-mesh communication, please refer to this file.

The text was updated successfully, but these errors were encountered:

ZYHowell · 2022-12-01T02:12:46Z

cc @jiaodong

AhmedMAlbreiki · 2023-02-07T09:31:37Z

Goodday,

I am currently working on this bug

zhisbug · 2023-02-07T19:39:08Z

@AhmedMAlbreiki Please submit a PR so we can help review, thanks!

AhmedRAlmansoori · 2023-02-27T07:31:10Z

Hello, Im helping out with this issue, i have some questions for this

Currently the rank is computed in _get_nccl_collective_communicator where in it sets is like so actual_rank = self.rank * len(device_list) + i. Is this the issue in question where it needs to be replaced by start_gpu_rank = something magical?

Still trying to fully understand the issue, thanks

ZYHowell added known bug Something isn't working good first issue Good for newcomers labels Dec 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Collective group's rank is incorrect #790

[BUG] Collective group's rank is incorrect #790

ZYHowell commented Dec 1, 2022

ZYHowell commented Dec 1, 2022

AhmedMAlbreiki commented Feb 7, 2023

zhisbug commented Feb 7, 2023

AhmedRAlmansoori commented Feb 27, 2023

[BUG] Collective group's rank is incorrect #790

[BUG] Collective group's rank is incorrect #790

Comments

ZYHowell commented Dec 1, 2022

Background

TODO

ZYHowell commented Dec 1, 2022

AhmedMAlbreiki commented Feb 7, 2023

zhisbug commented Feb 7, 2023

AhmedRAlmansoori commented Feb 27, 2023