You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 19, 2024. It is now read-only.
A NCCLGroup creates and manages nccl communicators for each GPU in this node. When we need to call a nccl function, we finally goes into the NCCLGroup to call it. However, in our current implementation, we use node_rank * num_devices_per_node + local_offset to compute the rank of a local GPU w.r.t. the communication group. An example is here. This is correct in most cases, but when the send mesh has a different number of devices per node with the receive mesh, it is incorrect.
TODO
Fix the bug above by adding a start_gpu_rank at the initialization of NCCLGroup.
Add tests for collective communications among meshes. For unit test on cross-mesh communication, please refer to this file.
The text was updated successfully, but these errors were encountered:
Background
Alpa initializes collective groups for each cross-mesh communication pair. The call stack to initialize a collective group is:
create_collective_group
orinit_collective_group
fromcollective.py
calls:create_collective_group
ofGroupManager
class incollective.py
calls:NCCLGroup.__init__
with two different implementations. One is based on cupy, while the other is based on xla.A
NCCLGroup
creates and manages nccl communicators for each GPU in this node. When we need to call a nccl function, we finally goes into theNCCLGroup
to call it. However, in our current implementation, we usenode_rank * num_devices_per_node + local_offset
to compute the rank of a local GPU w.r.t. the communication group. An example is here. This is correct in most cases, but when the send mesh has a different number of devices per node with the receive mesh, it is incorrect.TODO
start_gpu_rank
at the initialization ofNCCLGroup
.The text was updated successfully, but these errors were encountered: