forked from rapidsai/raft
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
MAINT: Simplify NCCL worker rank identification
This is a follow up on rapidsaigh-1926, since the rank sorting seemed a bit hard to understand. It does modify the logic in the sense that the host is now sorted by IP as a way to group based on it. But I don't really think that host sorting was ever a goal? If the goal is really about being deterministic, then this should be more (or at least clearer) deterministic about order of worker IPs. OTOH, if the NVML device order doesn't matter, we could just sort the workers directly. The original rapidsaigh-1587 mentions: NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node which is something that neither approach seems to quite assure, if such a requirement exists, I would want to do one of: * Ensure we can guarantee this, but this requires initializing workers that are not involved in the operation. * At least raise an error, because if NCCL will end up raising the error it will be very confusing.
- Loading branch information
Showing
1 changed file
with
10 additions
and
81 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters