Fix NCCL initialization when i6pn runc containers are scheduled on the same machine #2332
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves MOD-4067
NCCL with multiple containers on the same machine with the runc runtime doesn't work, crashing upon initialization. This occurs if a grouped function is scheduled on the same machine.
nccl uses a host hash to determine if two GPUs are on the same machine.
The host hash evaluates to
$(hostname)$(cat /proc/sys/kernel/random/boot_id)
.gvisor patches boot_id to be unique, but runc does not.
Therefore in runc, nccl thinks the two GPUs are on the same machine and crashes when it's not able to access them.
Luckily there's a magical NCCL_HOSTID environment variable that can override this.
For more context: https://modal-com.slack.com/archives/C07BWUF5JKW/p1728437710487689?thread_ts=1728428627.889669&cid=C07BWUF5JKW