Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix NCCL initialization when i6pn runc containers are scheduled on the same machine #2332

Merged
merged 2 commits into from
Oct 16, 2024

Conversation

thecodingwizard
Copy link
Contributor

Resolves MOD-4067

NCCL with multiple containers on the same machine with the runc runtime doesn't work, crashing upon initialization. This occurs if a grouped function is scheduled on the same machine.

nccl uses a host hash to determine if two GPUs are on the same machine.

The host hash evaluates to $(hostname)$(cat /proc/sys/kernel/random/boot_id).

gvisor patches boot_id to be unique, but runc does not.

Therefore in runc, nccl thinks the two GPUs are on the same machine and crashes when it's not able to access them.

Luckily there's a magical NCCL_HOSTID environment variable that can override this.

For more context: https://modal-com.slack.com/archives/C07BWUF5JKW/p1728437710487689?thread_ts=1728428627.889669&cid=C07BWUF5JKW

Copy link
Contributor

@thundergolfer thundergolfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice

import socket

hostname = socket.gethostname()
addr_info = socket.getaddrinfo("i6pn.modal.local", None, socket.AF_INET6)[0][4][0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
addr_info = socket.getaddrinfo("i6pn.modal.local", None, socket.AF_INET6)[0][4][0]
addr_info = socket.getaddrinfo("i6pn.modal.local", None, socket.AF_INET6)[0][4][0]

nit: would be more readable to unpack or comment the [0][4][0] indexing, even if it costs a couple extra LoC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the [0][4][0] is just a magic incantation that you need to know, I don't think there's anything smart about it 🤷‍♂️

maybe we can make it a helper function to get the i6pn

@thecodingwizard thecodingwizard merged commit 332b196 into main Oct 16, 2024
21 checks passed
@thecodingwizard thecodingwizard deleted the nathan/nccl-hostid branch October 16, 2024 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants