Fix NCCL initialization when i6pn runc containers are scheduled on the same machine #2332

thecodingwizard · 2024-10-09T10:40:18Z

Resolves MOD-4067

NCCL with multiple containers on the same machine with the runc runtime doesn't work, crashing upon initialization. This occurs if a grouped function is scheduled on the same machine.

nccl uses a host hash to determine if two GPUs are on the same machine.

The host hash evaluates to $(hostname)$(cat /proc/sys/kernel/random/boot_id).

gvisor patches boot_id to be unique, but runc does not.

Therefore in runc, nccl thinks the two GPUs are on the same machine and crashes when it's not able to access them.

Luckily there's a magical NCCL_HOSTID environment variable that can override this.

For more context: https://modal-com.slack.com/archives/C07BWUF5JKW/p1728437710487689?thread_ts=1728428627.889669&cid=C07BWUF5JKW

…machine

thundergolfer

LGTM, nice

thundergolfer · 2024-10-09T12:48:31Z

modal/experimental.py

+        import socket
+
+        hostname = socket.gethostname()
+        addr_info = socket.getaddrinfo("i6pn.modal.local", None, socket.AF_INET6)[0][4][0]


Suggested change

addr_info = socket.getaddrinfo("i6pn.modal.local", None, socket.AF_INET6)[0][4][0]

addr_info = socket.getaddrinfo("i6pn.modal.local", None, socket.AF_INET6)[0][4][0]

nit: would be more readable to unpack or comment the [0][4][0] indexing, even if it costs a couple extra LoC.

the [0][4][0] is just a magic incantation that you need to know, I don't think there's anything smart about it 🤷‍♂️

maybe we can make it a helper function to get the i6pn

fix nccl host id when i6pn runc containers are scheduled on the same …

8249873

…machine

thundergolfer approved these changes Oct 9, 2024

View reviewed changes

refactor to get_i6pn()

c3cea6c

thecodingwizard merged commit 332b196 into main Oct 16, 2024
21 checks passed

thecodingwizard deleted the nathan/nccl-hostid branch October 16, 2024 19:08

thecodingwizard mentioned this pull request Oct 17, 2024

Fix i6pn main address being None (typo) #2351

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix NCCL initialization when i6pn runc containers are scheduled on the same machine #2332

Fix NCCL initialization when i6pn runc containers are scheduled on the same machine #2332

thecodingwizard commented Oct 9, 2024

thundergolfer left a comment •

edited

Loading

thundergolfer Oct 9, 2024

ekzhang Oct 14, 2024

	addr_info = socket.getaddrinfo("i6pn.modal.local", None, socket.AF_INET6)[0][4][0]
	addr_info = socket.getaddrinfo("i6pn.modal.local", None, socket.AF_INET6)[0][4][0]

Fix NCCL initialization when i6pn runc containers are scheduled on the same machine #2332

Fix NCCL initialization when i6pn runc containers are scheduled on the same machine #2332

Conversation

thecodingwizard commented Oct 9, 2024

thundergolfer left a comment • edited Loading

Choose a reason for hiding this comment

thundergolfer Oct 9, 2024

Choose a reason for hiding this comment

ekzhang Oct 14, 2024

Choose a reason for hiding this comment

thundergolfer left a comment •

edited

Loading