Skip to content

Commit

Permalink
Allow ports to be reused in gloo (pytorch#97677)
Browse files Browse the repository at this point in the history
Summary:
Pull Request resolved: pytorch#97677

X-link: facebookincubator/gloo#353

ProcessGroupGloo and gloo seem to be opening and closing sockets without allowing the port to be reused. We see this issue pop up in larger training jobs "Address already in use" and we assume it to be because all the ephemeral ports are exhausted.

This diff allows ports to be reused, we see a reduced number of ports being in `TIME_WAIT` state.

context: https://fb.workplace.com/groups/319878845696681/permalink/5988899781205532/

another issue: https://fb.workplace.com/groups/319878845696681/permalink/958768178474408/

Test Plan: Add a gloo test to create 4 groups of size 64 using multithreaded PG + gloo. In total 256 ranks.

Differential Revision: D44029927

fbshipit-source-id: 9c31c38485333602c33e12c12813bea33ccb9438
  • Loading branch information
H-Huang authored and facebook-github-bot committed Mar 30, 2023
1 parent 97fc8ea commit bd9bb36
Show file tree
Hide file tree
Showing 2 changed files with 47 additions and 0 deletions.
29 changes: 29 additions & 0 deletions test/distributed/test_multi_threaded_pg.py
Original file line number Diff line number Diff line change
Expand Up @@ -220,5 +220,34 @@ def test_gather(self):
for i in range(self.world_size):
self.assertEqual(gather_list[i], torch.ones(3, 3) * i)

class TestLargeWorld(MultiThreadedTestCase):
@property
def world_size(self):
return 64

def setUp(self):
super().setUp()
self._spawn_threads()

def test_gloo_init(self):
groups = []
num_ports_used = 0
num_groups = 4
# create multiple gloo groups with 64 ranks
for i in range(num_groups):
group = dist.new_group(backend="gloo")
groups.append(group)

# tear down gloo groups
for i in range(num_groups):
dist.destroy_process_group(groups[i])
groups.clear()
self.assertEqual(len(groups), 0)

# create multiple gloo groups with 64 ranks
for i in range(num_groups):
group = dist.new_group(backend="gloo")
groups.append(group)

if __name__ == "__main__":
run_tests()
18 changes: 18 additions & 0 deletions torch/csrc/distributed/c10d/ProcessGroupGloo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -638,6 +638,24 @@ bool doesHostnameResolveToUsableAddress(const std::string& hostname) {
struct addrinfo* rp = nullptr;
for (rp = result; rp != nullptr; rp = rp->ai_next) {
auto fd = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);

// Set SO_REUSEADDR to signal that reuse of the listening port is OK.
int on = 1;
rv = setsockopt(
fd,
SOL_SOCKET,
SO_REUSEADDR,
reinterpret_cast<const char*>(&on),
sizeof(on));
if (rv == -1) {
#ifdef _WIN32
closesocket(fd);
#else
close(fd);
#endif
logAndThrow("setsockopt: ", strerror(errno));
}

if (fd == -1) {
continue;
}
Expand Down

0 comments on commit bd9bb36

Please sign in to comment.