Skip to content

Commit

Permalink
Allow ports to be reused in gloo (#97677)
Browse files Browse the repository at this point in the history
Summary:
X-link: pytorch/pytorch#97677

Pull Request resolved: #353

ProcessGroupGloo and gloo seem to be opening and closing sockets without allowing the port to be reused. We see this issue pop up in larger training jobs "Address already in use" and we assume it to be because all the ephemeral ports are exhausted.

This diff allows ports to be reused, we see a reduced number of ports being in `TIME_WAIT` state.

context: https://fb.workplace.com/groups/319878845696681/permalink/5988899781205532/

another issue: https://fb.workplace.com/groups/319878845696681/permalink/958768178474408/

Differential Revision: D44029927

fbshipit-source-id: 4a1483d0eceda01ffd02c7747282129f7f4a2efe
  • Loading branch information
H-Huang authored and facebook-github-bot committed Mar 30, 2023
1 parent 56b221c commit 61f8d06
Showing 1 changed file with 9 additions and 0 deletions.
9 changes: 9 additions & 0 deletions gloo/transport/tcp/device.cc
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,15 @@ static void lookupAddrForHostname(struct attr& attr) {
struct addrinfo* rp;
for (rp = result; rp != nullptr; rp = rp->ai_next) {
auto fd = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);

// Set SO_REUSEADDR to signal that reuse of the listening port is OK.
int on = 1;
rv = setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, reinterpret_cast<const char*>(&on), sizeof(on));
if (rv == -1) {
close(fd);
GLOO_ENFORCE_NE(rv, -1);
}

if (fd == -1) {
continue;
}
Expand Down

0 comments on commit 61f8d06

Please sign in to comment.