Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow ports to be reused in gloo #353

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

H-Huang
Copy link
Member

@H-Huang H-Huang commented Mar 27, 2023

Summary:
ProcessGroupGloo and gloo seem to be opening and closing sockets without allowing the port to be reused. We see this issue pop up in larger training jobs "Address already in use" and we assume it to be because all the ephemeral ports are exhausted.

This diff allows ports to be reused, we see a reduced number of ports being in TIME_WAIT state.

context: https://fb.workplace.com/groups/319878845696681/permalink/5988899781205532/

another issue: https://fb.workplace.com/groups/319878845696681/permalink/958768178474408/

Differential Revision: D44029927

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D44029927

H-Huang added a commit to H-Huang/gloo that referenced this pull request Mar 27, 2023
Summary:
X-link: pytorch/pytorch#97677

Pull Request resolved: facebookincubator#353

ProcessGroupGloo and gloo seem to be opening and closing sockets without allowing the port to be reused. We see this issue pop up in larger training jobs "Address already in use" and we assume it to be because all the ephemeral ports are exhausted.

This diff allows ports to be reused, we see a reduced number of ports being in `TIME_WAIT` state.

context: https://fb.workplace.com/groups/319878845696681/permalink/5988899781205532/

another issue: https://fb.workplace.com/groups/319878845696681/permalink/958768178474408/

Differential Revision: D44029927

fbshipit-source-id: 447186248c900f122c56ae824e0a8eb56aab732f
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D44029927

H-Huang added a commit to H-Huang/gloo that referenced this pull request Mar 27, 2023
Summary:
X-link: pytorch/pytorch#97677

Pull Request resolved: facebookincubator#353

ProcessGroupGloo and gloo seem to be opening and closing sockets without allowing the port to be reused. We see this issue pop up in larger training jobs "Address already in use" and we assume it to be because all the ephemeral ports are exhausted.

This diff allows ports to be reused, we see a reduced number of ports being in `TIME_WAIT` state.

context: https://fb.workplace.com/groups/319878845696681/permalink/5988899781205532/

another issue: https://fb.workplace.com/groups/319878845696681/permalink/958768178474408/

Differential Revision: D44029927

fbshipit-source-id: 31f1c91d40686001195bb8b03baec38b88dfbb12
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D44029927

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D44029927

H-Huang added a commit to H-Huang/gloo that referenced this pull request Mar 29, 2023
Summary:
X-link: pytorch/pytorch#97677

Pull Request resolved: facebookincubator#353

ProcessGroupGloo and gloo seem to be opening and closing sockets without allowing the port to be reused. We see this issue pop up in larger training jobs "Address already in use" and we assume it to be because all the ephemeral ports are exhausted.

This diff allows ports to be reused, we see a reduced number of ports being in `TIME_WAIT` state.

context: https://fb.workplace.com/groups/319878845696681/permalink/5988899781205532/

another issue: https://fb.workplace.com/groups/319878845696681/permalink/958768178474408/

Differential Revision: D44029927

fbshipit-source-id: a12967a5c7120677a34efccb7da9bfc340f41a5f
H-Huang added a commit to H-Huang/gloo that referenced this pull request Mar 29, 2023
Summary:
X-link: pytorch/pytorch#97677

Pull Request resolved: facebookincubator#353

ProcessGroupGloo and gloo seem to be opening and closing sockets without allowing the port to be reused. We see this issue pop up in larger training jobs "Address already in use" and we assume it to be because all the ephemeral ports are exhausted.

This diff allows ports to be reused, we see a reduced number of ports being in `TIME_WAIT` state.

context: https://fb.workplace.com/groups/319878845696681/permalink/5988899781205532/

another issue: https://fb.workplace.com/groups/319878845696681/permalink/958768178474408/

Differential Revision: D44029927

fbshipit-source-id: 1f83e9288776a6ec6e2f2b1ea356739ae057d4a6
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D44029927

1 similar comment
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D44029927

H-Huang added a commit to H-Huang/gloo that referenced this pull request Mar 29, 2023
Summary:
X-link: pytorch/pytorch#97677

Pull Request resolved: facebookincubator#353

ProcessGroupGloo and gloo seem to be opening and closing sockets without allowing the port to be reused. We see this issue pop up in larger training jobs "Address already in use" and we assume it to be because all the ephemeral ports are exhausted.

This diff allows ports to be reused, we see a reduced number of ports being in `TIME_WAIT` state.

context: https://fb.workplace.com/groups/319878845696681/permalink/5988899781205532/

another issue: https://fb.workplace.com/groups/319878845696681/permalink/958768178474408/

Differential Revision: D44029927

fbshipit-source-id: c29af50177d52f17a98db73d7a63fc436708626f
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D44029927

H-Huang added a commit to H-Huang/gloo that referenced this pull request Mar 29, 2023
Summary:
X-link: pytorch/pytorch#97677

Pull Request resolved: facebookincubator#353

ProcessGroupGloo and gloo seem to be opening and closing sockets without allowing the port to be reused. We see this issue pop up in larger training jobs "Address already in use" and we assume it to be because all the ephemeral ports are exhausted.

This diff allows ports to be reused, we see a reduced number of ports being in `TIME_WAIT` state.

context: https://fb.workplace.com/groups/319878845696681/permalink/5988899781205532/

another issue: https://fb.workplace.com/groups/319878845696681/permalink/958768178474408/

Differential Revision: D44029927

fbshipit-source-id: cf12f681fd1a42249ffac259d9d34ab076a5ac94
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D44029927

H-Huang added a commit to H-Huang/gloo that referenced this pull request Mar 30, 2023
Summary:
X-link: pytorch/pytorch#97677

Pull Request resolved: facebookincubator#353

ProcessGroupGloo and gloo seem to be opening and closing sockets without allowing the port to be reused. We see this issue pop up in larger training jobs "Address already in use" and we assume it to be because all the ephemeral ports are exhausted.

This diff allows ports to be reused, we see a reduced number of ports being in `TIME_WAIT` state.

context: https://fb.workplace.com/groups/319878845696681/permalink/5988899781205532/

another issue: https://fb.workplace.com/groups/319878845696681/permalink/958768178474408/

Differential Revision: D44029927

fbshipit-source-id: 4a1483d0eceda01ffd02c7747282129f7f4a2efe
H-Huang added a commit to H-Huang/pytorch that referenced this pull request Mar 30, 2023
Summary:
Pull Request resolved: pytorch#97677

X-link: facebookincubator/gloo#353

ProcessGroupGloo and gloo seem to be opening and closing sockets without allowing the port to be reused. We see this issue pop up in larger training jobs "Address already in use" and we assume it to be because all the ephemeral ports are exhausted.

This diff allows ports to be reused, we see a reduced number of ports being in `TIME_WAIT` state.

context: https://fb.workplace.com/groups/319878845696681/permalink/5988899781205532/

another issue: https://fb.workplace.com/groups/319878845696681/permalink/958768178474408/

Test Plan: Add a gloo test to create 4 groups of size 64 using multithreaded PG + gloo. In total 256 ranks.

Differential Revision: D44029927

fbshipit-source-id: 9c31c38485333602c33e12c12813bea33ccb9438
Summary:
X-link: pytorch/pytorch#97677

Pull Request resolved: facebookincubator#353

ProcessGroupGloo and gloo seem to be opening and closing sockets without allowing the port to be reused. We see this issue pop up in larger training jobs "Address already in use" and we assume it to be because all the ephemeral ports are exhausted.

This diff allows ports to be reused, we see a reduced number of ports being in `TIME_WAIT` state.

context: https://fb.workplace.com/groups/319878845696681/permalink/5988899781205532/

another issue: https://fb.workplace.com/groups/319878845696681/permalink/958768178474408/

Differential Revision: D44029927

fbshipit-source-id: b531d67456d4656ce23e9db7c4cb892d8fc90475
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D44029927

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants