-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow ports to be reused in gloo #353
base: main
Are you sure you want to change the base?
Conversation
This pull request was exported from Phabricator. Differential Revision: D44029927 |
Summary: X-link: pytorch/pytorch#97677 Pull Request resolved: facebookincubator#353 ProcessGroupGloo and gloo seem to be opening and closing sockets without allowing the port to be reused. We see this issue pop up in larger training jobs "Address already in use" and we assume it to be because all the ephemeral ports are exhausted. This diff allows ports to be reused, we see a reduced number of ports being in `TIME_WAIT` state. context: https://fb.workplace.com/groups/319878845696681/permalink/5988899781205532/ another issue: https://fb.workplace.com/groups/319878845696681/permalink/958768178474408/ Differential Revision: D44029927 fbshipit-source-id: 447186248c900f122c56ae824e0a8eb56aab732f
This pull request was exported from Phabricator. Differential Revision: D44029927 |
191a459
to
1b3a5de
Compare
Summary: X-link: pytorch/pytorch#97677 Pull Request resolved: facebookincubator#353 ProcessGroupGloo and gloo seem to be opening and closing sockets without allowing the port to be reused. We see this issue pop up in larger training jobs "Address already in use" and we assume it to be because all the ephemeral ports are exhausted. This diff allows ports to be reused, we see a reduced number of ports being in `TIME_WAIT` state. context: https://fb.workplace.com/groups/319878845696681/permalink/5988899781205532/ another issue: https://fb.workplace.com/groups/319878845696681/permalink/958768178474408/ Differential Revision: D44029927 fbshipit-source-id: 31f1c91d40686001195bb8b03baec38b88dfbb12
1b3a5de
to
942e7de
Compare
This pull request was exported from Phabricator. Differential Revision: D44029927 |
This pull request was exported from Phabricator. Differential Revision: D44029927 |
Summary: X-link: pytorch/pytorch#97677 Pull Request resolved: facebookincubator#353 ProcessGroupGloo and gloo seem to be opening and closing sockets without allowing the port to be reused. We see this issue pop up in larger training jobs "Address already in use" and we assume it to be because all the ephemeral ports are exhausted. This diff allows ports to be reused, we see a reduced number of ports being in `TIME_WAIT` state. context: https://fb.workplace.com/groups/319878845696681/permalink/5988899781205532/ another issue: https://fb.workplace.com/groups/319878845696681/permalink/958768178474408/ Differential Revision: D44029927 fbshipit-source-id: a12967a5c7120677a34efccb7da9bfc340f41a5f
942e7de
to
4464366
Compare
Summary: X-link: pytorch/pytorch#97677 Pull Request resolved: facebookincubator#353 ProcessGroupGloo and gloo seem to be opening and closing sockets without allowing the port to be reused. We see this issue pop up in larger training jobs "Address already in use" and we assume it to be because all the ephemeral ports are exhausted. This diff allows ports to be reused, we see a reduced number of ports being in `TIME_WAIT` state. context: https://fb.workplace.com/groups/319878845696681/permalink/5988899781205532/ another issue: https://fb.workplace.com/groups/319878845696681/permalink/958768178474408/ Differential Revision: D44029927 fbshipit-source-id: 1f83e9288776a6ec6e2f2b1ea356739ae057d4a6
4464366
to
b832c03
Compare
This pull request was exported from Phabricator. Differential Revision: D44029927 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D44029927 |
Summary: X-link: pytorch/pytorch#97677 Pull Request resolved: facebookincubator#353 ProcessGroupGloo and gloo seem to be opening and closing sockets without allowing the port to be reused. We see this issue pop up in larger training jobs "Address already in use" and we assume it to be because all the ephemeral ports are exhausted. This diff allows ports to be reused, we see a reduced number of ports being in `TIME_WAIT` state. context: https://fb.workplace.com/groups/319878845696681/permalink/5988899781205532/ another issue: https://fb.workplace.com/groups/319878845696681/permalink/958768178474408/ Differential Revision: D44029927 fbshipit-source-id: c29af50177d52f17a98db73d7a63fc436708626f
b832c03
to
751f4fe
Compare
This pull request was exported from Phabricator. Differential Revision: D44029927 |
Summary: X-link: pytorch/pytorch#97677 Pull Request resolved: facebookincubator#353 ProcessGroupGloo and gloo seem to be opening and closing sockets without allowing the port to be reused. We see this issue pop up in larger training jobs "Address already in use" and we assume it to be because all the ephemeral ports are exhausted. This diff allows ports to be reused, we see a reduced number of ports being in `TIME_WAIT` state. context: https://fb.workplace.com/groups/319878845696681/permalink/5988899781205532/ another issue: https://fb.workplace.com/groups/319878845696681/permalink/958768178474408/ Differential Revision: D44029927 fbshipit-source-id: cf12f681fd1a42249ffac259d9d34ab076a5ac94
751f4fe
to
463c263
Compare
This pull request was exported from Phabricator. Differential Revision: D44029927 |
Summary: X-link: pytorch/pytorch#97677 Pull Request resolved: facebookincubator#353 ProcessGroupGloo and gloo seem to be opening and closing sockets without allowing the port to be reused. We see this issue pop up in larger training jobs "Address already in use" and we assume it to be because all the ephemeral ports are exhausted. This diff allows ports to be reused, we see a reduced number of ports being in `TIME_WAIT` state. context: https://fb.workplace.com/groups/319878845696681/permalink/5988899781205532/ another issue: https://fb.workplace.com/groups/319878845696681/permalink/958768178474408/ Differential Revision: D44029927 fbshipit-source-id: 4a1483d0eceda01ffd02c7747282129f7f4a2efe
463c263
to
61f8d06
Compare
Summary: Pull Request resolved: pytorch#97677 X-link: facebookincubator/gloo#353 ProcessGroupGloo and gloo seem to be opening and closing sockets without allowing the port to be reused. We see this issue pop up in larger training jobs "Address already in use" and we assume it to be because all the ephemeral ports are exhausted. This diff allows ports to be reused, we see a reduced number of ports being in `TIME_WAIT` state. context: https://fb.workplace.com/groups/319878845696681/permalink/5988899781205532/ another issue: https://fb.workplace.com/groups/319878845696681/permalink/958768178474408/ Test Plan: Add a gloo test to create 4 groups of size 64 using multithreaded PG + gloo. In total 256 ranks. Differential Revision: D44029927 fbshipit-source-id: 9c31c38485333602c33e12c12813bea33ccb9438
Summary: X-link: pytorch/pytorch#97677 Pull Request resolved: facebookincubator#353 ProcessGroupGloo and gloo seem to be opening and closing sockets without allowing the port to be reused. We see this issue pop up in larger training jobs "Address already in use" and we assume it to be because all the ephemeral ports are exhausted. This diff allows ports to be reused, we see a reduced number of ports being in `TIME_WAIT` state. context: https://fb.workplace.com/groups/319878845696681/permalink/5988899781205532/ another issue: https://fb.workplace.com/groups/319878845696681/permalink/958768178474408/ Differential Revision: D44029927 fbshipit-source-id: b531d67456d4656ce23e9db7c4cb892d8fc90475
61f8d06
to
4adb00a
Compare
This pull request was exported from Phabricator. Differential Revision: D44029927 |
Summary:
ProcessGroupGloo and gloo seem to be opening and closing sockets without allowing the port to be reused. We see this issue pop up in larger training jobs "Address already in use" and we assume it to be because all the ephemeral ports are exhausted.
This diff allows ports to be reused, we see a reduced number of ports being in
TIME_WAIT
state.context: https://fb.workplace.com/groups/319878845696681/permalink/5988899781205532/
another issue: https://fb.workplace.com/groups/319878845696681/permalink/958768178474408/
Differential Revision: D44029927