-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-host GPU tests on GKE #111
Conversation
37893a5
to
9b9e12e
Compare
Change-Id: I8b6ef0f096d965d6538e1de3b9699bb271e15232
Change-Id: I0507c9d9275cf3b920adfa75a065492d66dc0658
Change-Id: I7683d82b3f1ebce32b047f967c66b18b042373ca
Change-Id: Ibfabea0833dfaf3e9f3922438c0d5dca97463592
f535db7
to
5eff685
Compare
Change-Id: I5efbd5b7b7e61a471f2756f81bb0a8d23c8df355
Change-Id: Ia14f9601cf5f254af19b339416e525962e6e1e33
Change-Id: Ieca0e21a6e093f1635b60d1e29bbb663183d76e9
Change-Id: Ib54bd01bd54c5c785beb2dab1cddcaede1720ca9
Change-Id: I0f28c7e5a49ac6c8e2655f59d5b9b9db4a5f09e9
Change-Id: I2a7213cc67d11f4089409cbc52e29b27a0d5fffb
Change-Id: Ie97fc0545a399a8d088907afbe3e570818e5dab0
I'm having some issue with the code formatter not agreeing with the one on GitHub. This PR is otherwise ready for review. |
Change-Id: I76ed5ca9ca0cce013ede30a451039c39690f4f17
Looking great! Left some minor comment. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Will for this feature! This is great!
I have seen a successful run already, just check you also have a failed run that being captured :)
Change-Id: I79cb359c070b5bcd3b4d16188217c5ef9417b0c1
Change-Id: I433000f39c12dfb2d402ab445f9ba938048ebdb2
Change-Id: I5f83e564b15c8a97cb9805702e8162e8e29ef326
Change-Id: Iffc7b7d551e317e130f12c384213df8a5a56c3c9
Change-Id: I4fd2f13f8faa6705c7217f321562a42a9bce5345
Change-Id: Ief7e35acc1dc5a54b2a8f1b2f6a442230d973731
Change-Id: I29eadfea4bb89fb2a77a6c8752b5f7da6635fdef
Yeah, I tested that failing exit codes are propagated by interrupting one of the containers during a test run. |
Change-Id: I4dfcedfbd71bb17b7144254c9e6594468539aecc
Change-Id: Icfd3ec316e068d2d71d120685ac7857d37f43b17
Change-Id: Ic607c47aafb439729268a645e8210a47a7ca7ad2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Will! Overall looks good!
I have one high level question that we should integrate with post_process step to emit default metrics? Or I missed that in the latest test.
Change-Id: If3bc4d60890fb0545d269433c516db9dfd538306
Change-Id: Iad49f733c866f5309b968097cc724571de2d9e8d
Description
torchrun
.JSonnetGpuTest
. I intend to clean this up and generalize the API.gke.get_authenticated_client
Tests
Tested locally. Will post screenshots.
Checklist
Before submitting this PR, please make sure (put X in square brackets):