Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-host GPU tests on GKE #111

Merged
merged 30 commits into from
Feb 22, 2024
Merged

Multi-host GPU tests on GKE #111

merged 30 commits into from
Feb 22, 2024

Conversation

will-cromar
Copy link
Contributor

@will-cromar will-cromar commented Feb 12, 2024

Description

  • Implement multi-host GPU tests via GKE. GKE helps with resource management/coordination and provides a service discovery mechanism that we need for torchrun.
  • Uses legacy test templates, adapting @vanbasten23's PR in the old repository (Add multi-host gpu tests to PyTorch/XLA ml-testing-accelerators#1011) to the new system.
  • New task type currently requires the JSonnetGpuTest. I intend to clean this up and generalize the API.
  • Factor out GKE authentication to gke.get_authenticated_client

Tests

Tested locally. Will post screenshots.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run one-shot tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed.

@will-cromar will-cromar force-pushed the wcromar/multi-gpu-test branch from 37893a5 to 9b9e12e Compare February 12, 2024 21:40
@will-cromar
Copy link
Contributor Author

There's still plenty of cleanup to do here, but the basic idea works!

Screenshot:

image

Change-Id: I8b6ef0f096d965d6538e1de3b9699bb271e15232
Change-Id: I0507c9d9275cf3b920adfa75a065492d66dc0658
Change-Id: I7683d82b3f1ebce32b047f967c66b18b042373ca
Change-Id: Ibfabea0833dfaf3e9f3922438c0d5dca97463592
@will-cromar will-cromar force-pushed the wcromar/multi-gpu-test branch from f535db7 to 5eff685 Compare February 15, 2024 23:00
Change-Id: I5efbd5b7b7e61a471f2756f81bb0a8d23c8df355
Change-Id: Ia14f9601cf5f254af19b339416e525962e6e1e33
Change-Id: Ieca0e21a6e093f1635b60d1e29bbb663183d76e9
@will-cromar
Copy link
Contributor Author

Tested again after refactoring:

image

@will-cromar will-cromar mentioned this pull request Feb 16, 2024
4 tasks
Change-Id: Ib1ae393fec7b9e9fe8b88e87812cbd0a7a9e9ecb
Change-Id: Ib54bd01bd54c5c785beb2dab1cddcaede1720ca9
Change-Id: I0f28c7e5a49ac6c8e2655f59d5b9b9db4a5f09e9
Change-Id: Ie32ce56bf74fb31b420d693251c52539d8d3f8c2
Change-Id: I2a7213cc67d11f4089409cbc52e29b27a0d5fffb
Change-Id: Id0c6b916fb97ba4567caff6e7560cde21178e40c
Change-Id: Ie97fc0545a399a8d088907afbe3e570818e5dab0
@will-cromar
Copy link
Contributor Author

I'm having some issue with the code formatter not agreeing with the one on GitHub. This PR is otherwise ready for review.

@will-cromar will-cromar marked this pull request as ready for review February 16, 2024 21:38
@will-cromar will-cromar changed the title [WIP] Multi-GPU tests on GKE Multi-GPU tests on GKE Feb 16, 2024
Change-Id: I76ed5ca9ca0cce013ede30a451039c39690f4f17
xlml/apis/task.py Outdated Show resolved Hide resolved
@vanbasten23
Copy link
Collaborator

Looking great! Left some minor comment. Thanks!

Copy link
Collaborator

@RissyRan RissyRan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Will for this feature! This is great!

I have seen a successful run already, just check you also have a failed run that being captured :)

.github/requirements.txt Show resolved Hide resolved
xlml/apis/task.py Outdated Show resolved Hide resolved
xlml/apis/task.py Outdated Show resolved Hide resolved
xlml/apis/task.py Outdated Show resolved Hide resolved
xlml/apis/task.py Show resolved Hide resolved
xlml/utils/gke.py Outdated Show resolved Hide resolved
Change-Id: I79cb359c070b5bcd3b4d16188217c5ef9417b0c1
Change-Id: I433000f39c12dfb2d402ab445f9ba938048ebdb2
Change-Id: I5f83e564b15c8a97cb9805702e8162e8e29ef326
Change-Id: Iffc7b7d551e317e130f12c384213df8a5a56c3c9
Change-Id: I0b0716b19bfbade576be13df7cec92ef286e0543
Change-Id: I4fd2f13f8faa6705c7217f321562a42a9bce5345
Change-Id: Ief7e35acc1dc5a54b2a8f1b2f6a442230d973731
Change-Id: I29eadfea4bb89fb2a77a6c8752b5f7da6635fdef
@will-cromar
Copy link
Contributor Author

Thanks Will for this feature! This is great!

I have seen a successful run already, just check you also have a failed run that being captured :)

Yeah, I tested that failing exit codes are propagated by interrupting one of the containers during a test run.

Change-Id: I4dfcedfbd71bb17b7144254c9e6594468539aecc
Change-Id: Icfd3ec316e068d2d71d120685ac7857d37f43b17
Change-Id: Ic607c47aafb439729268a645e8210a47a7ca7ad2
Copy link
Collaborator

@vanbasten23 vanbasten23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@will-cromar will-cromar changed the title Multi-GPU tests on GKE Multi-host GPU tests on GKE Feb 21, 2024
Copy link
Collaborator

@RissyRan RissyRan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Will! Overall looks good!

I have one high level question that we should integrate with post_process step to emit default metrics? Or I missed that in the latest test.

xlml/apis/task.py Outdated Show resolved Hide resolved
xlml/apis/task.py Outdated Show resolved Hide resolved
xlml/apis/task.py Show resolved Hide resolved
Change-Id: If3bc4d60890fb0545d269433c516db9dfd538306
Change-Id: Iad49f733c866f5309b968097cc724571de2d9e8d
Change-Id: I04bb3b098d789328bbf7749f70b08e5a056aad01
@will-cromar will-cromar requested a review from RissyRan February 22, 2024 18:30
@will-cromar will-cromar merged commit d56f394 into master Feb 22, 2024
5 checks passed
@will-cromar will-cromar deleted the wcromar/multi-gpu-test branch February 22, 2024 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants