Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable docker image feature with xpk #28

Closed
wants to merge 12 commits into from

Conversation

RissyRan
Copy link
Collaborator

@RissyRan RissyRan commented Nov 30, 2023

New PR.

Description

Enable docker image feature with xpk:

  • Add this as an GKE example, and plan to add more examples for quick start
  • Add TpuGkeTask to execute run_model and post_process task groups
  • Add cluster configs under configs/cluster folder, so Airflow has the access to the created cluster with TPUs. It seems certificate-authority-data is safe to share based on this stackoverflow thread.
  • Update TpuTask to TpuGceTask to distinguish TpuGkeTask, and similar for test config

A few TODO once b/311073979 is approved:

  • Update GCP and cluster configs, and host clusters in cloud-ml-auto-solutions project
  • Will have a separate PR to document how to provision cluster using XPK. It is quite straightforward with 1-2 commands

Tests

Please describe the tests that you ran on Cloud VM to verify changes.

Instruction and/or command lines to reproduce your tests: ...

  • Upload codes to GCS bucket and test

List links for your tests (use go/shortn-gen for any internal link): ...

  • Airflow test: link

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run one-shot tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed.

apis/task.py Outdated Show resolved Hide resolved
apis/test_config.py Outdated Show resolved Hide resolved
configs/cluster/v5e_cluster_config Show resolved Hide resolved
configs/vm_resource.py Show resolved Hide resolved
dags/benchmark/pytorchxla_torchbench.py Outdated Show resolved Hide resolved
implementations/utils/xpk.py Show resolved Hide resolved
implementations/utils/xpk.py Show resolved Hide resolved
implementations/utils/metric.py Show resolved Hide resolved
implementations/utils/xpk.py Outdated Show resolved Hide resolved
dataset_name=metric_config.DatasetOption.XLML_DATASET,
)

run_model_cmds = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this test is actually going to be used, should it be in the example directory? If it is just an example, I would go even simpler and just print jax.device_count() to show the configuration is correct.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. Yes, currently I just want to show an example. Since we are testing models, I was thinking probably it's better to have a hello world E2E example. Do you feel this is a complex one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine if the script prints the topology or number of devices. IMO the main part we want to confirm with the example is that we created the right machine and any metadata needed for initialization is plumbed through correctly to the runtime.

RissyRan and others added 10 commits December 6, 2023 07:47
* Add unit tests to checks

* Update test for auth
* Add auto push after commit merge

* Address comment
* Build JSonnet files as part of the postsubmit

Change-Id: I811493a530d8eed91c975aebe15d714903283678

* remove extra quotes

Change-Id: Iba35dacab442d783304f738a2a42bb0bd350fb63
* Pull the latest submodule commit and add configs

Change-Id: Ia43089a8224e7f258eb1a9b44876ebd8f0c2b20f

* Apply config changes by regenerating

Change-Id: I9418e17389e421695d4511a4979f28f318c224df

* Remove Jsonnet configs and update test name

Change-Id: I4ff436944ed3b5986f7b50f8caa24c673ec15d3b

* Remove configs

Change-Id: I774bbe8d2d03835bef0cbaededff1f5575a45e83

* Add pjrt in the test name

Change-Id: I4847ac74d883def7b51363f5a60f0f99e9665b67
Change-Id: I58ee7d2e9237b686004e0a7454d54cf63f2e5e70
…form#37)

Change-Id: I58947fff8617ce27740b566d2aef90bbeda21152
…rm#38)

Change-Id: I575c2afcdbba6883ccc1569bd38d0a60e6f24f74
Change-Id: I39d23e223b6c0c2c2633b4cee3a3d65909dbf63f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants