Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CI overview to ci.md #7015

Merged
merged 4 commits into from
May 3, 2024
Merged

Add CI overview to ci.md #7015

merged 4 commits into from
May 3, 2024

Conversation

will-cromar
Copy link
Collaborator

@will-cromar will-cromar commented May 1, 2024

We had a few outdated Google docs floating around internally with some conceptual documentation about the CI. I added updated versions of that content to ci.md so we're less likely to forget about it.

I'll add instructions for internal tasks (like key rotation) to our internal team docs.

Small piggyback change: move upstream image build to separate workflow rather than having it show up as skipped on every PR. This should also make it harder to break the upstream image by accident

Changes from before:

  • The build and test workflow is much faster! It now takes ~1.5 hours instead of ~2.5 hours. Most tests will complete before the old build completed.
    • GPU tests use the CUDA plugin, which vastly speeds up the main torch_xla package build.
  • The CI builds used to use a separate build config in .circleci/, which nightly builds used our ansible config. You only have to make changes in one place to update all of our automated builds.
    • The upstream CI still uses the old scripts. Fixing this is left as an exercise to the reader.
  • C++ tests are built with the main build to ensure we don't blow the cache by slightly changing build settings.
    • You can download the pre-built C++ tests to run locally.
  • The TPU test workflow is now part of the main build_and_test.yml and uses the same build, instead of using a separate build script. You can trigger it on PRs by adding the tpuci label now.
  • Docs are generated for every PR so you can preview them. They're still only pushed after PRs are merged.
    • Fixed a bug where we mostly ignored doc changes in release branches.
  • We upload images to upstream's shared ECR instance when the build file is changed in our repo. We do not have to stage images in GCR.
    • We also do not upload images from pending PRs anymore. Changes must be merged before anything is changed for upstream.

@will-cromar will-cromar marked this pull request as ready for review May 2, 2024 21:21
@will-cromar will-cromar requested review from JackCaoG and lsy323 May 2, 2024 21:21
runs-on: ${{ inputs.runner }}
timeout-minutes: 240
runs-on: linux.12xlarge
timeout-minutes: 30
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why lower the timeout?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timeout used to encompass the entire package build. Now we're just building and pushing this base image.

Copy link
Collaborator

@lsy323 lsy323 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

@will-cromar will-cromar merged commit aad6a12 into master May 3, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants