Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CI image with develop docker image #5290

Merged
merged 53 commits into from
Oct 12, 2023
Merged

Conversation

lsy323
Copy link
Collaborator

@lsy323 lsy323 commented Jul 7, 2023

Update CI image with the develop docker image.

Plan to take the following steps to update CI image:

  • Enable PyTorch/XLA build in GHA workflows with new docker image.
  • Enable PyTorch/XLA build & tests in GHA workflows with new docker image.
  • Create PR in upstream to update the CI image used in upstream CI for PyTorch/XLA tests: Bump xla_base version tag to v1.1 pytorch#109757

Summary of the change in this PR:

  • The new docker image is built from .circleci/docker/Dockerfile. The image is using PyTorch/XLA develop container as a base image, plus some additional dependencies for CI and PyTorch CI. The new image is at gcr.io/tpu-pytorch/xla_base:dev-3.8_cuda_12.1
  • The new docker image is python3.8 + Debian:Bullseye + CUDA 12.1
  • CI image uses Conda for Python environment (Python 3.8), see comment for detail.

@lsy323 lsy323 added DO_NOT_MERGE_YET For PRs which cannot be merged, despite tests passing DO_NOT_REVIEW_YET labels Jul 7, 2023
@lsy323
Copy link
Collaborator Author

lsy323 commented Jul 12, 2023

Build and CPU tests passed with CUDA dev container.

However, the GPU test TestPjRtDistributedDataParallel.test_ddp_correctness_env_init in test/pjrt/test_ddp.py is failing. It seems that it turned into a bad state, causing the other test got stuck.

Try skipping ddp test to see if any other test is failing in GPU dev container.

@lsy323 lsy323 mentioned this pull request Jul 12, 2023
@lsy323 lsy323 force-pushed the lsiyuan/update-ci-image branch from 58f2a42 to 39045d3 Compare July 27, 2023 21:05
@lsy323 lsy323 force-pushed the lsiyuan/update-ci-image branch 2 times, most recently from 7f8bd10 to 5314548 Compare September 27, 2023 20:09
@lsy323 lsy323 removed DO_NOT_MERGE_YET For PRs which cannot be merged, despite tests passing DO_NOT_REVIEW_YET labels Sep 27, 2023
.circleci/build.sh Outdated Show resolved Hide resolved
python setup.py install

sccache --show-stats
Copy link
Contributor

@yeounoh yeounoh Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also leave this as-is

@@ -26,5 +26,5 @@ function install_torchvision() {
install_torchvision

export GCLOUD_SERVICE_KEY_FILE="$XLA_DIR/default_credentials.json"
export SILO_NAME='cache-silo-ci-gcc-11' # cache bucket for CI
export SILO_NAME='cache-silo-ci-dev-container-python38-bullseye' # cache bucket for CI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. Let's just use the image tag as prefix to cache-silo-ci-dev-, so it would look like cache-silo-ci-dev-3.8_cuda_12.1.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg, let me update the cache name

.kokoro/Dockerfile Outdated Show resolved Hide resolved
Copy link
Contributor

@yeounoh yeounoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, will approve after testing is complete.

@yeounoh yeounoh added the CI CI related change label Sep 27, 2023
@lsy323
Copy link
Collaborator Author

lsy323 commented Sep 28, 2023

Synced with @yeounoh offline, since we have additional dependencies for CI (e.g. sccache), which are not required by development. So we create an other CI image based on the dev image, installing the necessary dependencies in the CI image. The CI Dockerfile will reuse the .circleci/docker/Dockerfile.

I updated the .circleci/docker/Dockerfile to use dev container as base image, now the CI is passing on CPU and GPU.

I had to make the following changes to Dockerfile to make it work

  • Remove the jenkins user, otherwise I'm seeing permission error when executing the ./install_valgrind.sh. Tried to move jenkins creation at the begining but somehow didn't help. In this case, valgrind installation will be done by root user (need to remove sudo)
  • Removed conda installation, it's not needed anymore

@lsy323
Copy link
Collaborator Author

lsy323 commented Sep 28, 2023

Also we cannot create jenkins user at the end of the Dockerfile, otherwise, we need sudo to install pytorch, pytorch/xla

@lsy323
Copy link
Collaborator Author

lsy323 commented Oct 4, 2023

After offline discussion with @yeounoh, the following changes are made since last checkpoint:

  1. Preserve the jenkins user(As a normal user instead of root), as it's used in upstream CI template
  2. Because of change 1, we need to add --user to python setup.py command, this wasn't needed before, since we were using conda and conda env is private for the jenkins user.
  3. In Dockerfile, change exec permission for bazel and install sudo package
  4. Recover the changes to the paths in previous commit, as we are switching back to jenkins user

@lsy323 lsy323 force-pushed the lsiyuan/update-ci-image branch from 75a1689 to 4ea55d1 Compare October 11, 2023 17:24
@lsy323 lsy323 marked this pull request as ready for review October 12, 2023 03:15
Copy link
Contributor

@yeounoh yeounoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lsy323 lsy323 merged commit e793e86 into master Oct 12, 2023
17 checks passed
@vanbasten23
Copy link
Collaborator

This is awesome. Thanks Siyuan!

zpcore pushed a commit that referenced this pull request Oct 19, 2023
This PR updates PyTorch/XLA CI image to use dev container based image. The same image is used in upstream CI in #109757
---------

Co-authored-by: Siyuan Liu <[email protected]>
ghpvnist pushed a commit to ghpvnist/xla that referenced this pull request Oct 31, 2023
This PR updates PyTorch/XLA CI image to use dev container based image. The same image is used in upstream CI in #109757
---------

Co-authored-by: Siyuan Liu <[email protected]>
mbzomowski pushed a commit to mbzomowski-test-org/xla that referenced this pull request Nov 16, 2023
This PR updates PyTorch/XLA CI image to use dev container based image. The same image is used in upstream CI in #109757
---------

Co-authored-by: Siyuan Liu <[email protected]>
chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023
This PR updates PyTorch/XLA CI image to use dev container based image. The same image is used in upstream CI in #109757
---------

Co-authored-by: Siyuan Liu <[email protected]>
golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024
This PR updates PyTorch/XLA CI image to use dev container based image. The same image is used in upstream CI in #109757
---------

Co-authored-by: Siyuan Liu <[email protected]>
@lsy323 lsy323 deleted the lsiyuan/update-ci-image branch March 4, 2024 19:12
bhavya01 pushed a commit that referenced this pull request Apr 22, 2024
This PR updates PyTorch/XLA CI image to use dev container based image. The same image is used in upstream CI in #109757
---------

Co-authored-by: Siyuan Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI CI related change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants