Move .torch_pin and handle in ansible (#6920)

pytorch · Apr 26, 2024 · b3be775 · b3be775
1 parent 3f5ff0f
commit b3be775
Show file tree

Hide file tree

Showing 13 changed files with 44 additions and 62 deletions.
diff --git a/.circleci/README.md b/.circleci/README.md
diff --git a/.circleci/setup_ci_environment.sh b/.circleci/setup_ci_environment.sh
@@ -58,7 +58,7 @@ sudo apt-get -y remove linux-image-generic linux-headers-generic linux-generic d
 # How to figure out what the correct versions of these packages are?
 # My preferred method is to start a Docker instance of the correct
 # Ubuntu version (e.g., docker run -it ubuntu:16.04) and then ask
-# apt what the packages you need are.  Note that the CircleCI image
+# apt what the packages you need are.  Note that the CI image
 # comes with Docker.
 #
 # Using 'retry' here as belt-and-suspenders even though we are

diff --git a/.github/README.md b/.github/README.md
@@ -0,0 +1,19 @@
+# CI Overview
+PyTorch and PyTorch/XLA use CI to lint, build, and test each PR that is submitted. All CI tests should succeed before the PR is merged into master. PyTorch CI pins PyTorch/XLA to a specific commit. On the other hand, PyTorch/XLA CI pulls PyTorch from master unless a pin is manually provided. This README will go through the reasons of these pins, how to pin a PyTorch/XLA PR to an upstream PyTorch PR, and how to coordinate a merge for breaking PyTorch changes.
+
+## Why does PyTorch CI pin PyTorch/XLA?
+As mentioned above, [PyTorch CI pins PyTorch/XLA](https://github.com/pytorch/pytorch/blob/master/.jenkins/pytorch/common_utils.sh#L119) to a "known good" commit to prevent accidental changes from PyTorch/XLA to break PyTorch CI without warning. PyTorch has hundreds of commits each week, and this pin ensures that PyTorch/XLA as a downstream package does not cause failures in PyTorch CI.
+
+## Why does PyTorch/XLA CI pull from PyTorch master?
+[PyTorch/XLA CI pulls PyTorch from master](https://github.com/pytorch/xla/blob/f3415929683880192b63b285921c72439af55bf0/.circleci/common.sh#L15) unless a PyTorch pin is manually provided. PyTorch/XLA is a downstream package to PyTorch, and pulling from master ensures that PyTorch/XLA will stay up-to-date and works with the latest PyTorch changes.
+
+## Pinning PyTorch PR in PyTorch/XLA PR
+Sometimes a PyTorch/XLA PR needs to be pinned to a specific PyTorch PR to test new featurues, fix breaking changes, etc. Since PyTorch/XLA CI pulls from PyTorch master by default, we need to manually provided a PyTorch pin. In a PyTorch/XLA PR, PyTorch an be manually pinned by creating a `.torch_pin` file at the root of the repository. The `.torch_pin` should have the corresponding PyTorch PR number prefixed by "#". Take a look at [example here](https://github.com/pytorch/xla/pull/3792/commits/40f41fb98b0f2386d287eeac0bae86e873d4a9d8). Before the PyTorch/XLA PR gets merged, the `.torch_pin` must be deleted.
+
+## Coodinating merges for breaking PyTorch PRs
+When PyTorch PR introduces a breaking change, its PyTorch/XLA CI tests will fail. Steps for fixing and merging such breaking PyTorch change is as following:
+1. Create a PyTorch/XLA PR to fix this issue with `.torch_pin` and rebase with master to ensure the PR is up-to-date with the latest commit on PyTorch/XLA. Once this PR is created, it'll create a commit hash that will be used in step 2. If you have multiple commits in the PR, use the last one's hash. **Important note: When you rebase this PR, it'll create a new commit hash and make the old hash obsolete. Be cautious about rebasing, and if you rebase, make sure you inform the PyTorch PR's author.**
+2. Rebase (or ask the PR owner to rebase) the PyTorch PR with master. Update the PyTorch PR to pin the PyTorch/XLA to the commit hash created in step 1 by updating `pytorch/.github/ci_commit_pins/xla.txt`.
+3. Once CI tests are green on both ends, merge PyTorch PR.
+4. Remove the `.torch_pin` in PyTorch/XLA PR and merge. To be noted, `git commit --amend` should be avoided in this step as PyTorch CI will keep using the commit hash created in step 1 until other PRs update that manually or the nightly buildbot updates that automatically.
+5. Finally, don't delete your branch until 2 days later. See step 4 for explanations.
diff --git a/.github/workflows/_build_torch_xla.yml b/.github/workflows/_build_torch_xla.yml
@@ -38,7 +38,6 @@ jobs:
           repository: pytorch/pytorch
           path: pytorch
           submodules: recursive
-          # TODO: correct pin
       - name: Checkout PyTorch/XLA Repo
         uses: actions/checkout@v4
         with:

diff --git a/.github/workflows/lintercheck.yml b/.github/workflows/lintercheck.yml
@@ -24,7 +24,7 @@ jobs:
         if: github.event_name == 'push' && github.event.ref == 'refs/heads/master'
         shell: bash
         run: |
-          TORCH_PIN=./torch_patches/.torch_pin
+          TORCH_PIN=./.torch_pin
           if [[ -f "${TORCH_PIN}" ]]; then
             echo "Please remove ${TORCH_PIN} before landing."
             exit 1

diff --git a/OP_LOWERING_GUIDE.md b/OP_LOWERING_GUIDE.md
@@ -25,7 +25,7 @@ All file mentioned below lives under the `xla/torch_xla/csrc` folder, with the e
 7. `ops/` directory contains all `ir::ops` declaration and definition. Smaller nodes can be put in `ops/ops.h/.cpp`. More complicated nodes can be put into a separate file. All ops inherit from `ir::ops::Node` and provide a way to lower input `ir::Value` to a sequence of `XlaOp`.
 
 ## Unit Test
-Our CircleCI runs PyTorch native python tests for every change and every day. Those tests will use XLA implementation if we provide a lowering. We usually don’t need to add additional python tests for PyTorch/XLA unless we want to verify some xla behaviors(like dynamic shape) or we skipped the pytorch native test for some reason. The python test should be added to `xla/test/test_operations.py` if it is required. We also need to add CPP tests in `xla/test/cpp/test_aten_xla_tensor.cpp`. This test should call PyTorch c++ API and verify our implementation yields the same result as PyTorch native implementation. We also need to verify if the xla implementation is called when the tensor is a XLA tensor by checking the `aten::op` and `xla::op` counters.
+Our CI runs PyTorch native python tests for every change and every day. Those tests will use XLA implementation if we provide a lowering. We usually don’t need to add additional python tests for PyTorch/XLA unless we want to verify some xla behaviors(like dynamic shape) or we skipped the pytorch native test for some reason. The python test should be added to `xla/test/test_operations.py` if it is required. We also need to add CPP tests in `xla/test/cpp/test_aten_xla_tensor.cpp`. This test should call PyTorch c++ API and verify our implementation yields the same result as PyTorch native implementation. We also need to verify if the xla implementation is called when the tensor is a XLA tensor by checking the `aten::op` and `xla::op` counters.
 
 ## Tips
 The process of lowering is breaking down the PyTorch operations into a sequence of XlaOp. To provide a good lowering of the PyTorch operation, one needs to have a good grasp of what XLA is capable of. Reading the XlaOp document and looking into how similar ops is lowered is the best way to achieve that. You can find a minimal Op lowering example in [this pr](https://github.com/pytorch/xla/pull/2969). You can also find a slightly more complicated example with backward lowering in [this pr](https://github.com/pytorch/xla/pull/2972).

diff --git a/benchmarks/run_benchmark.sh b/benchmarks/run_benchmark.sh
@@ -5,7 +5,7 @@ LOGFILE=/tmp/benchmark_test.log
 
 # Note [Keep Going]
 #
-# Set the `CONTINUE_ON_ERROR` flag to `true` to make the CircleCI tests continue on error.
+# Set the `CONTINUE_ON_ERROR` flag to `true` to make the CI tests continue on error.
 # This will allow you to see all the failures on your PR, not stopping with the first
 # test failure like the default behavior.
 CONTINUE_ON_ERROR="${CONTINUE_ON_ERROR:-0}"

diff --git a/docs/README.md b/docs/README.md
@@ -1,12 +1,12 @@
 ## Publish documentation for a new release.
 
-CircleCI job `pytorch_xla_linux_debian11_and_push_doc` is specified to run on `release/*` branches, but it was not
+CI job `pytorch_xla_linux_debian11_and_push_doc` is specified to run on `release/*` branches, but it was not
 run on release branches due to "Only build pull requests" setting. Turning off "Only build pull requests" will result
 in much larger volumes in jobs which is often unnecessary. We're waiting for [this feature request](https://ideas.circleci.com/ideas/CCI-I-215)
 to be implemented so that we could override this setting on some branches.
 
 Before the feature is available on CircleCi side, we'll use a manual process to publish documentation for release.
-[Documentation for master branch](http://pytorch.org/xla/master/) is still updated automatically by the CircleCI job.
+[Documentation for master branch](http://pytorch.org/xla/master/) is still updated automatically by the CI job.
 But we'll need to manually commit the new versioned doc and point http://pytorch.org/xla to the documentation of new
 stable release.
 
@@ -22,4 +22,4 @@ cd /tmp/xla
 git add .
 git commit -m "Publish 1.5 documentation."
 git push origin gh-pages
-```
+```
diff --git a/infra/ansible/roles/build_srcs/tasks/main.yaml b/infra/ansible/roles/build_srcs/tasks/main.yaml
@@ -1,3 +1,18 @@
+- name: Read PyTorch pin
+  ansible.builtin.command: cat {{ (src_root, 'pytorch/xla/.torch_pin') | path_join }}
+  register: torch_pin
+  # Pin may not exist
+  ignore_errors: true
+
+- name: Checkout PyTorch pin
+  # ansible.builtin.git wants to fetch the entire history, so check out the pin manually
+  ansible.builtin.shell:
+    cmd: |
+      git fetch origin {{ torch_pin.stdout }}
+      git checkout --recurse-submodules {{ torch_pin.stdout }}
+    chdir: "{{ (src_root, 'pytorch') | path_join }}"
+  when: torch_pin is succeeded
+
 - name: Build PyTorch
   ansible.builtin.command:
     cmd: python setup.py bdist_wheel

diff --git a/scripts/apply_patches.sh b/scripts/apply_patches.sh
@@ -7,7 +7,7 @@ XDIR=$CDIR/..
 PTDIR=$XDIR/..
 OPENXLADIR=$XDIR/third_party/xla
 
-TORCH_PIN="$XDIR/torch_patches/.torch_pin"
+TORCH_PIN="$XDIR/.torch_pin"
 if [ -f "$TORCH_PIN" ]; then
   CID=$(cat "$TORCH_PIN")
   # If starts with # and it's not merged into master, fetch from origin

diff --git a/test/benchmarks/run_tests.sh b/test/benchmarks/run_tests.sh
@@ -9,7 +9,7 @@ export PYTHONPATH=$PYTHONPATH:$CDIR/../../benchmarks/
 
 # Note [Keep Going]
 #
-# Set the `CONTINUE_ON_ERROR` flag to `true` to make the CircleCI tests continue on error.
+# Set the `CONTINUE_ON_ERROR` flag to `true` to make the CI tests continue on error.
 # This will allow you to see all the failures on your PR, not stopping with the first
 # test failure like the default behavior.
 CONTINUE_ON_ERROR="${CONTINUE_ON_ERROR:-0}"

diff --git a/test/run_tests.sh b/test/run_tests.sh
@@ -8,7 +8,7 @@ VERBOSITY=2
 
 # Note [Keep Going]
 #
-# Set the `CONTINUE_ON_ERROR` flag to `true` to make the CircleCI tests continue on error.
+# Set the `CONTINUE_ON_ERROR` flag to `true` to make the CI tests continue on error.
 # This will allow you to see all the failures on your PR, not stopping with the first
 # test failure like the default behavior.
 CONTINUE_ON_ERROR="${CONTINUE_ON_ERROR:-0}"

diff --git a/torch_patches/README.md b/torch_patches/README.md