Skip to content

Commit

Permalink
Move .torch_pin and handle in ansible (#6920)
Browse files Browse the repository at this point in the history
  • Loading branch information
will-cromar authored Apr 26, 2024
1 parent 3f5ff0f commit b3be775
Show file tree
Hide file tree
Showing 13 changed files with 44 additions and 62 deletions.
19 changes: 0 additions & 19 deletions .circleci/README.md

This file was deleted.

2 changes: 1 addition & 1 deletion .circleci/setup_ci_environment.sh
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ sudo apt-get -y remove linux-image-generic linux-headers-generic linux-generic d
# How to figure out what the correct versions of these packages are?
# My preferred method is to start a Docker instance of the correct
# Ubuntu version (e.g., docker run -it ubuntu:16.04) and then ask
# apt what the packages you need are. Note that the CircleCI image
# apt what the packages you need are. Note that the CI image
# comes with Docker.
#
# Using 'retry' here as belt-and-suspenders even though we are
Expand Down
19 changes: 19 additions & 0 deletions .github/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# CI Overview
PyTorch and PyTorch/XLA use CI to lint, build, and test each PR that is submitted. All CI tests should succeed before the PR is merged into master. PyTorch CI pins PyTorch/XLA to a specific commit. On the other hand, PyTorch/XLA CI pulls PyTorch from master unless a pin is manually provided. This README will go through the reasons of these pins, how to pin a PyTorch/XLA PR to an upstream PyTorch PR, and how to coordinate a merge for breaking PyTorch changes.

## Why does PyTorch CI pin PyTorch/XLA?
As mentioned above, [PyTorch CI pins PyTorch/XLA](https://github.com/pytorch/pytorch/blob/master/.jenkins/pytorch/common_utils.sh#L119) to a "known good" commit to prevent accidental changes from PyTorch/XLA to break PyTorch CI without warning. PyTorch has hundreds of commits each week, and this pin ensures that PyTorch/XLA as a downstream package does not cause failures in PyTorch CI.

## Why does PyTorch/XLA CI pull from PyTorch master?
[PyTorch/XLA CI pulls PyTorch from master](https://github.com/pytorch/xla/blob/f3415929683880192b63b285921c72439af55bf0/.circleci/common.sh#L15) unless a PyTorch pin is manually provided. PyTorch/XLA is a downstream package to PyTorch, and pulling from master ensures that PyTorch/XLA will stay up-to-date and works with the latest PyTorch changes.

## Pinning PyTorch PR in PyTorch/XLA PR
Sometimes a PyTorch/XLA PR needs to be pinned to a specific PyTorch PR to test new featurues, fix breaking changes, etc. Since PyTorch/XLA CI pulls from PyTorch master by default, we need to manually provided a PyTorch pin. In a PyTorch/XLA PR, PyTorch an be manually pinned by creating a `.torch_pin` file at the root of the repository. The `.torch_pin` should have the corresponding PyTorch PR number prefixed by "#". Take a look at [example here](https://github.com/pytorch/xla/pull/3792/commits/40f41fb98b0f2386d287eeac0bae86e873d4a9d8). Before the PyTorch/XLA PR gets merged, the `.torch_pin` must be deleted.

## Coodinating merges for breaking PyTorch PRs
When PyTorch PR introduces a breaking change, its PyTorch/XLA CI tests will fail. Steps for fixing and merging such breaking PyTorch change is as following:
1. Create a PyTorch/XLA PR to fix this issue with `.torch_pin` and rebase with master to ensure the PR is up-to-date with the latest commit on PyTorch/XLA. Once this PR is created, it'll create a commit hash that will be used in step 2. If you have multiple commits in the PR, use the last one's hash. **Important note: When you rebase this PR, it'll create a new commit hash and make the old hash obsolete. Be cautious about rebasing, and if you rebase, make sure you inform the PyTorch PR's author.**
2. Rebase (or ask the PR owner to rebase) the PyTorch PR with master. Update the PyTorch PR to pin the PyTorch/XLA to the commit hash created in step 1 by updating `pytorch/.github/ci_commit_pins/xla.txt`.
3. Once CI tests are green on both ends, merge PyTorch PR.
4. Remove the `.torch_pin` in PyTorch/XLA PR and merge. To be noted, `git commit --amend` should be avoided in this step as PyTorch CI will keep using the commit hash created in step 1 until other PRs update that manually or the nightly buildbot updates that automatically.
5. Finally, don't delete your branch until 2 days later. See step 4 for explanations.
1 change: 0 additions & 1 deletion .github/workflows/_build_torch_xla.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,6 @@ jobs:
repository: pytorch/pytorch
path: pytorch
submodules: recursive
# TODO: correct pin
- name: Checkout PyTorch/XLA Repo
uses: actions/checkout@v4
with:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/lintercheck.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ jobs:
if: github.event_name == 'push' && github.event.ref == 'refs/heads/master'
shell: bash
run: |
TORCH_PIN=./torch_patches/.torch_pin
TORCH_PIN=./.torch_pin
if [[ -f "${TORCH_PIN}" ]]; then
echo "Please remove ${TORCH_PIN} before landing."
exit 1
Expand Down
2 changes: 1 addition & 1 deletion OP_LOWERING_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ All file mentioned below lives under the `xla/torch_xla/csrc` folder, with the e
7. `ops/` directory contains all `ir::ops` declaration and definition. Smaller nodes can be put in `ops/ops.h/.cpp`. More complicated nodes can be put into a separate file. All ops inherit from `ir::ops::Node` and provide a way to lower input `ir::Value` to a sequence of `XlaOp`.

## Unit Test
Our CircleCI runs PyTorch native python tests for every change and every day. Those tests will use XLA implementation if we provide a lowering. We usually don’t need to add additional python tests for PyTorch/XLA unless we want to verify some xla behaviors(like dynamic shape) or we skipped the pytorch native test for some reason. The python test should be added to `xla/test/test_operations.py` if it is required. We also need to add CPP tests in `xla/test/cpp/test_aten_xla_tensor.cpp`. This test should call PyTorch c++ API and verify our implementation yields the same result as PyTorch native implementation. We also need to verify if the xla implementation is called when the tensor is a XLA tensor by checking the `aten::op` and `xla::op` counters.
Our CI runs PyTorch native python tests for every change and every day. Those tests will use XLA implementation if we provide a lowering. We usually don’t need to add additional python tests for PyTorch/XLA unless we want to verify some xla behaviors(like dynamic shape) or we skipped the pytorch native test for some reason. The python test should be added to `xla/test/test_operations.py` if it is required. We also need to add CPP tests in `xla/test/cpp/test_aten_xla_tensor.cpp`. This test should call PyTorch c++ API and verify our implementation yields the same result as PyTorch native implementation. We also need to verify if the xla implementation is called when the tensor is a XLA tensor by checking the `aten::op` and `xla::op` counters.

## Tips
The process of lowering is breaking down the PyTorch operations into a sequence of XlaOp. To provide a good lowering of the PyTorch operation, one needs to have a good grasp of what XLA is capable of. Reading the XlaOp document and looking into how similar ops is lowered is the best way to achieve that. You can find a minimal Op lowering example in [this pr](https://github.com/pytorch/xla/pull/2969). You can also find a slightly more complicated example with backward lowering in [this pr](https://github.com/pytorch/xla/pull/2972).
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/run_benchmark.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ LOGFILE=/tmp/benchmark_test.log

# Note [Keep Going]
#
# Set the `CONTINUE_ON_ERROR` flag to `true` to make the CircleCI tests continue on error.
# Set the `CONTINUE_ON_ERROR` flag to `true` to make the CI tests continue on error.
# This will allow you to see all the failures on your PR, not stopping with the first
# test failure like the default behavior.
CONTINUE_ON_ERROR="${CONTINUE_ON_ERROR:-0}"
Expand Down
6 changes: 3 additions & 3 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
## Publish documentation for a new release.

CircleCI job `pytorch_xla_linux_debian11_and_push_doc` is specified to run on `release/*` branches, but it was not
CI job `pytorch_xla_linux_debian11_and_push_doc` is specified to run on `release/*` branches, but it was not
run on release branches due to "Only build pull requests" setting. Turning off "Only build pull requests" will result
in much larger volumes in jobs which is often unnecessary. We're waiting for [this feature request](https://ideas.circleci.com/ideas/CCI-I-215)
to be implemented so that we could override this setting on some branches.

Before the feature is available on CircleCi side, we'll use a manual process to publish documentation for release.
[Documentation for master branch](http://pytorch.org/xla/master/) is still updated automatically by the CircleCI job.
[Documentation for master branch](http://pytorch.org/xla/master/) is still updated automatically by the CI job.
But we'll need to manually commit the new versioned doc and point http://pytorch.org/xla to the documentation of new
stable release.

Expand All @@ -22,4 +22,4 @@ cd /tmp/xla
git add .
git commit -m "Publish 1.5 documentation."
git push origin gh-pages
```
```
15 changes: 15 additions & 0 deletions infra/ansible/roles/build_srcs/tasks/main.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,18 @@
- name: Read PyTorch pin
ansible.builtin.command: cat {{ (src_root, 'pytorch/xla/.torch_pin') | path_join }}
register: torch_pin
# Pin may not exist
ignore_errors: true

- name: Checkout PyTorch pin
# ansible.builtin.git wants to fetch the entire history, so check out the pin manually
ansible.builtin.shell:
cmd: |
git fetch origin {{ torch_pin.stdout }}
git checkout --recurse-submodules {{ torch_pin.stdout }}
chdir: "{{ (src_root, 'pytorch') | path_join }}"
when: torch_pin is succeeded

- name: Build PyTorch
ansible.builtin.command:
cmd: python setup.py bdist_wheel
Expand Down
2 changes: 1 addition & 1 deletion scripts/apply_patches.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ XDIR=$CDIR/..
PTDIR=$XDIR/..
OPENXLADIR=$XDIR/third_party/xla

TORCH_PIN="$XDIR/torch_patches/.torch_pin"
TORCH_PIN="$XDIR/.torch_pin"
if [ -f "$TORCH_PIN" ]; then
CID=$(cat "$TORCH_PIN")
# If starts with # and it's not merged into master, fetch from origin
Expand Down
2 changes: 1 addition & 1 deletion test/benchmarks/run_tests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ export PYTHONPATH=$PYTHONPATH:$CDIR/../../benchmarks/

# Note [Keep Going]
#
# Set the `CONTINUE_ON_ERROR` flag to `true` to make the CircleCI tests continue on error.
# Set the `CONTINUE_ON_ERROR` flag to `true` to make the CI tests continue on error.
# This will allow you to see all the failures on your PR, not stopping with the first
# test failure like the default behavior.
CONTINUE_ON_ERROR="${CONTINUE_ON_ERROR:-0}"
Expand Down
2 changes: 1 addition & 1 deletion test/run_tests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ VERBOSITY=2

# Note [Keep Going]
#
# Set the `CONTINUE_ON_ERROR` flag to `true` to make the CircleCI tests continue on error.
# Set the `CONTINUE_ON_ERROR` flag to `true` to make the CI tests continue on error.
# This will allow you to see all the failures on your PR, not stopping with the first
# test failure like the default behavior.
CONTINUE_ON_ERROR="${CONTINUE_ON_ERROR:-0}"
Expand Down
32 changes: 0 additions & 32 deletions torch_patches/README.md

This file was deleted.

0 comments on commit b3be775

Please sign in to comment.