Merge branch 'pytorch:master' into rpsilva_spmd_lc

pytorch · Dec 5, 2024 · 0f128c5 · 0f128c5
2 parents 05c996e + 4c99d21
commit 0f128c5
Show file tree

Hide file tree

Showing 35 changed files with 565 additions and 194 deletions.
diff --git a/.github/workflows/_tpu_ci.yml b/.github/workflows/_tpu_ci.yml
@@ -25,7 +25,7 @@ jobs:
           pip install rich
           # Jax nightly is needed for pallas tests.
           pip install torch_xla[pallas] -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html -f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html
-          pip install torch_xla[tpu] -f https://storage.googleapis.com/libtpu-releases/index.html
+          pip install torch_xla[tpu] -f https://storage.googleapis.com/libtpu-wheels/index.html -f https://storage.googleapis.com/libtpu-releases/index.html
           pip install --upgrade protobuf
       - name: Run Tests
         env:

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -1,82 +1,104 @@
 # Contribute To PyTorch/XLA
 
-We appreciate all contributions. If you are planning to contribute a bug fix for an open issue, please comment on the thread and we're happy to provide any guidance.
-You are very welcome to pick issues from [good first issue](https://github.com/pytorch/xla/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) and [help wanted](https://github.com/pytorch/xla/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted%22) labels.
+We appreciate all contributions. If you are planning to contribute a bug fix for 
+an open issue, please comment on the thread and we're happy to provide guidance.
+You are welcome to pick issues with [good first issue](https://github.com/pytorch/xla/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) 
+and [help wanted](https://github.com/pytorch/xla/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted%22) 
+labels to get started.
 
-If you plan to contribute new features, utility functions or extensions to the core, please first open an issue and discuss the feature with us.
-Sending a PR without discussion might end up resulting in a rejected PR, because we might be taking the core in a different direction than you might be aware of.
+If you plan to contribute new features or extensions to this repository, first 
+open an issue and discuss the feature with us. Sending a PR without discussion 
+might result in a rejected PR, because we might be taking the repository in a 
+different direction.
 
 ## Building from source
 
-We recommend you to use our prebuilt Docker image to start your development work using one of the two following methods.
+We recommend you use our prebuilt Docker image to start your development work 
+using either VS Code or a local container:
 
 ### Visual Studio Code Dev Container
 
-* Create an empty directory (optionally on a remote host via SSH) and open it in VSCode. Then, clone
-  PyTorch, TorchVision, and PyTorch/XLA:
+* Create an empty directory for your workspace on your development host. These 
+  instructions assume you are using a remote host and are connecting to it over 
+  SSH.
+
+* Clone PyTorch, TorchVision, and PyTorch/XLA into your workspace directory:
 
-  ```bash
+```bash
   git clone --recursive --depth=1 https://github.com/pytorch/pytorch.git
-  # Optional: install TorchVision if you need to run tests that involve vision modules
+
+  # Install TorchVision if you need to run tests that involve vision modules
   git clone --recursive --depth=1 https://github.com/pytorch/vision.git
+
+  # Clone with HTTPS if you use a GitHub a personal access token
   git clone https://github.com/pytorch/xla.git pytorch/xla
-  # Optional: use [email protected]:pytorch/xla.git instead if you prefer to use SSH with key forwarding
-  ```
 
-* Link (or copy) VSCode configuration to your workspace directory:
+  # Or clone with SSH if you prefer:
+  git clone [email protected]:pytorch/xla.git pytorch/xla
+```
+
+* Create links to VS Code configuration files in your workspace directory:
 
-  ```bash
+```bash
   ln -s pytorch/xla/.devcontainer/ .devcontainer
   ln -s pytorch/xla/contrib/vscode/ .vscode
   ln -s pytorch/xla/.style.yapf .style.yapf
   ln -s pytorch/xla/.clang-format .clang-format
-  ```
-
-* From VSCode's command menu, run `Reopen in Container` from the command palette
-  (F1 key) to open your workspace in one of our pre-built Docker containers.
-  Select the correct container config based on your local accelerator (default to
-  `tpu-contributor` if you are not sure).
-
-  * If you cannot find `Reopen in Container`, make sure the `Dev Containers`
-    VSCode extension is installed, then open the `pytorch/xla` folder in VSCode.
-
-* Since you are running as root in this container, teach `git` to recognize the
-  repositories you just cloned (outside of docker) as safe:
+```
 
-  ```bash
+* Start VS Code and ensure you have the [`Remote Development` Extension Pack](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.vscode-remote-extensionpack)
+  installed. It includes the [`Remote - SSH`](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-ssh) and
+  [`Dev Containers`](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers)
+  extensions.
+
+* From VS Code, connect to your remote host and open your workspace directory. 
+  You will be prompted to reopen your workspace in container. Choose the 
+  appropriate container. Use `tpu-contributor` if you are unsure of which to use. 
+  If you are not prompted to reopen in a container, in the VS Code command 
+  pallete, type `Dev Containers: Reopen in Container` to open your workspace in 
+  one of our pre-built Docker containers. Select the correct container based on 
+  your local accelerator. If you are unsure, use `tpu-contributor`.
+
+* Open a new terminal window in VS Code. Since you are running as root in this 
+  container, mark the repository directories as safe. The commands below assume
+  your workspace directory is `torch`, update the commands to use your workspace
+  directory.
+
+```bash
   git config --global --add safe.directory /workspaces/torch/pytorch
   git config --global --add safe.directory /workspaces/torch/pytorch/xla
   git config --global --add safe.directory /workspaces/torch/vision
-  ```
-
-* Build PyTorch, TorchVision, and PyTorch/XLA:
+```
+* In the terminal window, run the following commands to build PyTorch, 
+  TorchVision, and  PyTorch/XLA:
 
-  ```bash
+```bash
   cd pytorch
   # pytorch/xla requires pytorch wheel to be presented under pytorch/dist
   python setup.py bdist_wheel
   python setup.py install
-  cd ..
-  cd vision
+  cd ../vision
   python setup.py develop
-  cd ..
-  cd pytorch/xla
+  cd ../pytorch/xla
   python setup.py develop
   # Optional: if you're using TPU, install libtpu
-  pip install torch_xla[tpu] -f https://storage.googleapis.com/libtpu-releases/index.html
+  pip install torch_xla[tpu] \
+    -f https://storage.googleapis.com/libtpu-wheels/index.html \
+    -f https://storage.googleapis.com/libtpu-releases/index.html
   ```
 
-* Test your build
+* If you are running on a TPU VM, ensure `torch` and `torch_xla` were built and 
+  installed correctly:
 
-  ```bash
+```bash
   python -c 'import torch_xla as xla; print(xla.device())'
   # Output: xla:0
-  ```
+```
 
-**Subsequent builds**: after setting up the source checkouts and building them
-for the first time, you may find the need to build everything again after e.g.
-`git pull`. You can run `scripts/build_developer.sh` which will build PyTorch,
-TorchVision, and PyTorch/XLA according to the above.
+**Subsequent builds**: after building the packages from source code for the 
+first time, you may need to build everything again, for example, after a
+`git pull`. You can run `scripts/build_developer.sh` which will rebuild PyTorch,
+TorchVision, and PyTorch/XLA.
 
 ### Manually build in Docker container
 

diff --git a/README.md b/README.md
@@ -26,14 +26,14 @@ started:
 To install PyTorch/XLA stable build in a new TPU VM:
 
 ```
-pip install torch~=2.5.0 torch_xla[tpu]~=2.5.0 -f https://storage.googleapis.com/libtpu-releases/index.html
+pip install torch~=2.5.0 torch_xla[tpu]~=2.5.0 -f https://storage.googleapis.com/libtpu-releases/index.html -f https://storage.googleapis.com/libtpu-wheels/index.html
 ```
 
 To install PyTorch/XLA nightly build in a new TPU VM:
 
 ```
 pip3 install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cpu
-pip install 'torch_xla[tpu] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.6.0.dev-cp310-cp310-linux_x86_64.whl' -f https://storage.googleapis.com/libtpu-releases/index.html
+pip install 'torch_xla[tpu] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.6.0.dev-cp310-cp310-linux_x86_64.whl' -f https://storage.googleapis.com/libtpu-releases/index.html -f https://storage.googleapis.com/libtpu-wheels/index.html
 ```
 
 ### GPU Plugin
@@ -138,6 +138,11 @@ Our comprehensive user guides are available at:
   VM](https://cloud.google.com/tpu/docs/pytorch-xla-performance-profiling-tpu-vm)
 * [GPU guide](docs/gpu.md)
 
+## Reference implementations
+
+The [AI-Hypercomputer/tpu-recipies](https://github.com/AI-Hypercomputer/tpu-recipes)
+repo. contains examples for training and serving many LLM and diffusion models.
+
 ## Available docker images and wheels
 
 ### Python packages
@@ -147,7 +152,9 @@ can now install the main build with `pip install torch_xla`. To also install the
 Cloud TPU plugin corresponding to your installed `torch_xla`, install the optional `tpu` dependencies after installing the main build with
 
 ```
-pip install torch_xla[tpu] -f https://storage.googleapis.com/libtpu-releases/index.html
+pip install torch_xla[tpu] \
+  -f https://storage.googleapis.com/libtpu-wheels/index.html \
+  -f https://storage.googleapis.com/libtpu-releases/index.html
 ```
 
 GPU and nightly builds are available in our public GCS bucket.

diff --git a/docs/source/contribute/configure-environment.md b/docs/source/contribute/configure-environment.md
@@ -87,7 +87,9 @@ via the Command Palette (`Python: Create Environment`).
 Install the latest PyTorch and PyTorch/XLA releases:
 
 ``` bash
-pip install numpy torch torch_xla[tpu] -f https://storage.googleapis.com/libtpu-releases/index.html
+pip install numpy torch torch_xla[tpu] \
+  -f https://storage.googleapis.com/libtpu-wheels/index.html \
+  -f https://storage.googleapis.com/libtpu-releases/index.html
 ```
 
 Create a file `test.py`:

diff --git a/docs/source/learn/troubleshoot.md b/docs/source/learn/troubleshoot.md
@@ -254,7 +254,7 @@ the following resources:
 
 Take a look at:
 
-[examples/train_resnet_benchmark.py](https://github.com/pytorch/xla/blob/master/examples/train_resnet_benchmark.py)
+[examples/debug/train_resnet_benchmark.py](https://github.com/pytorch/xla/blob/master/examples/debug/train_resnet_benchmark.py)
 for how to benchmark a PyTorch/XLA model.
 
 ## Known Performance Caveats

diff --git a/docs/source/learn/xla-overview.md b/docs/source/learn/xla-overview.md
@@ -175,6 +175,11 @@ sudo apt-get install libopenblas-dev -y
 sudo apt-get update && sudo apt-get install libgl1 -y # diffusion specific
 ```
 
+## Reference implementations
+
+The [AI-Hypercomputer/tpu-recipies](https://github.com/AI-Hypercomputer/tpu-recipes)
+repo. contains examples for training and serving many LLM and diffusion models.
+
 ## Converting code to PyTorch XLA
 
 General guidelines to modify your code:

diff --git a/experimental/torch_xla2/dev-requirements.txt b/experimental/torch_xla2/dev-requirements.txt
@@ -1,4 +1,4 @@
 -f https://download.pytorch.org/whl/torch
-torch==2.4.0; sys_platform == 'darwin'  # macOS
-torch==2.4.0+cpu; sys_platform != 'darwin' # Non-macOS (CPU-only), like on TPU
+torch==2.5.1; sys_platform == 'darwin'  # macOS
+torch==2.5.1+cpu; sys_platform != 'darwin' # Non-macOS (CPU-only), like on TPU
 ruff~=0.3.5
diff --git a/experimental/torch_xla2/examples/basic_training.py b/experimental/torch_xla2/examples/basic_training.py
@@ -51,7 +51,8 @@ def matplotlib_imshow(img, one_channel=False):
         plt.imshow(npimg, cmap="Greys")
     else:
         plt.imshow(np.transpose(npimg, (1, 2, 0)))
-
+#torch_xla2.env.config.debug_print_each_op = True
+#torch_xla2.env.config.debug_mixed_tensor = True
 dataiter = iter(training_loader)
 images, labels = next(dataiter)
 
@@ -80,15 +81,15 @@ def forward(self, x):
         return x
 
 
-model = GarmentClassifier()
+model = GarmentClassifier().to('jax')
 
 loss_fn = torch.nn.CrossEntropyLoss()
 
 # NB: Loss functions expect data in batches, so we're creating batches of 4
 # Represents the model's confidence in each of the 10 classes for a given input
-dummy_outputs = torch.rand(4, 10)
+dummy_outputs = torch.rand(4, 10, device='jax')
 # Represents the correct class among the 10 being tested
-dummy_labels = torch.tensor([1, 5, 3, 7])
+dummy_labels = torch.tensor([1, 5, 3, 7], device='jax')
 
 print(dummy_outputs)
 print(dummy_labels)
@@ -110,6 +111,8 @@ def train_one_epoch(epoch_index, tb_writer=None):
         # Every data instance is an input + label pair
         # NEW: Move model to XLA device
         inputs, labels = data
+        inputs = inputs.to('jax')
+        labels = labels.to('jax')
 
         # Zero your gradients for every batch!
         optimizer.zero_grad()
@@ -162,7 +165,9 @@ def train_one_epoch(epoch_index, tb_writer=None):
     # Disable gradient computation and reduce memory consumption.
     with torch.no_grad():
         for i, vdata in enumerate(validation_loader):
-          # NOTE: move to XLA device
+          vinputs, vlabels = vdata
+          vinputs = vinputs.to('jax')
+          vlabels = vlabels.to('jax')
           voutputs = model(vinputs)  # call model's forward
           vloss = loss_fn(voutputs, vlabels)
           running_vloss += vloss
@@ -172,15 +177,11 @@ def train_one_epoch(epoch_index, tb_writer=None):
 
     # Log the running loss averaged per batch
     # for both training and validation
-    writer.add_scalars('Training vs. Validation Loss',
-                    { 'Training' : avg_loss, 'Validation' : avg_vloss },
-                    epoch_number + 1)
-    writer.flush()
-
-    # Track best performance, and save the model's state
-    if avg_vloss < best_vloss:
-        best_vloss = avg_vloss
-        model_path = 'model_{}_{}'.format(timestamp, epoch_number)
-        torch.save(model.state_dict(), model_path)
+
+    # # Track best performance, and save the model's state
+    # if avg_vloss < best_vloss:
+    #     best_vloss = avg_vloss
+    #     model_path = 'model_{}_{}'.format(timestamp, epoch_number)
+    #     torch.save(model.state_dict(), model_path)
 
     epoch_number += 1