Skip to content

Commit

Permalink
Merge branch 'pytorch:master' into rpsilva_spmd_lc
Browse files Browse the repository at this point in the history
  • Loading branch information
rpsilva-aws authored Dec 5, 2024
2 parents 05c996e + 4c99d21 commit 0f128c5
Show file tree
Hide file tree
Showing 35 changed files with 565 additions and 194 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/_tpu_ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ jobs:
pip install rich
# Jax nightly is needed for pallas tests.
pip install torch_xla[pallas] -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html -f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html
pip install torch_xla[tpu] -f https://storage.googleapis.com/libtpu-releases/index.html
pip install torch_xla[tpu] -f https://storage.googleapis.com/libtpu-wheels/index.html -f https://storage.googleapis.com/libtpu-releases/index.html
pip install --upgrade protobuf
- name: Run Tests
env:
Expand Down
106 changes: 64 additions & 42 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,82 +1,104 @@
# Contribute To PyTorch/XLA

We appreciate all contributions. If you are planning to contribute a bug fix for an open issue, please comment on the thread and we're happy to provide any guidance.
You are very welcome to pick issues from [good first issue](https://github.com/pytorch/xla/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) and [help wanted](https://github.com/pytorch/xla/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted%22) labels.
We appreciate all contributions. If you are planning to contribute a bug fix for
an open issue, please comment on the thread and we're happy to provide guidance.
You are welcome to pick issues with [good first issue](https://github.com/pytorch/xla/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22)
and [help wanted](https://github.com/pytorch/xla/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted%22)
labels to get started.

If you plan to contribute new features, utility functions or extensions to the core, please first open an issue and discuss the feature with us.
Sending a PR without discussion might end up resulting in a rejected PR, because we might be taking the core in a different direction than you might be aware of.
If you plan to contribute new features or extensions to this repository, first
open an issue and discuss the feature with us. Sending a PR without discussion
might result in a rejected PR, because we might be taking the repository in a
different direction.

## Building from source

We recommend you to use our prebuilt Docker image to start your development work using one of the two following methods.
We recommend you use our prebuilt Docker image to start your development work
using either VS Code or a local container:

### Visual Studio Code Dev Container

* Create an empty directory (optionally on a remote host via SSH) and open it in VSCode. Then, clone
PyTorch, TorchVision, and PyTorch/XLA:
* Create an empty directory for your workspace on your development host. These
instructions assume you are using a remote host and are connecting to it over
SSH.

* Clone PyTorch, TorchVision, and PyTorch/XLA into your workspace directory:

```bash
```bash
git clone --recursive --depth=1 https://github.com/pytorch/pytorch.git
# Optional: install TorchVision if you need to run tests that involve vision modules

# Install TorchVision if you need to run tests that involve vision modules
git clone --recursive --depth=1 https://github.com/pytorch/vision.git

# Clone with HTTPS if you use a GitHub a personal access token
git clone https://github.com/pytorch/xla.git pytorch/xla
# Optional: use [email protected]:pytorch/xla.git instead if you prefer to use SSH with key forwarding
```

* Link (or copy) VSCode configuration to your workspace directory:
# Or clone with SSH if you prefer:
git clone [email protected]:pytorch/xla.git pytorch/xla
```

* Create links to VS Code configuration files in your workspace directory:

```bash
```bash
ln -s pytorch/xla/.devcontainer/ .devcontainer
ln -s pytorch/xla/contrib/vscode/ .vscode
ln -s pytorch/xla/.style.yapf .style.yapf
ln -s pytorch/xla/.clang-format .clang-format
```

* From VSCode's command menu, run `Reopen in Container` from the command palette
(F1 key) to open your workspace in one of our pre-built Docker containers.
Select the correct container config based on your local accelerator (default to
`tpu-contributor` if you are not sure).

* If you cannot find `Reopen in Container`, make sure the `Dev Containers`
VSCode extension is installed, then open the `pytorch/xla` folder in VSCode.

* Since you are running as root in this container, teach `git` to recognize the
repositories you just cloned (outside of docker) as safe:
```

```bash
* Start VS Code and ensure you have the [`Remote Development` Extension Pack](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.vscode-remote-extensionpack)
installed. It includes the [`Remote - SSH`](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-ssh) and
[`Dev Containers`](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers)
extensions.

* From VS Code, connect to your remote host and open your workspace directory.
You will be prompted to reopen your workspace in container. Choose the
appropriate container. Use `tpu-contributor` if you are unsure of which to use.
If you are not prompted to reopen in a container, in the VS Code command
pallete, type `Dev Containers: Reopen in Container` to open your workspace in
one of our pre-built Docker containers. Select the correct container based on
your local accelerator. If you are unsure, use `tpu-contributor`.

* Open a new terminal window in VS Code. Since you are running as root in this
container, mark the repository directories as safe. The commands below assume
your workspace directory is `torch`, update the commands to use your workspace
directory.

```bash
git config --global --add safe.directory /workspaces/torch/pytorch
git config --global --add safe.directory /workspaces/torch/pytorch/xla
git config --global --add safe.directory /workspaces/torch/vision
```

* Build PyTorch, TorchVision, and PyTorch/XLA:
```
* In the terminal window, run the following commands to build PyTorch,
TorchVision, and PyTorch/XLA:

```bash
```bash
cd pytorch
# pytorch/xla requires pytorch wheel to be presented under pytorch/dist
python setup.py bdist_wheel
python setup.py install
cd ..
cd vision
cd ../vision
python setup.py develop
cd ..
cd pytorch/xla
cd ../pytorch/xla
python setup.py develop
# Optional: if you're using TPU, install libtpu
pip install torch_xla[tpu] -f https://storage.googleapis.com/libtpu-releases/index.html
pip install torch_xla[tpu] \
-f https://storage.googleapis.com/libtpu-wheels/index.html \
-f https://storage.googleapis.com/libtpu-releases/index.html
```

* Test your build
* If you are running on a TPU VM, ensure `torch` and `torch_xla` were built and
installed correctly:

```bash
```bash
python -c 'import torch_xla as xla; print(xla.device())'
# Output: xla:0
```
```

**Subsequent builds**: after setting up the source checkouts and building them
for the first time, you may find the need to build everything again after e.g.
`git pull`. You can run `scripts/build_developer.sh` which will build PyTorch,
TorchVision, and PyTorch/XLA according to the above.
**Subsequent builds**: after building the packages from source code for the
first time, you may need to build everything again, for example, after a
`git pull`. You can run `scripts/build_developer.sh` which will rebuild PyTorch,
TorchVision, and PyTorch/XLA.

### Manually build in Docker container

Expand Down
13 changes: 10 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,14 @@ started:
To install PyTorch/XLA stable build in a new TPU VM:

```
pip install torch~=2.5.0 torch_xla[tpu]~=2.5.0 -f https://storage.googleapis.com/libtpu-releases/index.html
pip install torch~=2.5.0 torch_xla[tpu]~=2.5.0 -f https://storage.googleapis.com/libtpu-releases/index.html -f https://storage.googleapis.com/libtpu-wheels/index.html
```

To install PyTorch/XLA nightly build in a new TPU VM:

```
pip3 install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cpu
pip install 'torch_xla[tpu] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.6.0.dev-cp310-cp310-linux_x86_64.whl' -f https://storage.googleapis.com/libtpu-releases/index.html
pip install 'torch_xla[tpu] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.6.0.dev-cp310-cp310-linux_x86_64.whl' -f https://storage.googleapis.com/libtpu-releases/index.html -f https://storage.googleapis.com/libtpu-wheels/index.html
```

### GPU Plugin
Expand Down Expand Up @@ -138,6 +138,11 @@ Our comprehensive user guides are available at:
VM](https://cloud.google.com/tpu/docs/pytorch-xla-performance-profiling-tpu-vm)
* [GPU guide](docs/gpu.md)

## Reference implementations

The [AI-Hypercomputer/tpu-recipies](https://github.com/AI-Hypercomputer/tpu-recipes)
repo. contains examples for training and serving many LLM and diffusion models.

## Available docker images and wheels

### Python packages
Expand All @@ -147,7 +152,9 @@ can now install the main build with `pip install torch_xla`. To also install the
Cloud TPU plugin corresponding to your installed `torch_xla`, install the optional `tpu` dependencies after installing the main build with

```
pip install torch_xla[tpu] -f https://storage.googleapis.com/libtpu-releases/index.html
pip install torch_xla[tpu] \
-f https://storage.googleapis.com/libtpu-wheels/index.html \
-f https://storage.googleapis.com/libtpu-releases/index.html
```

GPU and nightly builds are available in our public GCS bucket.
Expand Down
4 changes: 3 additions & 1 deletion docs/source/contribute/configure-environment.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,9 @@ via the Command Palette (`Python: Create Environment`).
Install the latest PyTorch and PyTorch/XLA releases:

``` bash
pip install numpy torch torch_xla[tpu] -f https://storage.googleapis.com/libtpu-releases/index.html
pip install numpy torch torch_xla[tpu] \
-f https://storage.googleapis.com/libtpu-wheels/index.html \
-f https://storage.googleapis.com/libtpu-releases/index.html
```

Create a file `test.py`:
Expand Down
2 changes: 1 addition & 1 deletion docs/source/learn/troubleshoot.md
Original file line number Diff line number Diff line change
Expand Up @@ -254,7 +254,7 @@ the following resources:

Take a look at:

[examples/train_resnet_benchmark.py](https://github.com/pytorch/xla/blob/master/examples/train_resnet_benchmark.py)
[examples/debug/train_resnet_benchmark.py](https://github.com/pytorch/xla/blob/master/examples/debug/train_resnet_benchmark.py)
for how to benchmark a PyTorch/XLA model.

## Known Performance Caveats
Expand Down
5 changes: 5 additions & 0 deletions docs/source/learn/xla-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,11 @@ sudo apt-get install libopenblas-dev -y
sudo apt-get update && sudo apt-get install libgl1 -y # diffusion specific
```

## Reference implementations

The [AI-Hypercomputer/tpu-recipies](https://github.com/AI-Hypercomputer/tpu-recipes)
repo. contains examples for training and serving many LLM and diffusion models.

## Converting code to PyTorch XLA

General guidelines to modify your code:
Expand Down
4 changes: 2 additions & 2 deletions experimental/torch_xla2/dev-requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
-f https://download.pytorch.org/whl/torch
torch==2.4.0; sys_platform == 'darwin' # macOS
torch==2.4.0+cpu; sys_platform != 'darwin' # Non-macOS (CPU-only), like on TPU
torch==2.5.1; sys_platform == 'darwin' # macOS
torch==2.5.1+cpu; sys_platform != 'darwin' # Non-macOS (CPU-only), like on TPU
ruff~=0.3.5
31 changes: 16 additions & 15 deletions experimental/torch_xla2/examples/basic_training.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,8 @@ def matplotlib_imshow(img, one_channel=False):
plt.imshow(npimg, cmap="Greys")
else:
plt.imshow(np.transpose(npimg, (1, 2, 0)))

#torch_xla2.env.config.debug_print_each_op = True
#torch_xla2.env.config.debug_mixed_tensor = True
dataiter = iter(training_loader)
images, labels = next(dataiter)

Expand Down Expand Up @@ -80,15 +81,15 @@ def forward(self, x):
return x


model = GarmentClassifier()
model = GarmentClassifier().to('jax')

loss_fn = torch.nn.CrossEntropyLoss()

# NB: Loss functions expect data in batches, so we're creating batches of 4
# Represents the model's confidence in each of the 10 classes for a given input
dummy_outputs = torch.rand(4, 10)
dummy_outputs = torch.rand(4, 10, device='jax')
# Represents the correct class among the 10 being tested
dummy_labels = torch.tensor([1, 5, 3, 7])
dummy_labels = torch.tensor([1, 5, 3, 7], device='jax')

print(dummy_outputs)
print(dummy_labels)
Expand All @@ -110,6 +111,8 @@ def train_one_epoch(epoch_index, tb_writer=None):
# Every data instance is an input + label pair
# NEW: Move model to XLA device
inputs, labels = data
inputs = inputs.to('jax')
labels = labels.to('jax')

# Zero your gradients for every batch!
optimizer.zero_grad()
Expand Down Expand Up @@ -162,7 +165,9 @@ def train_one_epoch(epoch_index, tb_writer=None):
# Disable gradient computation and reduce memory consumption.
with torch.no_grad():
for i, vdata in enumerate(validation_loader):
# NOTE: move to XLA device
vinputs, vlabels = vdata
vinputs = vinputs.to('jax')
vlabels = vlabels.to('jax')
voutputs = model(vinputs) # call model's forward
vloss = loss_fn(voutputs, vlabels)
running_vloss += vloss
Expand All @@ -172,15 +177,11 @@ def train_one_epoch(epoch_index, tb_writer=None):

# Log the running loss averaged per batch
# for both training and validation
writer.add_scalars('Training vs. Validation Loss',
{ 'Training' : avg_loss, 'Validation' : avg_vloss },
epoch_number + 1)
writer.flush()

# Track best performance, and save the model's state
if avg_vloss < best_vloss:
best_vloss = avg_vloss
model_path = 'model_{}_{}'.format(timestamp, epoch_number)
torch.save(model.state_dict(), model_path)

# # Track best performance, and save the model's state
# if avg_vloss < best_vloss:
# best_vloss = avg_vloss
# model_path = 'model_{}_{}'.format(timestamp, epoch_number)
# torch.save(model.state_dict(), model_path)

epoch_number += 1
Loading

0 comments on commit 0f128c5

Please sign in to comment.