Skip to content

Commit

Permalink
Add process group documentation for SPMD (#6469)
Browse files Browse the repository at this point in the history
  • Loading branch information
jonb377 authored Feb 5, 2024
1 parent 4814d64 commit 732a1c7
Showing 1 changed file with 22 additions and 0 deletions.
22 changes: 22 additions & 0 deletions docs/spmd.md
Original file line number Diff line number Diff line change
Expand Up @@ -320,6 +320,28 @@ for step, data in enumerate(dataloader):
print(f'Checkpoint taken at step {step}')
```

### Process Groups
To use `torch.distributed` APIs such as distributed checkpointing, a process
group is required. In SPMD mode, the `xla` backend is not supported since the
compiler is responsible for all collectives.

Instead, a CPU process group such as `gloo` must be used. On TPUs, the `xla://`
init_method is still supported to discover the master IP, global world size,
and host rank. An example initialization is below:

```python
import torch.distributed as dist
# Import to register the `xla://` init_method
import torch_xla.distributed.xla_backend
import torch_xla.runtime as xr

xr.use_spmd()

# The `xla://` init_method will automatically discover master worker IP, rank,
# and global world size without requiring environment configuration on TPUs.
dist.init_process_group('gloo', init_method='xla://')
```

### Virtual Device Optimization

PyTorch/XLA normally transfers tensor data asynchronously from host to device once the tensor is defined. This is to overlap the data transfer with the graph tracing time. However, because GSPMD allows the user to modify the tensor sharding _after _the tensor has been defined, we need an optimization to prevent unnecessary transfer of tensor data back and forth between host and device. We introduce Virtual Device Optimization, a technique to place the tensor data on a virtual device SPMD:0 first, before uploading to the physical devices when all the sharding decisions are finalized. Every tensor data in SPMD mode is placed on a virtual device, SPMD:0. The virtual device is exposed to the user as an XLA device XLA:0 with the actual shards on physical devices, like TPU:0, TPU:1, etc.
Expand Down

0 comments on commit 732a1c7

Please sign in to comment.