A torch library for easy distributed deep learning on HPC clusters. Supports both slurm and MPI. No unnecessary abstractions and overhead. Simple, yet powerful, API.
- Simple, yet powerful, API
- Easy initialization of
torch.distributed
- Distributed checkpointing and metrics
- Extensive logging and diagnostics
- Wandb support
- A wealth of useful utility functions
dmlcloud can be installed directly from PyPI:
pip install dmlcloud
Alternatively, you can install the latest development version directly from Github:
pip install git+https://github.com/sehoffmann/dmlcloud.git
See examples/barebone_mnist.py for a minimal and barebone example on how to distributely train MNIST. To run it on a single node with 4 GPUs, use
dmlrun -n 4 python examples/barebone_mnist.py
dmlrun
is a thin wrapper around torchrun
that makes development work on a single node easier.
To run your training across multiple nodes on a slurm cluster instead, you can simply use srun
:
srun --ntasks-per-node [NUM_GPUS] python examples/barebone_mnist.py
You can find the official documentation at Read the Docs