GitHub - sehoffmann/dmlcloud: Distributed torch training using horovod and slurm

A torch library for easy distributed deep learning on HPC clusters. Supports both slurm and MPI. No unnecessary abstractions and overhead. Simple, yet powerful, API.

Highlights

Simple, yet powerful, API
Easy initialization of torch.distributed
Distributed checkpointing and metrics
Extensive logging and diagnostics
Wandb support
A wealth of useful utility functions

Installation

dmlcloud can be installed directly from PyPI:

pip install dmlcloud

Alternatively, you can install the latest development version directly from Github:

pip install git+https://github.com/sehoffmann/dmlcloud.git

Minimal Example

See examples/barebone_mnist.py for a minimal and barebone example on how to distributely train MNIST. To run it on a single node with 4 GPUs, use

dmlrun -n 4 python examples/barebone_mnist.py

dmlrun is a thin wrapper around torchrun that makes development work on a single node easier.

To run your training across multiple nodes on a slurm cluster instead, you can simply use srun:

srun --ntasks-per-node [NUM_GPUS] python examples/barebone_mnist.py

Documentation

You can find the official documentation at Read the Docs

Name		Name	Last commit message	Last commit date
Latest commit History 135 Commits
.github		.github
dmlcloud		dmlcloud
doc		doc
examples		examples
misc/logo		misc/logo
test		test
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
README.md		README.md
ci_requirements.txt		ci_requirements.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Highlights

Installation

Minimal Example

Documentation

About

Releases 4

Packages

Languages

License

sehoffmann/dmlcloud

Folders and files

Latest commit

History

Repository files navigation

Highlights

Installation

Minimal Example

Documentation

About

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Languages

Packages