MCR-DL is a HiDL project. We encourage you to visit the HiDL website for additional information, the latest performance numbers, and similar projects on high-performance machine and deep learning. For the latest announcements on HiDL projects, register for the HiDL mailing list.
[12/24/2023] The initial release of MCR-DL! For now, we only support basic single-backend communication without going through the PyTorch distributed module. For a full list of new and existing features, please see the MCR-DL feature page
The initial release of MCR-DL doesn't allow for mixed-backend optimizations. It will still allow users to decouple communication backends from PyTorch's distributed module. This enables much faster small-message performance, allows non-NCCL backends to be used with torch without messy source builds, enables communication logging, torch communication benchmarks, and greatly simplifies communication optimizations such as compression.
- Python 3.8 or later (for Linux, Python 3.8.1+ is needed).
- Any MPI library (we recommend MVAPICH2-GDR), NCCL, or both
Refer MVAPICH2-GDR user guide to install MVAPICH2-GDR. - PyTorch 1.12.1 or later
Refer PyTorch installation guide to install PyTorch from source and configure MVAPICH2-GDR support.
Note: We used the following versions during implementation and testing. Python=3.9.16, cuda=11.7, gcc=10.3.0, cmake=3.22.2, PyTorch=2.0.1, MVAPICH2-GDR=2.3.7
cd MCR-DL
python setup.py install
Update mpi, cuda, and nccl paths appropriately in mcr_dl/config.yml
The intent of these benchmarks is to measure communication latency/bw of MCR-DL and/or pytorch distributed communication operations at the Python layer. These benchmarks are complementary to C-level comms benchmarks like OSU Micro-Benchmarks and NCCL Tests in that users can:
- Easily debug which layer of the communication software stack hangs or performance degradations originate from.
- Measure the expected communication performance of either MCR-DL comms or pure PyTorch distributed
To run benchmarks, there are two options:
- Run a single communication operation:
For example, run with a single large message size (calculated to barely fit within GPU mem):
mpirun -np 16 --hostfile ${HOSTFILE} -x LD_LIBRARY_PATH -x PATH -x LD_PRELOAD python all_reduce.py
Scan across message sizes:
mpirun -np 16 --hostfile ${HOSTFILE} -x LD_LIBRARY_PATH -x PATH -x LD_PRELOAD python all_reduce.py --scan
Benchmark pure PyTorch distributed comms (without importing or using MCR-DL) by launching with MPI
mpirun -np 16 --hostfile ${HOSTFILE} -x LD_LIBRARY_PATH -x PATH -x LD_PRELOAD python all_reduce.py --scan --dist="torch"
or Slurm
srun -n 16 python all_reduce.py --scan --dist="torch"
- Run all available communication benchmarks:
mpirun -np 16 --hostfile ${HOSTFILE} -x LD_LIBRARY_PATH -x PATH -x LD_PRELOAD python run_all.py
Like the individual benchmarks, run_all.py
supports scanning arguments for the max message size, bw-unit, etc. Simply pass the desired arguments to run_all.py
and they'll be propagated to each comm op.
Finally, users can choose specific communication operations to run in run_all.py
by passing them as arguments (all operations are run by default). For example:
mpirun -np 16 --hostfile ${HOSTFILE} -x LD_LIBRARY_PATH -x PATH -x LD_PRELOAD python run_all.py --scan --all-reduce --all-to-all --broadcast
To add new communication benchmarks, follow this general procedure:
- Copy a similar benchmark file (e.g. to add
reduce_scatter
, copyall_reduce.py
as a template) - Add a new bw formula in
utils.get_bw
, a new maximum tensor element formula inutils.max_numel
, and a new arg inutils.benchmark_parser
- Replace comm op calls in new file with find-replace
- Find a good default
mem_factor
for use inrun_<collective>_single()
function - Add new comm op to
run_all.py
- Quentin Anthony, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He, Aamir Shafi, Mustafa Abduljabbar, Hari Subramoni, Dhabaleswar Panda. (2023) MCR-DL: Mix-and-Match Communication Runtime for Deep Learning arXiv:2303.08374 and will appear at IPDPS 2023.