Skip to content

RDMA network setup for Pytorch Applications

Zhaobo edited this page Jan 9, 2023 · 5 revisions

Prerequisites

  • CX-5 and driver (ofed 5.4)
  • GPU and driver (510)
  • linux kernel version (5.15)

Basic Procedure and Commands

1. Install drivers and dependent libraries

2. Device check up

  • ibstats

3. Config IP on CX-5 device

4. Smoke test

  • ib_send_bw -a -b -R -d mlx5_2

5. NCCL and Pytorch software stack, rebuild, nccl>=2.14

6. Pytorch distributed training test

  • change docker default runtime to nvidia, and reload daemon and restart docker, if not done yet

7. RDMA packages and speed verification under pytorch application

  • Mellanox tcpudump special container

8. Training time comparison with and without RDMA

Advanced, multiple docker container sharing one CX-5 adapter

MacVlan Configuration