Low multi-node performance on SLURM cluster (32x H100) #708

casper-hansen · 2024-11-28T12:20:23Z

I used the default configs provided in TorchTitan to try and replicate performance. I have tried to modify your launch script, but I am unable to achieve reasonable performance (4 nodes is as fast as 1 node). There is one related issue #676.

Dependencies:

CUDA 12.6
torch 2.6.0 nightly 20241121

Things I tried:

Eliminating pyxis/enroot and thereby docker images. just using conda did not solve it.
I tried to play around with SBATCH and NCCL variables, but ultimately couldn't make it perform any better.

Performance

All nodes are on DGX Cloud through OCI with 3.2 TB/s bandwidth per node.

multi-node (32x H100)

1: [rank0]:2024-11-27 14:28:39,087 - root - INFO - Training starts at step 1, with local batch size 1, global batch size 32, sequence length 8192, total steps 1000 (warmup 200)
0: [rank0]:2024-11-27 14:29:40,378 - root - INFO - step: 10  loss:  1.5568  memory: 26.47GiB(33.47%)  wps: 1,786  mfu: 10.46%
3: [rank0]:2024-11-27 14:29:40,378 - root - INFO - step: 10  loss:  1.5568  memory: 26.47GiB(33.47%)  wps: 1,786  mfu: 10.46%
1: [rank0]:2024-11-27 14:29:40,378 - root - INFO - step: 10  loss:  1.5568  memory: 26.47GiB(33.47%)  wps: 1,786  mfu: 10.46%
2: [rank0]:2024-11-27 14:29:40,378 - root - INFO - step: 10  loss:  1.5568  memory: 26.47GiB(33.47%)  wps: 1,786  mfu: 10.46%

single-node (8x H100)

0: [rank0]:2024-11-27 14:38:23,621 - root - INFO - Training starts at step 1, with local batch size 1, global batch size 8, sequence length 8192, total steps 1000 (warmup 200)
0: [rank0]:2024-11-27 14:39:07,085 - root - INFO - step: 20  loss:  1.5182  memory: 37.87GiB(47.87%)  wps: 7,357  mfu: 43.08%

Launch script

#!/bin/bash
#SBATCH --job-name="torchtitan"
#SBATCH --nodes=4
#SBATCH --nodelist=gpu001,gpu002,gpu003,gpu004
#SBATCH --gpus-per-task=8
#SBATCH --output=%x-%j.txt
#SBATCH --error=%x-%j.txt
#SBATCH --mem=0
#SBATCH --exclusive
#SBATCH -p defq

DOCKER_IMAGE=docker://<my_container>/torchtitan:torch2.6.0.dev20241121_cuda126

# CUDA variables
export CUDA_HOME=/usr/local/cuda-12.6
export LD_LIBRARY_PATH=/usr/local/cuda-12.6/lib64:$LD_LIBRARY_PATH

# runtime variables
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
export HF_HUB_ENABLE_HF_TRANSFER="1"
export HF_HUB_ETAG_TIMEOUT=500
export OMP_NUM_THREADS=8
export CUDA_LAUNCH_BLOCKING=0

# NCCL variables
export NCCL_IB_DISABLE=1
export NCCL_BUFFSIZE=2097152
export NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_NVLS_ENABLE=0

# get master address and port
export MASTER_PORT=$(python3 slurm/get_free_port.py)
export MASTER_NAME=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
export MASTER_ADDR=$(srun --nodes=1 --ntasks=1 -w "$MASTER_NAME" hostname --ip-address)

echo "The head node name is $MASTER_NAME"
echo "The head node IP is $MASTER_ADDR"

srun -l -w $SLURM_NODELIST \
    --no-container-mount-home \
    --container-image=$DOCKER_IMAGE \
    --container-name=torchtitan_container \
    --container-mounts=/lustre:/lustre,./:/workspace \
    --container-workdir=/workspace \
    /bin/bash -c "
    source /opt/conda/etc/profile.d/conda.sh && conda activate py_3.11
    torchrun --nnodes $SLURM_JOB_NUM_NODES --nproc_per_node 8 \
        --rdzv_id 101 --rdzv_backend c10d --rdzv_endpoint "$MASTER_ADDR:$MASTER_PORT" \
        --local-ranks-filter 0 --role rank --tee 3 \
        train.py --job.config_file train_configs/llama3_8b.toml \
        "$@"
    "

The text was updated successfully, but these errors were encountered:

casper-hansen · 2024-11-28T15:39:17Z

This is now fixed. It was some NCCL environment variables not being properly set.

Needed this install: apt install ibverbs-utils
And these environment variables: NCCL_IB_DISABLE=0 and NCCL_NET=IB

@tianyu-l @lessw2020 In my opinion, it would be good to create some docs on multi-node setup and environment variables.

casper-hansen closed this as completed Nov 28, 2024

casper-hansen mentioned this issue Nov 28, 2024

Very low wps with H200 Gpus #676

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low multi-node performance on SLURM cluster (32x H100) #708

Low multi-node performance on SLURM cluster (32x H100) #708

casper-hansen commented Nov 28, 2024 •

edited

Loading

casper-hansen commented Nov 28, 2024 •

edited

Loading

Low multi-node performance on SLURM cluster (32x H100) #708

Low multi-node performance on SLURM cluster (32x H100) #708

Comments

casper-hansen commented Nov 28, 2024 • edited Loading

Performance

Launch script

casper-hansen commented Nov 28, 2024 • edited Loading

casper-hansen commented Nov 28, 2024 •

edited

Loading

casper-hansen commented Nov 28, 2024 •

edited

Loading