Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low multi-node performance on SLURM cluster (32x H100) #708

Closed
casper-hansen opened this issue Nov 28, 2024 · 1 comment
Closed

Low multi-node performance on SLURM cluster (32x H100) #708

casper-hansen opened this issue Nov 28, 2024 · 1 comment

Comments

@casper-hansen
Copy link

casper-hansen commented Nov 28, 2024

I used the default configs provided in TorchTitan to try and replicate performance. I have tried to modify your launch script, but I am unable to achieve reasonable performance (4 nodes is as fast as 1 node). There is one related issue #676.

Dependencies:

  • CUDA 12.6
  • torch 2.6.0 nightly 20241121

Things I tried:

  • Eliminating pyxis/enroot and thereby docker images. just using conda did not solve it.
  • I tried to play around with SBATCH and NCCL variables, but ultimately couldn't make it perform any better.

Performance

All nodes are on DGX Cloud through OCI with 3.2 TB/s bandwidth per node.

multi-node (32x H100)

1: [rank0]:2024-11-27 14:28:39,087 - root - INFO - Training starts at step 1, with local batch size 1, global batch size 32, sequence length 8192, total steps 1000 (warmup 200)
0: [rank0]:2024-11-27 14:29:40,378 - root - INFO - step: 10  loss:  1.5568  memory: 26.47GiB(33.47%)  wps: 1,786  mfu: 10.46%
3: [rank0]:2024-11-27 14:29:40,378 - root - INFO - step: 10  loss:  1.5568  memory: 26.47GiB(33.47%)  wps: 1,786  mfu: 10.46%
1: [rank0]:2024-11-27 14:29:40,378 - root - INFO - step: 10  loss:  1.5568  memory: 26.47GiB(33.47%)  wps: 1,786  mfu: 10.46%
2: [rank0]:2024-11-27 14:29:40,378 - root - INFO - step: 10  loss:  1.5568  memory: 26.47GiB(33.47%)  wps: 1,786  mfu: 10.46%

single-node (8x H100)

0: [rank0]:2024-11-27 14:38:23,621 - root - INFO - Training starts at step 1, with local batch size 1, global batch size 8, sequence length 8192, total steps 1000 (warmup 200)
0: [rank0]:2024-11-27 14:39:07,085 - root - INFO - step: 20  loss:  1.5182  memory: 37.87GiB(47.87%)  wps: 7,357  mfu: 43.08%

Launch script

#!/bin/bash
#SBATCH --job-name="torchtitan"
#SBATCH --nodes=4
#SBATCH --nodelist=gpu001,gpu002,gpu003,gpu004
#SBATCH --gpus-per-task=8
#SBATCH --output=%x-%j.txt
#SBATCH --error=%x-%j.txt
#SBATCH --mem=0
#SBATCH --exclusive
#SBATCH -p defq

DOCKER_IMAGE=docker://<my_container>/torchtitan:torch2.6.0.dev20241121_cuda126

# CUDA variables
export CUDA_HOME=/usr/local/cuda-12.6
export LD_LIBRARY_PATH=/usr/local/cuda-12.6/lib64:$LD_LIBRARY_PATH

# runtime variables
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
export HF_HUB_ENABLE_HF_TRANSFER="1"
export HF_HUB_ETAG_TIMEOUT=500
export OMP_NUM_THREADS=8
export CUDA_LAUNCH_BLOCKING=0

# NCCL variables
export NCCL_IB_DISABLE=1
export NCCL_BUFFSIZE=2097152
export NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_NVLS_ENABLE=0

# get master address and port
export MASTER_PORT=$(python3 slurm/get_free_port.py)
export MASTER_NAME=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
export MASTER_ADDR=$(srun --nodes=1 --ntasks=1 -w "$MASTER_NAME" hostname --ip-address)

echo "The head node name is $MASTER_NAME"
echo "The head node IP is $MASTER_ADDR"

srun -l -w $SLURM_NODELIST \
    --no-container-mount-home \
    --container-image=$DOCKER_IMAGE \
    --container-name=torchtitan_container \
    --container-mounts=/lustre:/lustre,./:/workspace \
    --container-workdir=/workspace \
    /bin/bash -c "
    source /opt/conda/etc/profile.d/conda.sh && conda activate py_3.11
    torchrun --nnodes $SLURM_JOB_NUM_NODES --nproc_per_node 8 \
        --rdzv_id 101 --rdzv_backend c10d --rdzv_endpoint "$MASTER_ADDR:$MASTER_PORT" \
        --local-ranks-filter 0 --role rank --tee 3 \
        train.py --job.config_file train_configs/llama3_8b.toml \
        "$@"
    "
@casper-hansen
Copy link
Author

casper-hansen commented Nov 28, 2024

This is now fixed. It was some NCCL environment variables not being properly set.

  1. Needed this install: apt install ibverbs-utils
  2. And these environment variables: NCCL_IB_DISABLE=0 and NCCL_NET=IB

@tianyu-l @lessw2020 In my opinion, it would be good to create some docs on multi-node setup and environment variables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant