Very low wps with H200 Gpus #676

aniltrkkn · 2024-11-13T23:59:00Z

Hello, I am running the multinode_trainer.slurm (llama3_70b.toml) on 4 nodes that have 32 H200 Gpus. However, wps is only around ~200. Any ideas what can cause this slowness?

output.txt

multinode_trainer.slurm.txt

lessw2020 · 2024-11-14T03:04:58Z

Hi @aniltrkkn - hard to say without a trace, but most likely something is amiss on the 'between node' connections for your setup and creating this slowness.

You mentioned you are using the multinode_slurm script - are you running this on AWS? There are some settings in there that were to ensure EFA is used for cross node comms, but it was never tested with H200 as AWS did not have them at that point and we no longer have AWS cluster access.

AWS has up to 3200 GBs for H100/H200 but it could be the settings need to be adjusted for EFA with the new H200s.

A couple options here:
a - if you can confirm you are on AWS (please confirm exactly what you are using ala EC2, etc) I can reach out to their SA's to review the multi-node slurm script and see what might need adjusting.

b - if you are not on AWS, you could adjust it directly for your hardware as the script assumes EFA is available so it might need tuning to leverage your higher speed node interconnect. Assuming my guess re: the likel issue is the between node network speed is correct.

c - finally, you could also run the same test you ran above but turn on profiling in the toml and get a trace or two and that would confirm where the slowdown is. You can gz compress a trace and post here as it should shrink it down to minor size and happy to take a look.

lessw2020 · 2024-11-14T03:07:31Z

btw, a quick test would also be just run the same short run on llama3-8b with FSDP only and using a single node and lets see how your wps looks there. That should be quite fast but if that is also slow, then the issue is within node rather than between node and would help ensure we bisecting the issue properly.

aniltrkkn · 2024-11-14T18:54:38Z

Hi @lessw2020, Thank you very much for your response. Here are my responses to your questions:

I am getting high wps with single node 8B trainings.

We are not using AWS for trainings so I need to check if we EFA is available in our training datacenter. But, our multi node training code works fine in the same cluster.

I am attaching the profile trace for one of the multi-node 70B trainings.
profile_trace.tar.gz (It seems like it has a lot of cpu_op calculations, maybe that is the issue)

yifuwang · 2024-11-15T04:20:47Z

Hmm the slurm script you posted says CUDA_LAUNCH_BLOCKING=0, but the trace looked like it was run with CUDA_LAUNCH_BLOCKING=1. Could you double check this?

aniltrkkn · 2024-11-15T04:36:57Z

Hi @yifuwang, i made it 0 and it is still very low. It seems like we don't support EFA, and we use InfiniBand. I tried our regular parameters

export CUDA_LAUNCH_BLOCKING=0
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_HCA=ibp
export NCCL_MIN_CTAS=32
export UCX_NET_DEVICES=ibp0:1,ibp1:1,ibp2:1,ibp3:1,ibp4:1,ibp5:1,ibp6:1,ibp7:1
export SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1
export NCCL_COLLNET_ENABLE=0
export NCCL_DEBUG=INFO
export NCCL_ALGO=NVLSTREE

but they also do not help. Is EFA absolutely necessary?

awgu · 2024-11-15T23:59:36Z

@aniltrkkn Could you share the new trace? Either way, you are heavily communication bound (e.g. FSDP all-gather is >2x longer than the forward compute). Perhaps, could you try HSDP (no TP) with 8-way sharding to keep the FSDP all-gather/reduce-scatter within node?

casper-hansen · 2024-11-28T15:45:27Z

@aniltrkkn I struggled with the same issue, but fixed it by specifying the right environment variables and installing a library

Related:
#708 (comment)

tianyu-l added the question Further information is requested label Nov 18, 2024

casper-hansen mentioned this issue Nov 28, 2024

Low multi-node performance on SLURM cluster (32x H100) #708

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very low wps with H200 Gpus #676

Very low wps with H200 Gpus #676

aniltrkkn commented Nov 13, 2024 •

edited

Loading

lessw2020 commented Nov 14, 2024

lessw2020 commented Nov 14, 2024

aniltrkkn commented Nov 14, 2024 •

edited

Loading

yifuwang commented Nov 15, 2024

aniltrkkn commented Nov 15, 2024

awgu commented Nov 15, 2024

casper-hansen commented Nov 28, 2024

Very low wps with H200 Gpus #676

Very low wps with H200 Gpus #676

Comments

aniltrkkn commented Nov 13, 2024 • edited Loading

lessw2020 commented Nov 14, 2024

lessw2020 commented Nov 14, 2024

aniltrkkn commented Nov 14, 2024 • edited Loading

yifuwang commented Nov 15, 2024

aniltrkkn commented Nov 15, 2024

awgu commented Nov 15, 2024

casper-hansen commented Nov 28, 2024

aniltrkkn commented Nov 13, 2024 •

edited

Loading

aniltrkkn commented Nov 14, 2024 •

edited

Loading