-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very low wps with H200 Gpus #676
Comments
Hi @aniltrkkn - hard to say without a trace, but most likely something is amiss on the 'between node' connections for your setup and creating this slowness. You mentioned you are using the multinode_slurm script - are you running this on AWS? There are some settings in there that were to ensure EFA is used for cross node comms, but it was never tested with H200 as AWS did not have them at that point and we no longer have AWS cluster access. AWS has up to 3200 GBs for H100/H200 but it could be the settings need to be adjusted for EFA with the new H200s. A couple options here: b - if you are not on AWS, you could adjust it directly for your hardware as the script assumes EFA is available so it might need tuning to leverage your higher speed node interconnect. Assuming my guess re: the likel issue is the between node network speed is correct. c - finally, you could also run the same test you ran above but turn on profiling in the toml and get a trace or two and that would confirm where the slowdown is. You can gz compress a trace and post here as it should shrink it down to minor size and happy to take a look. |
btw, a quick test would also be just run the same short run on llama3-8b with FSDP only and using a single node and lets see how your wps looks there. That should be quite fast but if that is also slow, then the issue is within node rather than between node and would help ensure we bisecting the issue properly. |
Hi @lessw2020, Thank you very much for your response. Here are my responses to your questions: I am getting high wps with single node 8B trainings. We are not using AWS for trainings so I need to check if we EFA is available in our training datacenter. But, our multi node training code works fine in the same cluster. I am attaching the profile trace for one of the multi-node 70B trainings. |
Hmm the slurm script you posted says CUDA_LAUNCH_BLOCKING=0, but the trace looked like it was run with CUDA_LAUNCH_BLOCKING=1. Could you double check this? |
Hi @yifuwang, i made it 0 and it is still very low. It seems like we don't support EFA, and we use InfiniBand. I tried our regular parameters export CUDA_LAUNCH_BLOCKING=0 but they also do not help. Is EFA absolutely necessary? |
@aniltrkkn Could you share the new trace? Either way, you are heavily communication bound (e.g. FSDP all-gather is >2x longer than the forward compute). Perhaps, could you try HSDP (no TP) with 8-way sharding to keep the FSDP all-gather/reduce-scatter within node? |
@aniltrkkn I struggled with the same issue, but fixed it by specifying the right environment variables and installing a library Related: |
Hello, I am running the multinode_trainer.slurm (llama3_70b.toml) on 4 nodes that have 32 H200 Gpus. However, wps is only around ~200. Any ideas what can cause this slowness?
output.txt
multinode_trainer.slurm.txt
The text was updated successfully, but these errors were encountered: