-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request to compare NCCL Performance with NSF NCAR Derecho #13
Comments
Hello! These numbers look broadly reasonable and actually a tiny bit better than what we observe on Perlmutter (e.g. 2-node, 8 GPU allreduce at large message sizes saturates around 76 GB/s of busbw for us). It's interesting that it is working for you using the
As such we've avoided using them and stuck with 1.6.0, which is the most recent non |
I can add that with the |
Can you share more detail about what needed to be patched in your case? |
Sure can - took a while to get all the build dependencies satisfied on our system, but once we did I was dismayed to see this runtime error:
The root cause is all the Cuda-awareness is wrapped in some So on Derecho with https://github.com/benkirk/derecho-pytorch-mpi/tree/main/patches This is all a work-in-progress but we've had success running it. Ultimately I'd like to turn those patches into a PR for pytorch so you can define the fallback assumptions with a configure argument or something, for non-OpenMPI builds |
Ah, right, thanks! I build pytorch with cray-mpich support but never bothered with trying to enable cuda-awareness because of those issues, and figured NCCL would be the more performant option anyway. It's cool to know that it can be made to work, and it's interesting to hear you saw comparable performance. If you have any performance results to share I'd be interested to see them. If you manage to upstream the "fix" let us know. That'd be a nice pytorch contribution. Thanks again for sharing. |
Hello NERSC Team,
HPCD CISL staff at NSF NCAR would like to request a performance comparison of your use of this nccl-ofi-plugin to ours as copied below. We used a test suite that expands a bit on your test runs found in this forked Github by @benkirk but please let us know if this is something you can do and provide a turn around on in a couple weeks.
Primarily, we would like to compare settings placed in the https://github.com/benkirk/nccl-ofi-plugin/blob/main/env_nccl_derecho.sh in order to determine what should be optimal.
To note, the below test used NCCL 2.22.3-1 and AWS NCCL Plugin 1.7.4. I am comfortable using the NCCL latest version but suspect it would be difficult to adapt AWS plugin usage beyond 1.7.4 given their specific targeting of AWS machines after that version.
Avg Bus Bandwidth (GB/s) per test suite ran:
all-gather
Intra-node (2GPUs) - 45.3614
Intra-node (4GPUs) - 127.491
Inter-node (2GPUs) - 14.0373
Inter-node (4GPUs) - 29.4738
Inter-node (8GPUs) - 50.7769
all-reduce
Intra-node (2GPUs) - 57.1399
Intra-node (4GPUs) - 154.882
Inter-node (2GPUs) - 16.6409
Inter-node (4GPUs) - 29.9176
Inter-node (8GPUs) - 51.8871
all-to-all
Intra-node (2GPUs) - 49.2425
Intra-node (4GPUs) - 135.873
Inter-node (2GPUs) - 13.1876
Inter-node (4GPUs) - 18.9097
Inter-node (8GPUs) - 21.0073
send-recv
Intra-node (2GPUs) - 56.3249
Intra-node (4GPUs) - 63.7496
Inter-node (2GPUs) - 14.6531
Inter-node (4GPUs) - 17.0042
Inter-node (8GPUs) - 16.3461
Thanks!
The text was updated successfully, but these errors were encountered: