You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While testing in the upper boundaries of memory limit, we have noticed what I suppose is a bug in nccl-tests.
Memory allocations
From what I see, the test initialize in two steps:
allocate (in advance) buffers using "max bytes" size
call ncclCommInitAll(), and start working on the test
This "max bytes" is determined with a couple of factors:
collective operation we're using (each operation has its own way to calculate memory footprint: AllReduceGetCollByteCount(), ReduceScatterGetCollByteCount(), ...)
total device memory (cudaDeviceProp::totalGlobalMem)
CLI argument
Specifically, the test caps the max bytes with: (total device memory - 1GiB) / 3.
(assuming default settings where datacheck is enabled (-c=1))
This results in 27971332778 bytes (about 26.05 GiB) exact of limit on my machine.
Problem
However, we're seeing errors with high max_bytes (-e) parameters.
For example, in this A100 with 80GiB memory, giving -e 27766464617 (about 25.86 GiB) crashes the test all_reduce_perf, during the call to ncclCommInitAll().
The breaking point should be somewhere around that; -e 27066464617 (about 25.21 GiB) doesn't crash.
(Note that all_reduce_perf has one of the highest GPU memory footprint (link: all_reduce.cu). Some other tests divides the payload across ranks. Thus, the same parameter works alright for ./reduce_scatter_perf, for example.)
NCCL trace shows it's an OOM:
$ NCCL_DEBUG=WARN NCCL_DEBUG_SUBSYS=ALL ./all_reduce_perf -b 27766464617 -e 27766464617 -w 0 -n 1 -t 4
# nThread 4 nGpus 1 minBytes 27766464617 maxBytes 27766464617 step: 1048576(bytes) warmup iters: 0 iters: 1 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 1206071 on <node name> device 0 [0x00] NVIDIA A100-SXM4-80GB
# Rank 1 Group 0 Pid 1206071 on <node name> device 1 [0x00] NVIDIA A100-SXM4-80GB
# Rank 2 Group 0 Pid 1206071 on <node name> device 2 [0x00] NVIDIA A100-SXM4-80GB
# Rank 3 Group 0 Pid 1206071 on <node name> device 3 [0x00] NVIDIA A100-SXM4-80GB
NCCL version 2.21.5+cuda12.5
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
27766464616 6941616154 float sum -1
<node name>:1206071:1206148 [1] enqueue.cc:1402 NCCL WARN Cuda failure 'out of memory'
<node name>: Test NCCL failure all_reduce.cu:44 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
.. <node name> pid 1206071: Test failure common.cu:377
.. <node name> pid 1206071: Test failure common.cu:413
.. <node name> pid 1206071: Test failure common.cu:603
.. <node name> pid 1206071: Test failure all_reduce.cu:90
.. <node name> pid 1206071: Test failure common.cu:615
<node name>:1206071:1206147 [2] enqueue.cc:1402 NCCL WARN Cuda failure 'out of memory'
<node name>: Test NCCL failure all_reduce.cu:44 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
.. <node name> pid 1206071: Test failure common.cu:377
.. <node name> pid 1206071: Test failure common.cu:413
.. <node name> pid 1206071: Test failure common.cu:603
.. <node name> pid 1206071: Test failure all_reduce.cu:90
.. <node name> pid 1206071: Test failure common.cu:615
With smaller NCCL_BUFFSIZE such as 65536 (64KiB, instead of the default 4194304 = 4 MiB), the test actually passes.
I am not sure whether NCCL is indeed loading as many as 256* buffers into the memory,
but with alignments, fragmentations, and whatnot, I wouldn't be surprised.
Maybe there's NVSwitch in play as well: more links, more buffers.
* 256 = headroom 1 GiB / default buffsize 4 MiB
Changing that line (maxMem - (1<<30)) from 1<<30 to 1ull<<31 immediately solved the problem,
and it works in every cases, however absurd the max_bytes (-e) goes.
I think we either need a larger default (possibly increasing per GPU size, RAM, or count), make it parameterized, or warn the user about this.
The text was updated successfully, but these errors were encountered:
Environment
This is on a node with 8 A100-SXM4-80GB GPUs, connected to 6 NVSwitches.
I'm not familiar with its topology in detail, but I can tell:
nvidia-smi nvlink -s
);/var/log/fabricmanager.log
);NV12
innvidia-smi topo -m
).NCCL version:
2.21.5
nccl-tests
version:v2.13.9
(latest)Issue
While testing in the upper boundaries of memory limit, we have noticed what I suppose is a bug in
nccl-tests
.Memory allocations
From what I see, the test initialize in two steps:
ncclCommInitAll()
, and start working on the testThis "max bytes" is determined with a couple of factors:
AllReduceGetCollByteCount()
,ReduceScatterGetCollByteCount()
, ...)cudaDeviceProp::totalGlobalMem
)Specifically, the test caps the max bytes with:
(total device memory - 1GiB) / 3
.(assuming default settings where datacheck is enabled (
-c=1
))nccl-tests/src/common.cu
Line 915 in c6afef0
This results in
27971332778
bytes (about 26.05 GiB) exact of limit on my machine.Problem
However, we're seeing errors with high
max_bytes (-e)
parameters.For example, in this A100 with 80GiB memory, giving
-e 27766464617
(about 25.86 GiB) crashes the testall_reduce_perf
, during the call toncclCommInitAll()
.The breaking point should be somewhere around that;
-e 27066464617
(about 25.21 GiB) doesn't crash.(Note that
all_reduce_perf
has one of the highest GPU memory footprint (link:all_reduce.cu
). Some other tests divides the payload across ranks. Thus, the same parameter works alright for./reduce_scatter_perf
, for example.)NCCL trace shows it's an OOM:
With smaller
NCCL_BUFFSIZE
such as65536
(64KiB, instead of the default4194304
= 4 MiB), the test actually passes.I am not sure whether NCCL is indeed loading as many as 256* buffers into the memory,
but with alignments, fragmentations, and whatnot, I wouldn't be surprised.
Maybe there's NVSwitch in play as well: more links, more buffers.
* 256 = headroom 1 GiB / default buffsize 4 MiB
Changing that line
(maxMem - (1<<30))
from1<<30
to1ull<<31
immediately solved the problem,and it works in every cases, however absurd the
max_bytes (-e)
goes.I think we either need a larger default (possibly increasing per GPU size, RAM, or count), make it parameterized, or warn the user about this.
The text was updated successfully, but these errors were encountered: