You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
$ srun --exclusive -A zam -N 1 -n 1 --cpus-per-gpu=17 --gpus=1 --gpus-per-task=1 --gres=gpu:1 bin/busyring input.json
gpu: yes
threads: 17
mpi: yes
ranks: 1
start=1720081941
cell stats: 2048 cells; 303110 branches; 2831618 compartments;
#cpu=2048 #gpu=0
#cell=2048 #local=2048 #groups=17
model-init=1720081945
running simulation
0% | | 0ms[1720081945.285889] [jpbot-001-20:559978:0]
spinlock.c:29 UCX WARN ucs_recursive_spinlock_destroy() failed: busy
[jpbot-001-20:559978] *** Process received signal ***
[jpbot-001-20:559978] Signal: Segmentation fault (11)
[jpbot-001-20:559978] Signal code: Address not mapped (1)
[jpbot-001-20:559978] Failing at address: 0x103bcae285ed0
[jpbot-001-20:559978:0:560001] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x103bd4e742890)
[jpbot-001-20:559978:1:559978] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3c0310f1d70)
[1720081945.285889] [jpbot-001-20:559978:1] debug.c:1294 UCX WARN ucs_debug_disable_signal: signal 8 was not set in ucs
[1720081945.285927] [jpbot-001-20:559978:2] debug.c:1294 UCX WARN ucs_debug_disable_signal: signal 8 was not set in ucs
[1720081945.285934] [jpbot-001-20:559978:0] spinlock.c:29 UCX WARN ucs_recursive_spinlock_destroy() failed: busy
[1720081945.285941] [jpbot-001-20:559978:1] spinlock.c:29 UCX WARN ucs_recursive_spinlock_destroy() failed: busy
[1720081945.285931] [jpbot-001-20:559978:3] debug.c:1294 UCX WARN ucs_debug_disable_signal: signal 8 was not set in ucs
[1720081945.285954] [jpbot-001-20:559978:4] debug.c:1294 UCX WARN ucs_debug_disable_signal: signal 8 was not set in ucs
[1720081945.285959] [jpbot-001-20:559978:5] debug.c:1294 UCX WARN ucs_debug_disable_signal: signal 8 was not set in ucs
[1720081945.285953] [jpbot-001-20:559978:2] spinlock.c:29 UCX WARN ucs_recursive_spinlock_destroy() failed: busy
[1720081945.285971] [jpbot-001-20:559978:5] spinlock.c:29 UCX WARN ucs_recursive_spinlock_destroy() failed: busy
[1720081945.285964] [jpbot-001-20:559978:6] debug.c:1294 UCX WARN ucs_debug_disable_signal: signal 8 was not set in ucs
[jpbot-001-20:559978:2:559992] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x103bd031d9a50)
[jpbot-001-20:559978:5:559991] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x103bd099f5b90)
[jpbot-001-20:559978:3:559987] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x103bc6491bf30)
[jpbot-001-20:559978:6:559988] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x103bc9653bd90)
[jpbot-001-20:559978:4:560002] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x103bc925f2e20)
[jpbot-001-20:559978] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffbb9e07f0]
[jpbot-001-20:559978] [ 1] bin/busyring[0x4be968]
[jpbot-001-20:559978] [ 2] bin/busyring[0x4dbbb0]
[jpbot-001-20:559978] [ 3] bin/busyring[0x4fabb0]
[jpbot-001-20:559978] [ 4] bin/busyring[0x460bf0]
[jpbot-001-20:559978] [ 5] bin/busyring[0x467d44]
[jpbot-001-20:559978] [ 6] /p/software/jedi/stages/2024/software/GCCcore/12.3.0/lib64/libstdc++.so.6(+0xd693c)[0xffffbb6a693c]
[jpbot-001-20:559978] [ 7] /lib64/libc.so.6(+0x80698)[0xffffbb390698]
[jpbot-001-20:559978] [ 8] /lib64/libc.so.6(+0xeabdc)[0xffffbb3fabdc]
[jpbot-001-20:559978] *** End of error message ***
srun: error: jpbot-001-20: task 0: Segmentation fault (core dumped)
Crash sometimes masquerade as MPI crash.
Example of crash in MPI
Same testcase, different number of tasks per GPU
Different stack trace showing the problem pointing at Arbor:
The text was updated successfully, but these errors were encountered: