core dump when request 2 or more gpus with Tesla T4 #24

ryan1051 · 2022-06-30T03:34:42Z

1. Issue or feature description

It's ok when request 1 gpu in yaml. But when request more than 1, the output of nvidia-smi is below:

The output of nvidia-smi in host machine is ok.

In another machine with GeForce RTX 2070 SUPER ,it's all right when request 2 gpus.

but when I run application locally , it abort due to :

[4pdvGPU ERROR (pid:697 thread=140106827071488 context.c:189)]: cuCtxGetDevice Not Found. tid=140106827071488 ctx=0x239601906000:0x23960041a000
 home/limengxuan/work/libcuda_override/src/cuda/context.c:189: cuCtxGetDevice: Assertion `0' failed.

2. Steps to reproduce the issue

ubuntu1~20.04 + microk8s + Tesla T4 GPU + 510driver

3. Information to attach (optional if deemed irrelevant)

Common error checking:

The output of nvidia-smi -a on your host
Your docker configuration file (e.g: /etc/docker/daemon.json)
-{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}

Additional information that might help better understand your environment and reproduce the bug:

Any relevant kernel output lines from dmesg

 nvidia-smi[2260220]: segfault at 0 ip 00007fde46d051ce sp 00007ffe1ae4c9e8 error 4 in libc-2.31.so[7fde46b9d000+178000]
[89993.700532] Code: fd d7 c9 0f bc d1 c5 fe 7f 27 c5 fe 7f 6f 20 c5 fe 7f 77 40 c5 fe 7f 7f 60 49 83 c0 1f 49 29 d0 48 8d 7c 17 61 e9 c2 04 00 00 <c5> fe 6f 1e c5 fe 6f 56 20 c5 fd 74 cb c5 fd d7 d1 49 83 f8 21 0f
[90182.697502] nvidia-smi[2265941]: segfault at 0 ip 00007f241971c1ce sp 00007fffff703d08 error 4 in libc-2.31.so[7f24195b4000+178000]
[90182.697509] Code: fd d7 c9 0f bc d1 c5 fe 7f 27 c5 fe 7f 6f 20 c5 fe 7f 77 40 c5 fe 7f 7f 60 49 83 c0 1f 49 29 d0 48 8d 7c 17 61 e9 c2 04 00 00 <c5> fe 6f 1e c5 fe 6f 56 20 c5 fd 74 cb c5 fd d7 d1 49 83 f8 21 0f

The text was updated successfully, but these errors were encountered:

ryan1051 · 2022-06-30T05:40:53Z

And memory and fault isolation are provided?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core dump when request 2 or more gpus with Tesla T4 #24

core dump when request 2 or more gpus with Tesla T4 #24

ryan1051 commented Jun 30, 2022 •

edited

Loading

ryan1051 commented Jun 30, 2022

core dump when request 2 or more gpus with Tesla T4 #24

core dump when request 2 or more gpus with Tesla T4 #24

Comments

ryan1051 commented Jun 30, 2022 • edited Loading

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

ryan1051 commented Jun 30, 2022

ryan1051 commented Jun 30, 2022 •

edited

Loading