NVIDIA driver breaks inside the container mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public #11

mahmoodn · 2024-06-27T09:34:03Z

I have noticed that NVIDIA driver inside the container breaks at some points. As you can see below, at a specific date, nvidi-smi works but one day after that, the same command doesn't work and you can see that the container has not been exited. The whole system, e.g. host, also works fine and I am able to run nvidia-smi on the host.

The solution is to exit the container and make prebuild again.

(mlperf) mahmood@mlperf-inference-mahmood-x86-64-29597:/work$ nvidia-smi
Wed Jun 26 19:29:13 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3080        Off | 00000000:2D:00.0  On |                  N/A |
|  0%   45C    P8              23W / 370W |    386MiB / 10240MiB |      3%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
(mlperf) mahmood@mlperf-inference-mahmood-x86-64-29597:/work$ nvidia-smi
Failed to initialize NVML: Unknown Error
(mlperf) mahmood@mlperf-inference-mahmood-x86-64-29597:/work$ date
Thu Jun 27 09:29:28 ETC 2024

More information about the docker images is shown below:

$ docker images
REPOSITORY                               TAG                                                       IMAGE ID       CREATED        SIZE
mlperf-inference                         mahmood-x86_64                                           0eb62ccf7eae   2 days ago     46.5GB
mlperf-inference                         mahmood-x86_64-latest                                    fbc8447ed1c2   2 days ago     46.5GB
nvcr.io/nvidia/mlperf/mlperf-inference   mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public   34b056f25fae   4 months ago   14.5GB

Any idea about that?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVIDIA driver breaks inside the container mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public #11

NVIDIA driver breaks inside the container mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public #11

mahmoodn commented Jun 27, 2024 •

edited

Loading

NVIDIA driver breaks inside the container mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public #11

NVIDIA driver breaks inside the container mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public #11

Comments

mahmoodn commented Jun 27, 2024 • edited Loading

mahmoodn commented Jun 27, 2024 •

edited

Loading