You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have noticed that NVIDIA driver inside the container breaks at some points. As you can see below, at a specific date, nvidi-smi works but one day after that, the same command doesn't work and you can see that the container has not been exited. The whole system, e.g. host, also works fine and I am able to run nvidia-smi on the host.
The solution is to exit the container and make prebuild again.
(mlperf) mahmood@mlperf-inference-mahmood-x86-64-29597:/work$ nvidia-smi
Wed Jun 26 19:29:13 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3080 Off | 00000000:2D:00.0 On | N/A |
| 0% 45C P8 23W / 370W | 386MiB / 10240MiB | 3% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
(mlperf) mahmood@mlperf-inference-mahmood-x86-64-29597:/work$ nvidia-smi
Failed to initialize NVML: Unknown Error
(mlperf) mahmood@mlperf-inference-mahmood-x86-64-29597:/work$ date
Thu Jun 27 09:29:28 ETC 2024
More information about the docker images is shown below:
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
mlperf-inference mahmood-x86_64 0eb62ccf7eae 2 days ago 46.5GB
mlperf-inference mahmood-x86_64-latest fbc8447ed1c2 2 days ago 46.5GB
nvcr.io/nvidia/mlperf/mlperf-inference mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public 34b056f25fae 4 months ago 14.5GB
Any idea about that?
The text was updated successfully, but these errors were encountered:
I have noticed that NVIDIA driver inside the container breaks at some points. As you can see below, at a specific date,
nvidi-smi
works but one day after that, the same command doesn't work and you can see that the container has not been exited. The whole system, e.g. host, also works fine and I am able to run nvidia-smi on the host.The solution is to exit the container and
make prebuild
again.More information about the docker images is shown below:
Any idea about that?
The text was updated successfully, but these errors were encountered: