Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA driver breaks inside the container mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public #11

Open
mahmoodn opened this issue Jun 27, 2024 · 0 comments

Comments

@mahmoodn
Copy link

mahmoodn commented Jun 27, 2024

I have noticed that NVIDIA driver inside the container breaks at some points. As you can see below, at a specific date, nvidi-smi works but one day after that, the same command doesn't work and you can see that the container has not been exited. The whole system, e.g. host, also works fine and I am able to run nvidia-smi on the host.

The solution is to exit the container and make prebuild again.

(mlperf) mahmood@mlperf-inference-mahmood-x86-64-29597:/work$ nvidia-smi
Wed Jun 26 19:29:13 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3080        Off | 00000000:2D:00.0  On |                  N/A |
|  0%   45C    P8              23W / 370W |    386MiB / 10240MiB |      3%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
(mlperf) mahmood@mlperf-inference-mahmood-x86-64-29597:/work$ nvidia-smi
Failed to initialize NVML: Unknown Error
(mlperf) mahmood@mlperf-inference-mahmood-x86-64-29597:/work$ date
Thu Jun 27 09:29:28 ETC 2024

More information about the docker images is shown below:

$ docker images
REPOSITORY                               TAG                                                       IMAGE ID       CREATED        SIZE
mlperf-inference                         mahmood-x86_64                                           0eb62ccf7eae   2 days ago     46.5GB
mlperf-inference                         mahmood-x86_64-latest                                    fbc8447ed1c2   2 days ago     46.5GB
nvcr.io/nvidia/mlperf/mlperf-inference   mlpinf-v4.0-cuda12.2-cudnn8.9-x86_64-ubuntu20.04-public   34b056f25fae   4 months ago   14.5GB

Any idea about that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant