Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8s pod ,After running for a while, the GPU cannot be found in the pod. Failed to initialize NVML: Unknown Error #981

Open
bilbilmyc opened this issue Oct 8, 2024 · 2 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@bilbilmyc
Copy link

environmental

k8s

k8s .123.7

docker

root@g007:/var/lib/kubelet# docker -v
Docker version 20.10.24, build 297e128

containerd

root@g007:/var/lib/kubelet# containerd -v
containerd github.com/containerd/containerd v1.6.20 2806fc1057397dbaeefbea0e4e17bddfbd388f38

device-plugin

root@pt13:~# kubectl -n kube-system get ds nvidia-device-plugin-daemonset -o yaml | grep image
      {"apiVersion":"apps/v1","kind":"DaemonSet","metadata":{"annotations":{},"name":"nvidia-device-plugin-daemonset","namespace":"kube-system"},"spec":{"selector":{"matchLabels":{"name":"nvidia-device-plugin-ds"}},"template":{"metadata":{"labels":{"name":"nvidia-device-plugin-ds"}},"spec":{"affinity":{"nodeAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"no-gpu","operator":"NotIn","values":["enable"]}]}]}}},"containers":[{"env":[{"name":"FAIL_ON_INIT_ERROR","value":"false"}],"image":"nvcr.io/nvidia/k8s-device-plugin:v0.15.0","name":"nvidia-device-plugin-ctr","securityContext":{"allowPrivilegeEscalation":true,"capabilities":{"drop":["ALL"]}},"volumeMounts":[{"mountPath":"/var/lib/kubelet/device-plugins","name":"device-plugin"}]}],"priorityClassName":"system-node-critical","tolerations":[{"effect":"NoSchedule","key":"nvidia.com/gpu","operator":"Exists"}],"volumes":[{"hostPath":{"path":"/var/lib/kubelet/device-plugins"},"name":"device-plugin"}]}},"updateStrategy":{"type":"RollingUpdate"}}}
        image: nvcr.io/nvidia/k8s-device-plugin:v0.16.2
        imagePullPolicy: IfNotPresent

Image

exec pod

root@pt13:~# kubectl -n shuzhifengqiao1 exec -it smiling-viva-6124-864b4f6cd7-r747r bash
tom@smiling-viva-6124-864b4f6cd7-r747r:~$ nvidia-smi
Failed to initialize NVML: Unknown Error

Image


kubelet

kubelet 's cpuManagerPolicy is static

root@g007:/var/lib/kubelet# cat /var/lib/kubelet/config.yaml | grep cpuManagerPolicy
cpuManagerPolicy: static
@bilbilmyc bilbilmyc changed the title k8s pod ,After running for a while, the GPU cannot be found in the pod k8s pod ,After running for a while, the GPU cannot be found in the pod. Failed to initialize NVML: Unknown Error Oct 8, 2024
@jrhunger
Copy link

Check this: NVIDIA/nvidia-container-toolkit#48

Copy link

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

2 participants