Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The resource requests and limits are not being applied to the pod as expected. #1145

Open
IndhumithaR opened this issue Nov 28, 2024 · 0 comments

Comments

@IndhumithaR
Copy link

IndhumithaR commented Nov 28, 2024

Gpu operator version: v24.6.1
driver.version: 535.154.05
device plugin verion: v0.16.2-ubi8

Kubernetes distribution
EKS

Kubernetes version
v1.27.0

Hi,

We attempted to install the Nvidia driver directly on our node's base image instead of using the GPU operator. However, after doing so, the resource requests and limits set for the pods are no longer effective, and all containers within the pods are able to access all the GPUs.

Sample pod spec

apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi-pod-3
spec:

  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: node.kubernetes.io/instance-type
            operator: In
            values:
            - g5.48xlarge
  containers:
  - name: nvidia-smi-container
    image: nvidia/cuda:12.6.2-cudnn-devel-ubuntu20.04
    command: ["sleep", "infinity"]
    resources:
      limits:
        nvidia.com/gpu: 5
      requests:
        nvidia.com/gpu: 5
 
    securityContext:
      capabilities:
        add:
        - SYS_NICE
      privileged: true
  tolerations:
  - key: "nvidia.com/gpu"
    value: "true"
    effect: "NoSchedule"

Here I am trying to set request and limit to 5.
But when I enter into the container and check, I am able to see all the 8 gpus.

Image

However, we tested running the same pod in a different environment where the same driver version was installed using the GPU operator (instead of directly in the base image), and it worked as expected.

Image

What could be the problem? Is there a way to fix it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant