Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: operator anti-pattern, validator pod deployments cause CrashBackLoop behaviour #1114

Open
justinthelaw opened this issue Nov 13, 2024 · 0 comments

Comments

@justinthelaw
Copy link

justinthelaw commented Nov 13, 2024

HOST INFORMATION

  1. OS and Architecture: Ubuntu 22.04, amd64
  2. Kubernetes Distribution: K3s, K3d, RKE2
  3. Kubernetes Version: v1.30.4
  4. Host Node GPUs: NVIDIA RTX 4090 and 4070

DESCRIPTION

The NVIDIA GPU Operator validator contains hard-coded deployments of the CUDA validation and Plugin validation pods within the gpu-operator-validator daemonset's container. There is no way to influence the way these pods are deployed via values files, nor is there an easy way to manipulate the workload pods via post-deploy actions (e.g., kubectl delete), else the validation daemonset fails.

This is a Kubernetes Operator anti-pattern for these reasons:

  1. Declarative Mismatch: Hardcoding breaks Kubernetes’ declarative model, reducing flexibility and forcing redeployments for changes.
  2. Reduced Flexibility: Users can’t easily customize pods without modifying the operator itself.
  3. Operator Role: Operators should automate operational knowledge, not act as static YAML deployment tools.
  4. Maintenance Complexity: Embedded manifests complicate testing, maintenance, and reusability.

PROBLEM STATEMENT

This anti-pattern also led to issues in our secure runtime stack. Our service mesh, Istio, must be implemented to secure ingress/egress using NetPols via internal CRs, adding another layer of defense to all services within the cluster. There are no exceptions to this rule, and all namespaces should have Istio injection enabled, with explicit and justified pod-level exclusions (e.g., certain jobs)

To this end, we had to modify the existing gpu-operator-validator Dockerfile and validation workload pod manifests to explicitly, instead of broadly, exclude sidecar injection in the validation pods, else they would hang indefinitely. There was no way to do a post-deployment patch to end the validation pods so that the deployment would continue. Our efforts, e.g., post-deploy cron-jobs, lead to the gpu-operator-validator daemonset going into a pseudo-CrashBackLoop, where it would re-deploy the validation workload pods again every 5 minutes or it would go into an actual CrashBackLoop, ending the advancement of the overall deployment altogether.

Ultimately, our modifications described in the above paragraph were successful in allowing the validations to run, complete, and continue/finish the NVIDIA GPU Operator deployment without further issues, all while allowing Istio injection in the gpu-operator namespace.

As a bonus issue, the resources and limits are inconsistently applied to the initContainers and containers within both validation workload pods.

RECOMMENDATIONS

  1. Use CRDs, ConfigMaps, or straight helm chart templates to drive resource creation dynamically for flexibility and separation of concerns.
  2. Modify the existing, hard-coded deployment manifests for the validation pods to allow for lower-level templating in the metadata, security context, and resources, to name a few.

ADDITIONAL CONTEXT

Please note that it may also be possible to post-patch a ConfigMap with these particular hard-coded manifests into the daemonset as well.

Modified Dockerfile

ARG OPERATOR_VALIDATOR_IMAGE="nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.6.2"

FROM $OPERATOR_VALIDATOR_IMAGE

RUN rm -rf /var/nvidia/manifests/cuda-workload-validation.yaml /var/nvidia/manifests/plugin-workload-validation.yaml

COPY ./src/validator-image/manifests/cuda-workload-validation.yaml /var/nvidia/manifests
COPY ./src/validator-image/manifests/plugin-workload-validation.yaml /var/nvidia/manifests

ENTRYPOINT ["/bin/bash"]

Example CUDA validation workload manifest

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: nvidia-cuda-validator
    sidecar.istio.io/inject: "false" # added line of most importance, other additions are secondary
  generateName: nvidia-cuda-validator-
  namespace: "FILLED_BY_THE_VALIDATOR"
spec:
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  restartPolicy: OnFailure
  serviceAccountName: nvidia-operator-validator
  initContainers:
    - name: cuda-validation
      image: "FILLED_BY_THE_VALIDATOR"
      imagePullPolicy: IfNotPresent
      command: ["sh", "-c"]
      args: ["vectorAdd"]
      env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: "all"
      securityContext:
        privileged: true
      # add resources and limits
      resources:
        requests:
          cpu: 50m
          memory: 32Mi
        limits:
          cpu: 100m
          memory: 64Mi
  containers:
    - name: nvidia-cuda-validator
      image: "FILLED_BY_THE_VALIDATOR"
      imagePullPolicy: IfNotPresent
      # override command and args as validation is already done by initContainer
      command: ["sh", "-c"]
      args: ["echo cuda workload validation is successful"]
      securityContext:
        privileged: true
        readOnlyRootFilesystem: true
      # add resources and limits
      resources:
        requests:
          cpu: 50m
          memory: 32Mi
        limits:
          cpu: 100m
          memory: 64Mi
@justinthelaw justinthelaw changed the title bug: operator anti-pattern, validator hard-coded pod deployments bug: operator anti-pattern, validator pod deployments cause CrashBackLoop behaviour Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant