bug: operator anti-pattern, validator pod deployments cause `CrashBackLoop` behaviour #1114

justinthelaw · 2024-11-13T15:38:27Z

HOST INFORMATION

OS and Architecture: Ubuntu 22.04, amd64
Kubernetes Distribution: K3s, K3d, RKE2
Kubernetes Version: v1.30.4
Host Node GPUs: NVIDIA RTX 4090 and 4070

DESCRIPTION

The NVIDIA GPU Operator validator contains hard-coded deployments of the CUDA validation and Plugin validation pods within the gpu-operator-validator daemonset's container. There is no way to influence the way these pods are deployed via values files, nor is there an easy way to manipulate the workload pods via post-deploy actions (e.g., kubectl delete), else the validation daemonset fails.

This is a Kubernetes Operator anti-pattern for these reasons:

Declarative Mismatch: Hardcoding breaks Kubernetes’ declarative model, reducing flexibility and forcing redeployments for changes.
Reduced Flexibility: Users can’t easily customize pods without modifying the operator itself.
Operator Role: Operators should automate operational knowledge, not act as static YAML deployment tools.
Maintenance Complexity: Embedded manifests complicate testing, maintenance, and reusability.

PROBLEM STATEMENT

This anti-pattern also led to issues in our secure runtime stack. Our service mesh, Istio, must be implemented to secure ingress/egress using NetPols via internal CRs, adding another layer of defense to all services within the cluster. There are no exceptions to this rule, and all namespaces should have Istio injection enabled, with explicit and justified pod-level exclusions (e.g., certain jobs)

To this end, we had to modify the existing gpu-operator-validator Dockerfile and validation workload pod manifests to explicitly, instead of broadly, exclude sidecar injection in the validation pods, else they would hang indefinitely. There was no way to do a post-deployment patch to end the validation pods so that the deployment would continue. Our efforts, e.g., post-deploy cron-jobs, lead to the gpu-operator-validator daemonset going into a pseudo-CrashBackLoop, where it would re-deploy the validation workload pods again every 5 minutes or it would go into an actual CrashBackLoop, ending the advancement of the overall deployment altogether.

Ultimately, our modifications described in the above paragraph were successful in allowing the validations to run, complete, and continue/finish the NVIDIA GPU Operator deployment without further issues, all while allowing Istio injection in the gpu-operator namespace.

As a bonus issue, the resources and limits are inconsistently applied to the initContainers and containers within both validation workload pods.

RECOMMENDATIONS

Use CRDs, ConfigMaps, or straight helm chart templates to drive resource creation dynamically for flexibility and separation of concerns.
Modify the existing, hard-coded deployment manifests for the validation pods to allow for lower-level templating in the metadata, security context, and resources, to name a few.

ADDITIONAL CONTEXT

Please note that it may also be possible to post-patch a ConfigMap with these particular hard-coded manifests into the daemonset as well.

Modified Dockerfile

ARG OPERATOR_VALIDATOR_IMAGE="nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.6.2"

FROM $OPERATOR_VALIDATOR_IMAGE

RUN rm -rf /var/nvidia/manifests/cuda-workload-validation.yaml /var/nvidia/manifests/plugin-workload-validation.yaml

COPY ./src/validator-image/manifests/cuda-workload-validation.yaml /var/nvidia/manifests
COPY ./src/validator-image/manifests/plugin-workload-validation.yaml /var/nvidia/manifests

ENTRYPOINT ["/bin/bash"]

Example CUDA validation workload manifest

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: nvidia-cuda-validator
    sidecar.istio.io/inject: "false" # added line of most importance, other additions are secondary
  generateName: nvidia-cuda-validator-
  namespace: "FILLED_BY_THE_VALIDATOR"
spec:
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  restartPolicy: OnFailure
  serviceAccountName: nvidia-operator-validator
  initContainers:
    - name: cuda-validation
      image: "FILLED_BY_THE_VALIDATOR"
      imagePullPolicy: IfNotPresent
      command: ["sh", "-c"]
      args: ["vectorAdd"]
      env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: "all"
      securityContext:
        privileged: true
      # add resources and limits
      resources:
        requests:
          cpu: 50m
          memory: 32Mi
        limits:
          cpu: 100m
          memory: 64Mi
  containers:
    - name: nvidia-cuda-validator
      image: "FILLED_BY_THE_VALIDATOR"
      imagePullPolicy: IfNotPresent
      # override command and args as validation is already done by initContainer
      command: ["sh", "-c"]
      args: ["echo cuda workload validation is successful"]
      securityContext:
        privileged: true
        readOnlyRootFilesystem: true
      # add resources and limits
      resources:
        requests:
          cpu: 50m
          memory: 32Mi
        limits:
          cpu: 100m
          memory: 64Mi

The text was updated successfully, but these errors were encountered:

justinthelaw changed the title ~~bug: operator anti-pattern, validator hard-coded pod deployments~~ bug: operator anti-pattern, validator pod deployments cause CrashBackLoop behaviour Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: operator anti-pattern, validator pod deployments cause `CrashBackLoop` behaviour #1114

bug: operator anti-pattern, validator pod deployments cause `CrashBackLoop` behaviour #1114

justinthelaw commented Nov 13, 2024 •

edited

Loading

bug: operator anti-pattern, validator pod deployments cause CrashBackLoop behaviour #1114

bug: operator anti-pattern, validator pod deployments cause CrashBackLoop behaviour #1114

Comments

justinthelaw commented Nov 13, 2024 • edited Loading

HOST INFORMATION

DESCRIPTION

PROBLEM STATEMENT

RECOMMENDATIONS

ADDITIONAL CONTEXT

Modified Dockerfile

Example CUDA validation workload manifest

bug: operator anti-pattern, validator pod deployments cause `CrashBackLoop` behaviour #1114

bug: operator anti-pattern, validator pod deployments cause `CrashBackLoop` behaviour #1114

justinthelaw commented Nov 13, 2024 •

edited

Loading