Skip to content

Commit

Permalink
[tfy-gpu-operator] Drop ns override, update container toolkit, config…
Browse files Browse the repository at this point in the history
…ure service monitors, expand generic values (#826)

* Drop namespace override in tfy-gpu-operator

* Update README.md with readme-generator-for-helm

Signed-off-by: chiragjn <[email protected]>

* Add defaults for generic cluster

* Enable mig manager

* Update README.md with readme-generator-for-helm

Signed-off-by: chiragjn <[email protected]>

* Enable service monitors

* Fix doc strings

* Update README.md with readme-generator-for-helm

Signed-off-by: chiragjn <[email protected]>

* Disable service monitors by default

* Update README.md with readme-generator-for-helm

Signed-off-by: chiragjn <[email protected]>

---------

Signed-off-by: chiragjn <[email protected]>
Co-authored-by: chiragjn <[email protected]>
  • Loading branch information
chiragjn and chiragjn authored Nov 26, 2024
1 parent 0ac937b commit 23eb954
Show file tree
Hide file tree
Showing 3 changed files with 348 additions and 21 deletions.
2 changes: 1 addition & 1 deletion charts/tfy-gpu-operator/Chart.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
apiVersion: v2
name: tfy-gpu-operator
version: 0.1.22
version: 0.1.23
description: "Truefoundry GPU Operator"
maintainers:
- name: truefoundry
Expand Down
51 changes: 44 additions & 7 deletions charts/tfy-gpu-operator/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Tfy-gpu-operator is a Helm chart that facilitates the deployment and management
| `aws-eks-gpu-operator.operator.resources.limits.memory` | Memory limit for the operator. | `300Mi` |
| `aws-eks-gpu-operator.driver.enabled` | Enable/Disable driver installation. | `false` |
| `aws-eks-gpu-operator.toolkit.enabled` | Enable/Disable nvidia container toolkit installation. | `true` |
| `aws-eks-gpu-operator.toolkit.version` | Version of the toolkit. | `v1.17.0-ubi8` |
| `aws-eks-gpu-operator.toolkit.version` | Version of the toolkit. | `v1.17.2-ubi8` |
| `aws-eks-gpu-operator.devicePlugin.enabled` | Enable/Disable nvidia device plugin installation. | `true` |
| `aws-eks-gpu-operator.node-feature-discovery.enableNodeFeatureApi` | Enable/Disable node feature api in node-feature-discovery. | `true` |
| `aws-eks-gpu-operator.node-feature-discovery.master.resources.requests.cpu` | CPU request for master node feature discovery. | `10m` |
Expand Down Expand Up @@ -84,7 +84,6 @@ Tfy-gpu-operator is a Helm chart that facilitates the deployment and management
| `gcp-gke-standard-dcgm-exporter.resources.requests.memory` | Memory request for the DCGM Exporter. | `300Mi` |
| `gcp-gke-standard-dcgm-exporter.resources.limits.cpu` | CPU limit for the DCGM Exporter. | `50m` |
| `gcp-gke-standard-dcgm-exporter.resources.limits.memory` | Memory limit for the DCGM Exporter. | `400Mi` |
| `gcp-gke-standard-dcgm-exporter.namespaceOverride` | Namespace override for the DCGM Exporter. | `tfy-gpu-operator` |
| `gcp-gke-standard-dcgm-exporter.serviceMonitor.enabled` | Enable or disable ServiceMonitor for DCGM Exporter. | `false` |
| `gcp-gke-standard-dcgm-exporter.mapPodsMetrics` | Enable mapping of pod metrics. | `true` |
| `gcp-gke-standard-dcgm-exporter.securityContext.privileged` | Set the container to privileged mode. | `true` |
Expand Down Expand Up @@ -123,7 +122,7 @@ Tfy-gpu-operator is a Helm chart that facilitates the deployment and management
| `azure-aks-gpu-operator.daemonsets.priorityClassName` | Priority class for Daemonsets | `system-node-critical` |
| `azure-aks-gpu-operator.driver.enabled` | Enable/Disable driver installation. | `false` |
| `azure-aks-gpu-operator.toolkit.enabled` | Enable/Disable nvidia container toolkit installation. | `true` |
| `azure-aks-gpu-operator.toolkit.version` | Version of the toolkit. Note for Aure Linux change `-ubuntu20.04` to `-ubi8`. However at the time of writing Azure Linux only supports V100 and T4 GPUs | `v1.17.0-ubuntu20.04` |
| `azure-aks-gpu-operator.toolkit.version` | Version of the toolkit. Note for Aure Linux change `-ubuntu20.04` to `-ubi8`. However at the time of writing Azure Linux only supports V100 and T4 GPUs | `v1.17.2-ubuntu20.04` |
| `azure-aks-gpu-operator.mig.strategy` | migStrategy for mig node, single or mixed | `mixed` |
| `azure-aks-gpu-operator.devicePlugin.enabled` | Enable/Disable nvidia device plugin installation. | `true` |
| `azure-aks-gpu-operator.dcgm.enabled` | Enabled/Disable standalone DCGM. | `false` |
Expand Down Expand Up @@ -184,7 +183,45 @@ Tfy-gpu-operator is a Helm chart that facilitates the deployment and management

### generic-gpu-operator Configuration for the GPU Operator. This section will only be used when clusterType.generic is set to true.

| Name | Description | Value |
| ------------------------------------------ | ------------------------------ | ------ |
| `generic-gpu-operator.operator.upgradeCRD` | upgrade CRD on chart upgrade | `true` |
| `generic-gpu-operator.operator.cleanupCRD` | cleanup CRD on chart uninstall | `true` |
| Name | Description | Value |
| ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- | ------------------------- |
| `generic-gpu-operator.nfd.enabled` | Enable/Disable node feature discovery. | `true` |
| `generic-gpu-operator.gfd.enabled` | Enable/Disable gpu feature discovery. | `true` |
| `generic-gpu-operator.operator.upgradeCRD` | upgrade CRD on chart upgrade | `true` |
| `generic-gpu-operator.operator.cleanupCRD` | cleanup CRD on chart uninstall | `true` |
| `generic-gpu-operator.operator.resources.requests.cpu` | CPU request for the operator. | `10m` |
| `generic-gpu-operator.operator.resources.requests.memory` | Memory request for the operator. | `100Mi` |
| `generic-gpu-operator.operator.resources.limits.cpu` | CPU limit for the operator. | `50m` |
| `generic-gpu-operator.operator.resources.limits.memory` | Memory limit for the operator. | `300Mi` |
| `generic-gpu-operator.node-feature-discovery.enableNodeFeatureApi` | Enable/Disable node feature api in node-feature-discovery. | `true` |
| `generic-gpu-operator.node-feature-discovery.master.resources.requests.cpu` | CPU request for master node feature discovery. | `10m` |
| `generic-gpu-operator.node-feature-discovery.master.resources.requests.memory` | Memory request for master node feature discovery. | `200Mi` |
| `generic-gpu-operator.node-feature-discovery.worker.resources.requests.cpu` | CPU request for worker node feature discovery. | `10m` |
| `generic-gpu-operator.node-feature-discovery.worker.resources.requests.memory` | Memory request for worker node feature discovery. | `100Mi` |
| `generic-gpu-operator.node-feature-discovery.worker.resources.limits.cpu` | CPU limit for worker node feature discovery. | `50m` |
| `generic-gpu-operator.node-feature-discovery.worker.resources.limits.memory` | Memory limit for worker node feature discovery. | `300Mi` |
| `generic-gpu-operator.node-feature-discovery.gc.enable` | Enable node feature discovery garbage collector. | `true` |
| `generic-gpu-operator.node-feature-discovery.gc.interval` | Interval between two garbage collection runs. | `30m` |
| `generic-gpu-operator.node-feature-discovery.gc.resources.requests.cpu` | CPU request for node feature discovery garbage collector. | `10m` |
| `generic-gpu-operator.node-feature-discovery.gc.resources.requests.memory` | Memory request for node feature discovery garbage collector. | `100Mi` |
| `generic-gpu-operator.daemonsets.updateStrategy` | Update Strategy for Daemonsets - one of ["OnDelete", "RollingUpdate"] | `OnDelete` |
| `generic-gpu-operator.daemonsets.priorityClassName` | Priority class for Daemonsets | `system-node-critical` |
| `generic-gpu-operator.driver.enabled` | Enable/Disable driver installation. | `true` |
| `generic-gpu-operator.toolkit.enabled` | Enable/Disable nvidia container toolkit installation. | `true` |
| `generic-gpu-operator.toolkit.version` | Version of the toolkit. | `v1.17.2-ubuntu20.04` |
| `generic-gpu-operator.mig.strategy` | migStrategy for mig node, single or mixed | `mixed` |
| `generic-gpu-operator.devicePlugin.enabled` | Enable/Disable nvidia device plugin installation. | `true` |
| `generic-gpu-operator.dcgm.enabled` | Enabled/Disable standalone DCGM. | `false` |
| `generic-gpu-operator.dcgm.version` | Image tag for DCGM container. Find all image tags at https://catalog.ngc.nvidia.com/orgs/nvidia/teams/cloud-native/containers/dcgm/tags | `3.3.8-1-ubuntu22.04` |
| `generic-gpu-operator.dcgm.resources.requests.cpu` | CPU request for standalone DCGM container | `10m` |
| `generic-gpu-operator.dcgm.resources.requests.memory` | Memory request for standalone DCGM container | `100Mi` |
| `generic-gpu-operator.dcgm.resources.limits.cpu` | CPU limit for standalone DCGM container | `100m` |
| `generic-gpu-operator.dcgm.resources.limits.memory` | Memory limit for standalone DCGM container | `1000Mi` |
| `generic-gpu-operator.dcgmExporter.enabled` | Enabled/Disable DCGM Exporter. | `true` |
| `generic-gpu-operator.dcgmExporter.version` | Image tag version for DCGM Exporter. Find all tags at https://catalog.ngc.nvidia.com/orgs/nvidia/teams/k8s/containers/dcgm-exporter/tags | `3.3.8-3.6.0-ubuntu22.04` |
| `generic-gpu-operator.dcgmExporter.serviceMonitor.enabled` | Enable or disable ServiceMonitor for DCGM Exporter. | `false` |
| `generic-gpu-operator.dcgmExporter.resources.requests.cpu` | CPU request for the DCGM Exporter. | `10m` |
| `generic-gpu-operator.dcgmExporter.resources.requests.memory` | Memory request for the DCGM Exporter. | `100Mi` |
| `generic-gpu-operator.dcgmExporter.resources.limits.cpu` | CPU limit for the DCGM Exporter. | `100m` |
| `generic-gpu-operator.dcgmExporter.resources.limits.memory` | Memory limit for the DCGM Exporter. | `1000Mi` |
| `generic-gpu-operator.dcgmExporter.args` | Arguments for the DCGM Exporter. | `["-c","5000"]` |
Loading

0 comments on commit 23eb954

Please sign in to comment.