metal-stack · majst01 · May 8, 2024 · May 6, 2024 · May 8, 2024 · May 8, 2024
@@ -28,6 +28,7 @@ makedocs(
             "Operating Systems" => "overview/os.md",
             "Kubernetes Integration" => "overview/kubernetes.md",
             "Isolated Kubernetes" => "overview/isolated-kubernetes.md",
+            "GPU Support" => "overview/gpu-support.md",
             "Storage" => "overview/storage.md",
             "Comparison" => "overview/comparison.md",
         ],

@@ -72,7 +72,7 @@ One exception is the `metal-console` service which must have the partition in it
 
 ### State
 
-In order to replicate certain data which must be available across all partitions we can use on of the existing open source databases which enable such kind of setup. There are a few avaible out there, the following uncomplete list will highlight the pro´s and cons of each.
+In order to replicate certain data which must be available across all partitions we can use on of the existing open source databases which enable such kind of setup. There are a few available out there, the following incomplete list will highlight the pro´s and cons of each.
 
 - RethinkDB
 

@@ -9,7 +9,7 @@ In this proposal we want to introduce a flexible and low-maintenance approach fo
 In general our auditing logs will be collected by a request interceptor or middleware. Every request and response will be processed and eventually logged to Meilisearch.
 Meilisearch will be configured to regularly create chunks of the auditing logs. These finished chunks will be backed up to a S3 compatible storage with a read-only option enabled.
 
-Of course sensitve data like session keys or passwords will be redacted before logging. We want to track relevant requests and responses. If auditing the request fails, the request itself will be aborted and will not be processed further. The requests and responses that will be audited will be annotated with a correlation id.
+Of course sensitive data like session keys or passwords will be redacted before logging. We want to track relevant requests and responses. If auditing the request fails, the request itself will be aborted and will not be processed further. The requests and responses that will be audited will be annotated with a correlation id.
 
 Transferring the meilisearch auditing data chunks to the S3 compatible storage will be done by a sidecar cronjob that is executed periodically.
 To avoid data manipulation the S3 compatible storage will be configured to be read-only.

@@ -40,7 +40,7 @@ _footer: ""
 
 - Fully independent locations with own storage and own node networks
 - Clusters can only be created independent in every location
-  - Failover mechanism for deployed applications requires duplicated deployments, which can serve indepedently
+  - Failover mechanism for deployed applications requires duplicated deployments, which can serve independently
   - Failover through BGP
 - If cluster nodes are spread across partitions (not implemented yet), nodes will not be able to reach each other
   - Would require an overlay network for inter-node-communication

@@ -40,7 +40,7 @@ Firewalls that access shared networks need to:
 
 ![Advanced Setup](./shared_advanced.png)
 
-## Getting internet acccess
+## Getting internet access
 
 Machines contained in a shared network can access the internet with different scenarios:
 

@@ -28,7 +28,7 @@ We incorporate community feedback into the roadmap. If you think that important
 - Autoscaler for metal control plane components
 - CI dashboard and public integration testing
 - Cilium as the default CNI for metal-stack on Gardener K8s clusters
-- Improved release and deploy processes (GitOps, [Spinnaker](https://spinnaker.io/), [Flux](https://www.weave.works/oss/flux/))
+- Improved release and deploy processes (GitOps, [Spinnaker](https://spinnaker.io/), [Flux](https://fluxcd.io/))
 - Machine internet without firewalls
 - metal-stack dashboard (UI)
 - Offer our metal-stack extensions as enterprise products (accounting, cluster-api, S3) (neither of them will ever be required for running metal-stack, they just add extra value for certain enterprises)

@@ -13,7 +13,7 @@ Additional an IDS is managed on the firewall to detect known network anomalies.
 
 For every `Service` of type `LoadBalancer` in the cluster, the corresponding ingress rules will be automatically generated.
 
-If `loadBalancerSourceRanges` is not specified, incomig traffic to this service will be allowed for any source ip adresses.
+If `loadBalancerSourceRanges` is not specified, incomig traffic to this service will be allowed for any source ip addresses.
 
 ## Configuration
 
@@ -29,16 +29,16 @@ metadata:
   namespace: firewall
   name: firewall
 spec:
-  # Interval of reconcilation if nftables rules and network traffic accounting
+  # Interval of reconciliation if nftables rules and network traffic accounting
   interval: 10s
   # Ratelimits specify on which physical interface, which maximum rate of traffic is allowed
   ratelimits:
   # The name of the interface visible with ip link show
   - interface: vrf104009
     # The maximum rate in MBits/s
     rate: 10
-  # Internalprefixes defines a list of prefixes where the traffic going to, or comming from is considered internal, e.g. not leaving into external networks
-  # given the archictecture picture above this would be:
+  # Internalprefixes defines a list of prefixes where the traffic going to, or coming from is considered internal, e.g. not leaving into external networks
+  # given the architecture picture above this would be:
   internalprefixes:
   - "1.2.3.0/24
   - "172.17.0.0/16"

@@ -23,7 +23,7 @@ The mini-lab is a small, virtual setup to locally run the metal-stack. It deploy
 - kvm as hypervisor for the VMs (you can check through the `kvm-ok` command)
 - [docker](https://www.docker.com/) >= 20.10.13 (for using kind and our deployment base image)
 - [kind](https://github.com/kubernetes-sigs/kind/releases) == v0.20.0 (for hosting the metal control plane)
-- [containerlab](https://containerlab.srlinux.dev/install/) >= v0.47.1
+- [containerlab](https://containerlab.dev/install/) >= v0.47.1
 - the lab creates a docker network on your host machine (`172.17.0.1`), this hopefully does not overlap with other networks you have
 - (recommended) haveged to have enough random entropy (only needed if the PXE process does not work)
 

@@ -612,7 +612,7 @@ The Edgerouters has to fulfill some requirements including:
 
 ### Management Servers
 
-The second bastion hosts are the management servers. They are the main bootstrapping components of the Out-Of-Band-Network. They also act as jump hosts for all components in a partition. Once they are installed and deployed, we are able to bootstrap all the other components. To bootstrap the management servers, we generate an ISO image which will automatically install an OS and an ansible user with ssh keys. It is preconfigured with a preseed file to allow an unattended OS installation for our needs. This is why we need remote access to the IPMI interface of the management servers: The generated ISO is attached via the virtual media function of the BMC. Ater that, all we have to do is boot from that virtual CD-ROM and wait for the installation to finish. Deployment jobs (Gitlab-CI) in a partition are delegated to the appropriate management servers, therefore we need a CI runner active on each management server.
+The second bastion hosts are the management servers. They are the main bootstrapping components of the Out-Of-Band-Network. They also act as jump hosts for all components in a partition. Once they are installed and deployed, we are able to bootstrap all the other components. To bootstrap the management servers, we generate an ISO image which will automatically install an OS and an ansible user with ssh keys. It is preconfigured with a preseed file to allow an unattended OS installation for our needs. This is why we need remote access to the IPMI interface of the management servers: The generated ISO is attached via the virtual media function of the BMC. After that, all we have to do is boot from that virtual CD-ROM and wait for the installation to finish. Deployment jobs (Gitlab-CI) in a partition are delegated to the appropriate management servers, therefore we need a CI runner active on each management server.
 
 After the CI runner has been installed, you can trigger your Playbooks from the the CI. The Ansible-Playbooks have to make sure that these functionalities are present on the management servers:
 
@@ -657,7 +657,7 @@ You can find installation instructions for Gardener on the Gardener website bene
 1. Add a [cloud profile](https://github.com/gardener/gardener/blob/v1.3.3/example/30-cloudprofile.yaml) called `metal` containing all your machine images, machine types and regions (region names can be chosen freely, the zone names need to match your partition names) together with our metal-stack-specific provider config as defined [here](https://github.com/metal-stack/gardener-extension-provider-metal/blob/v0.9.1/pkg/apis/metal/v1alpha1/types_cloudprofile.go)
 1. Register the [gardener-extension-provider-metal](https://github.com/metal-stack/gardener-extension-provider-metal) controller by deploying the [controller-registration](https://github.com/metal-stack/gardener-extension-provider-metal/blob/v0.9.1/example/controller-registration.yaml) into your Gardener cluster, parametrize the embedded chart in the controller registration's values section if necessary ([this](https://github.com/metal-stack/gardener-extension-provider-metal/tree/v0.9.1/charts/provider-metal) is the corresponding values file)
 1. metal-stack does not provide an own backup storage infrastructure for now. If you want to enable ETCD backups (which you should do because metal-stack also does not have persistent storage out of the box, which makes these backups even more valuable), you should deploy an extension-provider of another cloud provider and configure it to only reconcile the backup buckets (you can reference this backup infrastructure used for the metal shoot in the shoot spec)
-1. Register the [os-extension-provider-metal](https://github.com/metal-stack/os-metal-extension) controller by deploying the [controller-registration](https://github.com/metal-stack/os-metal-extension/blob/v0.4.1/example/controller-registration.yaml) into your Gardener cluster, this controller can transform the operating system configuration from Gardener into Ingition user data
+1. Register the [os-extension-provider-metal](https://github.com/metal-stack/os-metal-extension) controller by deploying the [controller-registration](https://github.com/metal-stack/os-metal-extension/blob/v0.4.1/example/controller-registration.yaml) into your Gardener cluster, this controller can transform the operating system configuration from Gardener into Ignition user data
 1. You need to use the Gardener's [networking-calico](https://github.com/gardener/gardener-extension-networking-calico) controller for setting up shoot CNI, you will have to put specific provider configuration into the shoot spec to make it work with metal-stack:
    ```yaml
         networking:

@@ -106,7 +106,7 @@ The following figure shows several partitions connected to a single metal contro
 Some notes on this picture:
 
 - By design, a partition only has very few ports open for incoming-connections from the internet. This contributes to a smaller attack surface and higher security of your infrastructure.
-- With the help of NSQ, it is not required to have connections from the metal control plane to the metal-core. The metal-core instances register at the message bus and can then consume partition-specfic topics, e.g. when a machine deletion gets issued by a user.
+- With the help of NSQ, it is not required to have connections from the metal control plane to the metal-core. The metal-core instances register at the message bus and can then consume partition-specific topics, e.g. when a machine deletion gets issued by a user.
 
 ## Machine Provisioning Sequence
 

@@ -0,0 +1,62 @@
+# GPU Support
+
+```@contents
+Pages = ["gpu-support.md"]
+Depth = 5
+```
+
+For workloads which require the assistance of GPUs, support for GPUs in bare metal servers was added to metal-stack.io v0.18.0.
+
+## GPU Operator installation
+
+With the nvidia image a worker has basic GPU support. This means that the required kernel driver, the containerd shim and the required containerd configuration are already installed and configured.
+
+To enable `Pods` that require GPU support to be scheduled on a worker node with a GPU, a `gpu-operator' must be installed.
+This has to be done by the cluster owner after the cluster is up and running.
+
+The simplest way to install this operator is as follows:
+
+```bash
+helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
+helm repo update
+
+kubectl create ns gpu-operator
+kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged
+
+helm install --wait \
+  --generate-name \
+  --namespace gpu-operator \
+  --create-namespace \
+    nvidia/gpu-operator \
+    --set driver.enabled=false \
+    --set toolkit.enabled=false
+```
+
+After that `kubectl describe node` must show the gpu in the capacity like so:
+
+```plain
+...
+Capacity:
+  cpu:                64
+  ephemeral-storage:  100205640Ki
+  hugepages-1Gi:      0
+  hugepages-2Mi:      0
+  memory:             263802860Ki
+  nvidia.com/gpu:     1
+  pods:               510
+...
+```
+
+With this basic installation, the worker node is ready to process GPU workloads.
+
+!!! warning
+    However, there is a caveat - only one 'Pod' can access the GPU. If this is all you need, no additional configuration is required.
+    On the other hand, if you are planning to deploy multiple applications that require GPU support, and there are not that many GPUs available, you will need to configure the `gpu-operator` to allow the GPU to be shared between multiple `Pods`.
+
+There are several approaches to sharing GPUs, please consult the official Nvidia documentation for further reference.
+
+[https://developer.nvidia.com/blog/improving-gpu-utilization-in-kubernetes](https://developer.nvidia.com/blog/improving-gpu-utilization-in-kubernetes)
+[https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html)
+[https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html)
+
+With this, happy AI processing.
@@ -6,19 +6,33 @@ We came up with a repository called [go-hal](https://github.com/metal-stack/go-h
 
 ## Servers
 
-At the moment we support the following server types:
+The following server types are officially supported and verified by the metal-stack project:
 
 | Vendor     | Series      | Model            | Board Type     | Status |
 |------------|-------------|------------------|:---------------|:-------|
 | Supermicro | Big-Twin    | SYS-2029BT-HNR   | X11DPT-B       | stable |
+| Supermicro | Big-Twin    | SYS-220BT-HNTR   | X12DPT-B6      | stable |
 | Supermicro | SuperServer | SSG-5019D8-TR12P | X11SDV-8C-TP8F | stable |
 | Supermicro | SuperServer | 2029UZ-TN20R25M  | X11DPU         | stable |
+| Supermicro | SuperServer | SYS-621C-TN12R   | X13DDW-A       | stable |
 | Supermicro | Microcloud  | 5039MD8-H8TNR    | X11SDD-8C-F    | stable |
 | Lenovo     | ThinkSystem | SD530            |                | alpha  |
 
+Other server series and models might work but were not reported to us.
+
+## GPUs
+
+The following GPU types are officially supported and verified by the metal-stack project:
+
+| Vendor | Model    | Status |
+| ------ | -------- | :----- |
+| NVIDIA | RTX 6000 | stable |
+
+Other GPU models might work but were not reported to us. For a detailed description howto use GPU support in a kubernetes cluster please check this [documentation](gpu-support.md)
+
 ## Switches
 
-At the moment we support the following switch types:
+The following switch types are officially supported and verified by the metal-stack project:
 
 | Vendor    | Series        | Model      | OS             | Status |
 | :-------- | :------------ | :--------- | :------------- | :----- |
@@ -27,6 +41,8 @@ At the moment we support the following switch types:
 | Edge-Core | AS7700 Series | AS7712-32X | Edgecore SONiC | stable |
 | Edge-Core | AS7700 Series | AS7726-32X | Edgecore SONiC | stable |
 
+Other switch series and models might work but were not reported to us.
+
 !!! warning
 
     On our switches we run [SONiC](https://sonicfoundation.dev). The metal-core writes network configuration specifically implemented for this operating system. Please also consider running SONiC on your switches if you do not want to run into any issues with networking.