Skip to content

Commit

Permalink
Update in-place upgrade design
Browse files Browse the repository at this point in the history
  • Loading branch information
abhinavmpandey08 committed Jan 10, 2024
1 parent 077a6d2 commit 1ef4b9b
Showing 1 changed file with 70 additions and 27 deletions.
97 changes: 70 additions & 27 deletions designs/in-place-upgrades-eksa.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
## Introduction
At present, the only supported upgrade strategy in EKS-A is rolling update. However, for certain use cases (such as Single-Node Clusters with no spare capacity, Multi-Node Clusters with VM/OS customizations, etc.), upgrading a cluster via a Rolling Update strategy could either be not feasible or a costly operation (requiring to add new hardware, re-apply customizations...).

In-place upgrades aims to solve this problem by allowing users to perform Kubernetes node upgrades without replacing the underlying machines.
In-place upgrades aim to solve this problem by allowing users to perform Kubernetes node upgrades without replacing the underlying machines.

In a [previous proposal](in-place-upgrades-capi.md) we defined how a pluggable upgrade strategy allows to implement in-place upgrades with Cluster API. In this doc we'll describe how we will leverage and implement such architecture to offer in-place upgrades in EKS-A.

Expand All @@ -21,7 +21,7 @@ The goal is to bring enough clarity to the table that a general solution can be
* External etcd

## Overview of the Solution
**TLDR**: the eks-a controller manager will host the two webhook servers and implement 3 new controllers that will orchestrate the upgrade. These controllers will schedule privilege pods on each node to be upgraded, that will execute the upgrade logic as a sequence of containers.
**TLDR**: the eks-a controller manager will watch KCP and MD objects for "in-place-upgrade-needed" annotation and implement 3 new controllers that will orchestrate the upgrade. These controllers will schedule privilege pods on each node to be upgraded, that will execute the upgrade logic as a sequence of containers.

### High level view
Following the CAPI external upgrade strategy idea, we can start with the following diagram.
Expand All @@ -38,29 +38,33 @@ CAPI provides the tooling to register and run a Go HTTP server that implements a

These Hooks will only be responsible for accepting/rejecting the upgrade request (by looking at the computed difference between current and new machine spec) and creating the corresponding CRDs to "trigger" a CP/workers in-place upgrade.

In the first iteration of in-place upgrades, we won't rely on runtime extensions but rather use annotations to trigger in-place upgrades as described in the section below. We will migrate to the runtime extensions path once it's implemented in CAPI upstream.

### Triggering in-place upgrade
EKSA controller will watch KCP and MD objects for "in-place-upgrade-needed" annotation. If it sees the annotation, it creates a respective `ControlPlaneUpgrade` or `MachineDeploymentUpgrade` object which kicks off in-place upgrade.

### Upgrading Control Planes
We will have a `ControlPlaneKubeadmUpgrade` CRD and implement a controller to reconcile it. This controller will be responsible for orchestrating the upgrade of the different CP nodes: controlling the node sequence, define the upgrade steps required for each node and updating the CAPI objects (`Machine`, `KubeadmConfig`, etc.) after each node is upgraded.
We will have a `ControlPlaneUpgrade` CRD and implement a controller to reconcile it. This controller will be responsible for orchestrating the upgrade of the different CP nodes: controlling the node sequence, define the upgrade steps required for each node and updating the CAPI objects (`Machine`, `KubeadmConfig`, etc.) after each node is upgraded.
- The controller will upgrade CP nodes one by one.
- The upgrade actions will be defined as container specs that will be passed to the `NodeUpgrade` to execute and track.
- The controller will create `NodeUpgrade` objects which will go and upgrade the nodes.

This `ControlPlaneKubeadmUpgrade` should contain information about the new component versions that will be installed in the nodes and a status that allows to track the progress of the upgrade. Example:
This `ControlPlaneUpgrade` should contain information about the new component versions that will be installed in the nodes and a status that allows to track the progress of the upgrade. Example:

```go
type ControlPlaneUpgradeSpec struct {
Cluster corev1.ObjectReference `json:"cluster"`
ControlPlane corev1.ObjectReference `json:"controlPlane"`
MachinesRequireUpgrade []corev1.ObjectReference `json:"machinesRequireUpgrade"`
KubernetesVersion string `json:"kubernetesVersion"`
KubeletVersion string `json:"kubeletVersion"`
EtcdVersion *string `json:"etcdVersion,omitempty"`
CoreDNSVersion *string `json:"coreDNSVersion,omitempty"`
KubeadmClusterConfig string `json:"kubeadmClusterConfig"`
ControlPlane corev1.ObjectReference `json:"controlPlane"`
}

type ControlPlaneUpgradeStatus struct {
RequireUpgrade int64 `json:"requireUpgrade"`
Upgraded int64 `json:"upgraded"`
Ready bool `json:"ready"`
RequireUpgrade int64 `json:"requireUpgrade"`
Upgraded int64 `json:"upgraded"`
Ready bool `json:"ready"`
MachineUpgradeStatus []MachineUpgradeStatus `json:machineUpgradeStatus`
}

type MachineUpgradeStatus struct {
Name string `json:"name"`
Upgraded bool `json:"upgraded"`
}
```

Expand All @@ -79,14 +83,53 @@ There are a few ways to go about this:
More thought is required in this area so it will require a follow up.

### Upgrading MachineDeployments
We will have a `WorkersKubeadmUpgrade` CRD and implement a controller to reconcile it. This controller will be responsible for orchestrating the upgrade of the worker nodes: controlling the node sequence, define the upgrade steps required for each node and updating the CAPI objects (`Machine`, `KubeadmConfig`, etc.) after each node is upgraded.
- The controller will upgrade worker nodes in the same `WorkersKubeadmUpgrade` one by one.
- The upgrade actions will be defined as container specs that will be passed to the `NodeUpgrade` to execute and track.
We will have a `MachineDeploymentUpgrade` CRD and implement a controller to reconcile it. This controller will be responsible for orchestrating the upgrade of the worker nodes: controlling the node sequence, define the upgrade steps required for each node and updating the CAPI objects (`Machine`, `KubeadmConfig`, etc.) after each node is upgraded.
- The controller will upgrade worker nodes in the same `MachineDeployment` one by one.
- The controller will create `NodeUpgrade` objects which will go and upgrade the nodes.

This `MachineDeploymentUpgrade` should contain reference to the `MachineDeployment` that needs to be upgraded and a status that allows to track the progress of the upgrade.

This `WorkersKubeadmUpgrade` should contain information about the new component versions that will be installed in the nodes and a status that allows to track the progress of the upgrade.
```go
type MachineDeploymentUpgradeSpec struct {
MachineDeployment corev1.ObjectReference `json:"machineDeployment"`
}

// MachineDeploymentUpgradeStatus defines the observed state of MachineDeploymentUpgrade.
type MachineDeploymentUpgradeStatus struct {
RequireUpgrade int64 `json:"requireUpgrade,omitempty"`
Upgraded int64 `json:"upgraded,omitempty"`
Ready bool `json:"ready,omitempty"`
MachineState []MachineState `json:"machineState,omitempty"`
}
```

#### Upgrading nodes
We will have `NodeKubeadmUpgrade` CRD and implement a controller to reconcile it. This controller will be responsible from scheduling a pod on the specified workload cluster node with the specified containers as `initContainers`. It will track their progress and bubble up any error/success to the `NodeKubeadmUpgrade` status. The status should also allow to track the progress of the different upgrade "steps".
We will have `NodeUpgrade` CRD and implement a controller to reconcile it. This controller will be responsible from scheduling a pod on the specified workload cluster node with `initContainers` with different commands depending on whether the node is a control plane node or a worker node. It will track their progress and bubble up any error/success to the `NodeUpgrade` status.

```go
type NodeUpgradeSpec struct {
Machine corev1.ObjectReference `json:"machine"`
KubernetesVersion string `json:"kubernetesVersion"`
EtcdVersion *string `json:"etcdVersion,omitempty"`
NodeType NodeType `json:"nodeType"`
// FirstNodeToBeUpgraded signifies that the Node is the first node to be upgraded.
// This flag is only valid for control plane nodes and ignored for worker nodes.
// +optional
FirstNodeToBeUpgraded bool `json:"firstNodeToBeUpgraded,omitempty"`
}

// NodeUpgradeStatus defines the observed state of NodeUpgrade.
type NodeUpgradeStatus struct {
// +optional
Conditions []Condition `json:"conditions,omitempty"`
// +optional
Completed bool `json:"completed,omitempty"`
// ObservedGeneration is the latest generation observed by the controller.
ObservedGeneration int64 `json:"observedGeneration,omitempty"`
}


```

![in-place-container-diagram](images/in-place-eks-a-container.png)

Expand All @@ -98,15 +141,15 @@ The node upgrade process we need to perform, although different depending on the
2. Upgrade containerd.
3. Upgrade CNI plugins.
4. Update kubeadm binary and run the `kubeadm upgrade` process.
5. Drain the node.
5. (Optional) Drain the node.
6. Update `kubectl`/`kubelet` binaries and restart the kubelet service.
7. Uncordon de node.
7. Uncordon the node.

Each of this steps will be executed as an init container in a privileged pod. For the commands that need to run "on the host", we will use `nsenter` to execute them in the host namespace. All these steps will be idempotent, so the full pod can be recreated and execute all the containers from the start or only a subset of them if required.

Draining and uncordoning the node could run in either the container or the host namespace. However, we will run it in the container namespace to be able to leverage the injected credentials for the `ServiceAccount`. This way we don't depend on having a kubeconfig in the host disk. This not only allows us to easily limit the rbac permissions that the `kubectl` command will use, but it's specially useful for worker nodes, since these don't have a kubeconfig with enough permissions to perform these actions (CP nodes have an admin kubeconfig).

In order to codify the logic of each step (the ones that require logic, like the kubeadm upgrade), we will build a single go binary with multiple commands (one per step). The `ControlPlaneKubeadmUpgrade` and `WorkersKubeadmUpgrade` will just reference these commands when building the init containers spec.
In order to codify the logic of each step (the ones that require logic, like the kubeadm upgrade), we will build a single go binary with multiple commands (one per step). The `ControlPlaneUpgrade` and `MachineDeploymentUpgrade` will just reference these commands when building the init containers spec.

#### Optimizations
The previous makes assumes all cluster topologies need the same steps, for the sake of simplicity. However, this is not always true and knowing this cna lead to further optimize the process:
Expand All @@ -124,7 +167,7 @@ We will build an image containing everything required for all upgrade steps:

This way, the only dependency for air-gapped environments is to have an available container image registry where they can mirror these images (the same dependency we have today). The tradeoff is we need to build one image per eks-a + eks-d combo we support.

We will maintain a mapping inside the cluster (using a `ConfiMap`) to go from eks-d version to upgrader image. This `ConfigMap` will be updated when the management cluster components are updated (when a new Bundle is made available). The information will be included in the Bundles manifest and just extracted and simplified so the in place upgrade controllers don't depend on the full EKS-A Bundle.
We will maintain a mapping inside the cluster (using a `ConfigMap`) to go from eks-d version to upgrader image. This `ConfigMap` will be updated when the management cluster components are updated (when a new Bundle is made available). The information will be included in the Bundles manifest and just extracted and simplified so the in place upgrade controllers don't depend on the full EKS-A Bundle.

## Customer experience
### API
Expand Down Expand Up @@ -165,9 +208,9 @@ spec:
The EKS-A cluster status should reflect the upgrade process as for any other upgrade, with the message for `ControlPlaneReady` and `WorkersReady` describing the reason why they are not ready. However, users might need a more granular insight in the process, both for slow upgrades and for troubleshooting.

The `ControlPlaneKubeadmUpgrade` and `WorkersKubeadmUpgrade` will reflect in their status the number of nodes to upgrade and how many have been upgraded. In addition, they will bubble up errors that happen for any of the node upgrades they control.
The `ControlPlaneUpgrade` and `MachineDeploymentUpgrade` will reflect in their status the number of nodes to upgrade and how many have been upgraded. In addition, they will bubble up errors that happen for any of the node upgrades they control.

In addition, `NodeKubeadmUpgrade` will reflect in the status any error that occurs during the upgrade, indicating the step at which it failed. It will also reflect the steps that have been completes successfully. If there is an error and the user needs to access the logs, they can just use `kubectl logs` for the failed container. Once the issue is identified and fixed, they can delete the pod adn our controller will recreate them, restarting the upgrade process.
In addition, `NodeUpgrade` will reflect in the status any error that occurs during the upgrade, indicating the step at which it failed. It will also reflect the steps that have been completed successfully. If there is an error and the user needs to access the logs, they can just use `kubectl logs` for the failed container. Once the issue is identified and fixed, they can delete the pod and our controller will recreate them, restarting the upgrade process.

The upgrader pod won't contain sh, so user won't be able to obtain a shell with it. If interactive debugging is required, they can always use a different image or use SSH access directly on the node.

Expand Down

0 comments on commit 1ef4b9b

Please sign in to comment.