Rollout new nodes for OSImageURL change on spec without changing K8s version #8656

rahulbabu95 · 2024-08-23T09:16:46Z

Issue #, if available:
For Baremetal user might want to patch the nodes with updated OSImage while without upgrading the K8s version. Consequently user might also want to rollout nodes with any specific config change that doesn't involve a K8s version upgrade but warrants a node rollout. Ex. API server flags update. Issue: https://github.com/aws/eks-anywhere-internal/issues/2443

Description of changes:

The upgrade flow only rolls out node on Tinkerbell provider for a K8s version change or when there's scale up of Control Plane or Workers. This is primarily due to the fact that there are hardware requirements for node rollout on Tinkerbell and if EKS-A misses any of those validations during the upgrade and doesn't set an error message on the corresponding object like KCP/MD, once the spec is applied CAPT would just get stuck waiting for additional hardware to be available for the rollout. This would be much later in the stack and would be inconvenient to debug. Currently we perform a bunch of hardware validations on the CLI and controller. This cover cases like RollingUpgrades, ModularUpgrades and Scaling operations on Control Plane and Worker Node groups.

The existing validations are implemented in a common place for CLI and controller and this is achieved by using a common interface ValidatableCluster. As we open up more flags/features for ControlPlane/Worker Groups, the existing method of validating won't be scalable as we would require this interface to implement more customized methods depending on the field of interest. This change Introduces a new Hardware Validation method that ExtraHardwareAvailableAssertionForNodeRollOut that could be used to validate any node rollout between ControlPlane and Workers. As we open up more feature gates through our spec more validations could be added ControlPlane or Worker Specific changes from CLI. On the controller side, only time CAPI rollsout a new node is when the underlying machine template Spec.MachineTemplate.InfrastructureRef changes for KCP or MD object. This change adds validation for that condition.

Tinkerbell reconciler overrides any generated template during an upgrade by using the omitTinkerbellMachineTemplates method which is called in every reconcile by fetching the existing machine templates from KCP and MD objects when the version on KCP and MD doesn't change. This has caused problems when users try to just update the TinkerbellTemplateConfig object on the cluster spec and expect a node to be upgraded. As this change adds hardware validations on controller for a node rollout, it is safe to remove this function and the controller would fail early.

Testing (if applicable):
Tested upgrades for control plane and worker node groups separately using both CLI and controller workflows.

Documentation added/planned (if applicable):

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Signed-off-by: Rahul Ganesh <[email protected]>

codecov · 2024-08-23T22:21:25Z

Codecov Report

Attention: Patch coverage is 73.11828% with 25 lines in your changes missing coverage. Please review.

Project coverage is 73.56%. Comparing base (a4e9dec) to head (e256558).
Report is 25 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/providers/tinkerbell/reconciler/reconciler.go	64.44%	10 Missing and 6 partials ⚠️
pkg/providers/tinkerbell/upgrade.go	77.14%	4 Missing and 4 partials ⚠️
pkg/providers/tinkerbell/assert.go	90.90%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8656      +/-   ##
==========================================
+ Coverage   73.53%   73.56%   +0.03%     
==========================================
  Files         578      578              
  Lines       36557    36612      +55     
==========================================
+ Hits        26882    26934      +52     
+ Misses       7955     7952       -3     
- Partials     1720     1726       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Rahul Ganesh <[email protected]>

Cajga · 2024-08-30T15:41:10Z

@rahulbabu95 , this affect us at the moment. We want to do an important change in the config of the nodes. Is there any way how I can force a new rollout other than wait for the next Kubernetes version (we are on 1.30)?

vivek-koppuru · 2024-09-03T23:43:42Z

/cherrypick release-0.20

eks-distro-pr-bot · 2024-09-03T23:43:44Z

@vivek-koppuru: once the present PR merges, I will cherry-pick it on top of release-0.20 in a new PR and assign it to you.

In response to this:

/cherrypick release-0.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pokearu

/lgtm

rahulbabu95 · 2024-09-04T20:39:24Z

/approve

eks-distro-bot · 2024-09-04T20:39:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rahulbabu95

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [rahulbabu95]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rahulbabu95 · 2024-09-04T20:45:59Z

@Cajga right now, we do not support rollout without changing the K8s version. Only other way I can think of is to try and make CAPI force rollout a node instead: https://cluster-api.sigs.k8s.io/clusterctl/commands/alpha-rollout.html. You might have the eks-a cluster controller paused during that process: https://release-0-19.anywhere.eks.amazonaws.com/docs/clustermgmt/cluster-upgrades/baremetal-upgrades/#:~:text=During%20the%20workload%20cluster%20upgrade%20process%2C%20EKS%20Anywhere%20pauses%20the%20cluster%20controller%20reconciliation%20by%20adding%20the%20paused%20annotation%20anywhere.eks.amazonaws.com/paused%3A%20true%20to%20the%20EKS%20Anywhere%20cluster%2C%20provider%20datacenterconfig%20and%20machineconfig%20resources%2C%20before%20the%20components%20upgrade.

eks-distro-pr-bot · 2024-09-04T20:47:18Z

@vivek-koppuru: new pull request created: #8707

In response to this:

/cherrypick release-0.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Cajga · 2024-09-05T09:57:48Z

@rahulbabu95 , just to report it here that I successfully managed to roll out new images pausing the eks-a cluster controller and then use the clusterctl alpha rollout restart for the kubeadmcontrolplane and to the machinedeployment.

Thanks for the hint!

rahulbabu95 · 2024-09-09T20:33:01Z

@Cajga great, we also have a v0.20.5 release out which has this fix included. You will not have to manually rollout nodes going forward!

Cajga · 2024-09-10T06:22:19Z

@rahulbabu95 great news. Thanks for the support.

eks-distro-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 23, 2024

Rollout new templates for osImageURL change

cede053

Signed-off-by: Rahul Ganesh <[email protected]>

rahulbabu95 force-pushed the fix/rollout-osimageurl-upgrade branch from 9e8956a to cede053 Compare August 23, 2024 21:57

format code and cleanup comments

97a2238

Signed-off-by: Rahul Ganesh <[email protected]>

Add more test coverage

e256558

Signed-off-by: Rahul Ganesh <[email protected]>

eks-distro-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 24, 2024

vivek-koppuru approved these changes Sep 3, 2024

View reviewed changes

eks-distro-bot assigned vivek-koppuru Sep 3, 2024

eks-distro-bot added the lgtm label Sep 3, 2024

pokearu approved these changes Sep 4, 2024

View reviewed changes

eks-distro-bot assigned pokearu Sep 4, 2024

eks-distro-bot added the approved label Sep 4, 2024

rahulbabu95 merged commit 36abaa6 into aws:main Sep 4, 2024
10 of 12 checks passed

eks-distro-pr-bot mentioned this pull request Sep 4, 2024

[release-0.20] Rollout new nodes for OSImageURL change on spec without changing K8s version #8707

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rollout new nodes for OSImageURL change on spec without changing K8s version #8656

Rollout new nodes for OSImageURL change on spec without changing K8s version #8656

rahulbabu95 commented Aug 23, 2024 •

edited

Loading

codecov bot commented Aug 23, 2024 •

edited

Loading

Cajga commented Aug 30, 2024 •

edited

Loading

vivek-koppuru commented Sep 3, 2024

eks-distro-pr-bot commented Sep 3, 2024

pokearu left a comment

rahulbabu95 commented Sep 4, 2024

eks-distro-bot commented Sep 4, 2024

rahulbabu95 commented Sep 4, 2024

eks-distro-pr-bot commented Sep 4, 2024

Cajga commented Sep 5, 2024

rahulbabu95 commented Sep 9, 2024

Cajga commented Sep 10, 2024

Rollout new nodes for OSImageURL change on spec without changing K8s version #8656

Rollout new nodes for OSImageURL change on spec without changing K8s version #8656

Conversation

rahulbabu95 commented Aug 23, 2024 • edited Loading

codecov bot commented Aug 23, 2024 • edited Loading

Codecov Report

Cajga commented Aug 30, 2024 • edited Loading

vivek-koppuru commented Sep 3, 2024

eks-distro-pr-bot commented Sep 3, 2024

pokearu left a comment

Choose a reason for hiding this comment

rahulbabu95 commented Sep 4, 2024

eks-distro-bot commented Sep 4, 2024

rahulbabu95 commented Sep 4, 2024

eks-distro-pr-bot commented Sep 4, 2024

Cajga commented Sep 5, 2024

rahulbabu95 commented Sep 9, 2024

Cajga commented Sep 10, 2024

rahulbabu95 commented Aug 23, 2024 •

edited

Loading

codecov bot commented Aug 23, 2024 •

edited

Loading

Cajga commented Aug 30, 2024 •

edited

Loading