Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rollout new nodes for OSImageURL change on spec without changing K8s version #8656

Merged
merged 3 commits into from
Sep 4, 2024

Conversation

rahulbabu95
Copy link
Member

@rahulbabu95 rahulbabu95 commented Aug 23, 2024

Issue #, if available:
For Baremetal user might want to patch the nodes with updated OSImage while without upgrading the K8s version. Consequently user might also want to rollout nodes with any specific config change that doesn't involve a K8s version upgrade but warrants a node rollout. Ex. API server flags update. Issue: https://github.com/aws/eks-anywhere-internal/issues/2443

Description of changes:

The upgrade flow only rolls out node on Tinkerbell provider for a K8s version change or when there's scale up of Control Plane or Workers. This is primarily due to the fact that there are hardware requirements for node rollout on Tinkerbell and if EKS-A misses any of those validations during the upgrade and doesn't set an error message on the corresponding object like KCP/MD, once the spec is applied CAPT would just get stuck waiting for additional hardware to be available for the rollout. This would be much later in the stack and would be inconvenient to debug. Currently we perform a bunch of hardware validations on the CLI and controller. This cover cases like RollingUpgrades, ModularUpgrades and Scaling operations on Control Plane and Worker Node groups.

The existing validations are implemented in a common place for CLI and controller and this is achieved by using a common interface ValidatableCluster. As we open up more flags/features for ControlPlane/Worker Groups, the existing method of validating won't be scalable as we would require this interface to implement more customized methods depending on the field of interest. This change Introduces a new Hardware Validation method that ExtraHardwareAvailableAssertionForNodeRollOut that could be used to validate any node rollout between ControlPlane and Workers. As we open up more feature gates through our spec more validations could be added ControlPlane or Worker Specific changes from CLI. On the controller side, only time CAPI rollsout a new node is when the underlying machine template Spec.MachineTemplate.InfrastructureRef changes for KCP or MD object. This change adds validation for that condition.

Tinkerbell reconciler overrides any generated template during an upgrade by using the omitTinkerbellMachineTemplates method which is called in every reconcile by fetching the existing machine templates from KCP and MD objects when the version on KCP and MD doesn't change. This has caused problems when users try to just update the TinkerbellTemplateConfig object on the cluster spec and expect a node to be upgraded. As this change adds hardware validations on controller for a node rollout, it is safe to remove this function and the controller would fail early.

Testing (if applicable):
Tested upgrades for control plane and worker node groups separately using both CLI and controller workflows.

Documentation added/planned (if applicable):

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@eks-distro-bot eks-distro-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 23, 2024
@rahulbabu95 rahulbabu95 force-pushed the fix/rollout-osimageurl-upgrade branch from 9e8956a to cede053 Compare August 23, 2024 21:57
Signed-off-by: Rahul Ganesh <[email protected]>
Copy link

codecov bot commented Aug 23, 2024

Codecov Report

Attention: Patch coverage is 73.11828% with 25 lines in your changes missing coverage. Please review.

Project coverage is 73.56%. Comparing base (a4e9dec) to head (e256558).
Report is 25 commits behind head on main.

Files with missing lines Patch % Lines
pkg/providers/tinkerbell/reconciler/reconciler.go 64.44% 10 Missing and 6 partials ⚠️
pkg/providers/tinkerbell/upgrade.go 77.14% 4 Missing and 4 partials ⚠️
pkg/providers/tinkerbell/assert.go 90.90% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8656      +/-   ##
==========================================
+ Coverage   73.53%   73.56%   +0.03%     
==========================================
  Files         578      578              
  Lines       36557    36612      +55     
==========================================
+ Hits        26882    26934      +52     
+ Misses       7955     7952       -3     
- Partials     1720     1726       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Rahul Ganesh <[email protected]>
@eks-distro-bot eks-distro-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 24, 2024
@Cajga
Copy link
Contributor

Cajga commented Aug 30, 2024

@rahulbabu95 , this affect us at the moment. We want to do an important change in the config of the nodes. Is there any way how I can force a new rollout other than wait for the next Kubernetes version (we are on 1.30)?

@vivek-koppuru
Copy link
Member

/cherrypick release-0.20

@eks-distro-pr-bot
Copy link
Contributor

@vivek-koppuru: once the present PR merges, I will cherry-pick it on top of release-0.20 in a new PR and assign it to you.

In response to this:

/cherrypick release-0.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@pokearu pokearu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@rahulbabu95
Copy link
Member Author

/approve

@eks-distro-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rahulbabu95

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rahulbabu95 rahulbabu95 merged commit 36abaa6 into aws:main Sep 4, 2024
10 of 12 checks passed
@eks-distro-pr-bot
Copy link
Contributor

@vivek-koppuru: new pull request created: #8707

In response to this:

/cherrypick release-0.20

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Cajga
Copy link
Contributor

Cajga commented Sep 5, 2024

@rahulbabu95 , just to report it here that I successfully managed to roll out new images pausing the eks-a cluster controller and then use the clusterctl alpha rollout restart for the kubeadmcontrolplane and to the machinedeployment.

Thanks for the hint!

@rahulbabu95
Copy link
Member Author

@Cajga great, we also have a v0.20.5 release out which has this fix included. You will not have to manually rollout nodes going forward!

@Cajga
Copy link
Contributor

Cajga commented Sep 10, 2024

@rahulbabu95 great news. Thanks for the support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants