-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rollout new nodes for OSImageURL change on spec without changing K8s version #8656
Rollout new nodes for OSImageURL change on spec without changing K8s version #8656
Conversation
Signed-off-by: Rahul Ganesh <[email protected]>
9e8956a
to
cede053
Compare
Signed-off-by: Rahul Ganesh <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8656 +/- ##
==========================================
+ Coverage 73.53% 73.56% +0.03%
==========================================
Files 578 578
Lines 36557 36612 +55
==========================================
+ Hits 26882 26934 +52
+ Misses 7955 7952 -3
- Partials 1720 1726 +6 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Rahul Ganesh <[email protected]>
@rahulbabu95 , this affect us at the moment. We want to do an important change in the config of the nodes. Is there any way how I can force a new rollout other than wait for the next Kubernetes version (we are on 1.30)? |
/cherrypick release-0.20 |
@vivek-koppuru: once the present PR merges, I will cherry-pick it on top of release-0.20 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rahulbabu95 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@Cajga right now, we do not support rollout without changing the K8s version. Only other way I can think of is to try and make CAPI force rollout a node instead: https://cluster-api.sigs.k8s.io/clusterctl/commands/alpha-rollout.html. You might have the eks-a cluster controller paused during that process: https://release-0-19.anywhere.eks.amazonaws.com/docs/clustermgmt/cluster-upgrades/baremetal-upgrades/#:~:text=During%20the%20workload%20cluster%20upgrade%20process%2C%20EKS%20Anywhere%20pauses%20the%20cluster%20controller%20reconciliation%20by%20adding%20the%20paused%20annotation%20anywhere.eks.amazonaws.com/paused%3A%20true%20to%20the%20EKS%20Anywhere%20cluster%2C%20provider%20datacenterconfig%20and%20machineconfig%20resources%2C%20before%20the%20components%20upgrade. |
@vivek-koppuru: new pull request created: #8707 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@rahulbabu95 , just to report it here that I successfully managed to roll out new images pausing the eks-a cluster controller and then use the Thanks for the hint! |
@Cajga great, we also have a v0.20.5 release out which has this fix included. You will not have to manually rollout nodes going forward! |
@rahulbabu95 great news. Thanks for the support. |
Issue #, if available:
For Baremetal user might want to patch the nodes with updated OSImage while without upgrading the K8s version. Consequently user might also want to rollout nodes with any specific config change that doesn't involve a K8s version upgrade but warrants a node rollout. Ex. API server flags update. Issue: https://github.com/aws/eks-anywhere-internal/issues/2443
Description of changes:
The upgrade flow only rolls out node on Tinkerbell provider for a K8s version change or when there's scale up of Control Plane or Workers. This is primarily due to the fact that there are hardware requirements for node rollout on Tinkerbell and if EKS-A misses any of those validations during the upgrade and doesn't set an error message on the corresponding object like KCP/MD, once the spec is applied CAPT would just get stuck waiting for additional hardware to be available for the rollout. This would be much later in the stack and would be inconvenient to debug. Currently we perform a bunch of hardware validations on the CLI and controller. This cover cases like RollingUpgrades, ModularUpgrades and Scaling operations on Control Plane and Worker Node groups.
The existing validations are implemented in a common place for CLI and controller and this is achieved by using a common interface
ValidatableCluster
. As we open up more flags/features for ControlPlane/Worker Groups, the existing method of validating won't be scalable as we would require this interface to implement more customized methods depending on the field of interest. This change Introduces a new Hardware Validation method thatExtraHardwareAvailableAssertionForNodeRollOut
that could be used to validate any node rollout between ControlPlane and Workers. As we open up more feature gates through our spec more validations could be added ControlPlane or Worker Specific changes from CLI. On the controller side, only time CAPI rollsout a new node is when the underlying machine templateSpec.MachineTemplate.InfrastructureRef
changes for KCP or MD object. This change adds validation for that condition.Tinkerbell reconciler overrides any generated template during an upgrade by using the
omitTinkerbellMachineTemplates
method which is called in every reconcile by fetching the existing machine templates from KCP and MD objects when the version on KCP and MD doesn't change. This has caused problems when users try to just update theTinkerbellTemplateConfig
object on the cluster spec and expect a node to be upgraded. As this change adds hardware validations on controller for a node rollout, it is safe to remove this function and the controller would fail early.Testing (if applicable):
Tested upgrades for control plane and worker node groups separately using both CLI and controller workflows.
Documentation added/planned (if applicable):
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.