-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement] Consider node conditions apart from node leases to take more informed decisions for scale down #110
Comments
We discussed 2 options: Option 1 Option 2 Caveat: If the lease for a node never gets renewed and the node status conditions also do not change then MCM will never replace this node. One way to alleviate this problem would be to define a timeout till which a node can skip replacement. Once that timeout expires then DWD should remove that annotation thus allowing MCM to act upon it. |
Another issue that exists even today is that if no lease ever gets renewed which means that the number of available nodes do not increase above 40% (considering 60% was the threshold) then KCM is never scaled back. Manual mitigation exists where you can annotate KCM with Should we solve this issue? |
We have decided not to solve this as we do not see this as a problem. We can monitor how frequently this happens and decide on it later if needed. |
The Gardener project currently lacks enough active contributors to adequately respond to all issues.
You can:
/lifecycle stale |
/remove-lifecycle stale |
The Gardener project currently lacks enough active contributors to adequately respond to all issues.
You can:
/lifecycle stale |
How to categorize this issue?
/area control-plane
/kind enhancement
/priority 3
What would you like to be added:
DWD today checks if the % of expired node leases is above a configured threshold then it will scale down the configured dependent resources (today that is KCM, MCM and CA). What it lacks is an ability to distinguish if the kubelet was unable to renew its lease due to network problems or was its own health or node's health. If the kubelet or the node is unhealthy then DWD should not scale down MCM and KCM and let them collaborate in replacing the unhealthy node.
Unhealthy Node
is determined by looking at node conditions. Some of the node conditions are as below:The above list is not comprehensive and therefore this list should be made configurable.
We therefore wish to introduce the following:
For all leases that are about to expire, check if the respective
Node
is present. If it is present then check theConditions
on theNode
object. If it is deduced that the node conditions indicate an unhealthy node then that node should not be counted in the set of nodes which have expired leases.Let us take an example to explain this better:
What happens today: Since 7 nodes have leases that are about to expire, all dependent resources are scaled down (this includes MCM and KCM). This results in 2 unhealthy nodes not being replaced by MCM.
What we wish to have: 2 unhealthy nodes should be replaced by MCM but rest 5 should not be replaced.
The text was updated successfully, but these errors were encountered: