Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Consider node conditions apart from node leases to take more informed decisions for scale down #110

Open
unmarshall opened this issue Apr 8, 2024 · 6 comments
Assignees
Labels
area/control-plane Control plane related kind/enhancement Enhancement, improvement, extension priority/1 Priority (lower number equals higher priority)

Comments

@unmarshall
Copy link
Contributor

unmarshall commented Apr 8, 2024

How to categorize this issue?

/area control-plane
/kind enhancement
/priority 3

What would you like to be added:
DWD today checks if the % of expired node leases is above a configured threshold then it will scale down the configured dependent resources (today that is KCM, MCM and CA). What it lacks is an ability to distinguish if the kubelet was unable to renew its lease due to network problems or was its own health or node's health. If the kubelet or the node is unhealthy then DWD should not scale down MCM and KCM and let them collaborate in replacing the unhealthy node.

Unhealthy Node is determined by looking at node conditions. Some of the node conditions are as below:

  • DiskPressure
  • KernelDeadlock
  • ReadOnlyFileSystem
  • KubeletUnavailable (this is set by node-problem-detector - currently this health checker plugin which sets this condition has not been enabled in g/g)

The above list is not comprehensive and therefore this list should be made configurable.

We therefore wish to introduce the following:
For all leases that are about to expire, check if the respective Node is present. If it is present then check the Conditions on the Node object. If it is deduced that the node conditions indicate an unhealthy node then that node should not be counted in the set of nodes which have expired leases.

Let us take an example to explain this better:

  • Consider a total of 10 nodes in a shoot cluster.
  • Threshold number of expired leases: 60% = 6 nodes (If there are 6 or more nodes having leases that are about to expire then DWD should consider scale down of dependent resources).
  • Let us assume that 7 nodes have leases that are about to expire.
  • 2 out of these 7 nodes are unhealthy.

What happens today: Since 7 nodes have leases that are about to expire, all dependent resources are scaled down (this includes MCM and KCM). This results in 2 unhealthy nodes not being replaced by MCM.

What we wish to have: 2 unhealthy nodes should be replaced by MCM but rest 5 should not be replaced.

@unmarshall unmarshall added the kind/enhancement Enhancement, improvement, extension label Apr 8, 2024
@gardener-prow gardener-prow bot added area/control-plane Control plane related priority/3 Priority (lower number equals higher priority) labels Apr 8, 2024
@unmarshall
Copy link
Contributor Author

We discussed 2 options:

Option 1
Do not consider scaling down dependent resources if majority (% can be defined later, but lets assume 80%) of nodes which have their leases close to expiry are also unhealthy (as defined above). To explain that further consider the above example where only 2 nodes are unhealthy where as a total of 7 nodes have their leases expired. In this case if we consider all 7 nodes then the % of nodes having expired leases till 70% > 60%. You will then scale down MCM and KCM. This is fine in this case because at the expense of 2 unhealthy nodes we prevent replacement of remaining 5 nodes. But consider a case where bulk of the nodes having expired leases are unhealthy. In that case the ratios have been reversed. To consider the same example above - lets assume out of 7 nodes 6 of them are unhealthy then at the expense of 1 node which is not really unhealthy you replace 6 nodes which are unhealthy. So the impact is minimised at the same time it ensures that unhealthy nodes are placed by MCM.

Option 2
Only consider scaling down KCM but never MCM. Add an additional annotation on the machine which is not unhealthy and should not be replaced. In this case nodes which are unhealthy still will get replaced by MCM (as MCM will check the node conditions) but the other nodes which only have their leases close to expiry or expired will not be replaced (since it cannot be ascertained with certainty the reason for node lease renewal failure). MCM should check the Machine object and if it finds this annotation present then it should not transition this machine to Unknown state and should preserve its last known state.
If and when the lease is renewed or if the node condition changes marking the node as unhealthy then DWD should remove this annotation letting MCM replace this node.

Caveat: If the lease for a node never gets renewed and the node status conditions also do not change then MCM will never replace this node. One way to alleviate this problem would be to define a timeout till which a node can skip replacement. Once that timeout expires then DWD should remove that annotation thus allowing MCM to act upon it.

@unmarshall
Copy link
Contributor Author

Another issue that exists even today is that if no lease ever gets renewed which means that the number of available nodes do not increase above 40% (considering 60% was the threshold) then KCM is never scaled back. Manual mitigation exists where you can annotate KCM with dependency-watchdog.gardener.cloud/ignore-scaling=true and scale it back up. Additional effort is required to remove this annotation when things are back to normal.

Should we solve this issue?

@rishabh-11 rishabh-11 self-assigned this Apr 12, 2024
@rishabh-11
Copy link
Collaborator

Should we solve this issue?

We have decided not to solve this as we do not see this as a problem. We can monitor how frequently this happens and decide on it later if needed.

@rishabh-11 rishabh-11 added priority/1 Priority (lower number equals higher priority) and removed priority/3 Priority (lower number equals higher priority) labels Apr 16, 2024
@gardener-ci-robot
Copy link

The Gardener project currently lacks enough active contributors to adequately respond to all issues.
This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Mark this issue as rotten with /lifecycle rotten
  • Close this issue with /close

/lifecycle stale

@gardener-prow gardener-prow bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 15, 2024
@unmarshall
Copy link
Contributor Author

/remove-lifecycle stale

@gardener-prow gardener-prow bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 23, 2024
@gardener-ci-robot
Copy link

The Gardener project currently lacks enough active contributors to adequately respond to all issues.
This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Mark this issue as rotten with /lifecycle rotten
  • Close this issue with /close

/lifecycle stale

@gardener-prow gardener-prow bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 21, 2024
@rishabh-11 rishabh-11 removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 19, 2024
@rishabh-11 rishabh-11 assigned aaronfern and unassigned rishabh-11 Dec 5, 2024
@rishabh-11 rishabh-11 removed their assignment Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Control plane related kind/enhancement Enhancement, improvement, extension priority/1 Priority (lower number equals higher priority)
Projects
None yet
Development

No branches or pull requests

4 participants