Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prober should not consider Failed or Terminating Machines, Unhealthy Nodes. #124

Merged
merged 19 commits into from
Nov 26, 2024

Conversation

rishabh-11
Copy link
Contributor

@rishabh-11 rishabh-11 commented Jul 1, 2024

What this PR does / why we need it:
This PR does the following:-

  1. Prober will no longer consider Failed or Terminating Machines in candidate lease calculation.

  2. Prober will no longer consider Unhealthy Nodes in candidate lease calculation. The criteria for a node to be Unhealthy is as follows:-
    a.) The Shoot resource has a section to define node Conditions per worker to be checked by MCM for marking a machine Unknown See:- https://github.com/gardener/gardener/blob/98fb871eb6045f69715d9f8b39dd84f9637f1fa0/example/90-shoot.yaml#L138. If not specified DWD will use the default node conditions that MCM uses (See :- https://github.com/gardener/machine-controller-manager/blob/11c2983db70c3c13a7a95d51f9de9896e69d4599/pkg/util/provider/app/options/options.go#L60).
    b.) If any of these node conditions is set to True, then DWD will consider the node Unhealthy and not consider it in candidate lease calculation.

  3. Replace mocks with fake clients for unit tests.

  4. If there is only one candidate lease present in the cluster, DWD probe will not perform any scaling operations.

  5. Enabled Pprof endpoint to allow profiling of DWD.

Which issue(s) this PR fixes:
Fixes #123 #126
Fixes part of #110

Special notes for your reviewer:

Release note:

Replaced mocks with fake clients for unit tests
DWD prober no longer considers `Failed` or `Terminating` Machines in scaling decisions for the dependent resources.
DWD prober no longer considers `Unhealthy` Nodes in scaling decisions for the dependent resources.
DWD prober no longer performs scaling operations for clusters with only one candidate lease.

@rishabh-11 rishabh-11 requested a review from unmarshall July 1, 2024 06:25
@gardener-prow gardener-prow bot added the do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. label Jul 1, 2024
@gardener-prow gardener-prow bot requested a review from ashwani2k July 1, 2024 06:25
@gardener-prow gardener-prow bot added cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 1, 2024
@rishabh-11
Copy link
Contributor Author

/kind enhancement

@gardener-prow gardener-prow bot added kind/enhancement Enhancement, improvement, extension and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Jul 1, 2024
@rishabh-11
Copy link
Contributor Author

/test pull-dependency-watchdog-check-vulnerabilities

@gardener-prow gardener-prow bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 17, 2024
@rishabh-11 rishabh-11 removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 18, 2024
@gardener-ci-robot
Copy link

The Gardener project currently lacks enough active contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:

  • After 15d of inactivity, lifecycle/stale is applied
  • After 15d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 7d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Mark this PR as rotten with /lifecycle rotten
  • Close this PR with /close

/lifecycle stale

@gardener-prow gardener-prow bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 2, 2024
@gardener-ci-robot
Copy link

The Gardener project currently lacks enough active contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:

  • After 15d of inactivity, lifecycle/stale is applied
  • After 15d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 7d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close

/lifecycle rotten

@gardener-prow gardener-prow bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 17, 2024
@rishabh-11 rishabh-11 removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 18, 2024
@@ -58,30 +60,30 @@ func testDeletePredicateFunc(g *WithT, numWorkers int) bool {
func TestUpdatePredicateFunc(t *testing.T) {
tests := []struct {
title string
oldNumWorkers int
newNumWorkers int
oldWorkerNames []string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just write a small function that generates the worker names and retain the old way as that is more flexible and easier to change. Same elsewhere as well.

@@ -150,9 +150,10 @@ func testWeederDedicatedEnvTest(t *testing.T) {
for _, test := range tests {
ctx, cancelFn := context.WithCancel(context.Background())
testEnv, reconciler := setupWeederEnv(ctx, t, test.apiServerFlags)
ns := testutil.CreateTestNamespace(ctx, g, reconciler.Client, strings.ToLower(test.name))
testNs := testutil.GenerateRandomAlphanumericString(g, 4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do not need to use this utility function any longer. We should perhaps remove this now.
Use https://pkg.go.dev/k8s.io/apimachinery/pkg/util/rand#String instead.
Change this at all places where this utility function is used.

* Check the role/rolebinding in the shoot for the dependency-watchdog-probe related service account in the `kube-system`
namespace. Add rules for listing leases in the `kube-node-lease` namespace.
* Scale down the DWD prober deployment in the garden namespace (in the shooted-seed) and start a local DWD process by
providing the prober config and the kubeconfig of the shooted seed as command line flags -
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to add a new line for commands that are enclosed within bash else the formatting is not going to be correct.

> process. To do that you will have to the following:
> * Create a new `DNSRecord` containing the new shoot Kube API server endpoint. This will create a provider specific
route (e.g. In case of AWS it will create a AWS-Route53 endpoint)
> * To ensure that a call to this endpoint is routed to the Kube API server of the shoot do the following:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need all this? Please revisit and change the docs.

// ProbeError is the error type for probe errors. It contains the error code, the cause of the error, and the error message.
// It is used by prober to record its last error and is currently only used for unit tests.
type ProbeError struct {
Code ErrorCode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add doc strings for all exposed fields


func (c *fakeClient) getRecordedObjectCollectionError(method ClientMethod, namespace string, labelSelector labels.Selector, objGVK schema.GroupVersionKind) error {
for _, errRecord := range c.errorRecords {
if errRecord.method == method && errRecord.resourceNamespace == namespace {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can combine these into a single IF condition and perhaps add some comments as well.

if errRecord.method == method &&
	errRecord.resourceNamespace == namespace &&
	(errRecord.resourceGVK == objGVK || // if it is a direct GVK match
		(labelSelector == nil && errRecord.labels == nil) || // if there is no label selector and no labels in the error record
		labelSelector != nil && labelSelector.Matches(errRecord.labels)) { // if there is a label selector, and it matches the labels in the error record
	return errRecord.err
}

}

func TestAPIServerProbeFailure(t *testing.T) {
testCases := []probeTestCase{
{name: "Unknown error is returned by api server", discoveryError: errFoo},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you seems to have removed this test case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not needed. The other cases are sufficient.

}

// GenerateRandomAlphanumericString generates a random alphanumeric string of the given length.
func GenerateRandomAlphanumericString(g *WithT, length int) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is no longer required.

Copy link
Contributor

@unmarshall unmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Nov 26, 2024
Copy link

gardener-prow bot commented Nov 26, 2024

LGTM label has been added.

Git tree hash: e83c81d405913128845165aff0a9e9367cc3f0f8

Copy link

gardener-prow bot commented Nov 26, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: unmarshall

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 26, 2024
@unmarshall unmarshall merged commit e54d69f into gardener:master Nov 26, 2024
6 of 7 checks passed
@rishabh-11 rishabh-11 deleted the refactor branch November 26, 2024 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/enhancement Enhancement, improvement, extension lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Exclude terminating and Failed machines/nodes from DWD probe calculation of Failed Leases
6 participants