feat: turn healthCheckHttpClient timeout from 500ms to 3s #1321

batleforc · 2024-09-16T20:57:00Z

What does this PR do?

It change the timeout of the healthcheck client from 500ms to 3s

What issues does this PR fix or reference?

Linked to eclipse-che/che#23067
Fixes #1325

Is it tested? How?

In progress

PR Checklist

E2E tests pass (when PR is ready, comment /test v8-devworkspace-operator-e2e, v8-che-happy-path to trigger)
- v8-devworkspace-operator-e2e: DevWorkspace e2e test
- v8-che-happy-path: Happy path for verification integration with Che

openshift-ci · 2024-09-16T20:57:11Z

Hi @batleforc. Thanks for your PR.

I'm waiting for a devfile member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

batleforc · 2024-09-16T21:23:39Z

Setup in a working environment, with the linked PR work

AObuchow

@batleforc Thank you for the PR :)

I assume you're submitting this PR with the hopes of getting it into upstream DWO (and Che) rather than just making changes for your own testing, correct?

Rather than change the default timeout as part of your PR, my gut instinct is we should instead expose a configuration option in the DevWorkspaceOperatorConfig that would allow users to customize the healthCheckHttpClient timeout. Then you could configure the timeout to your desired value from there.

If you're okay with reworking your PR to do this, let me know and I can help guide you further.

AObuchow · 2024-09-24T20:55:58Z

controllers/workspace/http.go

@@ -70,7 +70,7 @@ func setupHttpClients(k8s client.Client, logger logr.Logger) {
 	}
 	healthCheckHttpClient = &http.Client{
 		Transport: healthCheckTransport,
-		Timeout:   500 * time.Millisecond,
+		Timeout:   3 * 500 * time.Millisecond,


This would actually be 1500 ms (1.5s) instead of 3s.

Without a doubt a checkout of a stash to high, the test that I deployed was set to a hard coded 3s

AObuchow · 2024-09-24T20:58:33Z

/ok-to-test

batleforc · 2024-09-24T21:18:08Z

HI @AObuchow,
This PR is part of an issue in the Che Side where I have a problem with a slow CNI.
Your gut instinct are the same as mine, the end goal would be to have it merged with a possibility to set this value but i was conflicted on how to have it match between the Che Operator and the DevWorkspace Operator.
I'm totally okay on reworking this PR and if you have time I'm waiting for your guidance.

dkwon17 · 2024-09-30T19:54:40Z

Instead of increasing the timeout, what about returning a RetryError when the health check fails here , and handle it with checkDWError? This is so that another reconcile request would be created if the attempt to ping the health endpoint fails.

@batleforc does that work for your use case?

We do something similar when waiting for the workspace deployment to be ready:

devworkspace-operator/pkg/provision/workspace/deployment.go

Line 91 in 0055cb6

return &dwerrors.RetryError{Message: "Deployment is not ready"}

devworkspace-operator/controllers/workspace/devworkspace_controller.go

Lines 485 to 489 in 0055cb6

    
           if shouldReturn, reconcileResult, reconcileErr := r.checkDWError(workspace, err, "Error creating DevWorkspace deployment", metrics.DetermineProvisioningFailureReason(err.Error()), reqLogger, &reconcileStatus); shouldReturn { 
        
           	reqLogger.Info("Waiting on deployment to be ready") 
        
           	reconcileStatus.setConditionFalse(conditions.DeploymentReady, "Waiting for workspace deployment") 
        
           	return reconcileResult, reconcileErr 
        
           }

batleforc · 2024-09-30T20:27:23Z

I think that it could fix the problem, totally answer my case @dkwon17, and could remove the need of changing the Che operator source code.
What I really need is to not wait for Five more minutes when the IDE is already up but the CNI took too long to broadcast the IP of the Pod (it annoys me and kind of irritate the user).

batleforc · 2024-09-30T20:28:12Z

And your answer wouldn't lock the operator on this action and potentially unlock the process for future action

batleforc · 2024-10-07T19:28:47Z

@dkwon17, i've tried to set it up correctly, not sure if it's correct. I choose to set it with a reconcile after 1 second (could be less but don't know)

AObuchow

Thank you for the update to your PR.
@batleforc Have you tried if your latest changes resolve the issue you were encountering?

In my experience, I've seen DWO continuously re-attempt the health check when getting a 502 bad gateway response. However, in these cases you'd see a Main URL server not ready -- are you not seeing that message?

AObuchow · 2024-10-07T20:54:10Z

controllers/workspace/devworkspace_controller.go

-		return reconcile.Result{}, err
+	if shouldReturn, reconcileResult, reconcileErr := r.checkDWError(workspace, err, "Error checking server status", metrics.ReasonInfrastructureFailure, reqLogger, &reconcileStatus); shouldReturn {
+		reqLogger.Info("Waiting for DevWorkspace to be ready")
+		reconcileStatus.setConditionFalse(conditions.DeploymentReady, "Waiting for DevWorkspace to be ready") // No sure of the conditions.DeploymentReady is the right one


dw.DevWorkspaceReady may be a more appropriate condition (we use it when the health check fails)

dkwon17

Thank you @batleforc !

dkwon17 · 2024-10-07T20:59:29Z

controllers/workspace/devworkspace_controller.go

-		return reconcile.Result{}, err
+	if shouldReturn, reconcileResult, reconcileErr := r.checkDWError(workspace, err, "Error checking server status", metrics.ReasonInfrastructureFailure, reqLogger, &reconcileStatus); shouldReturn {
+		reqLogger.Info("Waiting for DevWorkspace to be ready")
+		reconcileStatus.setConditionFalse(conditions.DeploymentReady, "Waiting for DevWorkspace to be ready") // No sure of the conditions.DeploymentReady is the right one


~~WDYT @AObuchow ? After removing the comment, it looks good to me~~

I prefer @AObuchow 's suggestion above

After this change, this PR looks good to me :)

dkwon17 · 2024-10-07T21:01:24Z

In my experience, I've seen DWO continuously re-attempt the health check when getting a 502 bad gateway response. However, in these cases you'd see a Main URL server not ready -- are you not seeing that message?

I've experienced the opposite actually, after handling all the queued reconciles, a retry doesn't happen after a health fail check

dkwon17 · 2024-10-07T21:04:44Z

@AObuchow I think you're thinking about this case, where the health check endpoint is accessible, but a 502 is returned:

devworkspace-operator/controllers/workspace/devworkspace_controller.go

Lines 497 to 501 in 533d1f0

    
           if !serverReady { 
        
           	reqLogger.Info("Main URL server not ready", "status-code", serverStatusCode) 
        
           	reconcileStatus.setConditionFalse(dw.DevWorkspaceReady, "Waiting for editor to start") 
        
           	return reconcile.Result{RequeueAfter: 1 * time.Second}, nil 
        
           }

This PR is to handle the case where there is a timeout when trying to access the health check endpoint

AObuchow · 2024-10-07T21:40:16Z

@AObuchow I think you're thinking about this case, where the health check endpoint is accessible, but a 502 is returned:

devworkspace-operator/controllers/workspace/devworkspace_controller.go

Lines 497 to 501 in 533d1f0

if !serverReady {

reqLogger.Info("Main URL server not ready", "status-code", serverStatusCode)

reconcileStatus.setConditionFalse(dw.DevWorkspaceReady, "Waiting for editor to start")

return reconcile.Result{RequeueAfter: 1 * time.Second}, nil

}

This PR is to handle the case where there is a timeout when trying to access the health check endpoint

@dkwon17 thank you for the confirmation, yes you're right - I believe that's the case I was thinking about.

batleforc · 2024-10-08T04:31:51Z

Hi @AObuchow @dkwon17 ,
Didn't have time to try it yet (was pretty late), I will try to set it up in both instance that encounter the problem during the following day and will come back to you as fast as possible.

AObuchow

@batleforc this looks just about good to me to merge 😁 Thank you so much for your contribution.

One last request: Could you please re-organize your commits from this PR?
Here are my suggestions:

Squash all 3 of your commits into a single commit, as they are all iterations on the change introduced from this PR
Change the final commit message to feat: queue workspace reconcile if workspace health check encounters an error. In the description, add a fix #1325.

The final commit message should ressemble something like:

feat: queue workspace reconcile if workspace health check encounters an error

fix #1325

Signed-off-by: Max batleforc <[email protected]>

AObuchow · 2024-10-08T19:18:20Z

Hi @AObuchow @dkwon17 , Didn't have time to try it yet (was pretty late), I will try to set it up in both instance that encounter the problem during the following day and will come back to you as fast as possible.

Sounds great to me :) I hope this resolves your issue. FWIW: This PR is approved and can be merged (after my final request on cleaning up the commit log) :)

batleforc · 2024-10-09T18:50:22Z

So, I squashed the commit, but I still need to test it on the last env that has this problem quite frequently (a Devspaces 3.16.1 updated recently)

AObuchow · 2024-10-09T19:21:18Z

@batleforc Thank you for updating your PR, sounds good to me. Keep us updated when you have a moment :) It's greatly appreciated.

AObuchow

Some last minute comments, but looks great overall @batleforc :)

controllers/workspace/devworkspace_controller.go

AObuchow · 2024-10-09T19:33:50Z

controllers/workspace/devworkspace_controller.go

-		return reconcile.Result{}, err
+	if shouldReturn, reconcileResult, reconcileErr := r.checkDWError(workspace, err, "Error checking server status", metrics.ReasonInfrastructureFailure, reqLogger, &reconcileStatus); shouldReturn {
+		reqLogger.Info("Waiting for DevWorkspace to be ready")
+		reconcileStatus.setConditionFalse(dw.DevWorkspaceReady, "Waiting for DevWorkspace to be ready") // No sure of the conditions.DeploymentReady is the right one


@dkwon17 @batleforc Any thoughts on if we should change the DevWorkspace status message & log here to something more specific to the health check timing out? "Waiting for DevWorkspace to be ready"

Maybe something like:

reqLogger.Info("Waiting for DevWorkspace health check endpoint to become available") reconcileStatus.setConditionFalse(dw.DevWorkspaceReady, "Waiting for workspace health check to become available")

…an error fix devfile#1325 Signed-off-by: Max batleforc <[email protected]>

openshift-ci · 2024-10-09T19:56:05Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AObuchow, batleforc, dkwon17, ibuziuk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [AObuchow,dkwon17]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

batleforc · 2024-10-14T08:36:05Z

After 5 day of testing, the problem didn't appear again, so it's a fix.
And the user are more than happy about the 1 minute at most startup 🤣

AObuchow · 2024-10-15T15:42:16Z

After 5 day of testing, the problem didn't appear again, so it's a fix. And the user are more than happy about the 1 minute at most startup 🤣

I'm glad to hear this PR seems to resolve your issue 😁
Thank you so much for the great contribution @batleforc & thank you, @dkwon17, for steering this PR in the right direction with your review :)

Merging this PR now.

batleforc requested review from AObuchow, dkwon17 and ibuziuk as code owners September 16, 2024 20:57

openshift-ci bot added the needs-ok-to-test label Sep 16, 2024

AObuchow reviewed Sep 24, 2024

View reviewed changes

openshift-ci bot added ok-to-test and removed needs-ok-to-test labels Sep 24, 2024

AObuchow reviewed Oct 7, 2024

View reviewed changes

dkwon17 approved these changes Oct 7, 2024

View reviewed changes

openshift-ci bot assigned dkwon17 Oct 7, 2024

openshift-ci bot added lgtm approved labels Oct 7, 2024

openshift-ci bot removed the lgtm label Oct 8, 2024

ibuziuk approved these changes Oct 8, 2024

View reviewed changes

openshift-ci bot assigned ibuziuk Oct 8, 2024

openshift-ci bot added the lgtm label Oct 8, 2024

AObuchow reviewed Oct 8, 2024

View reviewed changes

batleforc force-pushed the Work-on-timeout branch from 03e8d12 to 8b49c22 Compare October 9, 2024 18:47

openshift-ci bot removed the lgtm label Oct 9, 2024

AObuchow reviewed Oct 9, 2024

View reviewed changes

feat: queue workspace reconcile if workspace health check encounters …

7df914e

…an error fix devfile#1325 Signed-off-by: Max batleforc <[email protected]>

batleforc force-pushed the Work-on-timeout branch from 8b49c22 to 7df914e Compare October 9, 2024 19:40

AObuchow approved these changes Oct 9, 2024

View reviewed changes

openshift-ci bot assigned AObuchow Oct 9, 2024

openshift-ci bot added the lgtm label Oct 9, 2024

batleforc mentioned this pull request Oct 14, 2024

Work on timeout eclipse-che/che-operator#1874

Closed

10 tasks

AObuchow merged commit 75c2a60 into devfile:main Oct 15, 2024
8 checks passed

AObuchow mentioned this pull request Oct 15, 2024

Healthz bad gateway when pod/service ip take time to propagate eclipse-che/che#23067

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: turn healthCheckHttpClient timeout from 500ms to 3s #1321

feat: turn healthCheckHttpClient timeout from 500ms to 3s #1321

batleforc commented Sep 16, 2024 •

edited by AObuchow

Loading

openshift-ci bot commented Sep 16, 2024

batleforc commented Sep 16, 2024

AObuchow left a comment

AObuchow Sep 24, 2024

batleforc Sep 24, 2024

AObuchow commented Sep 24, 2024

batleforc commented Sep 24, 2024

dkwon17 commented Sep 30, 2024

batleforc commented Sep 30, 2024

batleforc commented Sep 30, 2024 •

edited

Loading

batleforc commented Oct 7, 2024 •

edited

Loading

AObuchow left a comment •

edited

Loading

AObuchow Oct 7, 2024

dkwon17 left a comment

dkwon17 Oct 7, 2024 •

edited

Loading

AObuchow Oct 7, 2024

dkwon17 commented Oct 7, 2024 •

edited

Loading

dkwon17 commented Oct 7, 2024 •

edited

Loading

AObuchow commented Oct 7, 2024

batleforc commented Oct 8, 2024

AObuchow left a comment •

edited

Loading

AObuchow commented Oct 8, 2024

batleforc commented Oct 9, 2024

AObuchow commented Oct 9, 2024

AObuchow left a comment

AObuchow Oct 9, 2024

openshift-ci bot commented Oct 9, 2024

batleforc commented Oct 14, 2024 •

edited

Loading

AObuchow commented Oct 15, 2024

feat: turn healthCheckHttpClient timeout from 500ms to 3s #1321

feat: turn healthCheckHttpClient timeout from 500ms to 3s #1321

Conversation

batleforc commented Sep 16, 2024 • edited by AObuchow Loading

What does this PR do?

What issues does this PR fix or reference?

Is it tested? How?

PR Checklist

openshift-ci bot commented Sep 16, 2024

batleforc commented Sep 16, 2024

AObuchow left a comment

Choose a reason for hiding this comment

AObuchow Sep 24, 2024

Choose a reason for hiding this comment

batleforc Sep 24, 2024

Choose a reason for hiding this comment

AObuchow commented Sep 24, 2024

batleforc commented Sep 24, 2024

dkwon17 commented Sep 30, 2024

batleforc commented Sep 30, 2024

batleforc commented Sep 30, 2024 • edited Loading

batleforc commented Oct 7, 2024 • edited Loading

AObuchow left a comment • edited Loading

Choose a reason for hiding this comment

AObuchow Oct 7, 2024

Choose a reason for hiding this comment

dkwon17 left a comment

Choose a reason for hiding this comment

dkwon17 Oct 7, 2024 • edited Loading

Choose a reason for hiding this comment

AObuchow Oct 7, 2024

Choose a reason for hiding this comment

dkwon17 commented Oct 7, 2024 • edited Loading

dkwon17 commented Oct 7, 2024 • edited Loading

AObuchow commented Oct 7, 2024

batleforc commented Oct 8, 2024

AObuchow left a comment • edited Loading

Choose a reason for hiding this comment

AObuchow commented Oct 8, 2024

batleforc commented Oct 9, 2024

AObuchow commented Oct 9, 2024

AObuchow left a comment

Choose a reason for hiding this comment

AObuchow Oct 9, 2024

Choose a reason for hiding this comment

openshift-ci bot commented Oct 9, 2024

batleforc commented Oct 14, 2024 • edited Loading

AObuchow commented Oct 15, 2024

batleforc commented Sep 16, 2024 •

edited by AObuchow

Loading

batleforc commented Sep 30, 2024 •

edited

Loading

batleforc commented Oct 7, 2024 •

edited

Loading

AObuchow left a comment •

edited

Loading

dkwon17 Oct 7, 2024 •

edited

Loading

dkwon17 commented Oct 7, 2024 •

edited

Loading

dkwon17 commented Oct 7, 2024 •

edited

Loading

AObuchow left a comment •

edited

Loading

batleforc commented Oct 14, 2024 •

edited

Loading