-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: turn healthCheckHttpClient timeout from 500ms to 3s #1321
Conversation
Hi @batleforc. Thanks for your PR. I'm waiting for a devfile member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Setup in a working environment, with the linked PR work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@batleforc Thank you for the PR :)
I assume you're submitting this PR with the hopes of getting it into upstream DWO (and Che) rather than just making changes for your own testing, correct?
Rather than change the default timeout as part of your PR, my gut instinct is we should instead expose a configuration option in the DevWorkspaceOperatorConfig that would allow users to customize the healthCheckHttpClient timeout. Then you could configure the timeout to your desired value from there.
If you're okay with reworking your PR to do this, let me know and I can help guide you further.
controllers/workspace/http.go
Outdated
@@ -70,7 +70,7 @@ func setupHttpClients(k8s client.Client, logger logr.Logger) { | |||
} | |||
healthCheckHttpClient = &http.Client{ | |||
Transport: healthCheckTransport, | |||
Timeout: 500 * time.Millisecond, | |||
Timeout: 3 * 500 * time.Millisecond, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would actually be 1500 ms (1.5s) instead of 3s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without a doubt a checkout of a stash to high, the test that I deployed was set to a hard coded 3s
/ok-to-test |
HI @AObuchow, |
Instead of increasing the timeout, what about returning a @batleforc does that work for your use case? We do something similar when waiting for the workspace deployment to be ready:
devworkspace-operator/controllers/workspace/devworkspace_controller.go Lines 485 to 489 in 0055cb6
|
I think that it could fix the problem, totally answer my case @dkwon17, and could remove the need of changing the Che operator source code. |
And your answer wouldn't lock the operator on this action and potentially unlock the process for future action |
@dkwon17, i've tried to set it up correctly, not sure if it's correct. I choose to set it with a reconcile after 1 second (could be less but don't know) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the update to your PR.
@batleforc Have you tried if your latest changes resolve the issue you were encountering?
In my experience, I've seen DWO continuously re-attempt the health check when getting a 502 bad gateway response. However, in these cases you'd see a Main URL server not ready
-- are you not seeing that message?
return reconcile.Result{}, err | ||
if shouldReturn, reconcileResult, reconcileErr := r.checkDWError(workspace, err, "Error checking server status", metrics.ReasonInfrastructureFailure, reqLogger, &reconcileStatus); shouldReturn { | ||
reqLogger.Info("Waiting for DevWorkspace to be ready") | ||
reconcileStatus.setConditionFalse(conditions.DeploymentReady, "Waiting for DevWorkspace to be ready") // No sure of the conditions.DeploymentReady is the right one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dw.DevWorkspaceReady
may be a more appropriate condition (we use it when the health check fails)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @batleforc !
return reconcile.Result{}, err | ||
if shouldReturn, reconcileResult, reconcileErr := r.checkDWError(workspace, err, "Error checking server status", metrics.ReasonInfrastructureFailure, reqLogger, &reconcileStatus); shouldReturn { | ||
reqLogger.Info("Waiting for DevWorkspace to be ready") | ||
reconcileStatus.setConditionFalse(conditions.DeploymentReady, "Waiting for DevWorkspace to be ready") // No sure of the conditions.DeploymentReady is the right one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this change, this PR looks good to me :)
I've experienced the opposite actually, after handling all the queued reconciles, a retry doesn't happen after a health fail check |
@AObuchow I think you're thinking about this case, where the health check endpoint is accessible, but a 502 is returned: devworkspace-operator/controllers/workspace/devworkspace_controller.go Lines 497 to 501 in 533d1f0
This PR is to handle the case where there is a timeout when trying to access the health check endpoint |
@dkwon17 thank you for the confirmation, yes you're right - I believe that's the case I was thinking about. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@batleforc this looks just about good to me to merge 😁 Thank you so much for your contribution.
One last request: Could you please re-organize your commits from this PR?
Here are my suggestions:
- Squash all 3 of your commits into a single commit, as they are all iterations on the change introduced from this PR
- Change the final commit message to
feat: queue workspace reconcile if workspace health check encounters an error
. In the description, add afix #1325
.
The final commit message should ressemble something like:
feat: queue workspace reconcile if workspace health check encounters an error
fix #1325
Signed-off-by: Max batleforc <[email protected]>
Sounds great to me :) I hope this resolves your issue. FWIW: This PR is approved and can be merged (after my final request on cleaning up the commit log) :) |
03e8d12
to
8b49c22
Compare
So, I squashed the commit, but I still need to test it on the last env that has this problem quite frequently (a Devspaces 3.16.1 updated recently) |
@batleforc Thank you for updating your PR, sounds good to me. Keep us updated when you have a moment :) It's greatly appreciated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some last minute comments, but looks great overall @batleforc :)
return reconcile.Result{}, err | ||
if shouldReturn, reconcileResult, reconcileErr := r.checkDWError(workspace, err, "Error checking server status", metrics.ReasonInfrastructureFailure, reqLogger, &reconcileStatus); shouldReturn { | ||
reqLogger.Info("Waiting for DevWorkspace to be ready") | ||
reconcileStatus.setConditionFalse(dw.DevWorkspaceReady, "Waiting for DevWorkspace to be ready") // No sure of the conditions.DeploymentReady is the right one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dkwon17 @batleforc Any thoughts on if we should change the DevWorkspace status message & log here to something more specific to the health check timing out? "Waiting for DevWorkspace to be ready"
Maybe something like:
reqLogger.Info("Waiting for DevWorkspace health check endpoint to become available")
reconcileStatus.setConditionFalse(dw.DevWorkspaceReady, "Waiting for workspace health check to become available")
…an error fix devfile#1325 Signed-off-by: Max batleforc <[email protected]>
8b49c22
to
7df914e
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: AObuchow, batleforc, dkwon17, ibuziuk The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
After 5 day of testing, the problem didn't appear again, so it's a fix. |
I'm glad to hear this PR seems to resolve your issue 😁 Merging this PR now. |
What does this PR do?
It change the timeout of the healthcheck client from 500ms to 3s
What issues does this PR fix or reference?
Linked to eclipse-che/che#23067
Fixes #1325
Is it tested? How?
In progress
PR Checklist
/test v8-devworkspace-operator-e2e, v8-che-happy-path
to trigger)v8-devworkspace-operator-e2e
: DevWorkspace e2e testv8-che-happy-path
: Happy path for verification integration with Che