-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enhance ensureWork to continue processing clusters after individual failures #6061
Conversation
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #6061 +/- ##
==========================================
- Coverage 48.36% 48.35% -0.02%
==========================================
Files 666 666
Lines 54842 54880 +38
==========================================
+ Hits 26527 26537 +10
- Misses 26595 26617 +22
- Partials 1720 1726 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
90823d3
to
b403bed
Compare
b403bed
to
42d1f47
Compare
Not the real word use case. |
That is not an attempt to describe a real use case; rather, it is a quick and convenient method for reproduction, intended for testing and validation. The corresponding real use case is:
The effect is that when a certain abnormal cluster fails to create work, it affects the creation of work in other healthy clusters. However, we expect that work in the other healthy clusters can be created successfully. so, the patch is merely an attempt to simulate a residual |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/assign
Has the deployment been successfully deleted on all clusters except member1? If I understand correctly, the member1 cluster won't block the deletion on the other clusters. |
yes, the member1 cluster won't block the deletion on the other clusters, so the deployment is successfully deleted on all clusters except member1. |
OK.
Before recreate the deployment, what will happen to the residual work after the member1 is recovered? Will it remain permanently, or the system clean it up eventually? |
the system clean it up eventually (about 1000s) |
So, the value of this PR is : |
yes |
42d1f47
to
e1c6cee
Compare
…ailures Signed-off-by: chaosi-zju <[email protected]>
e1c6cee
to
af610cc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
wait for test report
latest test result ok $ karmadactl --karmada-context karmada-apiserver get deploy --operation-scope=members
NAME CLUSTER READY UP-TO-DATE AVAILABLE AGE ADOPTION
nginx member2 1/1 1 1 9s Y
nginx member3 1/1 1 1 9s Y by running above test method, after recreate deployment, we can see deployment can be quickly rolled out to all clusters except member1 |
I want to see the error message. |
one cluster fail: E0124 01:30:20.001513 1 binding_controller.go:131] Failed to transform resourceBinding(default/nginx-deployment) to works. Error: work karmada-es-member1/nginx-687f7fb96f is being deleted. two cluster fail: E0124 01:58:13.499553 1 binding_controller.go:131] Failed to transform resourceBinding(default/nginx-deployment) to works. Error: [work karmada-es-member1/nginx-687f7fb96f is being deleted, work karmada-es-member2/nginx-687f7fb96f is being deleted]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: RainbowMango The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
What this PR does / why we need it:
When
ensureWork
, one cluster fails would prevent other clusters from syncing, this PR aims to enhanceensureWork
to continue processing clusters after individual failures.real user case:
member1
cluster is discontected, whilemember2
/member3
is normalwork
ofmember1
cluster will remain due to the finalizermember1
cluster recoveredcause:
when
ensureWork
failed on member1 cluster due to the oldwork
is exist and being deleted, we fast failed without continue to try rest clusters. the problem will persist until the oldwork
inmember1
is successfully deleted in reconciliation process.benefit:
when recreating the deployment, it can be quickly rolled out to all clusters except member1. However, deployments to the member1 cluster may still be blocked for up to 1000s.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
reproduce testing method:
finalizer
to thework
inkarmada-es-member1
namespace to prevent thatwork
from being deletedwork
inkarmada-es-member1
will remain)work
will be created (a more reasonable result is member1's work creation fails, while other clusters succeed)just like:
Does this PR introduce a user-facing change?: