enhance ensureWork to continue processing clusters after individual failures #6061

chaosi-zju · 2025-01-17T02:29:39Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

When ensureWork, one cluster fails would prevent other clusters from syncing, this PR aims to enhance ensureWork to continue processing clusters after individual failures.

real user case:

member1 cluster is discontected, while member2/member3 is normal
delete deployment from control-plane, it will be successfully deleted on all clusters except member1, besides, the work of member1 cluster will remain due to the finalizer
then, assuming member1 cluster recovered
recreate the deployment with the same name, deployments to all clusters may be blocked for up to 1000s.

cause:

when ensureWork failed on member1 cluster due to the old work is exist and being deleted, we fast failed without continue to try rest clusters. the problem will persist until the old work in member1 is successfully deleted in reconciliation process.

benefit:

when recreating the deployment, it can be quickly rolled out to all clusters except member1. However, deployments to the member1 cluster may still be blocked for up to 1000s.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

reproduce testing method:

create any simple deployment and pp
add a finalizer to the work in karmada-es-member1 namespace to prevent that work from being deleted
delete the deployment (residual work in karmada-es-member1 will remain)
recreate deployment, then no work will be created (a more reasonable result is member1's work creation fails, while other clusters succeed)

just like:

kubectl create deploy nginx --image=nginx
kubectl get deploy nginx -o yaml | karmadactl apply --all-clusters=true -f -

kubectl patch work nginx-687f7fb96f --type='json' -p '[{"op": "add", "path": "/metadata/finalizers/1", "value": "test"}]' -n karmada-es-member1

kubectl delete deploy nginx
kubectl create deploy nginx --image=nginx

Does this PR introduce a user-facing change?:

codecov-commenter · 2025-01-17T02:43:44Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 44.82759% with 16 lines in your changes missing coverage. Please review.

Project coverage is 48.35%. Comparing base (ce41488) to head (af610cc).
Report is 15 commits behind head on master.

Files with missing lines	Patch %	Lines
pkg/controllers/binding/common.go	44.82%	15 Missing and 1 partial ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6061      +/-   ##
==========================================
- Coverage   48.36%   48.35%   -0.02%     
==========================================
  Files         666      666              
  Lines       54842    54880      +38     
==========================================
+ Hits        26527    26537      +10     
- Misses      26595    26617      +22     
- Partials     1720     1726       +6

Flag	Coverage Δ
unittests	`48.35% <44.82%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

chaosi-zju · 2025-01-17T06:22:13Z

@XiShanYongYe-Chang @RainbowMango

RainbowMango · 2025-01-21T06:19:13Z

add a finalizer to the work in karmada-es-member1 namespace to prevent that work from being deleted

Not the real word use case.
Please tell us why this patch is necessary, and what's the effect if not have it.

chaosi-zju · 2025-01-21T11:58:37Z

Not the real word use case.
Please tell us why this patch is necessary, and what's the effect if not have it.

That is not an attempt to describe a real use case; rather, it is a quick and convenient method for reproduction, intended for testing and validation.

The corresponding real use case is:

member1 cluster is discontected, while member2/member3 is normal
delete deployment from control-plane, then due to the inability to delete resources in the member1 cluster, the corresponding work will remain due to the finalizer
then, member1 cluster recovered

recreate the deployment with the same name, then it will failed to create work in ensureWork, due to the old work is exist and being deleted, see:

karmada/pkg/controllers/ctrlutil/work.go

Lines 79 to 81 in 820fd06

    
           if !runtimeObject.DeletionTimestamp.IsZero() { 
        
           	return fmt.Errorf("work %s/%s is being deleted", runtimeObject.GetNamespace(), runtimeObject.GetName()) 
        
           }

The effect is that when a certain abnormal cluster fails to create work, it affects the creation of work in other healthy clusters. However, we expect that work in the other healthy clusters can be created successfully.

so, the patch is merely an attempt to simulate a residual work in that cluster.

RainbowMango

/assign

RainbowMango · 2025-01-23T08:20:53Z

delete deployment from control-plane, then due to the inability to delete resources in the member1 cluster, the corresponding work will remain due to the finalizer

Has the deployment been successfully deleted on all clusters except member1? If I understand correctly, the member1 cluster won't block the deletion on the other clusters.

chaosi-zju · 2025-01-23T08:23:30Z

delete deployment from control-plane, then due to the inability to delete resources in the member1 cluster, the corresponding work will remain due to the finalizer

Has the deployment been successfully deleted on all clusters except member1?

yes, the member1 cluster won't block the deletion on the other clusters, so the deployment is successfully deleted on all clusters except member1.

RainbowMango · 2025-01-23T08:28:17Z

OK.

recreate the deployment with the same name, then it will failed to create work in ensureWork, due to the old work is exist and being deleted, see:

Before recreate the deployment, what will happen to the residual work after the member1 is recovered? Will it remain permanently, or the system clean it up eventually?

chaosi-zju · 2025-01-23T08:34:22Z

Before recreate the deployment, what will happen to the residual work after the member1 is recovered? Will it remain permanently, or the system clean it up eventually?

the system clean it up eventually (about 1000s)

RainbowMango · 2025-01-23T08:59:13Z

So, the value of this PR is :
When recreating the deployment, it can be quickly rolled out to all clusters except member1. However, deployments to the member1 cluster may still be blocked for up to 1000s.
Right?

chaosi-zju · 2025-01-23T09:03:49Z

When recreating the deployment, it can be quickly rolled out to all clusters except member1. However, deployments to the member1 cluster may still be blocked for up to 1000s.
Right?

yes

pkg/controllers/binding/common.go

…ailures Signed-off-by: chaosi-zju <[email protected]>

RainbowMango

/lgtm

wait for test report

chaosi-zju · 2025-01-23T11:51:52Z

latest test result ok

$ karmadactl --karmada-context karmada-apiserver get deploy --operation-scope=members
NAME    CLUSTER   READY   UP-TO-DATE   AVAILABLE   AGE   ADOPTION
nginx   member2   1/1     1            1           9s    Y
nginx   member3   1/1     1            1           9s    Y

by running above test method, after recreate deployment, we can see deployment can be quickly rolled out to all clusters except member1

RainbowMango · 2025-01-24T01:16:01Z

I want to see the error message.

chaosi-zju · 2025-01-24T01:58:41Z

I want to see the error message.

one cluster fail:

E0124 01:30:20.001513       1 binding_controller.go:131] Failed to transform resourceBinding(default/nginx-deployment) to works. Error: work karmada-es-member1/nginx-687f7fb96f is being deleted.

two cluster fail:

E0124 01:58:13.499553       1 binding_controller.go:131] Failed to transform resourceBinding(default/nginx-deployment) to works. Error: [work karmada-es-member1/nginx-687f7fb96f is being deleted, work karmada-es-member2/nginx-687f7fb96f is being deleted].

RainbowMango

/lgtm
/approve

karmada-bot · 2025-01-24T03:50:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RainbowMango

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controllers/OWNERS~~ [RainbowMango]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 17, 2025

karmada-bot requested review from jwcesign and XiShanYongYe-Chang January 17, 2025 02:29

karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 17, 2025

chaosi-zju force-pushed the ensureWork branch from 90823d3 to b403bed Compare January 17, 2025 02:51

karmada-bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 17, 2025

chaosi-zju force-pushed the ensureWork branch from b403bed to 42d1f47 Compare January 17, 2025 03:26

karmada-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 17, 2025

RainbowMango reviewed Jan 23, 2025

View reviewed changes

karmada-bot assigned RainbowMango Jan 23, 2025

RainbowMango reviewed Jan 23, 2025

View reviewed changes

pkg/controllers/binding/common.go Outdated Show resolved Hide resolved

chaosi-zju force-pushed the ensureWork branch from 42d1f47 to e1c6cee Compare January 23, 2025 09:49

enhance ensureWork to continue processing clusters after individual f…

af610cc

…ailures Signed-off-by: chaosi-zju <[email protected]>

chaosi-zju force-pushed the ensureWork branch from e1c6cee to af610cc Compare January 23, 2025 10:07

RainbowMango reviewed Jan 23, 2025

View reviewed changes

karmada-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 23, 2025

RainbowMango added this to the v1.13 milestone Jan 24, 2025

RainbowMango approved these changes Jan 24, 2025

View reviewed changes

karmada-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 24, 2025

karmada-bot merged commit d24b2b9 into karmada-io:master Jan 24, 2025
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhance ensureWork to continue processing clusters after individual failures #6061

enhance ensureWork to continue processing clusters after individual failures #6061

chaosi-zju commented Jan 17, 2025 •

edited

Loading

codecov-commenter commented Jan 17, 2025 •

edited

Loading

chaosi-zju commented Jan 17, 2025

RainbowMango commented Jan 21, 2025

chaosi-zju commented Jan 21, 2025 •

edited

Loading

RainbowMango left a comment

RainbowMango commented Jan 23, 2025

chaosi-zju commented Jan 23, 2025 •

edited

Loading

RainbowMango commented Jan 23, 2025

chaosi-zju commented Jan 23, 2025

RainbowMango commented Jan 23, 2025

chaosi-zju commented Jan 23, 2025

RainbowMango left a comment

chaosi-zju commented Jan 23, 2025

RainbowMango commented Jan 24, 2025

chaosi-zju commented Jan 24, 2025

RainbowMango left a comment

karmada-bot commented Jan 24, 2025

enhance ensureWork to continue processing clusters after individual failures #6061

enhance ensureWork to continue processing clusters after individual failures #6061

Conversation

chaosi-zju commented Jan 17, 2025 • edited Loading

codecov-commenter commented Jan 17, 2025 • edited Loading

Codecov Report

chaosi-zju commented Jan 17, 2025

RainbowMango commented Jan 21, 2025

chaosi-zju commented Jan 21, 2025 • edited Loading

RainbowMango left a comment

Choose a reason for hiding this comment

RainbowMango commented Jan 23, 2025

chaosi-zju commented Jan 23, 2025 • edited Loading

RainbowMango commented Jan 23, 2025

chaosi-zju commented Jan 23, 2025

RainbowMango commented Jan 23, 2025

chaosi-zju commented Jan 23, 2025

RainbowMango left a comment

Choose a reason for hiding this comment

chaosi-zju commented Jan 23, 2025

RainbowMango commented Jan 24, 2025

chaosi-zju commented Jan 24, 2025

RainbowMango left a comment

Choose a reason for hiding this comment

karmada-bot commented Jan 24, 2025

chaosi-zju commented Jan 17, 2025 •

edited

Loading

codecov-commenter commented Jan 17, 2025 •

edited

Loading

chaosi-zju commented Jan 21, 2025 •

edited

Loading

chaosi-zju commented Jan 23, 2025 •

edited

Loading