Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhance ensureWork to continue processing clusters after individual failures #6061

Merged
merged 1 commit into from
Jan 24, 2025

Conversation

chaosi-zju
Copy link
Member

@chaosi-zju chaosi-zju commented Jan 17, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

When ensureWork, one cluster fails would prevent other clusters from syncing, this PR aims to enhance ensureWork to continue processing clusters after individual failures.

real user case:

  1. member1 cluster is discontected, while member2/member3 is normal
  2. delete deployment from control-plane, it will be successfully deleted on all clusters except member1, besides, the work of member1 cluster will remain due to the finalizer
  3. then, assuming member1 cluster recovered
  4. recreate the deployment with the same name, deployments to all clusters may be blocked for up to 1000s.

cause:

when ensureWork failed on member1 cluster due to the old work is exist and being deleted, we fast failed without continue to try rest clusters. the problem will persist until the old work in member1 is successfully deleted in reconciliation process.

benefit:

when recreating the deployment, it can be quickly rolled out to all clusters except member1. However, deployments to the member1 cluster may still be blocked for up to 1000s.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

reproduce testing method:

  • create any simple deployment and pp
  • add a finalizer to the work in karmada-es-member1 namespace to prevent that work from being deleted
  • delete the deployment (residual work in karmada-es-member1 will remain)
  • recreate deployment, then no work will be created (a more reasonable result is member1's work creation fails, while other clusters succeed)

just like:

kubectl create deploy nginx --image=nginx
kubectl get deploy nginx -o yaml | karmadactl apply --all-clusters=true -f -

kubectl patch work nginx-687f7fb96f --type='json' -p '[{"op": "add", "path": "/metadata/finalizers/1", "value": "test"}]' -n karmada-es-member1

kubectl delete deploy nginx
kubectl create deploy nginx --image=nginx

Does this PR introduce a user-facing change?:


@karmada-bot karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 17, 2025
@karmada-bot karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 17, 2025
@codecov-commenter
Copy link

codecov-commenter commented Jan 17, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 44.82759% with 16 lines in your changes missing coverage. Please review.

Project coverage is 48.35%. Comparing base (ce41488) to head (af610cc).
Report is 15 commits behind head on master.

Files with missing lines Patch % Lines
pkg/controllers/binding/common.go 44.82% 15 Missing and 1 partial ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6061      +/-   ##
==========================================
- Coverage   48.36%   48.35%   -0.02%     
==========================================
  Files         666      666              
  Lines       54842    54880      +38     
==========================================
+ Hits        26527    26537      +10     
- Misses      26595    26617      +22     
- Partials     1720     1726       +6     
Flag Coverage Δ
unittests 48.35% <44.82%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@karmada-bot karmada-bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 17, 2025
@karmada-bot karmada-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 17, 2025
@chaosi-zju
Copy link
Member Author

@RainbowMango
Copy link
Member

add a finalizer to the work in karmada-es-member1 namespace to prevent that work from being deleted

Not the real word use case.
Please tell us why this patch is necessary, and what's the effect if not have it.

@chaosi-zju
Copy link
Member Author

chaosi-zju commented Jan 21, 2025

Not the real word use case.
Please tell us why this patch is necessary, and what's the effect if not have it.

That is not an attempt to describe a real use case; rather, it is a quick and convenient method for reproduction, intended for testing and validation.

The corresponding real use case is:

  1. member1 cluster is discontected, while member2/member3 is normal
  2. delete deployment from control-plane, then due to the inability to delete resources in the member1 cluster, the corresponding work will remain due to the finalizer
  3. then, member1 cluster recovered
  4. recreate the deployment with the same name, then it will failed to create work in ensureWork, due to the old work is exist and being deleted, see:
    if !runtimeObject.DeletionTimestamp.IsZero() {
    return fmt.Errorf("work %s/%s is being deleted", runtimeObject.GetNamespace(), runtimeObject.GetName())
    }

The effect is that when a certain abnormal cluster fails to create work, it affects the creation of work in other healthy clusters. However, we expect that work in the other healthy clusters can be created successfully.

so, the patch is merely an attempt to simulate a residual work in that cluster.

Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign

@RainbowMango
Copy link
Member

delete deployment from control-plane, then due to the inability to delete resources in the member1 cluster, the corresponding work will remain due to the finalizer

Has the deployment been successfully deleted on all clusters except member1? If I understand correctly, the member1 cluster won't block the deletion on the other clusters.

@chaosi-zju
Copy link
Member Author

chaosi-zju commented Jan 23, 2025

delete deployment from control-plane, then due to the inability to delete resources in the member1 cluster, the corresponding work will remain due to the finalizer

Has the deployment been successfully deleted on all clusters except member1?

yes, the member1 cluster won't block the deletion on the other clusters, so the deployment is successfully deleted on all clusters except member1.

@RainbowMango
Copy link
Member

OK.

recreate the deployment with the same name, then it will failed to create work in ensureWork, due to the old work is exist and being deleted, see:

Before recreate the deployment, what will happen to the residual work after the member1 is recovered? Will it remain permanently, or the system clean it up eventually?

@chaosi-zju
Copy link
Member Author

Before recreate the deployment, what will happen to the residual work after the member1 is recovered? Will it remain permanently, or the system clean it up eventually?

the system clean it up eventually (about 1000s)

@RainbowMango
Copy link
Member

So, the value of this PR is :
When recreating the deployment, it can be quickly rolled out to all clusters except member1. However, deployments to the member1 cluster may still be blocked for up to 1000s.
Right?

@chaosi-zju
Copy link
Member Author

When recreating the deployment, it can be quickly rolled out to all clusters except member1. However, deployments to the member1 cluster may still be blocked for up to 1000s.
Right?

yes

Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

wait for test report

@karmada-bot karmada-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 23, 2025
@chaosi-zju
Copy link
Member Author

latest test result ok

$ karmadactl --karmada-context karmada-apiserver get deploy --operation-scope=members
NAME    CLUSTER   READY   UP-TO-DATE   AVAILABLE   AGE   ADOPTION
nginx   member2   1/1     1            1           9s    Y
nginx   member3   1/1     1            1           9s    Y

by running above test method, after recreate deployment, we can see deployment can be quickly rolled out to all clusters except member1

@RainbowMango
Copy link
Member

I want to see the error message.

@chaosi-zju
Copy link
Member Author

I want to see the error message.

one cluster fail:

E0124 01:30:20.001513       1 binding_controller.go:131] Failed to transform resourceBinding(default/nginx-deployment) to works. Error: work karmada-es-member1/nginx-687f7fb96f is being deleted.

two cluster fail:

E0124 01:58:13.499553       1 binding_controller.go:131] Failed to transform resourceBinding(default/nginx-deployment) to works. Error: [work karmada-es-member1/nginx-687f7fb96f is being deleted, work karmada-es-member2/nginx-687f7fb96f is being deleted].

@RainbowMango RainbowMango added this to the v1.13 milestone Jan 24, 2025
Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RainbowMango

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 24, 2025
@karmada-bot karmada-bot merged commit d24b2b9 into karmada-io:master Jan 24, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants