Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync only the target filesystem at the end of mover tasks #1466

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

evandeaubl
Copy link

Describe what this PR does
This PR fixes a bug I observed in a cluster where I had many rsync-tls privileged volsync jobs running against Longhorn volumes, and at least one was doing a large transfer. After a short time, the load average on the node with the large transfer destination mover would shoot really high (around 15), and the node started having problems at the kubelet level (pods failing to start with CreateContainerError was the easiest to observe condition)

Digging deeper, I saw that the other destination mover pods on that node (not related to the large transfer) were hanging at the ends of their transfers, when they should have been exiting and recycled by the operator.

Comparing the logs to the script showed that the only thing that could have been hanging was the sync command at the end of the script. Execing into the container and running a manual sync hanged in the same way, confirming that was the issue. If I instead limited the sync to just the target filesystem using sync -f /data, the hang did not occur. Building a custom rsync-tls mover with the sync command limited to the one filesystem resolved this issue.

My understanding of how sync works in containers makes me think the general sync was trying to sync much more than just what was in the container itself, and combined with an active large transfer was causing things to bind up.

Is there anything that requires special attention?
I am only running rsync-tls at the moment, but there were sync commands in the rclone and rsync-ssh movers that could exhibit the same problem that I fixed as well, but did not test.

Related issues:
None

Copy link
Contributor

openshift-ci bot commented Nov 23, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: evandeaubl
Once this PR has been reviewed and has the lgtm label, please assign screeley44 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented Nov 23, 2024

Hi @evandeaubl. Thanks for your PR.

I'm waiting for a backube member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Signed-off-by: Tesshu Flower <[email protected]>
@openshift-ci openshift-ci bot added size/S and removed size/XS labels Nov 26, 2024
Copy link

sonarcloud bot commented Nov 26, 2024

@tesshuflower
Copy link
Contributor

@evandeaubl thanks for this great analysis and your PR! I do feel like we've heard about the occasional problem with running sync, hope this will address most of them. I agree it makes sense to only run the sync against the mounted volume.

I've just pushed a small change with updates to some of the scripts.

@evandeaubl
Copy link
Author

@tesshuflower Thanks for the script cleanup, and catching that one sync I missed. Easy to miss finding a sync in a project named volsync. 😄

Also, I've been testing this change over the past few days on my homelab cluster doing some pretty large transfers under the same conditions as described. The change has been working great, and hasn't caused any regressions that I've observed.

Copy link

codecov bot commented Nov 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 66.2%. Comparing base (6822ac2) to head (75c4ab3).
Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##            main   #1466     +/-   ##
=======================================
- Coverage   66.3%   66.2%   -0.2%     
=======================================
  Files         57      57             
  Lines       7474    7474             
=======================================
- Hits        4959    4950      -9     
- Misses      2230    2236      +6     
- Partials     285     288      +3     

see 1 file with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants