-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync only the target filesystem at the end of mover tasks #1466
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: evandeaubl The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @evandeaubl. Thanks for your PR. I'm waiting for a backube member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Signed-off-by: Evan Deaubl <[email protected]>
Signed-off-by: Tesshu Flower <[email protected]>
Quality Gate passedIssues Measures |
@evandeaubl thanks for this great analysis and your PR! I do feel like we've heard about the occasional problem with running sync, hope this will address most of them. I agree it makes sense to only run the sync against the mounted volume. I've just pushed a small change with updates to some of the scripts. |
@tesshuflower Thanks for the script cleanup, and catching that one Also, I've been testing this change over the past few days on my homelab cluster doing some pretty large transfers under the same conditions as described. The change has been working great, and hasn't caused any regressions that I've observed. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1466 +/- ##
=======================================
- Coverage 66.3% 66.2% -0.2%
=======================================
Files 57 57
Lines 7474 7474
=======================================
- Hits 4959 4950 -9
- Misses 2230 2236 +6
- Partials 285 288 +3 |
Describe what this PR does
This PR fixes a bug I observed in a cluster where I had many rsync-tls privileged volsync jobs running against Longhorn volumes, and at least one was doing a large transfer. After a short time, the load average on the node with the large transfer destination mover would shoot really high (around 15), and the node started having problems at the kubelet level (pods failing to start with CreateContainerError was the easiest to observe condition)
Digging deeper, I saw that the other destination mover pods on that node (not related to the large transfer) were hanging at the ends of their transfers, when they should have been exiting and recycled by the operator.
Comparing the logs to the script showed that the only thing that could have been hanging was the
sync
command at the end of the script. Execing into the container and running a manual sync hanged in the same way, confirming that was the issue. If I instead limited thesync
to just the target filesystem usingsync -f /data
, the hang did not occur. Building a custom rsync-tls mover with thesync
command limited to the one filesystem resolved this issue.My understanding of how sync works in containers makes me think the general
sync
was trying to sync much more than just what was in the container itself, and combined with an active large transfer was causing things to bind up.Is there anything that requires special attention?
I am only running rsync-tls at the moment, but there were
sync
commands in the rclone and rsync-ssh movers that could exhibit the same problem that I fixed as well, but did not test.Related issues:
None