Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync specific folders only or exclude folders from the sync #959

Open
carukc opened this issue Oct 25, 2023 · 7 comments
Open

Sync specific folders only or exclude folders from the sync #959

carukc opened this issue Oct 25, 2023 · 7 comments
Labels
enhancement New feature or request

Comments

@carukc
Copy link

carukc commented Oct 25, 2023

Describe the feature you'd like to have.
volsync only appears to replicate the entire volume. There are situations where only certain folders within the volume need to be replicated. Perhaps there could be a feature that allows the the specific folders to be specified.

There are also situations where indifidual nodes may have private (temp) files that are not required by all nodes. This would prevent them from being syncronized or from interfering with another nodes files (in situations where everyhting is stored in a single volume.

For the rsync syncronizations the exclude parameter might be used to specifically exclude files, folders or file patterns. I'm not sure how this would be done for syncthing but if possible it would be enough to sync (or exclude) based on foldernames.

What is the value to the end user? (why is it a priority?)
This will acceperate syncronizations and ensure that space is not wased on files that do not need to be syncronized.

How will we know we have a good solution? (acceptance criteria)
destination file would match the contents of the intial volume less any files or folders specifically excluded or not specifically included.

Additional context
No additional context

@carukc carukc added the enhancement New feature or request label Oct 25, 2023
@onedr0p
Copy link
Contributor

onedr0p commented Nov 14, 2023

It looks like this feature is supported with restic too. Maybe there could be a way to pass custom args to the CLI apps like restic or rclone in the ReplicationSource?

https://restic.readthedocs.io/en/stable/040_backup.html#excluding-files

@erenfro
Copy link

erenfro commented Dec 9, 2023

A big one as well. ext4 filesystems always make lost+found. This is owned by the user root, and the group root, and is intended to always be present. However, this causes a mover running as a less privileged user that is not root, to fail to backup because lost+found exists and is owned by root:root.

This is quite literally a huge show-stopper when it causes backups to outright fail due to this issue.

@tesshuflower
Copy link
Contributor

A big one as well. ext4 filesystems always make lost+found. This is owned by the user root, and the group root, and is intended to always be present. However, this causes a mover running as a less privileged user that is not root, to fail to backup because lost+found exists and is owned by root:root.

This is quite literally a huge show-stopper when it causes backups to outright fail due to this issue.

@erenfro Could you create a separate issue (bug) for this? Please provide details of the mover you are using.

@erenfro
Copy link

erenfro commented Dec 17, 2023

@erenfro Could you create a separate issue (bug) for this? Please provide details of the mover you are using.

I mean, I'm using restic, it's finding lost+found owned as root:root, it has no permission. Not much more to add here, and this is basic understanding of ext4.

@tesshuflower
Copy link
Contributor

@erenfro Could you create a separate issue (bug) for this? Please provide details of the mover you are using.

I mean, I'm using restic, it's finding lost+found owned as root:root, it has no permission. Not much more to add here, and this is basic understanding of ext4.

Created issue: #1033

@svengreb
Copy link

svengreb commented Nov 23, 2024

Is there any way to help to make this feature possible?
I'm currently facing the problem that some applications (e.g. Home Assistant) expect exactly one "configuration" directory where everything is stored: the SQLite database files, the whole code of "custom integrations" (kind of plugins in a custom_components directory), images, configurations files and so on. All configurations files are mounted as ConfigMap and Secret into this directory to achieve a full GitOps setup so they are not relevant for the backup with restic. On the other hand, the custom_components directory as well as all mounted configuration and secrets files should not be included in the backup.

Not having a way to configure which files should actually be included in a snapshot is also sadly a show-stopper for me. My only idea was to implement a custom application/script that runs as a CronJob that manually takes a snapshot of the relevant files and then point my restic based ReplicationSource to this temporary volume (that must be cleaned up on the next CronJob run to ensure a clean state). This is more than hacky and error prone so I'd really like to avoid it and use the "native" ways that restic already supports.

I quickly checked the mover runner scripts and saw that the parameters for restic are already configured as shell array…

RESTIC=("restic")
if [[ -n "${CUSTOM_CA}" ]]; then
echo "Using custom CA."
RESTIC+=(--cacert "${CUSTOM_CA}")
fi

…and also already include a "hard-coded" exclusion for lost+found directories.

"${RESTIC[@]}" backup --host "${RESTIC_HOST}" --exclude='lost+found' .

Extending the CRD, like @onedr0p already mentioned above, with a way to pass any CLI parameter to restic would make it possible for users to use all features of restic. I can understand that this could allow users to "break their own leg" by passing invalid or mismatching parameters, but a new CRD field like this should be seen (and documented) as a "use on your own risk" feature.

In the end this limitation makes it currently impossible to back up volumes that can not simply be split up into multiple volumes or mounted resources which causes massive disk space problems or the risk of backing up sensitive data, even they are encrypted but should/must not leave the system.

@tesshuflower
Copy link
Contributor

@svengreb Thanks for the detailed explanation in your comment.

I think generally we're a bit wary of allowing any parameters as this makes debugging issues a lot harder, and we also have to be careful not to break existing user scenarios. We also always have to be aware the user who is able to create CRs is not necessarily the cluster administrator, so having more control of the exact commands run in the mover job can be beneficial so we don't expose some unwanted behaviour.

Going back to your specific issue however, I think there is still part of it I don't understand.

Are you saying you have something like your data PVC mounted as, say /data , and then are mounting other configmaps or secrets as subdirectories under /data ?

Say something like:

DataPVC: mounted as /data
ConfigmapA: mounted as /data/custom_config

In this case your data pvc still does not contain /data/custom_config (unless the contents of the configmap are copied into /data at runtime) and when we snapshot & backup the data PVC, it will not contain custom_config.

Maybe there is still something with your custom_components dir that still needs to be kept separate? I guess I'm not sure how this is part of your data PVC but not intended for backup. Maybe I just need a clearer picture of the mounts and how they are setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: No status
Development

No branches or pull requests

5 participants