-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
My trigger schedule doesn't seem to be honored (v0.10.0) #1415
Comments
@samip5 Is the replication completing successfully? I can't tell from the container log. Do you have the full replicationsource (including the status) from your system? It should update the |
It would seem to have been the case which is somewhat an frequently occurring theme, that it gets to a state where it's locked by an older backup job (which is nowhere to be found again) and requires manual intervention to resolve. It was stuck quite a while and it seems the next sync time is also still wrong?
|
I'm not sure I understand this part - is the repo getting locked after a failed job? Is the mover pod getting killed while backing up (from OOM or something else?). If the backup doesn't complete due to an error, volsync will keep retrying, which potentially is why you see it keep attempting.
It looks to me like the next sync time is set correctly? In this case the replicationsource is still trying to perform the backup, as you can see the lastSyncStart time is also set to 2am:
If it completes successfully, then it should set the next sync start time to the next one in the schedule. |
I really wish I knew how it happens (but it does require me to run unlock), as the pod is nowhere to be found, even if it failed, so I cannot figure it out.. It however results in many pods when it fails and volsync keeps on trying and trying, but fails due to the lock, and before we know it there's 5+ pods trying to do it, and as the backup itself is not instant, it takes more than 3 minutes to do... The plot is thickened. |
VolSync's mover uses a job, and should only be running 1 pod at a time. You're mentioning unlock, do you have logs from any of the failed pods indicating the lock is an issue? If you need to run an unlock, there is an unlock option now for a restic job you could look at - see the |
I think what @samip5 is saying is that when the job fails, a new pod will take it's place, fail, create a new lock and stay in an errored state, another new one will take it's place, fail, create a new lock, stay in an errored state, etc.., etc.. New pods will be created by the job until the restic repo is in a healthy state without any locks. Once there is a successful backup all the errored pods will be deleted as the job is cleaned up by volsync. From what I can tell reading here and here is the current way the It looks like FWIW with the restic mover I don't know if it's even worth trying to start a new pod after the previous one failed, is there anyway to tweak |
Yes, that's what I was trying to say but onedr0p did a better job explaining it. |
I do see restartPolicy is never but backoffLimit is set to 8. Ideally I would like this to be 0. So it doesn't try again until the next schedule and gives me time to unlock it using the spec.restic.unlock option. |
@tesshuflower I wonder if we could make use of the pod failure policy feature and restics exit codes for a locked repo? e.g. set this on the job... podFailurePolicy:
rules:
- action: FailJob
onExitCodes:
containerName: restic
operator: In
values: [11] That should mark the job as failed and not be retried until the next volsync schedule instead of same job retrying over and over on a locked restic repo. Edit: created a new issue on this #1429 |
I think we never passed in the restic @samip5 I'd suggest to debug you could run with a manual schedule and look at the pod logs to see if there is any more detail. A restic lock doesn't normally get left behind if restic exits normally, so sounds like a crash of some sort or possibly the pod itself getting killed for OOM or some other reason. Another possibility I suppose is disk space issues on the repo or the cache. To debug in more detail, you could try adding the following annotation to the ReplicationSource:
When the next synchronization runs (you can run a manual trigger to run it immediately), the pod will now sit in "debug" mode (it will be idle and not run any backup), and then you can exec into it. The pod log has instructions but will have copied the mover script to /tmp where you can run it. You could then manually run the script which will invoke the backup and see if you get any more details. You can even modify the script in /tmp before running it and add |
Describe the bug
I have specified a trigger schedule to backup at 2AM, but it seems it's not honored?
Steps to reproduce
Set a replicationsource with trigger schedule of
0 2 * * *
Expected behavior
I expected it to honor my trigger.
Actual results
It doesn't seem to honor it.
Additional context
I'm using Volume Populator, and my replication source is here:
Logs that show amount of snapshots: https://gist.github.com/samip5/b7a9bdf8345a5957cb659718812ffd6c
It seems that for whatever reason, it's not following the trigger, like you can clearly see from:
So why would that be?
The text was updated successfully, but these errors were encountered: