Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed BackupSession is still marked as "running" #1488

Open
sgielen opened this issue Sep 30, 2022 · 6 comments
Open

Failed BackupSession is still marked as "running" #1488

sgielen opened this issue Sep 30, 2022 · 6 comments

Comments

@sgielen
Copy link

sgielen commented Sep 30, 2022

I have the following in my stash container logs:

I0930 05:51:15.930910       1 commands.go:120] Backing up target data
[golang-sh]$ /bin/restic backup /mnt/data/timemachine --quiet --json --host samba-0 --cache-dir /tmp/restic-cache --cleanup-cache
E0930 05:53:15.949682       1 leaderelection.go:367] Failed to update lock: context deadline exceeded
I0930 05:53:17.302915       1 leaderelection.go:283] failed to renew lease samba/lock-statefulset-samba-backup: timed out waiting for the condition
I0930 05:53:17.509894       1 backupsession.go:398] Lost leadership
I0930 05:53:18.075803       1 backupsession.go:187] Sync/Add/Update for Backup Session backup-1664245947
I0930 05:53:19.934081       1 backupsession.go:214] Skip processing BackupSession samba/backup-1664245947. Reason: BackupSession has been processed already for host "samba-0"
{"message_type":"error","error":{"Op":"lstat","Path":"/mnt/data/timemachine/sjors/[snip]","Err":2},"during":"archival","item":"/mnt/data/timemachine/sjors/[snip]"}
{"message_type":"error","error":{"Op":"lstat","Path":"/mnt/data/timemachine/sjors/[snip]","Err":2},"during":"archival","item":"/mnt/data/timemachine/sjors/[snip]"}
Warning: at least one source file could not be read
W0930 09:48:21.666033       1 backupsession.go:442] Failed to take backup for BackupSession backup-1664505145. Reason: Warning: at least one source file could not be read
I0930 09:48:21.693846       1 status.go:99] Updating post backup status.......

Inside the stash container, only /stash is running, nothing restic.

However, the BackupSession is still Phase: Running:

Status:
  Conditions:
    Last Transition Time:  2022-09-30T02:32:32Z
    Message:               Repository exist in the backend.
    Reason:                BackendRepositoryFound
    Status:                True
    Type:                  BackendRepositoryInitialized
  Phase:                   Running
  Session Deadline:        2022-10-02T02:32:25Z

Could this be a bug?

@hossainemruz
Copy link
Contributor

@hmsayem and @piyush1146115 Please take a look when you are available.

@hmsayem
Copy link
Member

hmsayem commented Sep 30, 2022

Hello @sgielen!
Did the BackupSession get stuck in the Running phase?

@hmsayem
Copy link
Member

hmsayem commented Oct 3, 2022

Stash runs backup as a non-root user. Thus, Stash fails to read the targeted data, If the target data directory is not readable to all the users. Solution is to run the backup process as the same user as the targeted application or run the backup process as the root user.
Please check the troubleshooting guide for more details:
https://stash.run/docs/v2022.09.29/guides/troubleshooting/source-data-read-failed/

@sgielen
Copy link
Author

sgielen commented Oct 3, 2022

Hello @sgielen!
Did the BackupSession get stuck in the Running phase?

Yes, it was in the Running phase and got stuck.

Stash runs backup as a non-root user. Thus, Stash fails to read the targeted data, If the target data directory is not readable to all the users. Solution is to run the backup process as the same user as the targeted application or run the backup process as the root user.

It does not seem to me to be an access issue: the backup runs as root (securityContext runAsUser 0 and runAsGroup 0), has succeeded multiple times before, and only 2 out of the thousands of files could not be read this time. The files are on a remote volume, so it is more likely to me that a network issue caused the backup to fail. A failing backup would then be expected, but not one stuck in Running.

In the meantime, the backup timeout expired so the Running backup was removed, but now backups are failing because the repository is locked (repository is already locked by... for a PID that is not running). I know how to fix this -- I'm mentioning it because perhaps this could give an idea why the backup was stuck in Running -- maybe restic did not exit in the expected way?

@hmsayem
Copy link
Member

hmsayem commented Oct 3, 2022

Restic creates a lock at the beginning of its operation. Since the last backup couldn't get completed and failed due to timeout, Restic coudn't remove the lock.

We need to check the BackupSession that failed due to timeout. Please share the BackupSession with us, if that happens again.

@sgielen
Copy link
Author

sgielen commented Oct 3, 2022

When I created my issue here, restic already wasn't running anymore according to ps aux, I reported this in the original issue. So, at that point, the Backup was still Running (see the partial output of the kubectl describe in the original issue) while restic was already exited.

Unfortunately indeed the timed out BackupSession has been removed from the cluster since then, but I will update this issue if it happens again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants