Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add restore command #244

Merged
merged 31 commits into from
Feb 2, 2024
Merged

Add restore command #244

merged 31 commits into from
Feb 2, 2024

Conversation

seoseonyu
Copy link
Contributor

Context

  • Add the restore command to restore a backup file created using the backup command.
  • Add the usage of the resotre command to the README.md (English and Korean have been written, but Chinese needs additional work)

Choices

  • There is a backup command, but not a restore command, so this is for users who are not familiar with CLI commands.

Test instructions

  1. set the path to the actual save file locally and tested it.
  2. added the command to the docker file and tested whether it runs normally after building.
  3. For safe recovery, we shut down the server, temporarily saved the save file, and created a recovery process in case of an error.

We've done our own error handling and testing for all cases, but if additional testing is needed, we can do it.

Checklist before requesting a review

  • [v] I have performed a self-review of my code
  • [v] I've added documentation about this change to the README.
  • [v] I've not introduced breaking changes.

@seoseonyu
Copy link
Contributor Author

We'll work on a fix for that error.

@seoseonyu
Copy link
Contributor Author

seoseonyu commented Jan 31, 2024

Resolved all Shell - Lint warnings except for functions for error handling.

In order to give you the option to recover your backed up files before proceeding with the recovery by executing a handling function to detect errors if they occur during the recovery, the Error handling functions are required.

The code I'm doing that generates the warning is executed when an execution error occurs via a trap.
The Lint process doesn't seem to detect this.

What do you think about this?

Copy link
Contributor

@Dashboy1998 Dashboy1998 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seem to be mixing 2x and 4x spacing for indentation. I would recommend using 4x spacing indentation as init.sh, start.sh, dockerfile, and update.sh, use 4x.

Backup.sh is the only one that uses 8x and docker-compose.yml is the only one that uses 3x. Don't think anyone wants to be git blamed to fix docker-compose.yml

scripts/restore.sh Outdated Show resolved Hide resolved
@seoseonyu
Copy link
Contributor Author

Thanks for the suggestion.

We'll take your suggestion into consideration and make changes.

scripts/restore.sh Outdated Show resolved Hide resolved
scripts/restore.sh Outdated Show resolved Hide resolved
@Dashboy1998
Copy link
Contributor

The container stops when su steam -c ./start.sh stops. su steam -c ./start.sh stops when the PalServer.sh stops.

I believe running the restore command leads to a race since the restore must finish during limited window in which the server has shutdown but su steam -c ./start.sh has not.

@seoseonyu
Copy link
Contributor Author

The container stops when su steam -c ./start.sh stops. su steam -c ./start.sh stops when the PalServer.sh stops.

I believe running the restore command leads to a race since the restore must finish during limited window in which the server has shutdown but su steam -c ./start.sh has not.

Hmm... When I ran several tests, the restore went fine.

rcon-cli -c /home/steam/server/rcon.yaml save
rcon-cli -c /home/steam/server/rcon.yaml "shutdown 1"

Are you saying that if this command is executed, the recovery should happen within the container's shutdown latency?

The docker in our test environment uses
--stop-timeout 30
option was enabled.

i will test this separately by removing the --stop-timeout option.

scripts/restore.sh Outdated Show resolved Hide resolved
@Dashboy1998
Copy link
Contributor

The container stops when su steam -c ./start.sh stops. su steam -c ./start.sh stops when the PalServer.sh stops.
I believe running the restore command leads to a race since the restore must finish during limited window in which the server has shutdown but su steam -c ./start.sh has not.

Hmm... When I ran several tests, the restore went fine.

rcon-cli -c /home/steam/server/rcon.yaml save
rcon-cli -c /home/steam/server/rcon.yaml "shutdown 1"

Are you saying that if this command is executed, the recovery should happen within the container's shutdown latency?

The docker in our test environment uses --stop-timeout 30 option was enabled.

i will test this separately by removing the --stop-timeout option.

Yes it must happen within the shutdown period however there's no way to know the size of the shutdown period. --stop-timeout 30 is the limit.

There is a non-zero amount of time between rcon-cli -c /home/steam/server/rcon.yaml "shutdown 1" and when the container stops. In that amount of time you must complete the restore.

Also looking at the script again it doesn't look like it actual waits for the server to stop. rcon-cli exits almost instantly. To prevent this you'd want to add this after the rcon wait $(pidof PalServer-Linux-Test).

@seoseonyu
Copy link
Contributor Author

seoseonyu commented Feb 1, 2024

The container stops when su steam -c ./start.sh stops. su steam -c ./start.sh stops when the PalServer.sh stops.
I believe running the restore command leads to a race since the restore must finish during limited window in which the server has shutdown but su steam -c ./start.sh has not.

Hmm... When I ran several tests, the restore went fine.

rcon-cli -c /home/steam/server/rcon.yaml save
rcon-cli -c /home/steam/server/rcon.yaml "shutdown 1"

Are you saying that if this command is executed, the recovery should happen within the container's shutdown latency?
The docker in our test environment uses --stop-timeout 30 option was enabled.
i will test this separately by removing the --stop-timeout option.

Yes it must happen within the shutdown period however there's no way to know the size of the shutdown period. --stop-timeout 30 is the limit.

There is a non-zero amount of time between rcon-cli -c /home/steam/server/rcon.yaml "shutdown 1" and when the container stops. In that amount of time you must complete the restore.

Also looking at the script again it doesn't look like it actual waits for the server to stop. rcon-cli exits almost instantly. To prevent this you'd want to add this after the rcon wait $(pidof PalServer-Linux-Test).

Thanks for the feedback.

I will do some more testing on this and think of a workaround.

I can't fix it right now because I'm at work. 😭

@Dashboy1998
Copy link
Contributor

The container stops when su steam -c ./start.sh stops. su steam -c ./start.sh stops when the PalServer.sh stops.
I believe running the restore command leads to a race since the restore must finish during limited window in which the server has shutdown but su steam -c ./start.sh has not.

Hmm... When I ran several tests, the restore went fine.

rcon-cli -c /home/steam/server/rcon.yaml save
rcon-cli -c /home/steam/server/rcon.yaml "shutdown 1"

Are you saying that if this command is executed, the recovery should happen within the container's shutdown latency?
The docker in our test environment uses --stop-timeout 30 option was enabled.
i will test this separately by removing the --stop-timeout option.

Yes it must happen within the shutdown period however there's no way to know the size of the shutdown period. --stop-timeout 30 is the limit.
There is a non-zero amount of time between rcon-cli -c /home/steam/server/rcon.yaml "shutdown 1" and when the container stops. In that amount of time you must complete the restore.
Also looking at the script again it doesn't look like it actual waits for the server to stop. rcon-cli exits almost instantly. To prevent this you'd want to add this after the rcon wait $(pidof PalServer-Linux-Test).

Thanks for the feedback.

I will do some more testing on this and think of a workaround.

I can't fix it right now because I'm at work. 😭

Realizing the issue also exists for backups I made this PR #248.

Also wait $(pidof PalServer-Linux-Test) won't work in the update script since PalServer-Linux-Test is not a child of your script. You'd have to do a pgrep and tail.

@seoseonyu
Copy link
Contributor Author

As suggested by @Dashboy1998, I removed the --stop-timeout and --restart unless-stopped options and forced a wait using the sleep command in the middle of the restore process and found that the container was actually terminated before the restore was complete.

As per his suggestion, I added code to init.sh to suspend container termination until the restore is complete, referring to PR #248.

  1. Force the restore script to be long-lived via the sleep command and test if it waits for the container to terminate while the script is in progress.
  2. test if the game is restored to the point in the backup file as normal if the container is waiting for termination.

We ran the above tests and confirmed that the container waits for termination and the game is restored to the point of view of the backup file.

We fixed the creation location of the temporary directory used for restoration to /palworld/backups.

@seoseonyu seoseonyu requested a review from Dashboy1998 February 1, 2024 14:24
@seoseonyu
Copy link
Contributor Author

Fixed for suggestions

README.md Outdated Show resolved Hide resolved
docs/kr/README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
docs/kr/README.md Outdated Show resolved Hide resolved
Copy link
Contributor

@Dashboy1998 Dashboy1998 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests

Ran store with: docker exec -it palworld-server restore, note the -it

Created a large file for testing fallocate -l 10G palworld/Pal/Saved/deleteme
Created a backup I used to restore

  1. With rcon=true, restored a backup from backups/. Restored as expected

  2. With rcon=true, restored a backup from backups/subfolder/. Restored as expected

  3. Tried to restore with RCON=false, provided message and exited as expected RCON is not enabled. Please enable RCON to use this feature.

  4. With rcon=true I deleted rcon.yaml to test term_error_handler, it triggered as expected

    *****STARTING PROCESS*****
    *****SHUTDOWN SERVER*****
    cli: config: parse file: read file: open /home/steam/server/rcon.yaml: no such file or directory
    An error occurred during server shutdown.
    
  5. Tried to restore an invalid backup to caught restore_error_handler, it triggered as expected

    *****START RESTORE*****
    Saves the current state before the restore proceeds.
    /palworld/backups/restore-2024-02-02_00-32-50
    Save complete.
    
    gzip: stdin: not in gzip format
    tar: Child returned status 1
    tar: Error is not recoverable: exiting now
    An error occurred during restore.
    I have a backup before recovery can proceed. Do you want to recovery it? (y/n): y
    Recovery complete.
    Clean up the temporary directory.
    
    *****START RESTORE*****
    Saves the current state before the restore proceeds.
    /palworld/backups/restore-2024-02-02_00-34-40
    Save complete.
    
    gzip: stdin: not in gzip format
    tar: Child returned status 1
    tar: Error is not recoverable: exiting now
    An error occurred during restore.
    I have a backup before recovery can proceed. Do you want to recovery it? (y/n): n
    Clean up the temporary directory.
    
  6. Issued stop to the container during restore and verified it waits until restore is finished

  7. I confirmed the container does wait for the restore process to complete before stopping

  8. Other noteworthly results. The server must be running in order to restore. If you stop the server while in the restore process but before the shutdown command is issued then then term_error_handler is caught.

    *****STARTING PROCESS*****
    *****SHUTDOWN SERVER*****
    cli: execute: auth: rcon: dial tcp 127.0.0.1:25575: connect: connection refused
    An error occurred during server shutdown.
    

Proposed solutions for issues

Proposed solution for 5

Use file and grep to verify it is a gzip file. file "${BACKUP_FILE}" | grep -q 'gzip'

While this won't fix all the restore errors it will validate that we only attempt to restore gzip files.

Mind you file is not installed in the container.

Possible solution for 8 (Just adds more issues)

Use pgrep to see if the server is running before running the rcon command

Having this solution would introduce a new issue where if someone attempts to restore while the server is updating then starts resulting in who knows what.

Conclusion

While I did propose solutions for 5, I think what we have here is great and works as intended. I don't see a huge need to add an additional validation check for the gzip file as the output given by the error was fairly identifiable.

You have put a lot a work into this. It is amazing. I approve these changes.

@seoseonyu
Copy link
Contributor Author

Thank you very much for your help

Copy link
Owner

@thijsvanloef thijsvanloef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seoseonyu I really appreciate the work you've put into this. And thanks @Dashboy1998 for testing it so thoroughly.

I do have 1 single request, when you start the script, please show a disclaimer with the following information:

This script has been designed to help you restore; however, I am not responsible for any data loss. It is recommended that you create a backup beforehand, and in the event of script failure, be prepared to restore it manually.

I do not want to be responsible for people losing their data due to unforeseen circumstances.

@seoseonyu
Copy link
Contributor Author

I added this disclaimer to the process of selecting the backup file to restore and confirming user intentions just before running the command.

The output looks like the picture below.

image

Copy link
Owner

@thijsvanloef thijsvanloef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seoseonyu Amazing work!

@thijsvanloef thijsvanloef merged commit 56c938e into thijsvanloef:main Feb 2, 2024
4 checks passed
@rairulyle
Copy link

I haven't executed this script yet but just wanna ask, will this let me select a backup file or will it only restore the latest backup?

@Dashboy1998
Copy link
Contributor

I haven't executed this script yet but just wanna ask, will this let me select a backup file or will it only restore the latest backup?

It let's you select a backup.

MusclePr pushed a commit to MusclePr/palworld-server-docker that referenced this pull request Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants