Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out-of-memory corrupts catalog #4

Open
lyonel opened this issue Nov 22, 2014 · 8 comments
Open

Out-of-memory corrupts catalog #4

lyonel opened this issue Nov 22, 2014 · 8 comments

Comments

@lyonel
Copy link
Contributor

lyonel commented Nov 22, 2014

When Snebu reaches an out-of-memory condition (after eating all system memory + swap) running snebu expire or during a backup, it crashes, leaving a corrupted catalog DB. Further tries (either using various snebu commands or sqlite3 directly) often lead to SQLITE_CORRUPT (11) errors.

Recovery with sqlite3 snebu-catalog.db .dump | sqlite3 recovered-snebu-catalog.db is (usually) possible

PS: After a few months backuping 4 Linux boxes, my catalog DB is now 27GiB (mostly due to the various indices Snebu creates), which means that its DB journal often gets bigger than the available memory+swap (8GiB).

@derekp7
Copy link
Owner

derekp7 commented Nov 23, 2014

I had originally put in journal_mode=memory hoping to speed up the expire process, but ended up taking it out in one of the patches in the "cleanup" branch. Can you take a look at that branch and see what the performance characteristics of it are on your setup/

Also, what is the performance effect of journal_mode=truncate (in your patch), vs. the default of journal_mode=delete (the default when not specified)? If it looks like it should help, or at least shouldn't hurt. Thanks.

@lyonel
Copy link
Contributor Author

lyonel commented Nov 23, 2014

I've prefixed all calls to snebu with time on all my clients and backup server to get a baseline and tomorrow I will switch to a cleanup version to measure the performance effects... stay tuned.

w.r.t. TRUNCATE journal mode, it's supposed to be a bit faster than DELETE but I've never measured any significant difference (it may have a sizable effect if the number of transactions is very high, though). It probably also depends greatly on the underlying filesystem.

@derekp7
Copy link
Owner

derekp7 commented Nov 24, 2014

The big performance improvement that should be there is during the newbackup stage -- this is where the client sends a manifest of files to be included in the backup set, and snebu replies with which files are needed (vs. what it already has). Previously, it looked at all the backups for the given client. I've changed it to only look at the past 3 backups, so this part should go much faster. The only potential drawback is if the last 3 backups were partial ones, then it will return a bigger list of needed files (these still get deduplicated on the back end, but the backup set file transfer time will be impacted). In practice, this shouldn't be an issue. [Note, I'm planning on making this a tuneable parameter, stored in a "settings" table in the database, and possibly make it look at the past X backups of the same retention class].

Edit: looks like I didn't merge this performance improvement into the cleanup branch. I found a branch where I have a slightly different performance improvement for this, but not what I described above (i.e., considering only the most recent X backups when looking for existing files). But my notes say that this still did make a big difference. Will push this version trough after giving it a quick test.

Edit2: Found a bug in the expire section -- extra parentheses in one section, one missing in several other lines. Just fixed and pushed to the cleanup branch. I'm also running a simulation of 1 year's worth of backups for 20 systems (keeping 10 daily, 6 weekly, and 12 monthly backups for each). Stay tuned, I'll let you know how this test goes and if it is safe to use the cleanup branch yet.

Edit3: So far the simulated backup of 20 hosts is looking good. With about 240 backupsets in the DB, an expire takes about a 80 seconds if there are items to expire, and 1 second if nothing to expire. Backing up a system incrementally takes about 40 seconds. Note, this is all with a system with 16GB memory, and a warm disk cache. But at least the main DB slowness issues seem to be a bit better. Hopefully your tests will show the same.

Next step is to finish formalizing an automated test suite for this, work up a new release, and update the web site.

@lyonel
Copy link
Contributor Author

lyonel commented Nov 25, 2014

It looks OK so far, I've replaced the snebu binary by one from the cleanup branch. I've had to manually create the log table, though (snebu expire was complaining).
I'll post the results after next full run.

@derekp7
Copy link
Owner

derekp7 commented Nov 25, 2014

The log table is created by the initdb function, which gets called from newbackup. The initdb function both initializes a new db, and updates an existing one if there are more tables added. But it doesn't get called on any of the other targets, just newbackup.

What I should probably do is add a checkdb function, that gets called regardless of the target, and if the db is out of date, then exit the program with instructions to call snebu updatedb or something like that. This should work out better in the long run, since some of the upcoming enhancements will require migrating data from one table to another one, which will take a chunk of time (specifically, the enhancement I'm working on now which allows for more than one vault location, along with shadow database copies for incremental backups of laptops to a thumb drive).

@lyonel
Copy link
Contributor Author

lyonel commented Nov 26, 2014

The new version indeed shows an improvement in speed (backup times):

machine2:

New version Old version
real 13m44.315s real 21m12.750s
user 3m23.057s user 3m54.784s
sys 0m57.645s sys 1m0.077s

machine3:

New version Old version
real 62m48.616s real 86m27.384s
user 15m8.254s user 5m45.131s
sys 6m58.699s sys 2m51.405s

If you want, I can send you a URL where you can see more details (including the content of the log table)

@derekp7
Copy link
Owner

derekp7 commented Nov 27, 2014

Sure -- couple questions. For these backups, were they existing hosts, with incremental backups consisting of a handful of changed files, or either new hosts, or hosts with a large number of changed files? Secondly, for the sections that were taking the most time, was it the initial section (when the client script is running the "find" command, and getting the list of files needed to be backed up), or was the delay at the end, after it hit 100% (or nearly 100%), did it sit there for a while? Those are the two sections where it does a lot of DB activity. I found out, that sqlite doesn't handle multiple table joins as well as I hoped, so I started doing selects into temp tables, then joining from there -- seems to help a bit. If the majority of the backup time was just moving the backup file data over, then I'm not too worried, as that is limited by the speed of tar, and the disk / network. But I do want to make sure that the DB related activities are minimal time.

A "select * from log" should have most of what I need. I may add a few more log entries too, if needed. Thanks.

@lyonel
Copy link
Contributor Author

lyonel commented Nov 27, 2014

These are incremental backups of existing hosts, I've e-mailed you the link to my dashboard (where you can see the log entries).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants