Out-of-memory corrupts catalog #4

lyonel · 2014-11-22T15:55:15Z

When Snebu reaches an out-of-memory condition (after eating all system memory + swap) running snebu expire or during a backup, it crashes, leaving a corrupted catalog DB. Further tries (either using various snebu commands or sqlite3 directly) often lead to SQLITE_CORRUPT (11) errors.

Recovery with sqlite3 snebu-catalog.db .dump | sqlite3 recovered-snebu-catalog.db is (usually) possible

PS: After a few months backuping 4 Linux boxes, my catalog DB is now 27GiB (mostly due to the various indices Snebu creates), which means that its DB journal often gets bigger than the available memory+swap (8GiB).

The text was updated successfully, but these errors were encountered:

derekp7 · 2014-11-23T22:26:45Z

I had originally put in journal_mode=memory hoping to speed up the expire process, but ended up taking it out in one of the patches in the "cleanup" branch. Can you take a look at that branch and see what the performance characteristics of it are on your setup/

Also, what is the performance effect of journal_mode=truncate (in your patch), vs. the default of journal_mode=delete (the default when not specified)? If it looks like it should help, or at least shouldn't hurt. Thanks.

lyonel · 2014-11-23T23:27:07Z

I've prefixed all calls to snebu with time on all my clients and backup server to get a baseline and tomorrow I will switch to a cleanup version to measure the performance effects... stay tuned.

w.r.t. TRUNCATE journal mode, it's supposed to be a bit faster than DELETE but I've never measured any significant difference (it may have a sizable effect if the number of transactions is very high, though). It probably also depends greatly on the underlying filesystem.

derekp7 · 2014-11-24T01:37:27Z

The big performance improvement that should be there is during the newbackup stage -- this is where the client sends a manifest of files to be included in the backup set, and snebu replies with which files are needed (vs. what it already has). Previously, it looked at all the backups for the given client. I've changed it to only look at the past 3 backups, so this part should go much faster. The only potential drawback is if the last 3 backups were partial ones, then it will return a bigger list of needed files (these still get deduplicated on the back end, but the backup set file transfer time will be impacted). In practice, this shouldn't be an issue. [Note, I'm planning on making this a tuneable parameter, stored in a "settings" table in the database, and possibly make it look at the past X backups of the same retention class].

Edit: looks like I didn't merge this performance improvement into the cleanup branch. I found a branch where I have a slightly different performance improvement for this, but not what I described above (i.e., considering only the most recent X backups when looking for existing files). But my notes say that this still did make a big difference. Will push this version trough after giving it a quick test.

Edit2: Found a bug in the expire section -- extra parentheses in one section, one missing in several other lines. Just fixed and pushed to the cleanup branch. I'm also running a simulation of 1 year's worth of backups for 20 systems (keeping 10 daily, 6 weekly, and 12 monthly backups for each). Stay tuned, I'll let you know how this test goes and if it is safe to use the cleanup branch yet.

Edit3: So far the simulated backup of 20 hosts is looking good. With about 240 backupsets in the DB, an expire takes about a 80 seconds if there are items to expire, and 1 second if nothing to expire. Backing up a system incrementally takes about 40 seconds. Note, this is all with a system with 16GB memory, and a warm disk cache. But at least the main DB slowness issues seem to be a bit better. Hopefully your tests will show the same.

Next step is to finish formalizing an automated test suite for this, work up a new release, and update the web site.

lyonel · 2014-11-25T13:58:29Z

It looks OK so far, I've replaced the snebu binary by one from the cleanup branch. I've had to manually create the log table, though (snebu expire was complaining).
I'll post the results after next full run.

derekp7 · 2014-11-25T19:22:42Z

The log table is created by the initdb function, which gets called from newbackup. The initdb function both initializes a new db, and updates an existing one if there are more tables added. But it doesn't get called on any of the other targets, just newbackup.

What I should probably do is add a checkdb function, that gets called regardless of the target, and if the db is out of date, then exit the program with instructions to call snebu updatedb or something like that. This should work out better in the long run, since some of the upcoming enhancements will require migrating data from one table to another one, which will take a chunk of time (specifically, the enhancement I'm working on now which allows for more than one vault location, along with shadow database copies for incremental backups of laptops to a thumb drive).

lyonel · 2014-11-26T10:15:25Z

The new version indeed shows an improvement in speed (backup times):

machine2:

New version	Old version
real 13m44.315s	real 21m12.750s
user 3m23.057s	user 3m54.784s
sys 0m57.645s	sys 1m0.077s

machine3:

New version	Old version
real 62m48.616s	real 86m27.384s
user 15m8.254s	user 5m45.131s
sys 6m58.699s	sys 2m51.405s

If you want, I can send you a URL where you can see more details (including the content of the log table)

derekp7 · 2014-11-27T01:11:40Z

Sure -- couple questions. For these backups, were they existing hosts, with incremental backups consisting of a handful of changed files, or either new hosts, or hosts with a large number of changed files? Secondly, for the sections that were taking the most time, was it the initial section (when the client script is running the "find" command, and getting the list of files needed to be backed up), or was the delay at the end, after it hit 100% (or nearly 100%), did it sit there for a while? Those are the two sections where it does a lot of DB activity. I found out, that sqlite doesn't handle multiple table joins as well as I hoped, so I started doing selects into temp tables, then joining from there -- seems to help a bit. If the majority of the backup time was just moving the backup file data over, then I'm not too worried, as that is limited by the speed of tar, and the disk / network. But I do want to make sure that the DB related activities are minimal time.

A "select * from log" should have most of what I need. I may add a few more log entries too, if needed. Thanks.

lyonel · 2014-11-27T09:06:09Z

These are incremental backups of existing hosts, I've e-mailed you the link to my dashboard (where you can see the log entries).

lyonel mentioned this issue Nov 22, 2014

do not keep journal in-memory to prevent corruption #5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out-of-memory corrupts catalog #4

Out-of-memory corrupts catalog #4

lyonel commented Nov 22, 2014

derekp7 commented Nov 23, 2014

lyonel commented Nov 23, 2014

derekp7 commented Nov 24, 2014

lyonel commented Nov 25, 2014

derekp7 commented Nov 25, 2014

lyonel commented Nov 26, 2014

derekp7 commented Nov 27, 2014

lyonel commented Nov 27, 2014

Out-of-memory corrupts catalog #4

Out-of-memory corrupts catalog #4

Comments

lyonel commented Nov 22, 2014

derekp7 commented Nov 23, 2014

lyonel commented Nov 23, 2014

derekp7 commented Nov 24, 2014

lyonel commented Nov 25, 2014

derekp7 commented Nov 25, 2014

lyonel commented Nov 26, 2014

derekp7 commented Nov 27, 2014

lyonel commented Nov 27, 2014