-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out-of-memory corrupts catalog #4
Comments
I had originally put in journal_mode=memory hoping to speed up the expire process, but ended up taking it out in one of the patches in the "cleanup" branch. Can you take a look at that branch and see what the performance characteristics of it are on your setup/ Also, what is the performance effect of journal_mode=truncate (in your patch), vs. the default of journal_mode=delete (the default when not specified)? If it looks like it should help, or at least shouldn't hurt. Thanks. |
I've prefixed all calls to w.r.t. |
The big performance improvement that should be there is during the newbackup stage -- this is where the client sends a manifest of files to be included in the backup set, and snebu replies with which files are needed (vs. what it already has). Previously, it looked at all the backups for the given client. I've changed it to only look at the past 3 backups, so this part should go much faster. The only potential drawback is if the last 3 backups were partial ones, then it will return a bigger list of needed files (these still get deduplicated on the back end, but the backup set file transfer time will be impacted). In practice, this shouldn't be an issue. [Note, I'm planning on making this a tuneable parameter, stored in a "settings" table in the database, and possibly make it look at the past X backups of the same retention class]. Edit: looks like I didn't merge this performance improvement into the cleanup branch. I found a branch where I have a slightly different performance improvement for this, but not what I described above (i.e., considering only the most recent X backups when looking for existing files). But my notes say that this still did make a big difference. Will push this version trough after giving it a quick test. Edit2: Found a bug in the expire section -- extra parentheses in one section, one missing in several other lines. Just fixed and pushed to the cleanup branch. I'm also running a simulation of 1 year's worth of backups for 20 systems (keeping 10 daily, 6 weekly, and 12 monthly backups for each). Stay tuned, I'll let you know how this test goes and if it is safe to use the cleanup branch yet. Edit3: So far the simulated backup of 20 hosts is looking good. With about 240 backupsets in the DB, an expire takes about a 80 seconds if there are items to expire, and 1 second if nothing to expire. Backing up a system incrementally takes about 40 seconds. Note, this is all with a system with 16GB memory, and a warm disk cache. But at least the main DB slowness issues seem to be a bit better. Hopefully your tests will show the same. Next step is to finish formalizing an automated test suite for this, work up a new release, and update the web site. |
It looks OK so far, I've replaced the snebu binary by one from the |
The What I should probably do is add a |
The new version indeed shows an improvement in speed (backup times): machine2:
machine3:
If you want, I can send you a URL where you can see more details (including the content of the |
Sure -- couple questions. For these backups, were they existing hosts, with incremental backups consisting of a handful of changed files, or either new hosts, or hosts with a large number of changed files? Secondly, for the sections that were taking the most time, was it the initial section (when the client script is running the "find" command, and getting the list of files needed to be backed up), or was the delay at the end, after it hit 100% (or nearly 100%), did it sit there for a while? Those are the two sections where it does a lot of DB activity. I found out, that sqlite doesn't handle multiple table joins as well as I hoped, so I started doing selects into temp tables, then joining from there -- seems to help a bit. If the majority of the backup time was just moving the backup file data over, then I'm not too worried, as that is limited by the speed of tar, and the disk / network. But I do want to make sure that the DB related activities are minimal time. A "select * from log" should have most of what I need. I may add a few more log entries too, if needed. Thanks. |
These are incremental backups of existing hosts, I've e-mailed you the link to my dashboard (where you can see the |
When Snebu reaches an out-of-memory condition (after eating all system memory + swap) running
snebu expire
or during a backup, it crashes, leaving a corrupted catalog DB. Further tries (either using varioussnebu
commands orsqlite3
directly) often lead toSQLITE_CORRUPT
(11) errors.Recovery with
sqlite3 snebu-catalog.db .dump | sqlite3 recovered-snebu-catalog.db
is (usually) possiblePS: After a few months backuping 4 Linux boxes, my catalog DB is now 27GiB (mostly due to the various indices Snebu creates), which means that its DB journal often gets bigger than the available memory+swap (8GiB).
The text was updated successfully, but these errors were encountered: