-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathindex.html
266 lines (214 loc) · 24.5 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
<!DOCTYPE html>
<html>
<head>
<meta charset='utf-8'>
<meta http-equiv="X-UA-Compatible" content="chrome=1">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
<link href='https://fonts.googleapis.com/css?family=Architects+Daughter' rel='stylesheet' type='text/css'>
<link rel="stylesheet" type="text/css" href="stylesheets/stylesheet.css" media="screen" />
<link rel="stylesheet" type="text/css" href="stylesheets/pygment_trac.css" media="screen" />
<link rel="stylesheet" type="text/css" href="stylesheets/print.css" media="print" />
<!--[if lt IE 9]>
<script src="//html5shiv.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
<title>ZBackup</title>
</head>
<body>
<header>
<div class="inner">
<h1>ZBackup</h1>
<h2>A versatile deduplicating backup tool</h2>
<!-- <a href="https://github.com/zbackup" class="button"><small>Follow me on</small>GitHub</a>-->
</div>
</header>
<div id="content-wrapper">
<div class="inner clearfix">
<section id="main-content">
<div class="release">Latest release: <span class="version">1.3</span> (<a href="https://github.com/zbackup/zbackup/releases/tag/1.3">release notes</a>, <a href="https://github.com/zbackup/zbackup/archive/1.3.tar.gz">tar.gz</a>, <a href="https://github.com/zbackup/zbackup/archive/1.3.zip">zip</a>)</div>
<h1>
<a name="introduction" class="anchor" href="#introduction"><span class="octicon octicon-link"></span></a>Introduction</h1>
<p><strong>zbackup</strong> is a globally-deduplicating backup tool, based on the ideas found in <a href="http://rsync.samba.org/">rsync</a>. Feed a large <code>.tar</code> into it, and it will store duplicate regions of it only once, then compress and optionally encrypt the result. Feed another <code>.tar</code> file, and it will also re-use any data found in any previous backups. This way only new changes are stored, and as long as the files are not very different, the amount of storage required is very low. Any of the backup files stored previously can be read back in full at any time. The program is format-agnostic, so you can feed virtually any files to it (any types of archives, proprietary formats, even raw disk images -- but see <a href="#caveats">Caveats</a>).</p>
<p>This is achieved by sliding a window with a rolling hash over the input at a byte granularity and checking whether the block in focus was ever met already. If a rolling hash matches, an additional full cryptographic hash is calculated to ensure the block is indeed the same. The deduplication happens then.</p>
<h1>
<a name="features" class="anchor" href="#features"><span class="octicon octicon-link"></span></a>Features</h1>
<p>The program has the following features:</p>
<ul>
<li>Parallel LZMA compression of the stored data</li>
<li>Built-in AES encryption of the stored data</li>
<li>Possibility to delete old backup data in the future</li>
<li>Use of a 64-bit rolling hash, keeping the amount of soft collisions to zero</li>
<li>Repository consists of immutable files. No existing files are ever modified</li>
<li>Written in C++ only with only modest library dependencies</li>
<li>Safe to use in production (see <a href="#safety">below</a>)</li>
</ul><h1>
<a name="build-dependencies" class="anchor" href="#build-dependencies"><span class="octicon octicon-link"></span></a>Build dependencies</h1>
<ul>
<li>
<code>cmake</code> >= 2.8.2 (though it should not be too hard to compile the sources by hand if needed)</li>
<li>
<code>libssl-dev</code> for all encryption, hashing and random numbers</li>
<li>
<code>libprotobuf-dev</code> for data serialization</li>
<li>
<code>liblzma-dev</code> for compression</li>
<li>
<code>zlib1g-dev</code> for adler32 calculation</li>
</ul><h1>
<a name="quickstart" class="anchor" href="#quickstart"><span class="octicon octicon-link"></span></a>Quickstart</h1>
<p>Download version 1.2 source code: <a href="https://github.com/zbackup/zbackup/archive/1.2.tar.gz">.tar.gz</a> or <a href="https://github.com/zbackup/zbackup/archive/1.2.zip">.zip</a>.</p>
<p>To build:</p>
<div class="highlight"><pre><span class="nb">cd </span>zbackup
cmake .
make
sudo make install
<span class="c"># or just run as ./zbackup</span>
</pre></div>
<p>To use:</p>
<div class="highlight"><pre>zbackup init --non-encrypted /my/backup/repo
tar c /my/precious/data | zbackup backup /my/backup/repo/backups/backup-<span class="sb">`</span>date <span class="s1">'+%Y-%m-%d'</span><span class="sb">`</span>
zbackup restore /my/backup/repo/backups/backup-<span class="sb">`</span>date <span class="s1">'+%Y-%m-%d'</span><span class="sb">`</span> > /my/precious/backup-restored.tar
</pre></div>
<p>If you have a lot of RAM to spare, you can use it to speed-up the restore process -- to use 512 MB more, pass <code>--cache-size 512mb</code> when restoring.</p>
<p>If encryption is wanted, create a file with your password:</p>
<div class="highlight"><pre><span class="c"># more secure to use an editor</span>
<span class="nb">echo </span>mypassword > ~/.my_backup_password
chmod 600 ~/.my_backup_password
</pre></div>
<p>Then init the repo the following way:</p>
<div class="highlight"><pre>zbackup init --password-file ~/.my_backup_password /my/backup/repo
</pre></div>
<p>And always pass the same argument afterwards:</p>
<div class="highlight"><pre>tar c /my/precious/data | zbackup --password-file ~/.my_backup_password backup /my/backup/repo/backups/backup-<span class="sb">`</span>date <span class="s1">'+%Y-%m-%d'</span><span class="sb">`</span>
zbackup --password-file ~/.my_backup_password restore /my/backup/repo/backups/backup-<span class="sb">`</span>date <span class="s1">'+%Y-%m-%d'</span><span class="sb">`</span> > /my/precious/backup-restored.tar
</pre></div>
<p>If you have a 32-bit system and a lot of cores, consider lowering the number of compression threads by passing <code>--threads 4</code> or <code>--threads 2</code> if the program runs out of address space when backing up (see why <a href="#caveats">below</a>, item 2). There should be no problem on a 64-bit system.</p>
<h1>
<a name="caveats" class="anchor" href="#caveats"><span class="octicon octicon-link"></span></a>Caveats</h1>
<ul>
<li>While you can pipe any data into the program, the data should be uncompressed and unencrypted -- otherwise no deduplication could be performed on it. <code>zbackup</code> would compress and encrypt the data itself, so there's no need to do that yourself. So just run <code>tar c</code> and pipe it into <code>zbackup</code> directly. If backing up disk images employing encryption, pipe the unencrypted version (the one you normally mount). If you create <code>.zip</code> or <code>.rar</code> files, use no compression (<code>-0</code> or <code>-m0</code>) and no encryption.</li>
<li>Parallel LZMA compression uses a lot of RAM (several hundreds of megabytes, depending on the number of threads used), and ten times more virtual address space. The latter is only relevant on 32-bit architectures where it's limited to 2 or 3 GB. If you hit the ceiling, lower the number of threads with <code>--threads</code>.</li>
<li>Since the data is deduplicated, there's naturally no redundancy in it. A loss of a single file can lead to a loss of virtually all data. Make sure you store it on a redundant storage (RAID1, a cloud provider etc).</li>
<li>The encryption key, if used, is stored in the <code>info</code> file in the root of the repo. It is encrypted with your password. Technically thus you can change your password without re-encrypting any data, and as long as no one possesses the old <code>info</code> file and knows your old password, you would be safe (even though the actual option to change password is not implemented yet -- someone who needs this is welcome to create a pull request -- the possibility is all there). Also note that it is crucial you don't lose your <code>info</code> file, as otherwise the whole backup would be lost. </li>
</ul><h1>
<a name="limitations" class="anchor" href="#limitations"><span class="octicon octicon-link"></span></a>Limitations</h1>
<ul>
<li>Right now the only modes supported are reading from standard input and writing to standard output. FUSE mounts and NBD servers may be added later if someone contributes the code.</li>
<li>The program keeps all known blocks in an in-RAM hash table, which may create scalability problems for very large repos (see <a href="#scalability">below</a>).</li>
<li>The only encryption mode currently implemented is <code>AES-128</code> in <code>CBC</code> mode with <code>PKCS#7</code> padding. If you believe that this is not secure enough, patches are welcome. Before you jump to conclusions however, read <a href="http://www.schneier.com/blog/archives/2009/07/another_new_aes.html">this article</a>.</li>
<li>The only compression mode supported is LZMA, which suits backups very nicely.</li>
<li>It's only possible to fully restore the backup in order to get to a required file, without any option to quickly pick it out. <code>tar</code> would not allow to do it anyway, but e.g. for <code>zip</code> files it could have been possible. This is possible to implement though, e.g. by exposing the data over a FUSE filesystem.</li>
<li>There's no option to delete old backup data yet. The possibility is all there, though. Someone needs to implement it (see <a href="#improvements">below</a>).</li>
<li>There's no option to specify block and bundle sizes other than the default ones (currently <code>64k</code> and <code>2MB</code> respectively), though it's trivial to add command-line switches for those.</li>
</ul><p>Most of those limitations can be lifted by implementing the respective features.</p>
<h1>
<a name="safety" class="anchor" href="#safety"><span class="octicon octicon-link"></span></a>Safety</h1>
<p>Is it safe to use <code>zbackup</code> for production data? Being free software, the program comes with no warranty of any kind. That said, it's perfectly safe for production, and here's why. When performing a backup, the program never modifies or deletes any existing files -- only new ones are created. It specifically checks for that, and the code paths involved are short and easy to inspect. Furthermore, each backup is protected by its <code>SHA256</code> sum, which is calculated before piping the data into the deduplication logic. The code path doing that is also short and easy to inspect. When a backup is being restored, its <code>SHA256</code> is calculated again and compared against the stored one. The program would fail on a mismatch. Therefore, to ensure safety it is enough to restore each backup to <code>/dev/null</code> immediately after creating it. If it restores fine, it will restore fine ever after.<br>
To add some statistics, the author of the program has been using an older version of <code>zbackup</code> internally for over a year. The <code>SHA256</code> check never ever failed. Again, even if it does, you would know immediately, so no work would be lost. Therefore you are welcome to try the program in production, and if you like it, stick with it.</p>
<h1>
<a name="usage-notes" class="anchor" href="#usage-notes"><span class="octicon octicon-link"></span></a>Usage notes</h1>
<p>The repository has the following directory structure:</p>
<pre><code>/repo
backups/
bundles/
00/
01/
02/
...
index/
info
</code></pre>
<ul>
<li>The <code>backups</code> directory contain your backups. Those are very small files which are needed for restoration. They are encrypted if encryption is enabled. The names can be arbitrary. It is possible to arrange files in subdirectories, too. Free renaming is also allowed.</li>
<li>The <code>bundles</code> directory contains the bulk of data. Each bundle internally contains multiple small chunks, compressed together and encrypted. Together all those chunks account for all deduplicated data stored.</li>
<li>The <code>index</code> directory contains the full index of all chunks in the repository, together with their bundle names. A separate index file is created for each backup session. Technically those files are redundant, all information is contained in the bundles themselves. However, having a separate <code>index</code> is nice for two reasons: 1) it's faster to read as it incurs less seeks, and 2) it allows making backups while storing bundles elsewhere. Bundles are only needed when restoring -- otherwise it's sufficient to only have <code>index</code>. One could then move all newly created bundles into another machine after each backup.</li>
<li>
<code>info</code> is a very important file which contains all global repository metadata, such as chunk and bundle sizes, and an encryption key encrypted with the user password. It is paramount not to lose it, so backing it up separately somewhere might be a good idea. On the other hand, if you absolutely don't trust your remote storage provider, you might consider not storing it with the rest of the data. It would then be impossible to decrypt it at all, even if your password gets known later.</li>
</ul><p>The program does not have any facilities for sending your backup over the network. You can <code>rsync</code> the repo to another computer or use any kind of cloud storage capable of storing files. Since <code>zbackup</code> never modifies any existing files, the latter is especially easy -- just tell the upload tool you use not to upload any files which already exist on the remote side (e.g. with <code>gsutil</code> it's <code>gsutil cp -R -n /my/backup gs:/mybackup/</code>).</p>
<p>To aid with creating backups, there's an utility called <code>tartool</code> included with <code>zbackup</code>. The idea is the following: one sprinkles empty files called <code>.backup</code> and <code>.no-backup</code> across the entire filesystem. Directories where <code>.backup</code> files are placed are marked for backing up. Similarly, directories with <code>.no-backup</code> files are marked not to be backed up. Additionally, it is possible to place <code>.backup-XYZ</code> in the same directory where <code>XYZ</code> is to mark <code>XYZ</code> for backing up, or place <code>.no-backup-XYZ</code> to mark it not to be backed up. Then <code>tartool</code> can be run with three arguments -- the root directory to start from (can be <code>/</code>), the output <code>includes</code> file, and the output <code>excludes</code> file. The tool traverses over the given directory noting the <code>.backup*</code> and <code>.no-backup*</code> files and creating include and exclude lists for the <code>tar</code> utility. The <code>tar</code> utility could then be run as <code>tar c --files-from includes --exclude-from excludes</code> to store all chosen data.</p>
<h1>
<a name="scalability" class="anchor" href="#scalability"><span class="octicon octicon-link"></span></a>Scalability</h1>
<p>This section tries do address the question on the maximum amount of data which can be held in a backup repository. What is meant here is the deduplicated data. The number of bytes in all source files ever fed into the repository doesn't matter, but the total size of the resulting repository does.
Internally all input data is split into small blocks called chunks (up to <code>64k</code> each by default). Chunks are collected into bundles (up to <code>2MB</code> each by default), and those bundles are then compressed and encrypted.</p>
<p>There are then two problems with the total number of chunks in the repository:</p>
<ul>
<li>Hashes of all existing chunks are needed to be kept in RAM while the backup is ongoing. Since the sliding window performs checking with a single-byte granularity, lookups would otherwise be too slow. The amount of data needed to be stored is technically only 24 bytes for each chunk, where the size of the chunk is up to <code>64k</code>. In an example real-life <code>18GB</code> repo, only <code>18MB</code> are taken by in its hash index. Multiply this roughly by two to have an estimate of RAM needed to store this index as an in-RAM hash table. However, as this size is proportional to the total size of the repo, for <code>2TB</code> repo you could already require <code>2GB</code> of RAM. Most repos are much smaller though, and as long as the deduplication works properly, in many cases you can store terabytes of highly-redundant backup files in a <code>20GB</code> repo easily.</li>
<li>We use a 64-bit rolling hash, which allows to have an <code>O(1)</code> lookup cost at each byte we process. Due to <a href="https://en.wikipedia.org/wiki/Birthday_paradox">birthday paradox</a>, we would start having collisions when we approach <code>2^32</code> hashes. If each chunk we have is <code>32k</code> on average, we would get there when our repo grows to <code>128TB</code>. We would still be able to continue, but as the number of collisions would grow, we would have to resort to calculating the full hash of a block at each byte more and more often, which would result in a considerable slowdown.</li>
</ul><p>All in all, as long as the amount of RAM permits, one can go up to several terabytes in deduplicated data, and start having some slowdown after having hundreds of terabytes, RAM-permitting.</p>
<h1>
<a name="design-choices" class="anchor" href="#design-choices"><span class="octicon octicon-link"></span></a>Design choices</h1>
<ul>
<li>We use a 64-bit modified Rabin-Karp rolling hash (see <code>rolling_hash.hh</code> for details), while most other programs use a 32-bit one. As noted previously, one problem with the hash size is its birthday bound, which with the 32-bit hash is met after having only <code>2^16</code> hashes. The choice of a 64-bit hash allows us to scale much better while having virtually the same calculation cost on a typical 64-bit machine.</li>
<li>
<code>rsync</code> uses <code>MD5</code> as its strong hash. While <code>MD5</code> is known to be fast, it is also known to be broken, allowing a malicious user to craft colliding inputs. <code>zbackup</code> uses <code>SHA1</code> instead. The cost of <code>SHA1</code> calculations on modern machines is actually less than that of <code>MD5</code> (run <code>openssl speed md5 sha1</code> on yours), so it's a win-win situation. We only keep the first 128 bits of the <code>SHA1</code> output, and therefore together with the rolling hash we have a 192-bit hash for each chunk. It's a multiple of 8 bytes which is a nice properly on 64-bit machines, and it is long enough not to worry about possible collisions.</li>
<li>
<code>AES-128</code> in <code>CBC</code> mode with <code>PKCS#7</code> padding is used for encryption. This seems to be a reasonbly safe classic solution. Each encrypted file has a random IV as its first 16 bytes.</li>
<li>We use Google's <a href="https://developers.google.com/protocol-buffers/">protocol buffers</a> to represent data structures in binary form. They are very efficient and relatively simple to use.</li>
</ul><h1>
<a name="improvements" class="anchor" href="#improvements"><span class="octicon octicon-link"></span></a>Improvements</h1>
<p>There's a lot to be improved in the program. It was released with the minimum amount of functionality to be useful. It is also stable. This should hopefully stimulate people to join the development and add all those other fancy features. Here's a list of ideas:</p>
<ul>
<li>Additional options, such as configurable chunk and bundle sizes etc.</li>
<li>A command to change password.</li>
<li>A command to perform garbage collection. The program should skim through all backups and note which chunks are used by all of them. Then it should skim through all bundles and see which chunks among the ones stored were never used by the backups. If a bundle has more than <em>X%</em> of unused chunks, the remaining chunks should be transferred into brand new bundles. The old bundles should be deleted then. Once the process finishes, a new single index file with all existing chunk ids should be written, replacing all previous index files. With this command, it would become possible to remove old backups.</li>
<li>A command to fsck the repo by doing something close to what garbage collection does, but also checking all hashes and so on.</li>
<li>Parallel decompression. Right now decompression is single-threaded, but it is possible to look ahead in the stream and perform prefetching.</li>
<li>Support for mounting the repo over FUSE. Random access to data would then be possible.</li>
<li>Support for exposing a backed up file over a userspace NBD server. It would then be possible to mount raw disk images without extracting them.</li>
<li>Support for other encryption types (preferably for everything <code>openssl</code> supports with its <code>evp</code>).</li>
<li>Support for other compression methods.</li>
<li>You name it!</li>
</ul><h1>
<a name="communication" class="anchor" href="#communication"><span class="octicon octicon-link"></span></a>Communication</h1>
<ul>
<li>
The program's website is at <a href="http://zbackup.org/">http://zbackup.org/</a>.</li>
<li>
Development happens at <a href="https://github.com/zbackup/zbackup">https://github.com/zbackup/zbackup</a>.</li>
<li>
Discussion forum is at <a href="https://groups.google.com/forum/#!forum/zbackup">https://groups.google.com/forum/#!forum/zbackup</a>. Please ask for help there!</li>
</ul>
<p>The author is reachable over email at <a href="mailto:[email protected]">[email protected]</a>. Please be constructive and don't ask for help using the program, though. In most cases it's best to stick to the forum, unless you have something to discuss with the author in private.</p>
<h1>
<a name="similar-projects" class="anchor" href="#similar-projects"><span class="octicon octicon-link"></span></a>Similar projects</h1>
<p><code>zbackup</code> is certainly not the first project to embrace the idea of using a rolling hash for deduplication. Here's a list of other projects the author found on the web:</p>
<ul>
<li>
<a href="https://github.com/bup/bup">bup</a>, based on storing data in <code>git</code> packs. No possibility of removing old data. This program was the initial inspiration for <code>zbackup</code>.</li>
<li>
<a href="http://www.synctus.com/ddar/">ddar</a>, seems to be a little bit outdated. Contains a nice list of alternatives with comparisons.</li>
<li>
<a href="http://www.nongnu.org/rdiff-backup/">rdiff-backup</a>, based on the original <code>rsync</code> algorithm. Does not do global deduplication, only working over the files with the same file name.</li>
<li>
<a href="http://duplicity.nongnu.org/">duplicity</a>, which looks similar to <code>rdiff-backup</code> with regards to mode of operation.</li>
<li>Some filesystems (most notably <a href="http://en.wikipedia.org/wiki/ZFS">ZFS</a> and <a href="http://en.wikipedia.org/wiki/Btrfs">Btrfs</a>) provide deduplication features. They do so only at block level though, without a sliding window, so they can not accomodate to arbitrary byte insertion/deletion in the middle of data.</li>
</ul><h1>
<a name="credits" class="anchor" href="#credits"><span class="octicon octicon-link"></span></a>Credits</h1>
<p>Copyright (c) 2012-2014 Konstantin Isakov (<a href="mailto:[email protected]">[email protected]</a>). Licensed under GNU GPLv2 or later + OpenSSL, see LICENSE.</p>
<p>This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.</p>
<p>This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.</p>
<p>You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.</p>
<p id="theme-by">Architect theme by <a href="https://twitter.com/jasonlong">Jason Long</a>.</p>
</section>
<aside id="sidebar">
</aside>
</div>
</div>
<!-- Start of StatCounter Code for Default Guide -->
<script type="text/javascript">
var sc_project=9108239;
var sc_invisible=1;
var sc_security="e8adde97";
var scJsHost = (("https:" == document.location.protocol) ?
"https://secure." : "http://www.");
document.write("<sc"+"ript type='text/javascript' src='" +
scJsHost+
"statcounter.com/counter/counter.js'></"+"script>");
</script>
<noscript><div class="statcounter"><a title="free hit
counters" href="http://statcounter.com/"
target="_blank"><img class="statcounter"
src="http://c.statcounter.com/9108239/0/e8adde97/1/"
alt="free hit counters"></a></div></noscript>
<!-- End of StatCounter Code for Default Guide -->
</body>
</html>