-
Notifications
You must be signed in to change notification settings - Fork 23
RocksDB vs TuplDB
Test setup:
- RAM: 48GB
- Storage: Intel Optane SSD 900P 480GB
- CPU: Ryzen 7 1700
- 8 physical cores, 16 logical cores
- Kernel: 5.4.0-169
- File system: ext4
- RocksDB JNI: 8.8.1
- Compression type: NO_COMPRESSION
- Compaction style: LEVEL
- Write buffer size: 64MiB
- Max write buffer number: 3
- Target file size base: 64MiB
- Level 0 file num compaction trigger: 8
- Level 0 slowdown writes trigger: 17
- Level 0 stop writes trigger: 24
- Num levels: 4
- Max bytes for level base: 512MiB
- Max bytes for level multiplier: 8
- Background threads: 16
- Block cache: LRU cache 16GB (recommended to be ⅓ the size of total RAM)
- Cache index and filter blocks: true
- TuplDB: 1.8.0
- Cache size: 44GB
- DurabilityMode.NO_FLUSH
- G1GC, -Xmx3g -XX:+UseLargePages -XX:+UseTransparentHugePages
Note regarding the RocksDB cache size: I did notice that a larger cache size improved performance, but eventually the process would consume all the RAM and would be killed by the OS. With 16GB, the tests just barely passed before running out of RAM.
For this round of tests, I didn't bother measuring insert performance when starting from an empty database. Older tests examined this and are still valid.
To prepare for this test, a fresh database was created with 1 billion sequentially ordered entries. The key size is 8 bytes, and the value size is 100 bytes.
The test measures the performance of transactionally reading one value against a randomly selected key, within a specific range. Initially, the key range is restricted to the first 10 million entries, and the test runs over this range for 10 seconds. The key range is expanded by an additional 10 million entries, tested for 10 seconds, and so on, up to a maximum range of 1 billion entries. Each time the range is expanded, the newly added key range is pre-scanned to populate the cache. The pre-scan time is not included in the test result time.
The entire test was run several times, with 1, 2, 4, 8, and 16 threads. This first chart shows the results for 1 thread. Charts which with more threads have a similar shape, and so they don't need to be included.
TuplDB is much faster than RockDB at reading, but this is to be expected considering that log-structured designs like RockDB are slower than traditional B-Tree designs. TuplDB shows a dip in performance when the key range reaches about 380 million, because the database no longer fits in the cache. A similar dip can be seen with RocksDB, but it occurs sooner because the block cache size needed to be restricted to 16GB.
These next charts show how TuplDB and RocksDB scale as more read threads are added to the test.
For the first chart, the performance over the 100 to 300 million key range was averaged together. This shows read performance when the entire range of requested entries fits entirely in memory. Both TuplDB and RocksDB scale well with more threads, but TuplDB is much faster overall. The test machine has 8 physical cores and 16 logical cores, which is why the chart has an asterisk with 16 threads.
The second chart is averaging over the 700 to 900 million key range, and so the range of entries doesn't fit in memory. Again, both TuplDB and RocksDB scale well with more threads, but TuplDB is still faster. It should be noted that the performance gap is narrower in this test, but that's because the bottleneck shifted more towards I/O.
This test ran with the same database which was prepared for the random read performance test. Like the earlier test, it randomly selects a key over a specific range. The difference is that it transactionally scans over 100 entries using a cursor instead of retrieving a single value.
Like before, the entire test was run several times, with 1, 2, 4, 8, and 16 threads. This first chart shows the results for 1 thread, and charts with more threads have a similar shape, but they aren't shown.
TuplDB is much faster than RocksDB in this test. As before, both show a dip in performance when the key range reaches about 380 million, because the database no longer fits in the cache. There's no significant dip with RocksDB, suggesting that it has a CPU bottleneck.
These next charts show read scan scalability as more threads are added.
TuplDB and RocksDB both scale as threads are added, but TuplDB is always faster. Like the simple read test, the performance gap between the two databases is narrower when the bottleneck shifts more towards I/O.
The testing continues with the same database, except this time entries are randomly selected and transactionally updated with a new value of the same size (100 bytes). Like before, the entire test was run several times, with 1, 2, 4, 8, and 16 threads.
This test shows that RocksDB has a clear advantage over TuplDB, but only when the range of updates exceeds the cache size. This is one area in which a log-structured design can out perform a traditional B-Tree, but it's only true when performing unchecked value replacement. A later section examines the performance of read-modify-write operations.
These next charts show update scalability as more threads are added.
The first chart shows that both TuplDB and RocksDB scale with more threads, but TuplDB is faster when the active range fits in memory. The second chart shows that RocksDB doesn't scale well with more threads. With 4 threads, TuplDB performs better than RocksDB, and it keeps scaling with more threads. RocksDB performance peaks with 4 threads, and it gets slower with more threads.
It's possible that RocksDB needs to perform more background tasks to manage updates, but eventually these tasks get overwhelmed and can't keep up. These next charts suggest that this is likely the case.
With 1 or 2 threads, update performance is fairly smooth and consistent. The same is true with 4 threads, up to a point (600M). With 8 and 16 threads, update performance is completely chaotic. Compare this to TuplDB's update rate:
TuplDB does show some variation in performance with 8 and 16 threads, but it's much less severe than with RocksDB.
This test is similar to the previous update test except it performs read-modify-write style updates. For each transaction, an existing value is selected at random, the first and last bytes of the value are modified, and the value is stored back. With TuplDB, a cursor is used to perform the operation, but not with RocksDB because it doesn't support updatable cursors.
The shape of this chart looks every similar to that of the random read test, and RocksDB performance is dominated by the read operation. Unlike with unchecked updates, log-structured designs aren't able to optimize read-modify-write operations.
These next charts show read-modify-write scalability as more threads are added.
Like with the random read test, TuplDB and RocksDB scale with more threads. In all cases, TuplDB outperforms RocksDB.
The random mix test combines scan performance with unchecked update performance. The threads randomly choose to perform a scan of 100 values (90% of the time) or else they do an update. Both operations are transactional.
The presence of updates doesn't appear to affect the shape of the charts, and the overall performance is similar to the random scan test. For completeness, here's the scalability charts:
Again, TuplDB and RocksDB scale with more threads, and TuplDB outperforms RocksDB.
TuplDB implements a traditional B-Tree, but RocksDB implements a log-structured merge-tree (LSM). Compared to a B-Tree, it's generally assumed that an LSM design is faster at writing, but slower at reading. These tests confirm that, but only for a very specific case. In particular, for unchecked updates, when the active set doesn't fit in memory, and when there are fewer than 4 writing threads.
Applications using a database typically perform more reads than writes, and so in general it doesn't make sense to choose an LSM design over a traditional B-Tree. A write-heavy application which works well with RocksDB would do just as well with TuplDB, as long as the work is being performed by many threads.
One thing this test did not examine is the effects of write amplification. The storage device used was an Optane 3D XPoint SSD, which has unmatched write performance when compared to NAND flash SSDs. Because write performance varies so much among the different NAND flash SSD models, these test results would also vary quite a bit. A write-heavy application which is storing to a low-end SSD might do much better with RocksDB than with TuplDB. However, low-end SSDs tend to be slow at reading too, and RocksDB is always slower than TuplDB when reading.