Skip to content

RocksDB vs TuplDB

Brian S. O'Neill edited this page Jul 25, 2024 · 1 revision

Test setup:

  • RAM: 48GB
  • Storage: Intel Optane SSD 900P 480GB
  • CPU: Ryzen 7 1700
    • 8 physical cores, 16 logical cores
  • Kernel: 5.4.0-169
  • File system: ext4
  • RocksDB JNI: 8.8.1
    • Compression type: NO_COMPRESSION
    • Compaction style: LEVEL
    • Write buffer size: 64MiB
    • Max write buffer number: 3
    • Target file size base: 64MiB
    • Level 0 file num compaction trigger: 8
    • Level 0 slowdown writes trigger: 17
    • Level 0 stop writes trigger: 24
    • Num levels: 4
    • Max bytes for level base: 512MiB
    • Max bytes for level multiplier: 8
    • Background threads: 16
    • Block cache: LRU cache 16GB (recommended to be ⅓ the size of total RAM)
    • Cache index and filter blocks: true
  • TuplDB: 1.8.0
    • Cache size: 44GB
    • DurabilityMode.NO_FLUSH
    • G1GC, -Xmx3g -XX:+UseLargePages -XX:+UseTransparentHugePages

Note regarding the RocksDB cache size: I did notice that a larger cache size improved performance, but eventually the process would consume all the RAM and would be killed by the OS. With 16GB, the tests just barely passed before running out of RAM.

Random insert performance

For this round of tests, I didn't bother measuring insert performance when starting from an empty database. Older tests examined this and are still valid.

Random read performance

To prepare for this test, a fresh database was created with 1 billion sequentially ordered entries. The key size is 8 bytes, and the value size is 100 bytes.

The test measures the performance of transactionally reading one value against a randomly selected key, within a specific range. Initially, the key range is restricted to the first 10 million entries, and the test runs over this range for 10 seconds. The key range is expanded by an additional 10 million entries, tested for 10 seconds, and so on, up to a maximum range of 1 billion entries. Each time the range is expanded, the newly added key range is pre-scanned to populate the cache. The pre-scan time is not included in the test result time.

The entire test was run several times, with 1, 2, 4, 8, and 16 threads. This first chart shows the results for 1 thread. Charts which with more threads have a similar shape, and so they don't need to be included.

rr_1

TuplDB is much faster than RockDB at reading, but this is to be expected considering that log-structured designs like RockDB are slower than traditional B-Tree designs. TuplDB shows a dip in performance when the key range reaches about 380 million, because the database no longer fits in the cache. A similar dip can be seen with RocksDB, but it occurs sooner because the block cache size needed to be restricted to 16GB.

These next charts show how TuplDB and RocksDB scale as more read threads are added to the test.

rr_scalability_1 rr_scalability_2

For the first chart, the performance over the 100 to 300 million key range was averaged together. This shows read performance when the entire range of requested entries fits entirely in memory. Both TuplDB and RocksDB scale well with more threads, but TuplDB is much faster overall. The test machine has 8 physical cores and 16 logical cores, which is why the chart has an asterisk with 16 threads.

The second chart is averaging over the 700 to 900 million key range, and so the range of entries doesn't fit in memory. Again, both TuplDB and RocksDB scale well with more threads, but TuplDB is still faster. It should be noted that the performance gap is narrower in this test, but that's because the bottleneck shifted more towards I/O.

Random scan performance

This test ran with the same database which was prepared for the random read performance test. Like the earlier test, it randomly selects a key over a specific range. The difference is that it transactionally scans over 100 entries using a cursor instead of retrieving a single value.

Like before, the entire test was run several times, with 1, 2, 4, 8, and 16 threads. This first chart shows the results for 1 thread, and charts with more threads have a similar shape, but they aren't shown.

rs2_1

TuplDB is much faster than RocksDB in this test. As before, both show a dip in performance when the key range reaches about 380 million, because the database no longer fits in the cache. There's no significant dip with RocksDB, suggesting that it has a CPU bottleneck.

These next charts show read scan scalability as more threads are added.

rs2_scalability_1 rs2_scalability_2

TuplDB and RocksDB both scale as threads are added, but TuplDB is always faster. Like the simple read test, the performance gap between the two databases is narrower when the bottleneck shifts more towards I/O.

Random update performance

The testing continues with the same database, except this time entries are randomly selected and transactionally updated with a new value of the same size (100 bytes). Like before, the entire test was run several times, with 1, 2, 4, 8, and 16 threads.

ru_1

This test shows that RocksDB has a clear advantage over TuplDB, but only when the range of updates exceeds the cache size. This is one area in which a log-structured design can out perform a traditional B-Tree, but it's only true when performing unchecked value replacement. A later section examines the performance of read-modify-write operations.

These next charts show update scalability as more threads are added.

ru_scalability_1 ru_scalability_2

The first chart shows that both TuplDB and RocksDB scale with more threads, but TuplDB is faster when the active range fits in memory. The second chart shows that RocksDB doesn't scale well with more threads. With 4 threads, TuplDB performs better than RocksDB, and it keeps scaling with more threads. RocksDB performance peaks with 4 threads, and it gets slower with more threads.

It's possible that RocksDB needs to perform more background tasks to manage updates, but eventually these tasks get overwhelmed and can't keep up. These next charts suggest that this is likely the case.

ru_rocks

With 1 or 2 threads, update performance is fairly smooth and consistent. The same is true with 4 threads, up to a point (600M). With 8 and 16 threads, update performance is completely chaotic. Compare this to TuplDB's update rate:

ru_tupl

TuplDB does show some variation in performance with 8 and 16 threads, but it's much less severe than with RocksDB.

Random read-modify-write performance

This test is similar to the previous update test except it performs read-modify-write style updates. For each transaction, an existing value is selected at random, the first and last bytes of the value are modified, and the value is stored back. With TuplDB, a cursor is used to perform the operation, but not with RocksDB because it doesn't support updatable cursors.

ru2_1

The shape of this chart looks every similar to that of the random read test, and RocksDB performance is dominated by the read operation. Unlike with unchecked updates, log-structured designs aren't able to optimize read-modify-write operations.

These next charts show read-modify-write scalability as more threads are added.

ru2_scalability_1 ru2_scalability_2

Like with the random read test, TuplDB and RocksDB scale with more threads. In all cases, TuplDB outperforms RocksDB.

Random mix performance

The random mix test combines scan performance with unchecked update performance. The threads randomly choose to perform a scan of 100 values (90% of the time) or else they do an update. Both operations are transactional.

rm2_1

The presence of updates doesn't appear to affect the shape of the charts, and the overall performance is similar to the random scan test. For completeness, here's the scalability charts:

rm2_scalability_1 rm2_scalability_2

Again, TuplDB and RocksDB scale with more threads, and TuplDB outperforms RocksDB.

Conclusions

TuplDB implements a traditional B-Tree, but RocksDB implements a log-structured merge-tree (LSM). Compared to a B-Tree, it's generally assumed that an LSM design is faster at writing, but slower at reading. These tests confirm that, but only for a very specific case. In particular, for unchecked updates, when the active set doesn't fit in memory, and when there are fewer than 4 writing threads.

Applications using a database typically perform more reads than writes, and so in general it doesn't make sense to choose an LSM design over a traditional B-Tree. A write-heavy application which works well with RocksDB would do just as well with TuplDB, as long as the work is being performed by many threads.

One thing this test did not examine is the effects of write amplification. The storage device used was an Optane 3D XPoint SSD, which has unmatched write performance when compared to NAND flash SSDs. Because write performance varies so much among the different NAND flash SSD models, these test results would also vary quite a bit. A write-heavy application which is storing to a low-end SSD might do much better with RocksDB than with TuplDB. However, low-end SSDs tend to be slow at reading too, and RocksDB is always slower than TuplDB when reading.

Clone this wiki locally