Skip to content

Latch coupling

Brian S. O'Neill edited this page Oct 11, 2024 · 7 revisions

For supporting concurrent B-tree operations, TuplDB uses a typical latch coupling (or crabbing) technique. Searches start from the root, which must acquire a shared latch. When the search finds the appropriate child node, a shared latch is acquired for the child node and then the parent latch is released. This continues until the leaf node is reached, which by then, is the only node held shared by the current thread. Operations which update the tree will acquire exclusive latches as necessary.

A bottleneck in a typical latch implementation is the field which tracks the number of threads which have shared access. It becomes a point of contention as multiple threads attempt to update it concurrently. The root node in particular experiences the most contention, and deeper nodes experience less contention, unless all threads are searching for the same thing.

One strategy to reduce the contention is to stripe the state among multiple slots, each in a separate cache line. This is extremely expensive from a memory overhead standpoint, considering that all in-memory nodes have their own latch. For supporting contended latches without high memory overhead, TuplDB has a special Clutch class, which stripes the state among sharable "packs". Each pack has a fixed amount of integer slots, and the amount is determined by the number of CPU cores which are available.

The clutch ordinarily acts like a regular latch, but it's adaptive. When contention is detected, it switches to a special contended mode. Later, when exclusive access is requested, the clutch reverts back to the regular non-contended mode. When in contended mode, each thread which is acquiring shared access will randomly select a slot to use, and it stores state there. The thread remembers which slot was selected by using a thread-local variable. If any contention for the slot is detected (due to a CAS failure), the thread will randomly select another slot. The Java LongAdder class uses a similar technique.

Performance

To test the performance of the clutch compared to an ordinary latch, an in-memory TuplDB instance is filled with 1 billion 8-byte keys which map to empty values. The test machine has eight physical CPU cores, and eight threads are launched which randomly read over a range of values, centered around the middle. The range of values is a percentage of the total 1 billion range, and it starts small and ramps up to 25%. Small ranges have high cache locality, but they also have high latch contention.

Clutch vs  Latch

As can be seen in the chart, the latch performance actually increases as the range percentage increases. This is because as the search range gets larger, the threads span out over many different internal B-tree nodes and experience less contention. As the range gets even larger, performance dips due to reduced CPU cache locality — it has to fetch from main memory more often.

The clutch is much less affected by contention, and the chart shows substantially higher performance when cache locality is highest. Even as the performance dips due to reduced cache locality, the clutch still outperforms the latch. Although not shown on the chart, the clutch maintains it's performance advantage even when the range is at 100%. With this range, reads are completely random, and so they experience the worst cache locality. But the internal nodes remain in the CPU cache, and they're still contended. It's for this reason that the clutch always outperforms a regular latch.

Clone this wiki locally