From 5b604e61078a777195e7394587f3d2b9af433b79 Mon Sep 17 00:00:00 2001 From: Al Cutter <al@google.com> Date: Mon, 22 Jul 2024 17:26:37 +0100 Subject: [PATCH] Add POSIX design overview --- storage/posix/DESIGN.md | 43 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 43 insertions(+) create mode 100644 storage/posix/DESIGN.md diff --git a/storage/posix/DESIGN.md b/storage/posix/DESIGN.md new file mode 100644 index 00000000..a41a3005 --- /dev/null +++ b/storage/posix/DESIGN.md @@ -0,0 +1,43 @@ +# POSIX Design + +This document describes how the storage implementation for running Tessera on a POSIX-compliant filesystem +is intended to work. + +## Overview + +POSIX provides for a small number of atomic operations on compliant filesystems. + +This design leverages those to safely maintain a Merkle tree log on disk, in a format +which can be exposed directly via a read-only endpoint to clients of the log (for example, +using `nginx` or similar). + +In contrast with some of other other storage backends, sequencing and integration of entries into +the tree is synchronous. + +## Life of a leaf + +In the description below, when we talk about writing to files - either appending or creating new ones, +the _actual_ process used always follows the following pattern: +1. Create a temporary file on the same filesystem as the target location +1. If we're appening data, copy the contents of the prefix location into the temporary file +1. Write any new/additional data into the temporary file +1. Close the temporary file +1. Rename the temporary file into the target location. + +The final step in the dance above is atomic according to the POSIX spec, so in performing this sequence +of actions we can avoid corrupt or partially written files being part of the tree. + +1. Leaves are submitted by the binary built using Tessera via a call the storage's `Add` func. +1. The storage library batches these entries up, and, after a configurable period of time has elapsed + or the batch reaches a configurable size threshold, the batch is sequenced and integrated into the tree: + 1. An advisory lock is taken on a file adjacent to the `checkpoint` file + This helps prevent multiple frontends from stepping on each other, but isn't necesary for safety. + 1. Flushed entries are assigned contiguous sequence numbers, and written out into entry bundle files. + 1. Integrate newly added leaves into Merkle tree, and write tiles out as files. + 1. Update `checkpoint` file with new state. + +## Filesystems + +This implementation has been somewhat tested on both a local `ext4` filesystem and on a distributed +[CephFS](https://docs.ceph.com/en/reef/cephfs/) instance on GCP, in both cases with multiple +personality binaries attempting to add new entries concurrently.