From 15e437a8aac01ddee1dac0be5192c4b4668fbadb Mon Sep 17 00:00:00 2001 From: Al Cutter Date: Wed, 21 Aug 2024 10:48:45 +0100 Subject: [PATCH] Design docs (#155) --- docs/design/README.md | 16 +++ .../DESIGN.md => docs/design/gcp_storage.md | 9 +- .../DESIGN.md => docs/design/mysql_storage.md | 0 .../DESIGN.md => docs/design/posix_storage.md | 0 docs/philosophy.md | 111 ++++++++++++++++++ 5 files changed, 132 insertions(+), 4 deletions(-) create mode 100644 docs/design/README.md rename storage/gcp/DESIGN.md => docs/design/gcp_storage.md (86%) rename storage/mysql/DESIGN.md => docs/design/mysql_storage.md (100%) rename storage/posix/DESIGN.md => docs/design/posix_storage.md (100%) create mode 100644 docs/philosophy.md diff --git a/docs/design/README.md b/docs/design/README.md new file mode 100644 index 00000000..7afe00a3 --- /dev/null +++ b/docs/design/README.md @@ -0,0 +1,16 @@ +# Design docs + +This directory contains design documentation for Tessera. + +It's probably wise to start with the [philosophy](philosophy.md]) doc first in order to provide +the context around the approach and design trade-offs made herein. + +## Storage + +Tessera supports multiple backend storage implementations, each of these has an associated +"one-pager" design doc: + +* [GCP](storage_gcp.md) +* [MySQL](storage_mysql.md) +* [POSIX filesystem](storage_posix.md) + diff --git a/storage/gcp/DESIGN.md b/docs/design/gcp_storage.md similarity index 86% rename from storage/gcp/DESIGN.md rename to docs/design/gcp_storage.md index 263a747c..bee47c1f 100644 --- a/storage/gcp/DESIGN.md +++ b/docs/design/gcp_storage.md @@ -34,7 +34,7 @@ This table is used to coordinate integration of sequenced batches in the `Seq` t ## Life of a leaf 1. Leaves are submitted by the binary built using Tessera via a call the storage's `Add` func. -1. Dupe squashing (TODO): look for existing `internal/seqByHash/` object, read assigned sequence number if present and return. +1. Dupe squashing (TODO): look for existing `` object, read assigned sequence number if present and return. 1. The storage library batches these entries up, and, after a configurable period of time has elapsed or the batch reaches a configurable size threshold, the batch is written to the `Seq` table which effectively assigns a sequence numbers to the entries using the following algorithm: @@ -52,12 +52,13 @@ This table is used to coordinate integration of sequenced batches in the `Seq` t 1. Delete consumed batches from `Seq` 1. Update `IntCoord` with `seq+=num_entries_integrated` 1. Dupe detection (TODO): - 1. Writes out internal/seqByHash/ containing the leaf's sequence number + 1. Writes out `` containing the leaf's sequence number ## Dedup -This currently uses GCS to store the hash -> index mapping in individual files, but it may make sense to explore a -paging scheme to reduce the number of objects, or store the index mapping elsewhere. +An experimental implementation has been tested which uses Spanner to store the `` --> `sequence` +mapping. This works well using "slack" Spanner CPU available in the smallest possible footprint, and consequently +is comparably cheap requiring only extra Spanner storage costs. ### Alternatives considered diff --git a/storage/mysql/DESIGN.md b/docs/design/mysql_storage.md similarity index 100% rename from storage/mysql/DESIGN.md rename to docs/design/mysql_storage.md diff --git a/storage/posix/DESIGN.md b/docs/design/posix_storage.md similarity index 100% rename from storage/posix/DESIGN.md rename to docs/design/posix_storage.md diff --git a/docs/philosophy.md b/docs/philosophy.md new file mode 100644 index 00000000..3e5728fe --- /dev/null +++ b/docs/philosophy.md @@ -0,0 +1,111 @@ + + +## Objective + +This document explains the rationale behind some of the philosophy and design choices underpinning Trillian Tessera. + + +## Simplicity + +Tessera is intended to be: simple to use, adopt, and maintain; and cheaper/easier to operate than Trillian v1. + +There are many tensions and trade-offs here, and while there is no guarantee that a single "right answer" +exists, we are shooting for a MVP, and must hold ourselves accountable whenever we're adding cost, complexity, +or [speculative abstractions](https://100go.co/#interface-pollution-5) - _"is the driver for this something +we *really need now*?", or otherwise restricting our ability to make large internal changes in the future. + + +## Multi-implementation storage + +Each storage implementation for Trillian Tessera is independently implemented, and takes the most "native" +approach for the infrastructure it targets. + +Trillian v1 defined `LogStorage` and embedded `TreeStorage` interfaces which all storage implementations had +to implement. These interfaces were created early, reflected implementation details of a small sampling of largely +similar storage implementations, and consequently turned out not to be a clean abstraction of what was _actually_ +desired by higher levels in the stack. In turn, this made it hard to: + +1. Support non-single-domain/non-transactional storage implementations, and +2. Refactor storage internals to improve performance. + +With Trillian Tessera, we are learning from these mistakes, and acknowledging that: + +1. The different storage implementations we are building now, and those which will come in the future, have their + own unique properties which stem from the infrastructure they're built upon - e.g. _some_ infrastructure offers + rich transactional semantics over multiple entities, others offer only check-and-set semantics. +2. We don't _necessarily_ need to use the more expensive transactional storage to serve reads. +3. Prematurely binding different storage implementations together (e.g. through inappropriate code reuse, + interfaces, structures, etc.) which _appear_ similar today can lead to headaches later if we find we need to + make structural changes. + +For at least the early versions of Tessera, it is an explicit non-goal to try to reuse code between storage +implementations. Attempting to do this so early in the project lifecycle opens us up to the same pitfalls described +above, and any perceived benefits from this approach are unlikely to be worth the risk; storage implementations are +expected to be relatively small in terms of LoC and complexity. + + +## Asynchronous integration in storage implementation + +In Trillian v1, the only supported mechanism for adding entries to a log was via a fully decoupled queue: the +caller requesting the addition of the entry was given nothing more than a timestamp and a promise that the entry +would be integrated at some point (note that 24h is the CT _policy_, but there's no specific parameter or deadline +in Trillian itself - it's _"as soon as possible"_). + +With Trillian Tessera, we're tightening the storage contract up so that calls to add entries to the log will +return with a durably assigned sequence number, or an error. + +It's not a requirement that the Merkle tree has already been extended to cryptographically commit to the new leaf +by the time the call to add returns, although it _is_ expected that this process will take place within a short +window (e.g. seconds). + +This API represents a reasonable set of tradeoffs: + +1. Keeping sequencing and integration separate enables: + 1. Storage to be implemented in the way which works best given the particular constraints of that + infrastructure + 2. Higher write-throughput + * E.g. A Bucket-type storage typically has roundtrip read/write latency which is far higher than DBMS. + From our experiments, typically, a transactional DBMS is used for coordination and sequencing, and the + slower, cheaper, bucket storage is used for serving the tree. + + Sequencing, which has typically been done within the DBMS only, is fast. Integration, however, must + update the buckets with new tree structure, leaves, checkpoint, etc., and is by far the slower + operation of the two. + + Coupling <sequence>-<integrate> within a call to add entries to the log (even if batched via a + local pool in a single frontend), requires blocking updates to the tree for the long-pole integration + duration. + + Allowing <sequence> operations to happen asynchronously to <integration> enables sequencing to + proceed from multiple frontend pools concurrently with any integration operation (at least until some + back-pressure is applied), which, in turn, allows the <integration> operation to potentially + amortise the long-pole cost over a larger number of sequenced entries. +2. Limiting the intended window between <sequence> and <integrate> operations to a low + single-digit-seconds target enables a synchronous add API to be constructed at the layer above (i.e within + the "personality" that is built using Trillian Tessera). + + This approach enables: + 1. synchronous "personalities" to benefit from improved write-throughput (compared with a naive + <sequence>-<integrate> approach) at the cost of a small increase of latency + 2. Other "personalities" are not forced to pay the cost of the synchronous operation they do not require. + +Back-pressure for `Writer` requests to the storage implementation will be important to maintain the very short +window between sequence numbers being assigned to entries, and those entries being committed to by the Merkle +tree. + + +## Resilience and availability + +Storage implementations should be implemented in such a way that multiple instances of a Tessera-backed +personality can *safely* be run concurrently for any given log storage instance. The manner in which this is +achieved is left for each storage implementation to decide, allowing for the simplest/most +infrastructure-native mechanism to be used in each case. + +Having this property offers two benefits: + +* Tessera-backed logs can offer comparable availability as a similar log being managed by Trillian v1 on the + same infrastructure would have. +* Safety guard rails are in place against "silly" mistakes, such as "<up><up><enter>" and + _copy-n-paste_ errors, resulting in accidentally launching multiple instances pointing at the same storage + configuration. +