Author: Martin Hutchinson
This document describes how the storage implementation for running Tessera on MySQL is intended to work.
- Cloud neutral deployments
- On-premise deployments
The DB layout has been designed such that serving any read request is a point lookup. This can be framed as each row in the database maps to a single file that would be created in the file-based implementations of Tessera.
A single row that records the current version of the Tessera schema and data compatibility.
A single row that records the current published checkpoint.
A single row that records the current state of the tree. Updated after every integration.
An internal tile consisting of hashes. There is one row for each internal tile, and this is updated until it is completed, at which point it is immutable.
The data committed to by the leaves of the tree. Follows the same evolution as Subtree.
Reads can scale horizontally with very little overhead or contention between frontends.
Writing is more complicated than reading for the following reasons:
- Each write request updates multiple rows across all tables.
- Write requests need to be globally coordinated to ensure every sequence number is allocated precisely once.
There can be an arbitrary number of frontends each receiving write traffic. Each of these has the following processes:
API Handlers:
- Handle the /add request, and extract the data to be added
- Add the data to a pool to be sequenced and block until this returns the index
Sequence pool:
- Accept requests from API handlers to add a new entry to the pool.
- If the pool is now of the maximum size, flush the current batch.
- If this is the first entry in the batch, then start a timer and flush the batch after a timeout.
- Flushing: starts a sequence & integrate operation.
Sequence & integrate (DB integration starts here):
- Takes a batch of entries to sequence and integrate
- Starts a transaction, which first takes a write lock on the
TreeState
row to ensure that:- No other processes will be competing with this work.
- That the next index to sequence is known (this is the same as the current tree size)
- Update the required TiledLeaves rows
- Perform an integration operation to update the Merkle tree, updating/adding Subtree rows as needed, and eventually updating the
TreeState
row - Commit the transaction
- Checkpoints representing the latest state of the tree are published at the configured interval.
Either all the money, or free. This could run as lightly as fitting inside a free-tier GCE VM, or scale up to a Cloud SQL instance that costs a hefty sum each month. These prices could be estimated based on QPS. It is a lot harder to estimate the price when physical machines are owned in an on-prem deployment.
For this implementation, MySQL 8 was picked as the RDBMS. Other options considered were PostgreSQL and CockroachDB. The choice was somewhat arbitrary, and any of these solutions could be reasonably justified. My justification for picking MySQL is that it is more ubiquitous than Cockroach, and writes scale better than PostgreSQL. The implementation uses pretty standard SQL, so it isn’t envisaged that switching implementation would be insurmountable. That said, testing has only been performed with MySQL 8.