From a5e7547859aadbcba823fb48c6b91caba4953d97 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Cain=C3=A3=20Costa?= Date: Thu, 25 Jul 2024 12:21:34 -0300 Subject: [PATCH] chore(doc): expand viewstamped replication concept --- docs/concepts/viewstamped-replication.md | 135 ++++++++++++++++++++--- 1 file changed, 121 insertions(+), 14 deletions(-) diff --git a/docs/concepts/viewstamped-replication.md b/docs/concepts/viewstamped-replication.md index d6f7bce..ed32e2e 100644 --- a/docs/concepts/viewstamped-replication.md +++ b/docs/concepts/viewstamped-replication.md @@ -18,6 +18,27 @@ VR comprises three main sub-protocols: 2. View changes to select a new primary 3. Recovery of failed replicas +The quorum size $Q$ for a replica group of size $N = 2f+1$ is defined as: + +$$ +Q = f + 1 = \left\lfloor\frac{N}{2}\right\rfloor + 1 +$$ + +This ensures that any two quorums have at least one replica in common, allowing the system to maintain consistency across view changes. + +### Replica State + +Each replica maintains the following state: + +- Configuration: A sorted array containing the IP addresses of each replica +- Replica number: The index of this replica in the configuration +- View-number: The current view number, initially 0 +- Status: Either normal, view-change, or recovering +- Op-number: The number assigned to the most recently received request +- Log: An array containing entries for all requests received +- Commit-number: The op-number of the most recently committed operation +- Client-table: Records for each client the number of its most recent request and the result + ## Normal Operation During normal operation, when the primary is not faulty and all participating replicas are in the same view, the protocol proceeds as follows: @@ -99,27 +120,62 @@ The recovery protocol allows failed replicas to rejoin the group with an up-to-d This process ensures that recovered replicas rejoin the group with a consistent and up-to-date state. -## Optimizations +## Handling Non-deterministic Operations -Several optimizations can significantly improve VR performance: +Some operations may be non-deterministic, such as reading the current time. To ensure consistency across replicas: -### Witnesses -Use $f$ witness replicas that do not store full state or execute operations. This reduces resource requirements while maintaining fault tolerance. +1. The primary predicts the result of the non-deterministic operation. +2. The predicted value is included in the PREPARE message sent to backups. +3. All replicas use the predicted value when executing the operation. + +This approach ensures that all replicas produce the same state changes for non-deterministic operations. + +## Client Recovery + +If a client crashes and recovers, it must ensure that its next request has a higher request number than any previous requests. The client recovery process works as follows: + +1. The recovering client contacts the replicas to fetch its latest request number. +2. The client adds 2 to this number to create its new request number. +3. This ensures uniqueness even if the client's last request before crashing was still in transit. + +## Optimizations ### Batching Process multiple client requests in a single protocol round. This improves throughput, especially under high load. The throughput improvement can be modeled as: $$ -\text{Throughput}_{\text{batched}} = \frac{\text{Batch Size}}{\text{Protocol Round Time}} \quad \text{requests/second} +\text{Throughput}_{\text{batched}} = \frac{B}{\tau_r + B \cdot \tau_p} \quad \text{requests/second} $$ +Where: +- $B$ is the batch size +- $\tau_r$ is the round-trip time for a single request +- $\tau_p$ is the per-request processing time + ### Fast Reads -Allow the primary to execute read-only operations without consulting other replicas. Use leases to ensure consistency, preventing stale reads after view changes. The lease duration $T$ should satisfy: +Allow the primary to execute read-only operations without consulting other replicas. Use leases to ensure consistency, preventing stale reads after view changes. The lease duration $T_l$ should satisfy: + +$$ +T_l < \frac{T_v}{2} - \delta +$$ + +Where: +- $T_v$ is the view change timeout +- $\delta$ is the maximum clock skew between replicas + +The view change timeout $T_v$ should be set to: $$ -T < \frac{\text{View Change Timeout}}{2} +T_v > 2 \cdot (RTT_{\text{max}} + \tau_{\text{proc}}) $$ +Where: +- $RTT_{\text{max}}$ is the maximum round-trip time between any two replicas +- $\tau_{\text{proc}}$ is the maximum processing time for a view change message + +### Witnesses +Use $f$ witness replicas that do not store full state or execute operations. This reduces resource requirements while maintaining fault tolerance. + ### Checkpoints Periodically create snapshots of application state. This speeds up recovery and allows for log truncation, reducing storage requirements. The storage savings can be estimated as: @@ -134,9 +190,9 @@ Keep a prefix of the log on disk and push updates to disk in the background. Thi When implementing VR, consider the following: -- Use efficient data structures for the operation log and client table. For example, implement the client table using an in-memory key-value store. +- Use efficient data structures for the operation log and client table. For example, implement the client table using an in-memory key-value store like go-cache. -- Implement proper concurrency control to handle simultaneous client requests and protocol messages. Use techniques like buffered channels or thread-safe queues to manage incoming requests. +- Implement proper concurrency control to handle simultaneous client requests and protocol messages. Use techniques like buffered channels or thread-safe queues to manage incoming requests. In Go, channels provide an excellent mechanism for communication between threads. - Design the system to gracefully handle network partitions and message reordering. Implement timeouts and retries for all network communications. @@ -144,19 +200,70 @@ When implementing VR, consider the following: - Implement state transfer protocols to efficiently synchronize replica state. Use techniques like Merkle trees to minimize the amount of data transferred during recovery. The efficiency of Merkle trees can be expressed as: + $$ + \text{Data Transferred} = O(\log N \cdot \text{Diff Size}) + $$ + + Where $N$ is the total number of state elements and Diff Size is the number of different elements between replicas. + +- Carefully manage view numbers and operation numbers to ensure uniqueness and proper ordering across view changes. The global order of operations can be expressed as: + + $$ + \text{Global Order} = V \cdot M + O + $$ + + Where: + - $V$ is the view number + - $M$ is the maximum number of operations allowed per view + - $O$ is the operation number within the current view + + This ordering ensures that operations from newer views always have a higher global order than operations from older views, even if the operation numbers overlap. + +- Implement proper error handling and logging to facilitate debugging and system monitoring. + +- Consider the impact of various factors on system performance: + - Network latency between replicas + - Size of the replica group + - Frequency of client requests + - Size of client requests and responses + - Disk I/O performance (if used for logging or checkpoints) + +## Reconfiguration + +While not part of the core protocol, VR can be extended to support reconfiguration, allowing the membership of the replica group to change over time. This is useful for replacing failed nodes or adjusting the group size to handle changing failure rates. The reconfiguration process involves: + +1. A special client request to initiate reconfiguration +2. Processing this request through the normal case protocol +3. Transitioning to a new epoch with the updated configuration +4. Transferring state to new replicas before they become active + +Implementing reconfiguration adds complexity but is essential for long-running systems. + +## Performance Modeling + +The performance of a VR system can be modeled using queueing theory. Assuming a M/M/1 queue model for simplicity, the average response time $R$ for a client request can be estimated as: + $$ -\text{Data Transferred} = O(\log N \cdot \text{Diff Size}) +R = \frac{1}{\mu - \lambda} $$ - Where $N$ is the total number of state elements and Diff Size is the number of different elements between replicas. +Where: +- $\lambda$ is the average arrival rate of client requests +- $\mu$ is the service rate of the system (requests processed per second) -- Carefully manage view numbers and operation numbers to ensure uniqueness and proper ordering across view changes. The relationship between view numbers and operation numbers can be expressed as: +The service rate $\mu$ depends on various factors, including: $$ -\text{Global Order} = \text{View Number} \cdot \text{Max Operations Per View} + \text{Operation Number} +\mu = \min\left(\frac{1}{\tau_p}, \frac{1}{RTT + \tau_{\text{prep}}}, \frac{1}{\tau_{\text{disk}}}\right) $$ -- Implement proper error handling and logging to facilitate debugging and system monitoring. +Where: +- $\tau_p$ is the per-request processing time +- $RTT$ is the average round-trip time between replicas +- $\tau_{\text{prep}}$ is the time to prepare and send PREPARE messages +- $\tau_{\text{disk}}$ is the average disk I/O time (if applicable) + +This model can help in capacity planning and identifying bottlenecks in the system. ## Conclusion