Add new ParanoidPlus durability level, which doesn't require repair

To avoid repair, Durability::ParanoidPlus commits need to save the allocator state somewhere. We can't use the region headers, because we'd be overwriting them in place; we might crash partway through the overwrite, and then we'd need repair. So we instead save the allocator state to a new table in the system tree. Writing to the table is slightly tricky, because it needs to be done without allocating (see below), but other than that it's a perfectly ordinary transactional write with all the usual guarantees. The other requirement to avoid repair is knowing whether the last transaction used 2-phase commit. For this, we add a new two_phase_commit bit to the god byte, which is always updated atomically along with swapping the primary bit. Old redb versions will ignore the new flag when reading and clear it when writing, which is exactly what we want. This turns out to also fix a longstanding bug where Durability::Paranoid hasn't been providing any security benefit at all. The checksum forgery attack described in the Durability::Immediate documentation actually works equally well against Durability::Paranoid! The problem is that even though 2-phase commit guarantees the primary is valid, redb ignores the primary flag when repairing. It always picks whichever commit slot is newer, as long as the checksum is valid. So if you crash partway through a commit, it'll try to recover using the partially-written secondary rather than the fully-written primary, regardless of the durability mode. The fix for this is exactly the two_phase_commit bit described above. After a crash, we check whether the last transaction used 2-phase commit; if so, we only look at the primary (which is guaranteed to be valid) and ignore the secondary. Durability::ParanoidPlus needs this check anyway for safety, so we get the Durability::Paranoid bug fix for free. To write to the allocator state table without allocating, I've introduced a new insert_inplace() function. It's similar to insert_reserve(), but more general and maybe simpler. To use it, you have to first do an ordinary insert() with your desired key and a value of the appropriate length; then later in the same transaction you can call insert_inplace() to replace the value with a new one. Unlike insert_reserve(), this works with values that don't implement MutInPlaceValue, and it lets you hold multiple reservations simultaneously. insert_inplace() could be safely exposed to users, but I don't think there's any reason to. Since it doesn't give you a mutable reference, there's no benefit over insert() unless you're storing data that cares about its own position in the database. So for now it's private, and I haven't bothered making a new error type for it; it just panics if you don't satisfy the preconditions. The fuzzer is perfect for testing Durability::ParanoidPlus, because it can simulate a crash, reopen the database (skipping repair if possible), and then verify that the resulting allocator state exactly matches what would happen if it ran a full repair. I've updated the fuzzer to generate Durability::ParanoidPlus commits along with the existing Durability::None and Durability::Immediate.
cberner · Nov 10, 2024 · f2e1eeb · f2e1eeb
1 parent 25b55b6
commit f2e1eeb
Show file tree

Hide file tree

Showing 11 changed files with 602 additions and 113 deletions.
diff --git a/docs/design.md b/docs/design.md
@@ -88,12 +88,16 @@ controls which transaction pointer is the primary.
 `magic number` must be set to the ASCII letters 'redb' followed by 0x1A, 0x0A, 0xA9, 0x0D, 0x0A. This sequence is
 inspired by the PNG magic number.
 
-`god byte`, so named because this byte controls the state of the entire database, is a bitfield containing two flags:
+`god byte`, so named because this byte controls the state of the entire database, is a bitfield containing three flags:
 * first bit: `primary_bit` flag which indicates whether transaction slot 0 or transaction slot 1 contains the latest commit.
-  redb relies on the fact that this is a single bit to perform atomic commits.
-* second bit: `recovery_required` flag, if set then the recovery process must be run when opening the database.
-  During the recovery process, the region tracker and regional allocator states -- described below -- are reconstructed
-  by walking the btree from all active roots.
+* second bit: `recovery_required` flag, if set then the recovery process must be run when opening the database. This can be
+  a full repair, in which the region tracker and regional allocator states -- described below -- are reconstructed by walking
+  the btree from all active roots, or a quick-repair, in which the state is simply loaded from the allocator state table.
+* third bit: `two_phase_commit` flag, which indicates whether the transaction in the primary slot was written using 2-phase
+  commit. If so, the primary slot is guaranteed to be valid, and repair won't look at the secondary slot. This flag is always
+  updated atomically along with the primary bit.
+
+redb relies on the fact that this is a single byte to perform atomic commits.
 
 `page size` is the size of a redb page in bytes
 
@@ -155,7 +159,9 @@ changed during an upgrade.
 
 ### Region tracker
 The region tracker is an array of `BtreeBitmap`s that tracks the page orders which are free in each region.
-It is stored in a page in the data section of a region:
+There are two different places it can be stored: on shutdown, it's written to a page in the data section of
+a region, and when making a commit with `Durability::ParanoidPlus`, it's written to an entry in the allocator
+state table. The former is valid only after a clean shutdown; the latter is usable even after a crash.
 ```
 <-------------------------------------------- 8 bytes ------------------------------------------->
 ==================================================================================================
@@ -216,6 +222,11 @@ range has been allocated
 * n bytes: free index data
 * n bytes: allocated data
 
+Like the region tracker, there are two different places where the regional allocator state can be
+stored. On shutdown, it's written to the region header as described above, and when making a commit
+with `Durability::ParanoidPlus`, it's written to an entry in the allocator state table. The former
+is valid only after a clean shutdown; the latter is usable even after a crash.
+
 ```
 <-------------------------------------------- 8 bytes ------------------------------------------->
 ==================================================================================================
@@ -456,7 +467,7 @@ exists. Then, (2) will be accomplished by moving all allocations from transactio
 savepoint into the pending free state.
 
 #### Database repair
-To repair the database after an unclean shutdown we must:
+To do a full repair after an unclean shutdown we must:
 1) Update the super header to reference the last fully committed transaction
 2) Update the allocator state, so that it is consistent with all the database roots in the above
    transaction
@@ -472,6 +483,10 @@ All pages referenced by a savepoint must be contained in the above, because it i
 a) referenced directly by the data, system, or freed tree -- i.e. it's a committed page
 b) it is not referenced, in which case it is in the pending free state and is contained in the freed tree
 
+Alternatively, we might be able to do a quick-repair. This is only possible if the last transaction
+used 2-phase commit (so we know the primary slot is valid, without needing to walk the trees to verify
+their checksums) and also saved its allocator state to the allocator state tree.
+
 # Assumptions about underlying media
 redb is designed to be safe even in the event of power failure or on poorly behaved media.
 Therefore, we make only a few assumptions about the guarantees provided by the underlying filesystem:

diff --git a/fuzz/fuzz_targets/common.rs b/fuzz/fuzz_targets/common.rs
@@ -107,6 +107,16 @@ impl<const N: usize> Arbitrary<'_> for BoundedUSize<N> {
     }
 }
 
+// We don't simulate fsync(), so it's not interesting to fuzz with Durability::Eventual
+// or Durability::Paranoid (they're mostly equivalent to Durability::Immediate). But the
+// other three levels are all worth testing
+#[derive(Arbitrary, Debug, Clone, PartialEq)]
+pub(crate) enum FuzzDurability {
+    None,
+    Immediate,
+    ParanoidPlus,
+}
+
 #[derive(Arbitrary, Debug, Clone)]
 pub(crate) enum FuzzOperation {
     Get {
@@ -163,7 +173,7 @@ pub(crate) enum FuzzOperation {
 #[derive(Arbitrary, Debug, Clone)]
 pub(crate) struct FuzzTransaction {
     pub ops: Vec<FuzzOperation>,
-    pub durable: bool,
+    pub durability: FuzzDurability,
     pub commit: bool,
     pub create_ephemeral_savepoint: bool,
     pub create_persistent_savepoint: bool,

diff --git a/fuzz/fuzz_targets/fuzz_redb.rs b/fuzz/fuzz_targets/fuzz_redb.rs
@@ -580,9 +580,11 @@ fn exec_table_crash_support<T: Clone>(config: &FuzzConfig, apply: fn(WriteTransa
         // Disable IO error simulation while we update the transaction counter table
         let old_countdown = countdown.swap(u64::MAX, Ordering::SeqCst);
         let mut txn = db.begin_write().unwrap();
-        if !transaction.durable {
-            txn.set_durability(Durability::None);
-        }
+        txn.set_durability(match transaction.durability {
+            FuzzDurability::None => Durability::None,
+            FuzzDurability::Immediate => Durability::Immediate,
+            FuzzDurability::ParanoidPlus => Durability::ParanoidPlus,
+        });
         let mut counter_table = txn.open_table(COUNTER_TABLE).unwrap();
         let uncommitted_id = txn_id as u64 + 1;
         counter_table.insert((), uncommitted_id)?;
@@ -627,9 +629,9 @@ fn exec_table_crash_support<T: Clone>(config: &FuzzConfig, apply: fn(WriteTransa
         let commit_succeeded = last_committed == uncommitted_id;
         if commit_succeeded {
             assert!(transaction.commit);
-            savepoint_manager.commit(transaction.durable);
+            savepoint_manager.commit(transaction.durability != FuzzDurability::None);
             non_durable_reference = uncommitted_reference;
-            if transaction.durable {
+            if transaction.durability != FuzzDurability::None {
                 reference = non_durable_reference.clone();
             }
         } else {
@@ -747,7 +749,7 @@ fn apply_crashable_transaction_multimap(txn: WriteTransaction, uncommitted_refer
     }
 
     if transaction.commit {
-        if transaction.durable {
+        if transaction.durability != FuzzDurability::None {
             savepoints.gc_persistent_savepoints(&txn)?;
         }
         txn.commit()?;
@@ -767,7 +769,7 @@ fn apply_crashable_transaction(txn: WriteTransaction, uncommitted_reference: &mu
     }
 
     if transaction.commit {
-        if transaction.durable {
+        if transaction.durability != FuzzDurability::None {
             savepoints.gc_persistent_savepoints(&txn)?;
         }
         txn.commit()?;

diff --git a/src/db.rs b/src/db.rs
@@ -21,7 +21,9 @@ use std::sync::{Arc, Mutex};
 
 use crate::error::TransactionError;
 use crate::sealed::Sealed;
-use crate::transactions::SAVEPOINT_TABLE;
+use crate::transactions::{
+    AllocatorStateKey, AllocatorStateTree, ALLOCATOR_STATE_TABLE_NAME, SAVEPOINT_TABLE,
+};
 use crate::tree_store::file_backend::FileBackend;
 #[cfg(feature = "logging")]
 use log::{debug, info, warn};
@@ -431,7 +433,9 @@ impl Database {
             return Err(CompactionError::TransactionInProgress);
         }
         // Commit to free up any pending free pages
-        // Use 2-phase commit to avoid any possible security issues. Plus this compaction is going to be so slow that it doesn't matter
+        // Use 2-phase commit to avoid any possible security issues. Plus this compaction is going to be so slow that it doesn't matter.
+        // Once https://github.com/cberner/redb/issues/829 is fixed, we should upgrade this to use Durability::ParanoidPlus instead --
+        // that way the user can cancel the compaction without requiring repair afterwards
         let mut txn = self.begin_write().map_err(|e| e.into_storage_error())?;
         if txn.list_persistent_savepoints()?.next().is_some() {
             return Err(CompactionError::PersistentSavepointExists);
@@ -611,6 +615,12 @@ impl Database {
         repair_callback: &(dyn Fn(&mut RepairSession) + 'static),
     ) -> Result<[Option<BtreeHeader>; 3], DatabaseError> {
         if !Self::verify_primary_checksums(mem.clone())? {
+            if mem.used_two_phase_commit() {
+                return Err(DatabaseError::Storage(StorageError::Corrupted(
+                    "Primary is corrupted despite 2-phase commit".to_string(),
+                )));
+            }
+
             // 0.3 because the repair takes 3 full scans and the first is done now
             let mut handle = RepairSession::new(0.3);
             repair_callback(&mut handle);
@@ -703,23 +713,28 @@ impl Database {
         )?;
         let mut mem = Arc::new(mem);
         if mem.needs_repair()? {
-            #[cfg(feature = "logging")]
-            warn!("Database {:?} not shutdown cleanly. Repairing", &file_path);
-            let mut handle = RepairSession::new(0.0);
-            repair_callback(&mut handle);
-            if handle.aborted() {
-                return Err(DatabaseError::RepairAborted);
+            // If the last transaction used 2-phase commit and updated the allocator state table, then
+            // we can just load the allocator state from there. Otherwise, we need a full repair
+            if !Self::try_quick_repair(mem.clone())? {
+                #[cfg(feature = "logging")]
+                warn!("Database {:?} not shutdown cleanly. Repairing", &file_path);
+                let mut handle = RepairSession::new(0.0);
+                repair_callback(&mut handle);
+                if handle.aborted() {
+                    return Err(DatabaseError::RepairAborted);
+                }
+                let [data_root, system_root, freed_root] =
+                    Self::do_repair(&mut mem, repair_callback)?;
+                let next_transaction_id = mem.get_last_committed_transaction_id()?.next();
+                mem.commit(
+                    data_root,
+                    system_root,
+                    freed_root,
+                    next_transaction_id,
+                    false,
+                    true,
+                )?;
             }
-            let [data_root, system_root, freed_root] = Self::do_repair(&mut mem, repair_callback)?;
-            let next_transaction_id = mem.get_last_committed_transaction_id()?.next();
-            mem.commit(
-                data_root,
-                system_root,
-                freed_root,
-                next_transaction_id,
-                false,
-                true,
-            )?;
         }
 
         mem.begin_writable()?;
@@ -754,6 +769,42 @@ impl Database {
         Ok(db)
     }
 
+    // Returns true if quick-repair was successful, or false if a full repair is needed
+    fn try_quick_repair(mem: Arc<TransactionalMemory>) -> Result<bool> {
+        // Quick-repair is only possible if the primary was written using 2-phase commit
+        if !mem.used_two_phase_commit() {
+            return Ok(false);
+        }
+
+        // See if the allocator state table is present in the system table tree
+        let fake_freed_pages = Arc::new(Mutex::new(vec![]));
+        let system_table_tree = TableTreeMut::new(
+            mem.get_system_root(),
+            Arc::new(TransactionGuard::fake()),
+            mem.clone(),
+            fake_freed_pages.clone(),
+        );
+        let Some(allocator_state_table) = system_table_tree
+            .get_table::<AllocatorStateKey, &[u8]>(ALLOCATOR_STATE_TABLE_NAME, TableType::Normal)
+            .map_err(|e| e.into_storage_error_or_corrupted("Unexpected TableError"))?
+        else {
+            return Ok(false);
+        };
+
+        // Load the allocator state from the table
+        let InternalTableDefinition::Normal { table_root, .. } = allocator_state_table else {
+            unreachable!();
+        };
+        let tree = AllocatorStateTree::new(
+            table_root,
+            Arc::new(TransactionGuard::fake()),
+            mem.clone(),
+            fake_freed_pages,
+        );
+
+        mem.try_load_allocator_state(&tree)
+    }
+
     fn allocate_read_transaction(&self) -> Result<TransactionGuard> {
         let id = self
             .transaction_tracker