Skip to content

Commit

Permalink
Add new ParanoidPlus durability level, which doesn't require repair
Browse files Browse the repository at this point in the history
To avoid repair, Durability::ParanoidPlus commits need to save the
allocator state somewhere.  We can't use the region headers, because
we'd be overwriting them in place; we might crash partway through the
overwrite, and then we'd need repair.  So we instead save the allocator
state to a new table in the system tree.  Writing to the table is
slightly tricky, because it needs to be done without allocating (see
below), but other than that it's a perfectly ordinary transactional
write with all the usual guarantees.

The other requirement to avoid repair is knowing whether the last
transaction used 2-phase commit.  For this, we add a new two_phase_commit
bit to the god byte, which is always updated atomically along with
swapping the primary bit.  Old redb versions will ignore the new flag
when reading and clear it when writing, which is exactly what we want.

This turns out to also fix a longstanding bug where Durability::Paranoid
hasn't been providing any security benefit at all.  The checksum forgery
attack described in the Durability::Immediate documentation actually
works equally well against Durability::Paranoid!  The problem is that even
though 2-phase commit guarantees the primary is valid, redb ignores the
primary flag when repairing.  It always picks whichever commit slot is
newer, as long as the checksum is valid.  So if you crash partway through
a commit, it'll try to recover using the partially-written secondary
rather than the fully-written primary, regardless of the durability mode.

The fix for this is exactly the two_phase_commit bit described above.
After a crash, we check whether the last transaction used 2-phase commit;
if so, we only look at the primary (which is guaranteed to be valid)
and ignore the secondary.  Durability::ParanoidPlus needs this check
anyway for safety, so we get the Durability::Paranoid bug fix for free.

To write to the allocator state table without allocating, I've introduced
a new insert_inplace() function.  It's similar to insert_reserve(),
but more general and maybe simpler.  To use it, you have to first do an
ordinary insert() with your desired key and a value of the appropriate
length; then later in the same transaction you can call insert_inplace()
to replace the value with a new one.  Unlike insert_reserve(), this works
with values that don't implement MutInPlaceValue, and it lets you hold
multiple reservations simultaneously.

insert_inplace() could be safely exposed to users, but I don't think
there's any reason to.  Since it doesn't give you a mutable reference,
there's no benefit over insert() unless you're storing data that cares
about its own position in the database.  So for now it's private, and I
haven't bothered making a new error type for it; it just panics if you
don't satisfy the preconditions.

The fuzzer is perfect for testing Durability::ParanoidPlus, because it
can simulate a crash, reopen the database (skipping repair if possible),
and then verify that the resulting allocator state exactly matches
what would happen if it ran a full repair.  I've updated the fuzzer
to generate Durability::ParanoidPlus commits along with the existing
Durability::None and Durability::Immediate.
  • Loading branch information
mconst committed Nov 10, 2024
1 parent 25b55b6 commit f2e1eeb
Show file tree
Hide file tree
Showing 11 changed files with 602 additions and 113 deletions.
29 changes: 22 additions & 7 deletions docs/design.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,12 +88,16 @@ controls which transaction pointer is the primary.
`magic number` must be set to the ASCII letters 'redb' followed by 0x1A, 0x0A, 0xA9, 0x0D, 0x0A. This sequence is
inspired by the PNG magic number.

`god byte`, so named because this byte controls the state of the entire database, is a bitfield containing two flags:
`god byte`, so named because this byte controls the state of the entire database, is a bitfield containing three flags:
* first bit: `primary_bit` flag which indicates whether transaction slot 0 or transaction slot 1 contains the latest commit.
redb relies on the fact that this is a single bit to perform atomic commits.
* second bit: `recovery_required` flag, if set then the recovery process must be run when opening the database.
During the recovery process, the region tracker and regional allocator states -- described below -- are reconstructed
by walking the btree from all active roots.
* second bit: `recovery_required` flag, if set then the recovery process must be run when opening the database. This can be
a full repair, in which the region tracker and regional allocator states -- described below -- are reconstructed by walking
the btree from all active roots, or a quick-repair, in which the state is simply loaded from the allocator state table.
* third bit: `two_phase_commit` flag, which indicates whether the transaction in the primary slot was written using 2-phase
commit. If so, the primary slot is guaranteed to be valid, and repair won't look at the secondary slot. This flag is always
updated atomically along with the primary bit.

redb relies on the fact that this is a single byte to perform atomic commits.

`page size` is the size of a redb page in bytes

Expand Down Expand Up @@ -155,7 +159,9 @@ changed during an upgrade.

### Region tracker
The region tracker is an array of `BtreeBitmap`s that tracks the page orders which are free in each region.
It is stored in a page in the data section of a region:
There are two different places it can be stored: on shutdown, it's written to a page in the data section of
a region, and when making a commit with `Durability::ParanoidPlus`, it's written to an entry in the allocator
state table. The former is valid only after a clean shutdown; the latter is usable even after a crash.
```
<-------------------------------------------- 8 bytes ------------------------------------------->
==================================================================================================
Expand Down Expand Up @@ -216,6 +222,11 @@ range has been allocated
* n bytes: free index data
* n bytes: allocated data

Like the region tracker, there are two different places where the regional allocator state can be
stored. On shutdown, it's written to the region header as described above, and when making a commit
with `Durability::ParanoidPlus`, it's written to an entry in the allocator state table. The former
is valid only after a clean shutdown; the latter is usable even after a crash.

```
<-------------------------------------------- 8 bytes ------------------------------------------->
==================================================================================================
Expand Down Expand Up @@ -456,7 +467,7 @@ exists. Then, (2) will be accomplished by moving all allocations from transactio
savepoint into the pending free state.

#### Database repair
To repair the database after an unclean shutdown we must:
To do a full repair after an unclean shutdown we must:
1) Update the super header to reference the last fully committed transaction
2) Update the allocator state, so that it is consistent with all the database roots in the above
transaction
Expand All @@ -472,6 +483,10 @@ All pages referenced by a savepoint must be contained in the above, because it i
a) referenced directly by the data, system, or freed tree -- i.e. it's a committed page
b) it is not referenced, in which case it is in the pending free state and is contained in the freed tree

Alternatively, we might be able to do a quick-repair. This is only possible if the last transaction
used 2-phase commit (so we know the primary slot is valid, without needing to walk the trees to verify
their checksums) and also saved its allocator state to the allocator state tree.

# Assumptions about underlying media
redb is designed to be safe even in the event of power failure or on poorly behaved media.
Therefore, we make only a few assumptions about the guarantees provided by the underlying filesystem:
Expand Down
12 changes: 11 additions & 1 deletion fuzz/fuzz_targets/common.rs
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,16 @@ impl<const N: usize> Arbitrary<'_> for BoundedUSize<N> {
}
}

// We don't simulate fsync(), so it's not interesting to fuzz with Durability::Eventual
// or Durability::Paranoid (they're mostly equivalent to Durability::Immediate). But the
// other three levels are all worth testing
#[derive(Arbitrary, Debug, Clone, PartialEq)]
pub(crate) enum FuzzDurability {
None,
Immediate,
ParanoidPlus,
}

#[derive(Arbitrary, Debug, Clone)]
pub(crate) enum FuzzOperation {
Get {
Expand Down Expand Up @@ -163,7 +173,7 @@ pub(crate) enum FuzzOperation {
#[derive(Arbitrary, Debug, Clone)]
pub(crate) struct FuzzTransaction {
pub ops: Vec<FuzzOperation>,
pub durable: bool,
pub durability: FuzzDurability,
pub commit: bool,
pub create_ephemeral_savepoint: bool,
pub create_persistent_savepoint: bool,
Expand Down
16 changes: 9 additions & 7 deletions fuzz/fuzz_targets/fuzz_redb.rs
Original file line number Diff line number Diff line change
Expand Up @@ -580,9 +580,11 @@ fn exec_table_crash_support<T: Clone>(config: &FuzzConfig, apply: fn(WriteTransa
// Disable IO error simulation while we update the transaction counter table
let old_countdown = countdown.swap(u64::MAX, Ordering::SeqCst);
let mut txn = db.begin_write().unwrap();
if !transaction.durable {
txn.set_durability(Durability::None);
}
txn.set_durability(match transaction.durability {
FuzzDurability::None => Durability::None,
FuzzDurability::Immediate => Durability::Immediate,
FuzzDurability::ParanoidPlus => Durability::ParanoidPlus,
});
let mut counter_table = txn.open_table(COUNTER_TABLE).unwrap();
let uncommitted_id = txn_id as u64 + 1;
counter_table.insert((), uncommitted_id)?;
Expand Down Expand Up @@ -627,9 +629,9 @@ fn exec_table_crash_support<T: Clone>(config: &FuzzConfig, apply: fn(WriteTransa
let commit_succeeded = last_committed == uncommitted_id;
if commit_succeeded {
assert!(transaction.commit);
savepoint_manager.commit(transaction.durable);
savepoint_manager.commit(transaction.durability != FuzzDurability::None);
non_durable_reference = uncommitted_reference;
if transaction.durable {
if transaction.durability != FuzzDurability::None {
reference = non_durable_reference.clone();
}
} else {
Expand Down Expand Up @@ -747,7 +749,7 @@ fn apply_crashable_transaction_multimap(txn: WriteTransaction, uncommitted_refer
}

if transaction.commit {
if transaction.durable {
if transaction.durability != FuzzDurability::None {
savepoints.gc_persistent_savepoints(&txn)?;
}
txn.commit()?;
Expand All @@ -767,7 +769,7 @@ fn apply_crashable_transaction(txn: WriteTransaction, uncommitted_reference: &mu
}

if transaction.commit {
if transaction.durable {
if transaction.durability != FuzzDurability::None {
savepoints.gc_persistent_savepoints(&txn)?;
}
txn.commit()?;
Expand Down
87 changes: 69 additions & 18 deletions src/db.rs
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,9 @@ use std::sync::{Arc, Mutex};

use crate::error::TransactionError;
use crate::sealed::Sealed;
use crate::transactions::SAVEPOINT_TABLE;
use crate::transactions::{
AllocatorStateKey, AllocatorStateTree, ALLOCATOR_STATE_TABLE_NAME, SAVEPOINT_TABLE,
};
use crate::tree_store::file_backend::FileBackend;
#[cfg(feature = "logging")]
use log::{debug, info, warn};
Expand Down Expand Up @@ -431,7 +433,9 @@ impl Database {
return Err(CompactionError::TransactionInProgress);
}
// Commit to free up any pending free pages
// Use 2-phase commit to avoid any possible security issues. Plus this compaction is going to be so slow that it doesn't matter
// Use 2-phase commit to avoid any possible security issues. Plus this compaction is going to be so slow that it doesn't matter.
// Once https://github.com/cberner/redb/issues/829 is fixed, we should upgrade this to use Durability::ParanoidPlus instead --
// that way the user can cancel the compaction without requiring repair afterwards
let mut txn = self.begin_write().map_err(|e| e.into_storage_error())?;
if txn.list_persistent_savepoints()?.next().is_some() {
return Err(CompactionError::PersistentSavepointExists);
Expand Down Expand Up @@ -611,6 +615,12 @@ impl Database {
repair_callback: &(dyn Fn(&mut RepairSession) + 'static),
) -> Result<[Option<BtreeHeader>; 3], DatabaseError> {
if !Self::verify_primary_checksums(mem.clone())? {
if mem.used_two_phase_commit() {
return Err(DatabaseError::Storage(StorageError::Corrupted(
"Primary is corrupted despite 2-phase commit".to_string(),
)));
}

// 0.3 because the repair takes 3 full scans and the first is done now
let mut handle = RepairSession::new(0.3);
repair_callback(&mut handle);
Expand Down Expand Up @@ -703,23 +713,28 @@ impl Database {
)?;
let mut mem = Arc::new(mem);
if mem.needs_repair()? {
#[cfg(feature = "logging")]
warn!("Database {:?} not shutdown cleanly. Repairing", &file_path);
let mut handle = RepairSession::new(0.0);
repair_callback(&mut handle);
if handle.aborted() {
return Err(DatabaseError::RepairAborted);
// If the last transaction used 2-phase commit and updated the allocator state table, then
// we can just load the allocator state from there. Otherwise, we need a full repair
if !Self::try_quick_repair(mem.clone())? {
#[cfg(feature = "logging")]
warn!("Database {:?} not shutdown cleanly. Repairing", &file_path);
let mut handle = RepairSession::new(0.0);
repair_callback(&mut handle);
if handle.aborted() {
return Err(DatabaseError::RepairAborted);
}
let [data_root, system_root, freed_root] =
Self::do_repair(&mut mem, repair_callback)?;
let next_transaction_id = mem.get_last_committed_transaction_id()?.next();
mem.commit(
data_root,
system_root,
freed_root,
next_transaction_id,
false,
true,
)?;
}
let [data_root, system_root, freed_root] = Self::do_repair(&mut mem, repair_callback)?;
let next_transaction_id = mem.get_last_committed_transaction_id()?.next();
mem.commit(
data_root,
system_root,
freed_root,
next_transaction_id,
false,
true,
)?;
}

mem.begin_writable()?;
Expand Down Expand Up @@ -754,6 +769,42 @@ impl Database {
Ok(db)
}

// Returns true if quick-repair was successful, or false if a full repair is needed
fn try_quick_repair(mem: Arc<TransactionalMemory>) -> Result<bool> {
// Quick-repair is only possible if the primary was written using 2-phase commit
if !mem.used_two_phase_commit() {
return Ok(false);
}

// See if the allocator state table is present in the system table tree
let fake_freed_pages = Arc::new(Mutex::new(vec![]));
let system_table_tree = TableTreeMut::new(
mem.get_system_root(),
Arc::new(TransactionGuard::fake()),
mem.clone(),
fake_freed_pages.clone(),
);
let Some(allocator_state_table) = system_table_tree
.get_table::<AllocatorStateKey, &[u8]>(ALLOCATOR_STATE_TABLE_NAME, TableType::Normal)
.map_err(|e| e.into_storage_error_or_corrupted("Unexpected TableError"))?
else {
return Ok(false);
};

// Load the allocator state from the table
let InternalTableDefinition::Normal { table_root, .. } = allocator_state_table else {
unreachable!();
};
let tree = AllocatorStateTree::new(
table_root,
Arc::new(TransactionGuard::fake()),
mem.clone(),
fake_freed_pages,
);

mem.try_load_allocator_state(&tree)
}

fn allocate_read_transaction(&self) -> Result<TransactionGuard> {
let id = self
.transaction_tracker
Expand Down
Loading

0 comments on commit f2e1eeb

Please sign in to comment.