Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add set_quick_repair() #893

Merged
merged 6 commits into from
Nov 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 24 additions & 7 deletions docs/design.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,8 @@ database file.
| magic number |
| magic con.| god byte | padding | page size |
| region header pages | region max data pages |
| region tracker page number |
| number of full regions | data pages in trailing region |
| region tracker page number |
| padding |
| padding |
| padding |
Expand Down Expand Up @@ -88,12 +88,16 @@ controls which transaction pointer is the primary.
`magic number` must be set to the ASCII letters 'redb' followed by 0x1A, 0x0A, 0xA9, 0x0D, 0x0A. This sequence is
inspired by the PNG magic number.

`god byte`, so named because this byte controls the state of the entire database, is a bitfield containing two flags:
`god byte`, so named because this byte controls the state of the entire database, is a bitfield containing three flags:
* first bit: `primary_bit` flag which indicates whether transaction slot 0 or transaction slot 1 contains the latest commit.
redb relies on the fact that this is a single bit to perform atomic commits.
* second bit: `recovery_required` flag, if set then the recovery process must be run when opening the database.
During the recovery process, the region tracker and regional allocator states -- described below -- are reconstructed
by walking the btree from all active roots.
* second bit: `recovery_required` flag, if set then the recovery process must be run when opening the database. This can be
a full repair, in which the region tracker and regional allocator states -- described below -- are reconstructed by walking
the btree from all active roots, or a quick-repair, in which the state is simply loaded from the allocator state table.
* third bit: `two_phase_commit` flag, which indicates whether the transaction in the primary slot was written using 2-phase
commit. If so, the primary slot is guaranteed to be valid, and repair won't look at the secondary slot. This flag is always
updated atomically along with the primary bit.

redb relies on the fact that this is a single byte to perform atomic commits.

`page size` is the size of a redb page in bytes

Expand Down Expand Up @@ -155,7 +159,9 @@ changed during an upgrade.

### Region tracker
The region tracker is an array of `BtreeBitmap`s that tracks the page orders which are free in each region.
It is stored in a page in the data section of a region:
There are two different places it can be stored: on shutdown, it's written to a page in the data section of
a region, and when making a commit with quick-repair enabled, it's written to an entry in the allocator state
table. The former is valid only after a clean shutdown; the latter is usable even after a crash.
```
<-------------------------------------------- 8 bytes ------------------------------------------->
==================================================================================================
Expand Down Expand Up @@ -216,6 +222,11 @@ range has been allocated
* n bytes: free index data
* n bytes: allocated data

Like the region tracker, there are two different places where the regional allocator state can be
stored. On shutdown, it's written to the region header as described above, and when making a commit
with quick-repair enabled, it's written to an entry in the allocator state table. The former is valid
only after a clean shutdown; the latter is usable even after a crash.

```
<-------------------------------------------- 8 bytes ------------------------------------------->
==================================================================================================
Expand Down Expand Up @@ -461,6 +472,12 @@ To repair the database after an unclean shutdown we must:
2) Update the allocator state, so that it is consistent with all the database roots in the above
transaction

If the last commit before the crash had quick-repair enabled, then these are both trivial. The
primary commit slot is guaranteed to be valid, because it was written using 2-phase commit, and
the corresponding allocator state is stored in the allocator state table.

Otherwise, we need to perform a full repair:

For (1), if the primary commit slot is invalid we switch to the secondary slot.

For (2), we rebuild the allocator state by walking the following trees and marking all referenced
Expand Down
1 change: 1 addition & 0 deletions fuzz/fuzz_targets/common.rs
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,7 @@ pub(crate) enum FuzzOperation {
pub(crate) struct FuzzTransaction {
pub ops: Vec<FuzzOperation>,
pub durable: bool,
pub quick_repair: bool,
pub commit: bool,
pub create_ephemeral_savepoint: bool,
pub create_persistent_savepoint: bool,
Expand Down
1 change: 1 addition & 0 deletions fuzz/fuzz_targets/fuzz_redb.rs
Original file line number Diff line number Diff line change
Expand Up @@ -583,6 +583,7 @@ fn exec_table_crash_support<T: Clone>(config: &FuzzConfig, apply: fn(WriteTransa
if !transaction.durable {
txn.set_durability(Durability::None);
}
txn.set_quick_repair(transaction.quick_repair);
let mut counter_table = txn.open_table(COUNTER_TABLE).unwrap();
let uncommitted_id = txn_id as u64 + 1;
counter_table.insert((), uncommitted_id)?;
Expand Down
150 changes: 108 additions & 42 deletions src/db.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,7 @@ use crate::tree_store::{
TableType, TransactionalMemory, PAGE_SIZE,
};
use crate::types::{Key, Value};
use crate::{
CompactionError, DatabaseError, Durability, ReadOnlyTable, SavepointError, StorageError,
};
use crate::{CompactionError, DatabaseError, ReadOnlyTable, SavepointError, StorageError};
use crate::{ReadTransaction, Result, WriteTransaction};
use std::fmt::{Debug, Display, Formatter};

Expand All @@ -21,7 +19,9 @@ use std::sync::{Arc, Mutex};

use crate::error::TransactionError;
use crate::sealed::Sealed;
use crate::transactions::SAVEPOINT_TABLE;
use crate::transactions::{
AllocatorStateKey, AllocatorStateTree, ALLOCATOR_STATE_TABLE_NAME, SAVEPOINT_TABLE,
};
use crate::tree_store::file_backend::FileBackend;
#[cfg(feature = "logging")]
use log::{debug, info, warn};
Expand Down Expand Up @@ -386,17 +386,34 @@ impl Database {
.unwrap()
.clear_cache_and_reload()?;

if !Self::verify_primary_checksums(self.mem.clone())? {
was_clean = false;
}
let old_roots = [
self.mem.get_data_root(),
self.mem.get_system_root(),
self.mem.get_freed_root(),
];

Self::do_repair(&mut self.mem, &|_| {}).map_err(|err| match err {
let new_roots = Self::do_repair(&mut self.mem, &|_| {}).map_err(|err| match err {
DatabaseError::Storage(storage_err) => storage_err,
_ => unreachable!(),
})?;
if allocator_hash != self.mem.allocator_hash() {

if old_roots != new_roots || allocator_hash != self.mem.allocator_hash() {
was_clean = false;
}

if !was_clean {
let next_transaction_id = self.mem.get_last_committed_transaction_id()?.next();
let [data_root, system_root, freed_root] = new_roots;
self.mem.commit(
data_root,
system_root,
freed_root,
next_transaction_id,
false,
true,
)?;
}

self.mem.begin_writable()?;

Ok(was_clean)
Expand All @@ -414,19 +431,21 @@ impl Database {
return Err(CompactionError::TransactionInProgress);
}
// Commit to free up any pending free pages
// Use 2-phase commit to avoid any possible security issues. Plus this compaction is going to be so slow that it doesn't matter
// Use 2-phase commit to avoid any possible security issues. Plus this compaction is going to be so slow that it doesn't matter.
// Once https://github.com/cberner/redb/issues/829 is fixed, we should upgrade this to use quick-repair -- that way the user
// can cancel the compaction without requiring a full repair afterwards
let mut txn = self.begin_write().map_err(|e| e.into_storage_error())?;
if txn.list_persistent_savepoints()?.next().is_some() {
return Err(CompactionError::PersistentSavepointExists);
}
if self.transaction_tracker.any_savepoint_exists() {
return Err(CompactionError::EphemeralSavepointExists);
}
txn.set_durability(Durability::Paranoid);
txn.set_two_phase_commit(true);
txn.commit().map_err(|e| e.into_storage_error())?;
// Repeat, just in case executing list_persistent_savepoints() created a new table
let mut txn = self.begin_write().map_err(|e| e.into_storage_error())?;
txn.set_durability(Durability::Paranoid);
txn.set_two_phase_commit(true);
txn.commit().map_err(|e| e.into_storage_error())?;
// There can't be any outstanding transactions because we have a `&mut self`, so all pending free pages
// should have been cleared out by the above commit()
Expand All @@ -447,7 +466,7 @@ impl Database {

// Double commit to free up the relocated pages for reuse
let mut txn = self.begin_write().map_err(|e| e.into_storage_error())?;
txn.set_durability(Durability::Paranoid);
txn.set_two_phase_commit(true);
txn.commit().map_err(|e| e.into_storage_error())?;
assert!(self.mem.get_freed_root().is_none());

Expand Down Expand Up @@ -592,8 +611,14 @@ impl Database {
fn do_repair(
mem: &mut Arc<TransactionalMemory>, // Only &mut to ensure exclusivity
repair_callback: &(dyn Fn(&mut RepairSession) + 'static),
) -> Result<(), DatabaseError> {
) -> Result<[Option<BtreeHeader>; 3], DatabaseError> {
if !Self::verify_primary_checksums(mem.clone())? {
if mem.used_two_phase_commit() {
return Err(DatabaseError::Storage(StorageError::Corrupted(
"Primary is corrupted despite 2-phase commit".to_string(),
)));
}

// 0.3 because the repair takes 3 full scans and the first is done now
let mut handle = RepairSession::new(0.3);
repair_callback(&mut handle);
Expand Down Expand Up @@ -662,19 +687,7 @@ impl Database {
// by storing an empty root during the below commit()
mem.clear_read_cache();

let transaction_id = mem.get_last_committed_transaction_id()?.next();
mem.commit(
data_root,
system_root,
freed_root,
transaction_id,
false,
true,
// don't trim the database file, because we want the allocator hash to match exactly
false,
)?;

Ok(())
Ok([data_root, system_root, freed_root])
}

fn new(
Expand All @@ -698,14 +711,31 @@ impl Database {
)?;
let mut mem = Arc::new(mem);
if mem.needs_repair()? {
#[cfg(feature = "logging")]
warn!("Database {:?} not shutdown cleanly. Repairing", &file_path);
let mut handle = RepairSession::new(0.0);
repair_callback(&mut handle);
if handle.aborted() {
return Err(DatabaseError::RepairAborted);
// If the last transaction used 2-phase commit and updated the allocator state table, then
// we can just load the allocator state from there. Otherwise, we need a full repair
if Self::try_quick_repair(mem.clone())? {
#[cfg(feature = "logging")]
info!("Quick-repair successful, full repair not needed");
} else {
#[cfg(feature = "logging")]
warn!("Database {:?} not shutdown cleanly. Repairing", &file_path);
let mut handle = RepairSession::new(0.0);
repair_callback(&mut handle);
if handle.aborted() {
return Err(DatabaseError::RepairAborted);
}
let [data_root, system_root, freed_root] =
Self::do_repair(&mut mem, repair_callback)?;
let next_transaction_id = mem.get_last_committed_transaction_id()?.next();
mem.commit(
data_root,
system_root,
freed_root,
next_transaction_id,
false,
true,
)?;
}
Self::do_repair(&mut mem, repair_callback)?;
}

mem.begin_writable()?;
Expand Down Expand Up @@ -740,6 +770,42 @@ impl Database {
Ok(db)
}

// Returns true if quick-repair was successful, or false if a full repair is needed
fn try_quick_repair(mem: Arc<TransactionalMemory>) -> Result<bool> {
// Quick-repair is only possible if the primary was written using 2-phase commit
if !mem.used_two_phase_commit() {
return Ok(false);
}

// See if the allocator state table is present in the system table tree
let fake_freed_pages = Arc::new(Mutex::new(vec![]));
let system_table_tree = TableTreeMut::new(
mem.get_system_root(),
Arc::new(TransactionGuard::fake()),
mem.clone(),
fake_freed_pages.clone(),
);
let Some(allocator_state_table) = system_table_tree
.get_table::<AllocatorStateKey, &[u8]>(ALLOCATOR_STATE_TABLE_NAME, TableType::Normal)
.map_err(|e| e.into_storage_error_or_corrupted("Unexpected TableError"))?
else {
return Ok(false);
};

// Load the allocator state from the table
let InternalTableDefinition::Normal { table_root, .. } = allocator_state_table else {
unreachable!();
};
let tree = AllocatorStateTree::new(
table_root,
Arc::new(TransactionGuard::fake()),
mem.clone(),
fake_freed_pages,
);

mem.try_load_allocator_state(&tree)
}

fn allocate_read_transaction(&self) -> Result<TransactionGuard> {
let id = self
.transaction_tracker
Expand Down Expand Up @@ -1162,15 +1228,15 @@ mod test {
let table_def: TableDefinition<u64, &[u8]> = TableDefinition::new("x");

let mut tx = db.begin_write().unwrap();
tx.set_durability(Durability::Paranoid);
tx.set_two_phase_commit(true);
let savepoint0 = tx.ephemeral_savepoint().unwrap();
{
tx.open_table(table_def).unwrap();
}
tx.commit().unwrap();

let mut tx = db.begin_write().unwrap();
tx.set_durability(Durability::Paranoid);
tx.set_two_phase_commit(true);
let savepoint1 = tx.ephemeral_savepoint().unwrap();
tx.restore_savepoint(&savepoint0).unwrap();
tx.set_durability(Durability::None);
Expand All @@ -1182,15 +1248,15 @@ mod test {
tx.commit().unwrap();

let mut tx = db.begin_write().unwrap();
tx.set_durability(Durability::Paranoid);
tx.set_two_phase_commit(true);
tx.restore_savepoint(&savepoint0).unwrap();
{
tx.open_table(table_def).unwrap();
}
tx.commit().unwrap();

let mut tx = db.begin_write().unwrap();
tx.set_durability(Durability::Paranoid);
tx.set_two_phase_commit(true);
let savepoint2 = tx.ephemeral_savepoint().unwrap();
drop(savepoint0);
tx.restore_savepoint(&savepoint2).unwrap();
Expand All @@ -1203,7 +1269,7 @@ mod test {
tx.commit().unwrap();

let mut tx = db.begin_write().unwrap();
tx.set_durability(Durability::Paranoid);
tx.set_two_phase_commit(true);
let savepoint3 = tx.ephemeral_savepoint().unwrap();
drop(savepoint1);
tx.restore_savepoint(&savepoint3).unwrap();
Expand All @@ -1213,7 +1279,7 @@ mod test {
tx.commit().unwrap();

let mut tx = db.begin_write().unwrap();
tx.set_durability(Durability::Paranoid);
tx.set_two_phase_commit(true);
let savepoint4 = tx.ephemeral_savepoint().unwrap();
drop(savepoint2);
tx.restore_savepoint(&savepoint3).unwrap();
Expand All @@ -1225,7 +1291,7 @@ mod test {
tx.abort().unwrap();

let mut tx = db.begin_write().unwrap();
tx.set_durability(Durability::Paranoid);
tx.set_two_phase_commit(true);
let savepoint5 = tx.ephemeral_savepoint().unwrap();
drop(savepoint3);
assert!(tx.restore_savepoint(&savepoint4).is_err());
Expand All @@ -1235,7 +1301,7 @@ mod test {
tx.commit().unwrap();

let mut tx = db.begin_write().unwrap();
tx.set_durability(Durability::Paranoid);
tx.set_two_phase_commit(true);
tx.restore_savepoint(&savepoint5).unwrap();
tx.set_durability(Durability::None);
{
Expand Down
8 changes: 4 additions & 4 deletions src/multimap_table.rs
Original file line number Diff line number Diff line change
Expand Up @@ -308,8 +308,8 @@ pub(crate) fn finalize_tree_and_subtree_checksums(
value_size,
<()>::fixed_width(),
);
subtree.finalize_dirty_checksums()?;
sub_root_updates.push((i, entry.key().to_vec(), subtree.get_root().unwrap()));
let subtree_root = subtree.finalize_dirty_checksums()?.unwrap();
sub_root_updates.push((i, entry.key().to_vec(), subtree_root));
}
}
}
Expand All @@ -327,10 +327,10 @@ pub(crate) fn finalize_tree_and_subtree_checksums(
Ok(())
})?;

tree.finalize_dirty_checksums()?;
let root = tree.finalize_dirty_checksums()?;
// No pages should have been freed by this operation
assert!(freed_pages.lock().unwrap().is_empty());
Ok(tree.get_root())
Ok(root)
}

fn parse_subtree_roots<T: Page>(
Expand Down
Loading