-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add raw file backend #991
Add raw file backend #991
Conversation
I've pushed a much faster migration procedure in dafe220, which is detailed in a long block comment. This brings migration time down to 57 seconds for a 128 GiB disk, with nearly all that time being spent in |
...and one more optimization in 553704a, which brings migration for the 128 GiB disk down to 9 seconds. That seems fast enough to me. |
// by the `dirty` bit, but we're being _somewhat_ paranoid and manually | ||
// forcing a rehash if any blocks have multiple contexts stored. | ||
let ctxs = self.get_block_contexts(0, self.extent_size.value)?; | ||
let needs_rehash = ctxs.iter().any(|c| c.len() > 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feedback requested: is this paranoia justified, or should I just call self.fully_rehash_and_clean_all_stale_contexts(false)?
and trust the dirty flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say this is justified - sqlite
could have a bug that creates this edge case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just really some minor things.
Have you ran any of the tests in ./tools
directory on this? I'm going to take the
nightly-build from the CI and put it on a bench gimlet and let things run, but I'm
not expecting to find any issues, this all seems really solid.
Another win for these changes can be seen in the region creation test. Stock bits:
These changes:
|
740357a
to
db9c654
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a first pass for the review. I have unanswered questions about how ZFS behaves that I'll follow up with out of band.
downstairs/src/extent_inner_raw.rs
Outdated
* We define a block being written to or not has if that block has | ||
* `Some(...)` with a matching checksum serialized into a context slot | ||
* or not. So it is required that a written block has a checksum. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/has/as/g
, I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"So it is required that a written block as a checksum."?
(this doesn't make sense to me)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this line was updated after this comment was made haha, I was highlighting the first "has" here:
We define a block being written to or not has if that block has
Some(...)
with a matching checksum ...
downstairs/src/extent_inner_raw.rs
Outdated
let next_slot = | ||
self.set_block_context(&DownstairsBlockContext { | ||
block_context: write.block_context, | ||
block: write.offset.value, | ||
on_disk_hash, | ||
})?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If any of the extent data writes below fail, then the downstairs will re-execute this whole function for the same writes
argument. It looks like set_block_context
is destructive for the current contexts, which means if this function is retried two times (failing each time during the extent data write) then it will overwrite the correct context for all N blocks in writes
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little subtle, but I think it's correct.
- Let's say that block 0 is using context slot A.
set_block_context
will write to context slot B, but will not updateself.active_context
.- If we then fail the extent data write, we will then actively read the data back from the file and check which context slot is valid
- If the new data was written, according to reading back from the file (but then the write returned an error code?), we will update the context slot to slot B
- If the new data wasn't written, we leave the context slot at slot A. This means that we're in the same place that we started, where the data in the file corresponds to context slot A
06badc8
to
a4dceeb
Compare
I did a migration test for a fully populated 128 GiB volume (4KiB block size, 16384 blocks per extent). To populate the volume, I ran Using When starting the |
Migrating a dense 128 GiB region is now down to 42 seconds after 14c8eb7 |
addb3e8
to
622ef9b
Compare
I forgot to post this at the time, but reorganizing context slots into AAAAAABBBBBBB (instead of ABABABABAB) + writing them in bulk fixed our
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that the on-disk format of the raw file metadata slots is implemented manually in several places and not through for example some shared struct. I'd prefer replacing all the calls to pread/pwrite + calls bincode serialize_into/deserialize with some functions instead.
Something like this would be similar to what is written to disk:
struct SerializableOnDiskMetadata {
a_slots: Vec<Option<OnDiskDownstairsBlockContext>>,
b_slots: Vec<Option<OnDiskDownstairsBlockContext>>,
active_contexts: Vec<ContextSlot>,
metadata: OnDiskMeta,
};
Then you could imagine:
pub fn export_meta_and_context(
&mut self
) -> result<Vec<u8>, CrucibleError> {
let ctxs = self.get_block_contexts(0, self.extent_size.value)?;
let needs_rehash = ctxs.iter().any(|c| c.len() > 1);
let ctxs = if needs_rehash {
self.fully_rehash_and_clean_all_stale_contexts(true)?;
self.get_block_contexts(0, self.extent_size.value)?
} else {
ctxs
};
let ctxs: Vec<Option<OnDiskDownstairsBlockContext>> =
ctxs.into_iter().map(|c| {
match c.len() {
0 => None,
1 => Some(OnDiskDownstairsBlockContext {
block_context: c[0].block_context,
on_disk_hash: c[0].on_disk_hash,
}),
i => panic!("invalid context count: {i}"),
}
})
.collect();
SerializableOnDiskMetadata::buf_from_contexts(
ctxs,
self.dirty()?,
self.flush_number()?,
self.gen_number()?,
)
}
where
impl SerializableOnDiskMetadata {
pub fn buf_from_contexts(ctxs: Vec<OnDiskDownstairsBlockContext>, dirty: bool, flush_number: u64, gen_number: u64) -> Result<Vec<u8>, CrucibleError> {
let block_count = ctxs.len();
let struct = SerializableOnDiskMetadata {
a_slots: ctxs,
b_slots: vec![None; block_count],
active_contexts: vec![ContextSlot::A; block_count],
metadata: OnDiskMeta {
dirty,
flush_number,
gen_number,
ext_version: EXTENT_META_RAW, // new extent version for raw files
}
};
struct.write_to_buf()
}
}
Here, write_to_buf
wraps creating the Vec at the right size and writing in the correct order using the correct methods.
We'd also want to support grabbing only parts of the metadata file, and updating only a subset of the slots (aka a run of blocks that are being read or written to). This would require using the struct's functions but not the struct itself. Taking a block of code from open
:
// Read the active context bitpacked array and metadata. The former is
// only valid if `dirty` is false in the metadata, but that's the most
// common case, so we'll optimize for it.
let (active_contexts, metadata) = SerializableOnDiskMetadata::meta_from_file(file.as_raw_fd())?;
let active_contexts = if !metadata.dirty {
active_contexts
} else {
let (a_slots, b_slots) = SerializableOnDiskMetadata::slots_from_file(file.as_raw_fd())?;
// the code that compares the on-disk extent data with slots
};
and another example:
fn set_block_contexts_contiguous(
&mut self,
block_contexts: &[DownstairsBlockContext],
) -> Result<u64> {
for (a, b) in block_contexts.iter().zip(block_contexts.iter().skip(1)) {
assert_eq!(a.block + 1, b.block, "blocks must be contiguous");
}
let mut writes = 0u64;
for (slot, group) in block_contexts
.iter()
.group_by(|block_context| {
// We'll be writing to the inactive slot
!self.active_context[block_context.block as usize]
})
.into_iter()
{
let mut group = group.peekable();
let block_start = group.peek().unwrap().block;
let ctxs = group.iter().map(|block_context| OnDiskDownstairsBlockContext {
block_context: block_context.block_context,
on_disk_hash: block_context.on_disk_hash,
}).collect();
writes += SerializableOnDiskMetadata::partial_write_slot_contexts_contiguous(
self.file.as_raw_fd(),
slot,
block_start,
ctxs,
)?;
}
if let Some(writes) = writes.checked_sub(1) {
self.extra_syscall_count += writes;
self.extra_syscall_denominator += 1;
}
Ok(writes)
}
Another question - would we want to version SerializableOnDiskMetadata
as well? I know it already stores extent version in OnDiskMeta
but this is an opportunity to add a version now where we may want one later.
What are your thoughts?
I agree with the goal of having a single struct that isolates details of the raw file format. My goal was for that to be Previously, the SQLite export also had to know about the file format. I changed that in 634dae2, making the export use normal types, and adding a new At the end of the day, someone has to be calling |
slot: ContextSlot, | ||
) -> Result<(), CrucibleError> | ||
where | ||
I: Iterator<Item = Option<&'a DownstairsBlockContext>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outside of unit tests, I don't think it should be possible to write a None
slot in: even in the case that the Upstairs is writing an all-zero block, we'd want that to be an authenticated block of all zeroes instead of clearing the block back to unauthenticated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I experimented with changing this signature, with mixed results. It turns out that we use the possibility of writing None
in a few places:
- When migrating instances, to write the context slots in bulk
- When defragmenting context slots, for the same reason
It's possible to work around this, but it makes this code much trickier (see the diff below).
Because this is a private function – no one outside this module can call it – I'd vote to leave it as is, but would reconsider if you feel strongly about it.
For reference, here's the diff:
diff --git a/downstairs/src/extent_inner_raw.rs b/downstairs/src/extent_inner_raw.rs
index b8459d8..6d8a624 100644
--- a/downstairs/src/extent_inner_raw.rs
+++ b/downstairs/src/extent_inner_raw.rs
@@ -545,18 +545,22 @@ impl RawInner {
assert_eq!(block_count, ctxs.len());
file.set_len(layout.file_size())?;
- layout.write_context_slots_contiguous(
- file,
- 0,
- ctxs.iter().map(Option::as_ref),
- ContextSlot::A,
- )?;
- layout.write_context_slots_contiguous(
- file,
- 0,
- std::iter::repeat(None).take(block_count),
- ContextSlot::B,
- )?;
+ for chunk in ctxs
+ .iter()
+ .group_by(|b| b.is_some())
+ .into_iter()
+ .filter(|(some, _group)| *some)
+ .map(|(_some, group)| group)
+ {
+ layout.write_context_slots_contiguous(
+ file,
+ 0,
+ chunk.map(|b| b.as_ref().unwrap()),
+ ContextSlot::A,
+ )?;
+ }
+ // `ContextSlot::B` is already initialized to all zeros, which
+ // deserializes to `None`.
layout.write_active_context_and_metadata(
file,
@@ -914,10 +918,7 @@ impl RawInner {
let mut group = group.peekable();
let start = group.peek().unwrap().block;
self.layout.write_context_slots_contiguous(
- &self.file,
- start,
- group.map(Option::Some),
- slot,
+ &self.file, start, group, slot,
)?;
writes += 1;
}
@@ -1113,12 +1114,25 @@ impl RawInner {
self.sync_index + 1;
}
}
- let r = self.layout.write_context_slots_contiguous(
- &self.file,
- counter.min_block,
- dest_slots.iter().map(|v| v.as_ref()),
- !copy_from,
- );
+
+ let mut r = Ok(());
+ for chunk in dest_slots
+ .iter()
+ .group_by(|b| b.is_some())
+ .into_iter()
+ .filter(|(some, _group)| *some)
+ .map(|(_some, group)| group)
+ {
+ if let Err(e) = self.layout.write_context_slots_contiguous(
+ &self.file,
+ 0,
+ chunk.map(|b| b.as_ref().unwrap()),
+ !copy_from,
+ ) {
+ r = Err(e);
+ break;
+ }
+ }
// If this write failed, then try recomputing every context slot
// within the unknown range
@@ -1251,7 +1265,7 @@ impl RawLayout {
slot: ContextSlot,
) -> Result<(), CrucibleError>
where
- I: Iterator<Item = Option<&'a DownstairsBlockContext>>,
+ I: Iterator<Item = &'a DownstairsBlockContext>,
{
let mut buf = self.buf.take();
buf.clear();
@@ -1259,11 +1273,11 @@ impl RawLayout {
for block_context in iter {
let n = buf.len();
buf.extend([0u8; BLOCK_CONTEXT_SLOT_SIZE_BYTES as usize]);
- let d = block_context.map(|b| OnDiskDownstairsBlockContext {
- block_context: b.block_context,
- on_disk_hash: b.on_disk_hash,
- });
- bincode::serialize_into(&mut buf[n..], &d).unwrap();
+ let d = OnDiskDownstairsBlockContext {
+ block_context: block_context.block_context,
+ on_disk_hash: block_context.on_disk_hash,
+ };
+ bincode::serialize_into(&mut buf[n..], &Some(d)).unwrap();
}
let offset = self.context_slot_offset(block_start, slot);
nix::sys::uio::pwrite(file.as_raw_fd(), &buf, offset as i64).map_err(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Circling back here, are those two cases still true?
- When migrating, the context slots start as all zeroes, which deserialize as None.
- After defragmenting, both context slots contain the same data - the current code doesn't assign None to overwrite one of them.
fa37c53
to
e584aa0
Compare
for block in counter.min_block..=counter.max_block { | ||
self.active_context[block as usize] = !copy_from; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point in the function (if there was no error in writing the context slots), the in memory active context array is different than what active contexts were previously serialized to the file. The second of the following asserts will fire during the defragment tests:
diff --git a/downstairs/src/extent.rs b/downstairs/src/extent.rs
index 3457512..bbe4cd9 100644
--- a/downstairs/src/extent.rs
+++ b/downstairs/src/extent.rs
@@ -77,7 +77,7 @@ pub(crate) trait ExtentInner: Send + Debug {
}
/// BlockContext, with the addition of block index and on_disk_hash
-#[derive(Copy, Clone)]
+#[derive(Debug, Copy, Clone, PartialEq)]
pub struct DownstairsBlockContext {
pub block_context: BlockContext,
diff --git a/downstairs/src/extent_inner_raw.rs b/downstairs/src/extent_inner_raw.rs
index 0e4f8ea..a35b1ca 100644
--- a/downstairs/src/extent_inner_raw.rs
+++ b/downstairs/src/extent_inner_raw.rs
@@ -1131,6 +1140,10 @@ impl RawInner {
self.active_context[block as usize] = !copy_from;
}
}
+
+ assert!(!self.dirty);
+ assert_eq!(self.active_context, self.layout.get_active_contexts(&self.file)?);
+
r.map(|_| ())
}
Also the dirty bit is not set (as we just did a flush): if there is a crash now, Extent::open
will see that the dirty bit is not set and deserialize the active contexts that were previously written to the file, and not validate those against the block's data on disk.
However, I think this is ok: we only defragment after a flush, so we know what that that the block data contents are on durable storage and that the active context slot is correct and matches that. Defragmentation only copies the contents of the active context slot to the inactive context slot, so after defragmentation both slot A and B contain the same (valid) context. The context will be correct for the block data no matter which one is picked.
The following proves that:
diff --git a/downstairs/src/extent.rs b/downstairs/src/extent.rs
index 3457512..bbe4cd9 100644
--- a/downstairs/src/extent.rs
+++ b/downstairs/src/extent.rs
@@ -77,7 +77,7 @@ pub(crate) trait ExtentInner: Send + Debug {
}
/// BlockContext, with the addition of block index and on_disk_hash
-#[derive(Copy, Clone)]
+#[derive(Debug, Copy, Clone, PartialEq)]
pub struct DownstairsBlockContext {
pub block_context: BlockContext,
diff --git a/downstairs/src/extent_inner_raw.rs b/downstairs/src/extent_inner_raw.rs
index 0e4f8ea..585e18d 100644
--- a/downstairs/src/extent_inner_raw.rs
+++ b/downstairs/src/extent_inner_raw.rs
@@ -1065,6 +1065,13 @@ impl RawInner {
max_block: 0,
};
let mut b_count = a_count;
+
+ // PARANOID: make sure these match
+ assert_eq!(
+ self.active_context,
+ self.layout.get_active_contexts(&self.file)?,
+ );
+
for (i, s) in self.active_context.iter().enumerate() {
let count = match s {
ContextSlot::A => &mut a_count,
@@ -1086,6 +1093,7 @@ impl RawInner {
assert!(counter.count > 0);
assert!(counter.max_block >= counter.min_block);
let num_slots = counter.max_block + 1 - counter.min_block;
+ assert!(num_slots > 0);
// Read the source context slots from the file
let source_slots = self.layout.read_context_slots_contiguous(
@@ -1113,6 +1121,7 @@ impl RawInner {
self.sync_index + 1;
}
}
+
let r = self.layout.write_context_slots_contiguous(
&self.file,
counter.min_block,
@@ -1131,6 +1140,94 @@ impl RawInner {
self.active_context[block as usize] = !copy_from;
}
}
+
+ // Here, self.active_context does not match what is on disk, and dirty
+ // is not set. If there is a crash now, Extent::open will be called, and
+ // active context will be read from the disk and _not_ validated against
+ // the file data. This is ok if we defragment only after a flush has
+ // successfully occurred: due to how defragmentation works, when there
+ // is a mismatch between what is in memory now and what is on disk, the
+ // two context slots on disk are duplicates of each other, and match the
+ // file data on disk (because we just did a successful flush).
+ //
+ // Go through _every_ block, and debug assert that when there is a
+ // mismatch the two context slots on disk are duplicates of each other,
+ // and match the file data that is on disk.
+ assert!(!self.dirty);
+ assert_ne!(self.active_context, self.layout.get_active_contexts(&self.file)?);
+
+ let block_size = self.extent_size.block_size_in_bytes();
+ for (i, (in_memory_slot, on_disk_slot)) in std::iter::zip(self.active_context.iter(), self.layout.get_active_contexts(&self.file)?.iter()).enumerate() {
+ let block = i as u64;
+
+ let (buf, hash) = {
+ let mut buf = vec![0; block_size as usize];
+
+ nix::sys::uio::pread(
+ self.file.as_raw_fd(),
+ &mut buf,
+ (block_size as u64 * block) as i64,
+ )
+ .map_err(|e| {
+ CrucibleError::IoError(format!(
+ "extent {}: reading block {} data failed: {e}",
+ self.extent_number,
+ block,
+ ))
+ })?;
+
+ let hash = integrity_hash(&[&buf]);
+
+ (buf, hash)
+ };
+
+ if in_memory_slot != on_disk_slot {
+ // The in memory active slot does not match the on-disk active
+ // slot. Make sure that the on disk contexts are duplicates, and
+ // that they both match the on disk file data.
+ let a_context = self.layout.read_context_slots_contiguous(
+ &self.file,
+ block,
+ 1,
+ ContextSlot::A,
+ )?;
+ let b_context = self.layout.read_context_slots_contiguous(
+ &self.file,
+ block,
+ 1,
+ ContextSlot::B,
+ )?;
+ debug_assert_eq!(a_context, b_context);
+
+ if let Some(a_context) = a_context[0] {
+ // If the slot is Some, then there is encrypted data there.
+ debug_assert_eq!(a_context.block, block);
+ debug_assert_eq!(a_context.on_disk_hash, hash);
+ } else {
+ // If the slot is None, then the data should be all zeros
+ debug_assert_eq!(buf, vec![0u8; block_size as usize]);
+ }
+ } else {
+ // The in memory active slot matches the on disk active slot.
+ // Validate the on disk active slot while we're here.
+ let on_disk_context = self.layout.read_context_slots_contiguous(
+ &self.file,
+ block,
+ 1,
+ *on_disk_slot,
+ )?;
+
+ if let Some(on_disk_context) = on_disk_context[0] {
+ // If the slot is Some, then there is encrypted data there.
+ debug_assert_eq!(on_disk_context.block, block);
+ debug_assert_eq!(on_disk_context.on_disk_hash, hash);
+ } else {
+ // If the slot is None, then the data should be all zeros
+ debug_assert_eq!(buf, vec![0u8; block_size as usize]);
+ }
+ }
+ }
+
r.map(|_| ())
}
}
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent analysis, I agree with both parts (the cached context slots on disk disagree with active context slots in-memory, despite dirty being false; and this is still safe, because the cached context slots contain valid contexts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also added another assertion in 5ed99f7 to make sure no one tries to call defragment
on a dirty extent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth adding a comment to describe the scenario and why that scenario is covered here?
Unless there is a specific test for it. I can imagine future me coming here and thinking I found a problem and
a comment might help me understand that everything is okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, added a comment in 971ed1e
06fa0d9
to
57a52b6
Compare
Let's say for whatever reason the Upstairs is talking to one downstairs that's a sqlite backend, and two others that are both raw backends. Both are message version 5. After an unclean shutdown, or if a write lands on one but not the other before shutdown, the Upstairs may decide to reconcile when reconnecting to the set based on a mismatch here: Lines 6678 to 6707 in 6df8a67
If that's the case, it'll start the repair process, which cannot succeed due to backend differences. I'm thinking that the downstairs needs to report which backend it's using to the Upstairs, but if this is the case then the Upstairs has to panic - repair is required but can't happen. What do you think? Is this too contrived a scenario? |
@jmpesp I think the initial situation ("the Upstairs is talking to one downstairs that's a sqlite backend, and two others that are both raw backends") is too contrived. The downstairs should always be running the raw file backend, unless it's a previously-existing SQLite-backed volume and In other words, to arrive in that initial situation would require the three Downstairs to disagree about whether a volume is |
My testing setup
observations of interest
numbers
|
@mkeeter another contrived scenario: one sqlite backend, two raw backends, both read-write this time. I was concerned that the Upstairs would panic if it received N block contexts from the sqlite backend but only one from the raw backends, if there was a mismatch in length of Lines 4439 to 4467 in 6df8a67
But this isn't the case. I reproduced this with a (admittedly contrived!) integration test and the Upstairs filters out all but the correct hash here and here. However, I agree with what you said about the situation being too contrived though. If we were not automatically upgrading every read-write downstairs this way, then it could happen, but every sqlite backend should be read-only after this commit lands. We'll have:
The only way we would end up with a read-write downstairs using the sqlite backend after this commit lands is if a failure to migrate a read-write downstairs to the raw backend is ignored by the Downstairs, and it starts anyway. This isn't what we do though! |
I did some more poking at RAM, and generally feel good about merging this. Running the same volume shape as @faithanalog (128 GiB, 64 MiB extents, 512-byte blocks, unencrypted), I did get
The vast majority of RAM is in this line:
This makes sense, because we've got a 512-byte block size. Note that the The second largest RAM hog is
Each extent contains 2x one-byte-per-block We could shrink this further by bitpacking Running the same tests on The panic, just for reference:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the state of this change right now. I've been following the meta-discussion around this; I've also read through the new extent_inner_raw
implementation and it makes sense to me (love the strategy for defragmentation!). I don't think I can fully comb through every part of this change (particularly version migration) right now, and I don't want to hold up the PR over that- if james and alan like it, that's good by me.
I have gotten into a couple Weird states while hammering the storage stack with fio jobs, but most of the time I don't hit those states, which is a good sign.
Because I can't deterministically reproduce them, I currently have no way of knowing if those weird states are a consequence of changes in this PR, or if they're caused by bugs that are already in main. Or both: this PR making bugs in other codepaths easier to hit simply by speeding things up, also an option. But from my understanding of this code, this PR does not really strike me as a likely source of deadlocks.
I'm going to keep chasing down trying to reproduce those ghosts because I do not like them. I don't feel compelled to hold this PR back because of them, unless someone else has a particular theory about them.
This PR adds a new backend for Crucible Downstairs extent files: a single on-disk file. The on-disk file includes the raw block contents, a small metadata block, and two slots per block for storing encryption context.
By alternating between slots, we get the same crash consistency behavior as the SQLite database, but with many fewer syncs – we only need to add a
fsync
call if we're about to overwrite a context slot that hasn't yet been synched to disk, i.e. someone has written the same block 3x in a row without a flush.The new backend implements the same
trait ExtentInner
as the SQLite backend, introduced in #982 . This means that it's mostly seamless, although I had to tweak a few small things around the edges.Though this is a change to the on-disk format, we maintain support for existing (SQLite-based) volumes. Existing volumes fall into two categories, which we handle differently:
SqliteInner
implementation around and use it in this case.The migration uses the same strategy as live repair:Creating a separate staging directory and staging new files in thereRenaming that directory (to.migrate
in this case) once everything is stagedCopying from the.migrate
directory if it's present on startupTo simplify the implementation, this change requires a whole-rack update. Specifically, live repair between SQLite extents and raw file extents is not possible! We have bumped the
CRUCIBLE_MESSAGE_VERSION
to prevent the system from trying such a thing (though the message format remains identical).One subtle point: we still have to support
.replace
directories with SQLite files, because a live repair of an older extent could have been interrupted before the system update!