Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add raw file backend #991

Merged
merged 62 commits into from
Nov 9, 2023
Merged

Add raw file backend #991

merged 62 commits into from
Nov 9, 2023

Conversation

mkeeter
Copy link
Contributor

@mkeeter mkeeter commented Oct 6, 2023

This PR adds a new backend for Crucible Downstairs extent files: a single on-disk file. The on-disk file includes the raw block contents, a small metadata block, and two slots per block for storing encryption context.

By alternating between slots, we get the same crash consistency behavior as the SQLite database, but with many fewer syncs – we only need to add a fsync call if we're about to overwrite a context slot that hasn't yet been synched to disk, i.e. someone has written the same block 3x in a row without a flush.

The new backend implements the same trait ExtentInner as the SQLite backend, introduced in #982 . This means that it's mostly seamless, although I had to tweak a few small things around the edges.

Though this is a change to the on-disk format, we maintain support for existing (SQLite-based) volumes. Existing volumes fall into two categories, which we handle differently:

  • Snapshots are stored on read-only filesystems, so we can't mess with them. We keep the old SqliteInner implementation around and use it in this case.
  • Volumes that aren't read-only are migrated from SQLite to raw extent files when they are opened. The migration uses the same strategy as live repair:
    • Creating a separate staging directory and staging new files in there
    • Renaming that directory (to .migrate in this case) once everything is staged
    • Copying from the .migrate directory if it's present on startup
    • EDIT: the migration uses a new strategy; see this comment

To simplify the implementation, this change requires a whole-rack update. Specifically, live repair between SQLite extents and raw file extents is not possible! We have bumped the CRUCIBLE_MESSAGE_VERSION to prevent the system from trying such a thing (though the message format remains identical).

One subtle point: we still have to support .replace directories with SQLite files, because a live repair of an older extent could have been interrupted before the system update!

@mkeeter
Copy link
Contributor Author

mkeeter commented Oct 6, 2023

Performance notes: this is a ~30% speedup running pgbench on Alpine Linux, with an encrypted 128 GiB volume (4K blocks, 16384 blocks/extent).

Screenshot 2023-10-06 at 3 46 20 PM

On the same system, it's roughly a 67% speedup for 4K random writes using fio. However, it's slower for 1M and 4M writes, for reasons I haven't yet ascertained.

main:
  WRITE: bw=15.2MiB/s (15.9MB/s), 15.2MiB/s-15.2MiB/s (15.9MB/s-15.9MB/s), io=910MiB (954MB), run=60009-60009msec
  WRITE: bw=371MiB/s (389MB/s), 371MiB/s-371MiB/s (389MB/s-389MB/s), io=21.8GiB (23.4GB), run=60064-60064msec
  WRITE: bw=348MiB/s (365MB/s), 348MiB/s-348MiB/s (365MB/s-365MB/s), io=20.5GiB (22.0GB), run=60263-60263msec

raw-file-extents:
  WRITE: bw=25.5MiB/s (26.7MB/s), 25.5MiB/s-25.5MiB/s (26.7MB/s-26.7MB/s), io=1529MiB (1603MB), run=60005-60005msec
  WRITE: bw=311MiB/s (326MB/s), 311MiB/s-311MiB/s (326MB/s-326MB/s), io=18.2GiB (19.6GB), run=60076-60076msec
  WRITE: bw=342MiB/s (358MB/s), 342MiB/s-342MiB/s (358MB/s-358MB/s), io=20.1GiB (21.6GB), run=60301-60301msec

Here's the fio file I used for these tests:

[global]
filename=/dev/nvme0n1
iodepth=25
ioengine=aio
time_based
runtime=60
numjobs=1
direct=1
stonewall=1

[randwrite-4K]
bs=4K
rw=randwrite

[randwrite-1M]
bs=1M
rw=randwrite

[randwrite-4M]
bs=4M
rw=randwrite

@mkeeter
Copy link
Contributor Author

mkeeter commented Oct 6, 2023

There are a few unit tests for automatic migration, but I also did an end-to-end test:

  • Created a 128 GiB SQLite-flavored volume using crucible-downstairs / dsc on main (same specs as above)
  • In a propolis-standalone VM, initialized that disk as ext4 and wrote 4GiB of random data to /mnt/lol.dat
  • Recorded the checksum of that random file
  • Exited the VM and propolis-standalone
  • Restarted 3x crucible-downstairs on this branch (raw-file-extents). This triggers automatic migration!
  • Restarted the VM
  • Mounted /dev/nvme0n1 to /mnt
  • Calculated the checksum of /mnt/lol.dat

Sure enough, the checksum matched after the migration.

However, the migration was a bit slow: it took 8 minutes for the crucible-downstairs to come up. I'm not sure if this is a problem!

Looking at the flamegraph, it appears to be mostly Good Honest Disk I/O in std::fs::copy:

Screenshot 2023-10-06 at 5 24 05 PM

Right now, the procedure is to

  1. Compute the new raw extent file in a .copy directory, then rename it to .migrate
  2. Copy that file into place from a .migrate directory
  3. Rename the .migrate directory to .complete, then delete it

(with syncs in various places)

This makes sense for the SQLite repair process, which has multiple files to keep track of. However, we only care about a single file here. I'm wondering whether we could simply rename the raw file (instead of copying it), which I think should be atomic.

@mkeeter
Copy link
Contributor Author

mkeeter commented Oct 9, 2023

I've pushed a much faster migration procedure in dafe220, which is detailed in a long block comment. This brings migration time down to 57 seconds for a 128 GiB disk, with nearly all that time being spent in fully_rehash_and_clean_all_stale_contexts:

Screenshot 2023-10-09 at 8 12 33 AM

@mkeeter
Copy link
Contributor Author

mkeeter commented Oct 9, 2023

...and one more optimization in 553704a, which brings migration for the 128 GiB disk down to 9 seconds. That seems fast enough to me.

@leftwo leftwo added this to the 3 milestone Oct 9, 2023
// by the `dirty` bit, but we're being _somewhat_ paranoid and manually
// forcing a rehash if any blocks have multiple contexts stored.
let ctxs = self.get_block_contexts(0, self.extent_size.value)?;
let needs_rehash = ctxs.iter().any(|c| c.len() > 1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feedback requested: is this paranoia justified, or should I just call self.fully_rehash_and_clean_all_stale_contexts(false)? and trust the dirty flag?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say this is justified - sqlite could have a bug that creates this edge case.

Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just really some minor things.

Have you ran any of the tests in ./tools directory on this? I'm going to take the
nightly-build from the CI and put it on a bench gimlet and let things run, but I'm
not expecting to find any issues, this all seems really solid.

downstairs/src/extent.rs Show resolved Hide resolved
downstairs/src/extent.rs Outdated Show resolved Hide resolved
downstairs/src/extent.rs Show resolved Hide resolved
downstairs/src/extent.rs Show resolved Hide resolved
downstairs/src/extent.rs Outdated Show resolved Hide resolved
downstairs/src/extent_inner_raw.rs Show resolved Hide resolved
downstairs/src/extent_inner_raw.rs Outdated Show resolved Hide resolved
downstairs/src/extent_inner_sqlite.rs Outdated Show resolved Hide resolved
downstairs/src/region.rs Show resolved Hide resolved
downstairs/src/region.rs Show resolved Hide resolved
@leftwo
Copy link
Contributor

leftwo commented Oct 10, 2023

Another win for these changes can be seen in the region creation test.
We can create larger regions much faster now:

Stock bits:

  SECONDS REGION_SIZE  EXTENT_SIZE     ES     EC   BS
  171.751  750.00 GiB    16.00 MiB   4096  48000 4096
   54.902  750.00 GiB    32.00 MiB   8192  24000 4096
   19.971  750.00 GiB    64.00 MiB  16384  12000 4096
    7.886  750.00 GiB   128.00 MiB  32768   6000 4096
  301.727 1024.00 GiB    16.00 MiB   4096  65536 4096
   90.760 1024.00 GiB    32.00 MiB   8192  32768 4096
   30.534 1024.00 GiB    64.00 MiB  16384  16384 4096
   11.440 1024.00 GiB   128.00 MiB  32768   8192 4096

These changes:

  SECONDS REGION_SIZE  EXTENT_SIZE     ES     EC   BS
   53.728  750.00 GiB    16.00 MiB   4096  48000 4096
   28.935  750.00 GiB    32.00 MiB   8192  24000 4096
   17.791  750.00 GiB    64.00 MiB  16384  12000 4096
   10.442  750.00 GiB   128.00 MiB  32768   6000 4096
   74.311 1024.00 GiB    16.00 MiB   4096  65536 4096
   38.683 1024.00 GiB    32.00 MiB   8192  32768 4096
   23.020 1024.00 GiB    64.00 MiB  16384  16384 4096
   14.461 1024.00 GiB   128.00 MiB  32768   8192 4096

Copy link
Contributor

@jmpesp jmpesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a first pass for the review. I have unanswered questions about how ZFS behaves that I'll follow up with out of band.

downstairs/src/extent.rs Show resolved Hide resolved
downstairs/src/extent_inner_raw.rs Outdated Show resolved Hide resolved
Comment on lines 171 to 200
* We define a block being written to or not has if that block has
* `Some(...)` with a matching checksum serialized into a context slot
* or not. So it is required that a written block has a checksum.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/has/as/g, I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"So it is required that a written block as a checksum."?

(this doesn't make sense to me)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this line was updated after this comment was made haha, I was highlighting the first "has" here:

We define a block being written to or not has if that block has Some(...) with a matching checksum ...

Comment on lines 247 to 252
let next_slot =
self.set_block_context(&DownstairsBlockContext {
block_context: write.block_context,
block: write.offset.value,
on_disk_hash,
})?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any of the extent data writes below fail, then the downstairs will re-execute this whole function for the same writes argument. It looks like set_block_context is destructive for the current contexts, which means if this function is retried two times (failing each time during the extent data write) then it will overwrite the correct context for all N blocks in writes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little subtle, but I think it's correct.

  • Let's say that block 0 is using context slot A.
  • set_block_context will write to context slot B, but will not update self.active_context.
  • If we then fail the extent data write, we will then actively read the data back from the file and check which context slot is valid
    • If the new data was written, according to reading back from the file (but then the write returned an error code?), we will update the context slot to slot B
    • If the new data wasn't written, we leave the context slot at slot A. This means that we're in the same place that we started, where the data in the file corresponds to context slot A

@askfongjojo askfongjojo modified the milestones: 3, 4 Oct 17, 2023
@mkeeter mkeeter force-pushed the raw-file-backend branch 2 times, most recently from 06badc8 to a4dceeb Compare October 20, 2023 12:34
@mkeeter
Copy link
Contributor Author

mkeeter commented Oct 23, 2023

I did a migration test for a fully populated 128 GiB volume (4KiB block size, 16384 blocks per extent).

To populate the volume, I ran dd if=/dev/urandom of=/dev/nvme0n1 bs=4194304 count=32768

Using kill -USR1 $PID to track the progress of dd, I saw about 259 MiB/sec.

When starting the raw-file-backend Crucible downstairs, it took 57 seconds to migrate the data over – basically all in SQLite!

@mkeeter
Copy link
Contributor Author

mkeeter commented Oct 23, 2023

Migrating a dense 128 GiB region is now down to 42 seconds after 14c8eb7

@mkeeter
Copy link
Contributor Author

mkeeter commented Oct 25, 2023

I forgot to post this at the time, but reorganizing context slots into AAAAAABBBBBBB (instead of ABABABABAB) + writing them in bulk fixed our fio performance numbers for larger writes:

main:
  WRITE: bw=15.2MiB/s (15.9MB/s), 15.2MiB/s-15.2MiB/s (15.9MB/s-15.9MB/s), io=910MiB (954MB),
  WRITE: bw=371MiB/s (389MB/s), 371MiB/s-371MiB/s (389MB/s-389MB/s), io=21.8GiB (23.4GB),
  WRITE: bw=348MiB/s (365MB/s), 348MiB/s-348MiB/s (365MB/s-365MB/s), io=20.5GiB (22.0GB),

raw-file-backend
  WRITE: bw=28.3MiB/s (29.7MB/s), 28.3MiB/s-28.3MiB/s (29.7MB/s-29.7MB/s), io=1698MiB (1781MB), 
  WRITE: bw=370MiB/s (388MB/s), 370MiB/s-370MiB/s (388MB/s-388MB/s), io=21.7GiB (23.3GB),
  WRITE: bw=372MiB/s (390MB/s), 372MiB/s-372MiB/s (390MB/s-390MB/s), io=21.9GiB (23.5GB),

Copy link
Contributor

@jmpesp jmpesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that the on-disk format of the raw file metadata slots is implemented manually in several places and not through for example some shared struct. I'd prefer replacing all the calls to pread/pwrite + calls bincode serialize_into/deserialize with some functions instead.

Something like this would be similar to what is written to disk:

struct SerializableOnDiskMetadata {
  a_slots: Vec<Option<OnDiskDownstairsBlockContext>>,
  b_slots: Vec<Option<OnDiskDownstairsBlockContext>>,
  active_contexts: Vec<ContextSlot>,
  metadata: OnDiskMeta,
};

Then you could imagine:

pub fn export_meta_and_context(
  &mut self
) -> result<Vec<u8>, CrucibleError> {
  let ctxs = self.get_block_contexts(0, self.extent_size.value)?;
  let needs_rehash = ctxs.iter().any(|c| c.len() > 1);

  let ctxs = if needs_rehash {
    self.fully_rehash_and_clean_all_stale_contexts(true)?;
    self.get_block_contexts(0, self.extent_size.value)?
  } else {
    ctxs
  };

  let ctxs: Vec<Option<OnDiskDownstairsBlockContext>> = 
    ctxs.into_iter().map(|c| {
      match c.len() {
        0 => None,
        1 => Some(OnDiskDownstairsBlockContext {
          block_context: c[0].block_context,
          on_disk_hash: c[0].on_disk_hash,
        }),
        i => panic!("invalid context count: {i}"),
      }
    })
    .collect();

  SerializableOnDiskMetadata::buf_from_contexts(
    ctxs,
    self.dirty()?,
    self.flush_number()?,
    self.gen_number()?,
  )
}

where

impl SerializableOnDiskMetadata {
  pub fn buf_from_contexts(ctxs: Vec<OnDiskDownstairsBlockContext>, dirty: bool, flush_number: u64, gen_number: u64) -> Result<Vec<u8>, CrucibleError> {
    let block_count = ctxs.len();

    let struct = SerializableOnDiskMetadata {
      a_slots: ctxs,
      b_slots: vec![None; block_count],
      active_contexts: vec![ContextSlot::A; block_count],
      metadata: OnDiskMeta {
        dirty,
        flush_number,
        gen_number,
        ext_version: EXTENT_META_RAW, // new extent version for raw files
      }
    };

    struct.write_to_buf()
  }
}

Here, write_to_buf wraps creating the Vec at the right size and writing in the correct order using the correct methods.

We'd also want to support grabbing only parts of the metadata file, and updating only a subset of the slots (aka a run of blocks that are being read or written to). This would require using the struct's functions but not the struct itself. Taking a block of code from open:

        // Read the active context bitpacked array and metadata.  The former is
        // only valid if `dirty` is false in the metadata, but that's the most
        // common case, so we'll optimize for it.
        let (active_contexts, metadata) = SerializableOnDiskMetadata::meta_from_file(file.as_raw_fd())?;

        let active_contexts = if !metadata.dirty {
          active_contexts
        } else {
          let (a_slots, b_slots) = SerializableOnDiskMetadata::slots_from_file(file.as_raw_fd())?;

          // the code that compares the on-disk extent data with slots
        };

and another example:

    fn set_block_contexts_contiguous(
        &mut self,
        block_contexts: &[DownstairsBlockContext],
    ) -> Result<u64> {
        for (a, b) in block_contexts.iter().zip(block_contexts.iter().skip(1)) {
            assert_eq!(a.block + 1, b.block, "blocks must be contiguous");
        }

        let mut writes = 0u64;

        for (slot, group) in block_contexts
            .iter()
            .group_by(|block_context| {
                // We'll be writing to the inactive slot
                !self.active_context[block_context.block as usize]
            })
            .into_iter()
        {
            let mut group = group.peekable();
            let block_start = group.peek().unwrap().block;

            let ctxs = group.iter().map(|block_context| OnDiskDownstairsBlockContext {
              block_context: block_context.block_context,
              on_disk_hash: block_context.on_disk_hash,
            }).collect();

            writes += SerializableOnDiskMetadata::partial_write_slot_contexts_contiguous(
              self.file.as_raw_fd(),
              slot,
              block_start,
              ctxs,
            )?;
        }

        if let Some(writes) = writes.checked_sub(1) {
            self.extra_syscall_count += writes;
            self.extra_syscall_denominator += 1;
        }

        Ok(writes)
    }

Another question - would we want to version SerializableOnDiskMetadata as well? I know it already stores extent version in OnDiskMeta but this is an opportunity to add a version now where we may want one later.

What are your thoughts?

downstairs/src/extent_inner_raw.rs Outdated Show resolved Hide resolved
downstairs/src/extent_inner_raw.rs Show resolved Hide resolved
downstairs/src/extent_inner_sqlite.rs Outdated Show resolved Hide resolved
downstairs/src/extent_inner_raw.rs Outdated Show resolved Hide resolved
@mkeeter
Copy link
Contributor Author

mkeeter commented Oct 26, 2023

I'm concerned that the on-disk format of the raw file metadata slots is implemented manually in several places and not through for example some shared struct. I'd prefer replacing all the calls to pread/pwrite + calls bincode serialize_into/deserialize with some functions instead.

I agree with the goal of having a single struct that isolates details of the raw file format. My goal was for that to be struct RawInner, rather than a new SerializableOnDiskMetadata. This is why RawInner has helper functions like meta_offset and context_slot_offset – it's already the thing that knows how the file is laid out on disk, so I wanted to keep all that info within the type.

Previously, the SQLite export also had to know about the file format. I changed that in 634dae2, making the export use normal types, and adding a new RawInner::import to generate the meta + context data.

At the end of the day, someone has to be calling pread / pwrite 😄. Let me do an audit of RawInner and see if there are other places where I could isolate and deduplicate those bits, and then we can talk through remaining concerns.

@mkeeter
Copy link
Contributor Author

mkeeter commented Oct 26, 2023

@jmpesp After some poking, I did end up making a new class (RawLayout) and isolated all of the syscalls into there. Take a look at 2c41131 and see if that lines up with what you had in mind!

@jmpesp
Copy link
Contributor

jmpesp commented Oct 30, 2023

@jmpesp After some poking, I did end up making a new class (RawLayout) and isolated all of the syscalls into there. Take a look at 2c41131 and see if that lines up with what you had in mind!

💯 , this is what I was thinking yeah!

downstairs/src/extent.rs Outdated Show resolved Hide resolved
downstairs/src/extent_inner_raw.rs Show resolved Hide resolved
downstairs/src/extent_inner_raw.rs Outdated Show resolved Hide resolved
downstairs/src/extent_inner_raw.rs Outdated Show resolved Hide resolved
slot: ContextSlot,
) -> Result<(), CrucibleError>
where
I: Iterator<Item = Option<&'a DownstairsBlockContext>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside of unit tests, I don't think it should be possible to write a None slot in: even in the case that the Upstairs is writing an all-zero block, we'd want that to be an authenticated block of all zeroes instead of clearing the block back to unauthenticated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I experimented with changing this signature, with mixed results. It turns out that we use the possibility of writing None in a few places:

  • When migrating instances, to write the context slots in bulk
  • When defragmenting context slots, for the same reason

It's possible to work around this, but it makes this code much trickier (see the diff below).

Because this is a private function – no one outside this module can call it – I'd vote to leave it as is, but would reconsider if you feel strongly about it.

For reference, here's the diff:

diff --git a/downstairs/src/extent_inner_raw.rs b/downstairs/src/extent_inner_raw.rs
index b8459d8..6d8a624 100644
--- a/downstairs/src/extent_inner_raw.rs
+++ b/downstairs/src/extent_inner_raw.rs
@@ -545,18 +545,22 @@ impl RawInner {
         assert_eq!(block_count, ctxs.len());
 
         file.set_len(layout.file_size())?;
-        layout.write_context_slots_contiguous(
-            file,
-            0,
-            ctxs.iter().map(Option::as_ref),
-            ContextSlot::A,
-        )?;
-        layout.write_context_slots_contiguous(
-            file,
-            0,
-            std::iter::repeat(None).take(block_count),
-            ContextSlot::B,
-        )?;
+        for chunk in ctxs
+            .iter()
+            .group_by(|b| b.is_some())
+            .into_iter()
+            .filter(|(some, _group)| *some)
+            .map(|(_some, group)| group)
+        {
+            layout.write_context_slots_contiguous(
+                file,
+                0,
+                chunk.map(|b| b.as_ref().unwrap()),
+                ContextSlot::A,
+            )?;
+        }
+        // `ContextSlot::B` is already initialized to all zeros, which
+        // deserializes to `None`.
 
         layout.write_active_context_and_metadata(
             file,
@@ -914,10 +918,7 @@ impl RawInner {
             let mut group = group.peekable();
             let start = group.peek().unwrap().block;
             self.layout.write_context_slots_contiguous(
-                &self.file,
-                start,
-                group.map(Option::Some),
-                slot,
+                &self.file, start, group, slot,
             )?;
             writes += 1;
         }
@@ -1113,12 +1114,25 @@ impl RawInner {
                     self.sync_index + 1;
             }
         }
-        let r = self.layout.write_context_slots_contiguous(
-            &self.file,
-            counter.min_block,
-            dest_slots.iter().map(|v| v.as_ref()),
-            !copy_from,
-        );
+
+        let mut r = Ok(());
+        for chunk in dest_slots
+            .iter()
+            .group_by(|b| b.is_some())
+            .into_iter()
+            .filter(|(some, _group)| *some)
+            .map(|(_some, group)| group)
+        {
+            if let Err(e) = self.layout.write_context_slots_contiguous(
+                &self.file,
+                0,
+                chunk.map(|b| b.as_ref().unwrap()),
+                !copy_from,
+            ) {
+                r = Err(e);
+                break;
+            }
+        }
 
         // If this write failed, then try recomputing every context slot
         // within the unknown range
@@ -1251,7 +1265,7 @@ impl RawLayout {
         slot: ContextSlot,
     ) -> Result<(), CrucibleError>
     where
-        I: Iterator<Item = Option<&'a DownstairsBlockContext>>,
+        I: Iterator<Item = &'a DownstairsBlockContext>,
     {
         let mut buf = self.buf.take();
         buf.clear();
@@ -1259,11 +1273,11 @@ impl RawLayout {
         for block_context in iter {
             let n = buf.len();
             buf.extend([0u8; BLOCK_CONTEXT_SLOT_SIZE_BYTES as usize]);
-            let d = block_context.map(|b| OnDiskDownstairsBlockContext {
-                block_context: b.block_context,
-                on_disk_hash: b.on_disk_hash,
-            });
-            bincode::serialize_into(&mut buf[n..], &d).unwrap();
+            let d = OnDiskDownstairsBlockContext {
+                block_context: block_context.block_context,
+                on_disk_hash: block_context.on_disk_hash,
+            };
+            bincode::serialize_into(&mut buf[n..], &Some(d)).unwrap();
         }
         let offset = self.context_slot_offset(block_start, slot);
         nix::sys::uio::pwrite(file.as_raw_fd(), &buf, offset as i64).map_err(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Circling back here, are those two cases still true?

  • When migrating, the context slots start as all zeroes, which deserialize as None.
  • After defragmenting, both context slots contain the same data - the current code doesn't assign None to overwrite one of them.

downstairs/src/extent_inner_raw.rs Show resolved Hide resolved
downstairs/src/extent_inner_raw.rs Outdated Show resolved Hide resolved
integration_tests/src/lib.rs Outdated Show resolved Hide resolved
for block in counter.min_block..=counter.max_block {
self.active_context[block as usize] = !copy_from;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point in the function (if there was no error in writing the context slots), the in memory active context array is different than what active contexts were previously serialized to the file. The second of the following asserts will fire during the defragment tests:

diff --git a/downstairs/src/extent.rs b/downstairs/src/extent.rs
index 3457512..bbe4cd9 100644
--- a/downstairs/src/extent.rs
+++ b/downstairs/src/extent.rs
@@ -77,7 +77,7 @@ pub(crate) trait ExtentInner: Send + Debug {
 }
 
 /// BlockContext, with the addition of block index and on_disk_hash
-#[derive(Copy, Clone)]
+#[derive(Debug, Copy, Clone, PartialEq)]
 pub struct DownstairsBlockContext {
     pub block_context: BlockContext,

diff --git a/downstairs/src/extent_inner_raw.rs b/downstairs/src/extent_inner_raw.rs
index 0e4f8ea..a35b1ca 100644
--- a/downstairs/src/extent_inner_raw.rs
+++ b/downstairs/src/extent_inner_raw.rs
@@ -1131,6 +1140,10 @@ impl RawInner {
                 self.active_context[block as usize] = !copy_from;
             }
         }
+
+        assert!(!self.dirty);
+        assert_eq!(self.active_context, self.layout.get_active_contexts(&self.file)?);
+
         r.map(|_| ())
     }

Also the dirty bit is not set (as we just did a flush): if there is a crash now, Extent::open will see that the dirty bit is not set and deserialize the active contexts that were previously written to the file, and not validate those against the block's data on disk.

However, I think this is ok: we only defragment after a flush, so we know what that that the block data contents are on durable storage and that the active context slot is correct and matches that. Defragmentation only copies the contents of the active context slot to the inactive context slot, so after defragmentation both slot A and B contain the same (valid) context. The context will be correct for the block data no matter which one is picked.

The following proves that:

diff --git a/downstairs/src/extent.rs b/downstairs/src/extent.rs
index 3457512..bbe4cd9 100644
--- a/downstairs/src/extent.rs
+++ b/downstairs/src/extent.rs
@@ -77,7 +77,7 @@ pub(crate) trait ExtentInner: Send + Debug {
 }
 
 /// BlockContext, with the addition of block index and on_disk_hash
-#[derive(Copy, Clone)]
+#[derive(Debug, Copy, Clone, PartialEq)]
 pub struct DownstairsBlockContext {
     pub block_context: BlockContext,

diff --git a/downstairs/src/extent_inner_raw.rs b/downstairs/src/extent_inner_raw.rs
index 0e4f8ea..585e18d 100644
--- a/downstairs/src/extent_inner_raw.rs
+++ b/downstairs/src/extent_inner_raw.rs
@@ -1065,6 +1065,13 @@ impl RawInner {
             max_block: 0,
         };
         let mut b_count = a_count;
+
+        // PARANOID: make sure these match
+        assert_eq!(
+            self.active_context,
+            self.layout.get_active_contexts(&self.file)?,
+        );
+
         for (i, s) in self.active_context.iter().enumerate() {
             let count = match s {
                 ContextSlot::A => &mut a_count,
@@ -1086,6 +1093,7 @@ impl RawInner {
         assert!(counter.count > 0);
         assert!(counter.max_block >= counter.min_block);
         let num_slots = counter.max_block + 1 - counter.min_block;
+        assert!(num_slots > 0);
 
         // Read the source context slots from the file
         let source_slots = self.layout.read_context_slots_contiguous(
@@ -1113,6 +1121,7 @@ impl RawInner {
                     self.sync_index + 1;
             }
         }
+
         let r = self.layout.write_context_slots_contiguous(
             &self.file,
             counter.min_block,
@@ -1131,6 +1140,94 @@ impl RawInner {
                 self.active_context[block as usize] = !copy_from;
             }
         }
+
+        // Here, self.active_context does not match what is on disk, and dirty
+        // is not set. If there is a crash now, Extent::open will be called, and
+        // active context will be read from the disk and _not_ validated against
+        // the file data. This is ok if we defragment only after a flush has
+        // successfully occurred: due to how defragmentation works, when there
+        // is a mismatch between what is in memory now and what is on disk, the
+        // two context slots on disk are duplicates of each other, and match the
+        // file data on disk (because we just did a successful flush).
+        //
+        // Go through _every_ block, and debug assert that when there is a
+        // mismatch the two context slots on disk are duplicates of each other,
+        // and match the file data that is on disk.
+        assert!(!self.dirty);
+        assert_ne!(self.active_context, self.layout.get_active_contexts(&self.file)?);
+
+        let block_size = self.extent_size.block_size_in_bytes();
+        for (i, (in_memory_slot, on_disk_slot)) in std::iter::zip(self.active_context.iter(), self.layout.get_active_contexts(&self.file)?.iter()).enumerate() {
+            let block = i as u64;
+
+            let (buf, hash) = {
+                let mut buf = vec![0; block_size as usize];
+
+                nix::sys::uio::pread(
+                    self.file.as_raw_fd(),
+                    &mut buf,
+                    (block_size as u64 * block) as i64,
+                )
+                .map_err(|e| {
+                    CrucibleError::IoError(format!(
+                        "extent {}: reading block {} data failed: {e}",
+                        self.extent_number,
+                        block,
+                    ))
+                })?;
+
+                let hash = integrity_hash(&[&buf]);
+
+                (buf, hash)
+            };
+
+            if in_memory_slot != on_disk_slot {
+                // The in memory active slot does not match the on-disk active
+                // slot. Make sure that the on disk contexts are duplicates, and
+                // that they both match the on disk file data.
+                let a_context = self.layout.read_context_slots_contiguous(
+                    &self.file,
+                    block,
+                    1,
+                    ContextSlot::A,
+                )?;
+                let b_context = self.layout.read_context_slots_contiguous(
+                    &self.file,
+                    block,
+                    1,
+                    ContextSlot::B,
+                )?;
+                debug_assert_eq!(a_context, b_context);
+
+                if let Some(a_context) = a_context[0] {
+                    // If the slot is Some, then there is encrypted data there.
+                    debug_assert_eq!(a_context.block, block);
+                    debug_assert_eq!(a_context.on_disk_hash, hash);
+                } else {
+                    // If the slot is None, then the data should be all zeros
+                    debug_assert_eq!(buf, vec![0u8; block_size as usize]);
+                }
+            } else {
+                // The in memory active slot matches the on disk active slot.
+                // Validate the on disk active slot while we're here.
+                let on_disk_context = self.layout.read_context_slots_contiguous(
+                    &self.file,
+                    block,
+                    1,
+                    *on_disk_slot,
+                )?;
+
+                if let Some(on_disk_context) = on_disk_context[0] {
+                    // If the slot is Some, then there is encrypted data there.
+                    debug_assert_eq!(on_disk_context.block, block);
+                    debug_assert_eq!(on_disk_context.on_disk_hash, hash);
+                } else {
+                    // If the slot is None, then the data should be all zeros
+                    debug_assert_eq!(buf, vec![0u8; block_size as usize]);
+                }
+            }
+        }
+
         r.map(|_| ())
     }
 }

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent analysis, I agree with both parts (the cached context slots on disk disagree with active context slots in-memory, despite dirty being false; and this is still safe, because the cached context slots contain valid contexts)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added another assertion in 5ed99f7 to make sure no one tries to call defragment on a dirty extent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth adding a comment to describe the scenario and why that scenario is covered here?
Unless there is a specific test for it. I can imagine future me coming here and thinking I found a problem and
a comment might help me understand that everything is okay?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, added a comment in 971ed1e

@mkeeter
Copy link
Contributor Author

mkeeter commented Nov 7, 2023

After the last two commits (868bca4 and 0d4f573), we're using significantly less RAM than main.

main raw-file-backend
startup 394M 136M
after running tests 2708M 448M

(measured using the RES column in htop)

The tests are my usual suite of fio + pgbench.

@jmpesp
Copy link
Contributor

jmpesp commented Nov 7, 2023

Let's say for whatever reason the Upstairs is talking to one downstairs that's a sqlite backend, and two others that are both raw backends. Both are message version 5. After an unclean shutdown, or if a write lands on one but not the other before shutdown, the Upstairs may decide to reconcile when reconnecting to the set based on a mismatch here:

crucible/upstairs/src/lib.rs

Lines 6678 to 6707 in 6df8a67

/*
* Determine what extents don't match and what to do
* about that
*/
let reconcile_list = self.mismatch_list(ds);
if let Some(reconcile_list) = reconcile_list {
/*
* We transition all the downstairs to needing repair here
* while we have the downstairs lock. This will insure that
* all downstairs enter the repair path.
*/
ds.ds_state.iter_mut().for_each(|ds_state| {
info!(self.log, "Transition from {} to Repair", *ds_state);
/*
* This is a panic and not an error because we should
* not call this method without already verifying the
* downstairs are in the proper state.
*/
assert_eq!(*ds_state, DsState::WaitQuorum);
*ds_state = DsState::Repair;
});
info!(
self.log,
"Found {:?} extents that need repair",
reconcile_list.mend.len()
);
ds.convert_rc_to_messages(reconcile_list.mend, max_flush, max_gen);
ds.reconcile_repair_needed = ds.reconcile_task_list.len();
Ok(true)

If that's the case, it'll start the repair process, which cannot succeed due to backend differences. I'm thinking that the downstairs needs to report which backend it's using to the Upstairs, but if this is the case then the Upstairs has to panic - repair is required but can't happen.

What do you think? Is this too contrived a scenario?

@mkeeter
Copy link
Contributor Author

mkeeter commented Nov 8, 2023

@jmpesp I think the initial situation ("the Upstairs is talking to one downstairs that's a sqlite backend, and two others that are both raw backends") is too contrived. The downstairs should always be running the raw file backend, unless it's a previously-existing SQLite-backed volume and read_only is true when we open it (since we automatically migrate existing SQLite-backed extents if they aren't read-only).

In other words, to arrive in that initial situation would require the three Downstairs to disagree about whether a volume is read_only; in that situation, I suspect many other things will also go wrong!

@faithanalog
Copy link
Contributor

faithanalog commented Nov 8, 2023

My testing setup

  • 2048 extents
  • 131072 blocks per extent
  • 512-byte block
  • 128GiB disk
  • no disk encryption (so no encryption overhead)
  • all fio tests
    • 120 seconds
    • direct IO
    • io_uring
    • io depth 48
    • sync between fio invocations

observations of interest

  • CPU usage is high on small writes

    • randwrite-16k: each downstairs is using 10-14 cores worth of CPU, 30% of total gimlet when you combine all 3
    • We are getting significant downstairs parallelization!!!
    • I think we can limit this by configuring the number of tokio workers if we want, but do we want to?
      • limiting this way might reduce burst random IOPS though
      • consider that most people are not doing random writes across all extents on the disk as fast as possible, especially not for long periods.
  • ram

    • measured by RES column in htop (NOT VIRT)
    • startup: 450MiB / downstairs
    • during initial long sustained sequential write, grows to 2.7GiB and holds steady
    • By the end of running all FIO tests, ram usage was different on 3 downstairs
      • 6.6 GiB / 4.4 GiB / 4.8 GiB
      • Why is it different? I don't know yet.
      • Not sure where the memory usage is coming from. Cached metadata in ram should be a small fraction of this size.
    • unclear whether RAM usage grows unbounded or eventually caps out.
    • (TODO: is it only on write ops?)
  • high backpressure under 4M randrw

    • backpressure capped at 76,000
    • one of the downstairs was moved to offline for 2 seconds
      • reconnected, did not go into live repair when it came back
    • took 35 seconds to drain
    • pure-write patterns do not end up in this backpressure situation
    • smaller randrws (4k, 16k) do not cause this
    • Unless you can invent a scenario where a customer will hit this, maybe not an immediate concern.
  • I managed to cause an upstairs lockup

    • error was NOT on latest commit. on 622ef9b
      • all other test results in this comment are on latest though.
    • I have had it happen twice but it doesn't happen every time.
    • TODO: can i reproduce on latest?
      • going to re-run the benches a bunch to see if it triggers
    • Previously I was only running pure-read and pure-write tests. With the move to iodriver I started including rw which does interleaved reads and writes
    • UPW stuck at 93, not changing
    • DSW stuck at 6281, not changing
    • BAKPR stuck at 40836, not changing
    • Propolis had to be force-killed
    • Sequence:
      • longwrite - special 300 second 4M write followed by brief 4k read
      • randread-16k
      • randread-4M
      • randread-4k
      • randrw-16k
      • randrw-4M
      • randrw-4k <-- error

numbers

  • Long sequential 4M write:

    • 509MiB/s
    • only 5ms latency cost incurred on read after the write finished!
  • rand read:

    • 4K: 8292 IOPS
    • 16K: 4546 IOPS
    • 4M: 25 IOPS (106MiB/s)
  • rand write:

    • 4K: 5194 IOPS
    • 16K: 3700 IOPS
    • 4M: 56 IOPS (224MiB/s)
  • rand rw:

    • 4K: 2751/2751 rw IOPS
      • notice the falloff vs pure read or pure write 4K
    • 16K: 2469/2469 rw IOPS
    • 4M: 10/10 rw IOPS (40MiB/s duplex)
  • seq read:

    • 4K: 8035 IOPS
    • 16K: 4199 IOPS
    • 4M: 24 IOPS (99.4MiB/s)
  • seq write:

    • 4K: 10,600 IOPS
    • 16K: 9006 IOPS
    • 4M: 104 IOPS (417MiB/s)
      • lower than initial longwrite - TODO: overhead from region with data already written to it, or overhead of long-running downstairs process?
  • seq rw:

    • 4K: 2795/2795 rw IOPS
    • 16K: 4321/4321 rw IOPS
    • 4M: 22/22 rw IOPS (much better than rand rw)
  • pgbench

    • unsure how to interpret these numbers but 520 tps on stock pgbench settings/test

@jmpesp
Copy link
Contributor

jmpesp commented Nov 8, 2023

@mkeeter another contrived scenario: one sqlite backend, two raw backends, both read-write this time. I was concerned that the Upstairs would panic if it received N block contexts from the sqlite backend but only one from the raw backends, if there was a mismatch in length of read_response_hashes here:

crucible/upstairs/src/lib.rs

Lines 4439 to 4467 in 6df8a67

/*
* For a read, make sure the data from a previous read
* has the same hash
*/
let read_data: Vec<ReadResponse> = read_data.unwrap();
assert!(!read_data.is_empty());
if job.read_response_hashes != read_response_hashes {
// XXX This error needs to go to Nexus
// XXX This will become the "force all downstairs
// to stop and refuse to restart" mode.
let msg = format!(
"[{}] read hash mismatch on id {}\n\
Expected {:x?}\n\
Computed {:x?}\n\
guest_id:{} request:{:?}\n\
job state:{:?}",
client_id,
ds_id,
job.read_response_hashes,
read_response_hashes,
job.guest_id,
requests,
job.state,
);
if job.replay {
info!(self.log, "REPLAY {}", msg);
} else {
panic!("{}", msg);
}

But this isn't the case. I reproduced this with a (admittedly contrived!) integration test and the Upstairs filters out all but the correct hash here and here.

However, I agree with what you said about the situation being too contrived though. If we were not automatically upgrading every read-write downstairs this way, then it could happen, but every sqlite backend should be read-only after this commit lands. We'll have:

  • old read-only downstairs that use the sqlite backend
  • new read-write downstairs that will use the raw backend
  • new read-only downstairs that are created from those, which use the raw backend

The only way we would end up with a read-write downstairs using the sqlite backend after this commit lands is if a failure to migrate a read-write downstairs to the raw backend is ignored by the Downstairs, and it starts anyway. This isn't what we do though!

@mkeeter
Copy link
Contributor Author

mkeeter commented Nov 8, 2023

I did some more poking at RAM, and generally feel good about merging this.

Running the same volume shape as @faithanalog (128 GiB, 64 MiB extents, 512-byte blocks, unencrypted), I did get RES usage up to 2.7 GiB running some fio tests. Grabbing a core with gcore and looking at it with ::umastat in mdb, here's the relevant stuff:

cache                        buf     buf     buf     buf  memory     alloc alloc
name                        size  in use  in ptc   total  in use   succeed  fail
------------------------- ------ ------- ------- ------- ------- --------- -----
umem_magazine_1               16    3848       -    4032     64K      4014     0
umem_magazine_3               32    1039       -    2142     68K      3474     0
umem_magazine_7               64     219       -     378     24K      2533     0
umem_magazine_15             128     184       -    1054    136K      1325     0
umem_magazine_31             256     135       -     690    184K       961     0
umem_magazine_47             384     128       -     350    140K       730     0
umem_magazine_63             512     128       -     455    260K       459     0
umem_magazine_95             768   23738       -   23745   18.6M     23743     0
umem_magazine_143           1152      10       -      12     16K        10     0
umem_slab_cache               56  374363       -  374409   23.2M    374885     0
umem_bufctl_cache             24 2235372       - 2235744   69.3M   2238899     0
umem_bufctl_audit_cache      192       0       -       0       0         0     0
umem_alloc_8                   8       0       0       0       0         0     0
umem_alloc_16                 16   78754   78726   78876   1.22M     78754     0
umem_alloc_32                 32     599     452    3780    120K       599     0
umem_alloc_48                 48  368795  368505  368844   17.2M    368795     0
umem_alloc_64                 64     748     222     756     48K       748     0
umem_alloc_80                 80    2746     428    2750    220K      2746     0
umem_alloc_96                 96     248     115     252     24K       248     0
umem_alloc_112               112    2529     196    2556    284K      2529     0
umem_alloc_128               128   44016   41838   44020   5.55M     44016     0
umem_alloc_160               160     262     257     275     44K       262     0
umem_alloc_192               192     298     297     315     60K       298     0
umem_alloc_224               224    7025    7021    7038   1.53M      7025     0
umem_alloc_256               256     258     258     270     72K       258     0
umem_alloc_320               320     317     315     324    108K       317     0
umem_alloc_384               384     210     171     210     84K       210     0
umem_alloc_448               448     279     279     279    124K       279     0
umem_alloc_512               512       1       -     272    136K    446022     0
umem_alloc_640               640       2       - 2218494   1.41G 110102664     0
umem_alloc_768               768       2       -     405    324K      3761     0
umem_alloc_896               896       0       -   10710   9.30M    862946     0
umem_alloc_1152             1152       4       -     357    408K    336394     0
umem_alloc_1344             1344       1       -     198    264K       440     0
umem_alloc_1600             1600       0       -     270    432K     70220     0
umem_alloc_2048             2048       0       -     274    548K     59246     0
umem_alloc_2304             2304     129       -     434    992K    276587     0
umem_alloc_2688             2688       0       -      78    208K       175     0
umem_alloc_4096             4096       0       -     156    624K    128351     0
umem_alloc_4544             4544       6       -     944   4.15M    279220     0
umem_alloc_8192             8192       1       -     157   1.23M    128585     0
umem_alloc_9216             9216       1       -     188   1.65M    275021     0
umem_alloc_12288           12288       0       -      84   1008K       152     0
umem_alloc_16384           16384       3       -     189   2.95M    128901     0
umem_alloc_24576           24576       1       -     249   5.84M    271548     0
umem_alloc_32768           32768       0       -     254   7.94M    126536     0
umem_alloc_40960           40960       1       -     244   9.53M    300132     0
umem_alloc_49152           49152       0       -      78   3.66M     68793     0
umem_alloc_57344           57344       0       -     325   17.8M     74740     0
umem_alloc_65536           65536       0       -     130   8.12M     50137     0
umem_alloc_73728           73728       0       -      93   6.54M      2246     0
umem_alloc_81920           81920       0       -     178   13.9M      2404     0
umem_alloc_90112           90112       0       -      53   4.55M       120     0
umem_alloc_98304           98304       0       -     173   16.2M      2956     0
umem_alloc_106496         106496       0       -     190   19.3M     47781     0
umem_alloc_114688         114688       0       -       0       0         0     0
umem_alloc_122880         122880       0       -     195   22.9M     50837     0
umem_alloc_131072         131072       0       -       0       0         0     0
------------------------- ------ ------- ------- ------- ------- --------- -----
Total [umem_internal]                                       112M   2651033     0
Total [umem_default]                                       1.59G 114603999     0
------------------------- ------ ------- ------- ------- ------- --------- -----

vmem                         memory     memory    memory     alloc alloc
name                         in use      total    import   succeed  fail
------------------------- --------- ---------- --------- --------- -----
sbrk_top                      2.37G      3.12G         0    647807  6463
    sbrk_heap                 2.37G      2.37G     2.37G    647807     0
        vmem_internal          130M       130M      130M     31512     0
            vmem_seg           123M       123M      123M     31482     0
            vmem_hash         6.66M      6.66M     6.66M        27     0
            vmem_vmem         45.1K      54.0K       36K        15     0
        umem_internal          129M       129M      129M     28790     0
            umem_cache         958K      1.13M     1.13M        58     0
            umem_hash         16.1M      16.1M     16.1M        55     0
        umem_log                  0          0         0         0     0
        umem_firewall_va          0          0         0         0     0
            umem_firewall         0          0         0         0     0
        umem_oversize          518M       534M      534M    169418     0
        umem_memalign         96.3K       116K      116K    219360     0
        umem_default          1.59G      1.59G     1.59G    381666     0
------------------------- --------- ---------- --------- --------- -----

The vast majority of RAM is in this line:

cache                        buf     buf     buf     buf  memory     alloc alloc
name                        size  in use  in ptc   total  in use   succeed  fail
------------------------- ------ ------- ------- ------- ------- --------- -----
umem_alloc_640               640       2       - 2218494   1.41G 110102664     0

This makes sense, because we've got a 512-byte block size. Note that the buf in use column is very small; this is entirely the allocator deciding not to give back 640-byte chunks back to the OS, in case the process needs them again. Allocator tuning is probably out of scope for this PR.

The second largest RAM hog is

vmem                         memory     memory    memory     alloc alloc
name                         in use      total    import   succeed  fail
------------------------- --------- ---------- --------- --------- -----
        umem_oversize          518M       534M      534M    169418     0

Each extent contains 2x one-byte-per-block Vec of block metadata. There are 131072 blocks per extent, so it makes sense that it would fall into the umem_oversize bucket; it can't fit into umem_alloc_131072, since the allocator takes a little extra room for metadata. Anyways, multiplying 131072 bytes/vec * 2 vec/extent * 2048 extents = 512 MiB, which neatly explains umem_oversize at 518M.

We could shrink this further by bitpacking active_context and context_slot_dirty into a single u8, but 512 MiB of metadata in RAM for 128 GiB of storage doesn't seem unreasonable. (This is also another case of 512-byte blocks being Bad, Actually)


Running the same tests on main got up to 2.7 GiB of RAM used, then fio failed due to hitting the Downstairs IO limit (IO_OUTSTANDING_MAX) and propolis-standalone panicked. We probably need to tune backpressure on unencrypted volumes, but that's orthogonal to this PR; I mostly wanted to confirm that it doesn't make things worse.

The panic, just for reference:

Nov 08 19:16:03.596 INFO [0] Proc runs for 127.0.0.1:8810 in state Faulted
Nov 08 19:16:03.607 INFO [0] upstairs guest_io_ready=TRUE, promote! session 0b3ab39d-5a43-4eac-a6f4-01696513ce6e
thread 'tokio-runtime-worker' panicked at 'called `Option::unwrap()` on a `None` value', /home/matt/crucible/upstairs/src/lib.rs:8697:61
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Abort (core dumped)

Copy link
Contributor

@faithanalog faithanalog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the state of this change right now. I've been following the meta-discussion around this; I've also read through the new extent_inner_raw implementation and it makes sense to me (love the strategy for defragmentation!). I don't think I can fully comb through every part of this change (particularly version migration) right now, and I don't want to hold up the PR over that- if james and alan like it, that's good by me.

I have gotten into a couple Weird states while hammering the storage stack with fio jobs, but most of the time I don't hit those states, which is a good sign.

Because I can't deterministically reproduce them, I currently have no way of knowing if those weird states are a consequence of changes in this PR, or if they're caused by bugs that are already in main. Or both: this PR making bugs in other codepaths easier to hit simply by speeding things up, also an option. But from my understanding of this code, this PR does not really strike me as a likely source of deadlocks.

I'm going to keep chasing down trying to reproduce those ghosts because I do not like them. I don't feel compelled to hold this PR back because of them, unless someone else has a particular theory about them.

@mkeeter mkeeter merged commit 3927d8d into oxidecomputer:main Nov 9, 2023
18 checks passed
@mkeeter mkeeter deleted the raw-file-backend branch November 9, 2023 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants