sled-agent: Setup dump dev on M.2, savecore to U.2 #3586

lifning · 2023-07-13T01:11:12Z

Each time the sled-agent upserts a new disk, it enumerates every disk it knows about. If an M.2 is found, it runs dumpadm(8) to set it as the dump device. If a U.2 is also found, it has dumpadm(8) invoke savecore(8) to save the previous dump on the M.2, if any, to an encrypted zvol on the U.2, and mark the dump slice as empty.

This is a bit of a kludge due to savecore(8) not yet exposing a clear way to save-and-clear a dump partition other than the one configured system-wide by dumpadm. While redundant, this is for our purposes idempotent - as long as any M.2 is configured, dumps have a place to go, and as long as any U.2 is present, we will attempt to save any yet-unsaved kernel core dumps on all known M.2s to it.

(see RFD 118 for details of partition layout) (#2450, #2478)

jclulow · 2023-07-14T23:15:20Z

FYI: I have turned off the site/compliance/dump service by default as of oxidecomputer/helios@811f89c which should mean that nothing from the OS image is doing anything with dumpadm or savecore. There should be no need to deal with the crash dump directory pilot might have created in the past, etc, as that should never end up existing on production systems.

Let me know if you need something else there!

jclulow · 2023-07-15T09:24:58Z

FYI: adjacent relevant follow-up fix I have made to Helios: oxidecomputer/helios@6cc29f6

smklein

The integration pieces of this make sense to me, and look good!

I have some question that are 90% related to me trying to understand the savecore and dumapdm utilities.

smklein · 2023-07-16T01:54:10Z

illumos-utils/src/dumpadm.rs

+        // ...but do create and use a tmpfs path (rather than the default
+        // location under /var/crash, which is in the ramdisk pool), because
+        // dumpadm refuses to do what we ask otherwise.
+        let tmp_crash = "/tmp/crash";
+        std::fs::create_dir_all(tmp_crash).map_err(DumpAdmError::Mkdir)?;


... weird, so it's not possible to run -n without -s?

dumpadm doesn't like it if the configured savecore directory doesn't exist, so we're making sure it does (even though we're not using it in this case)

smklein · 2023-07-16T01:56:24Z

illumos-utils/src/dumpadm.rs

+    // Include memory from the current process if there is one for the panic
+    // context, in addition to kernel memory:
+    cmd.arg("-c").arg("curproc");


Is this a superset of the all option?

curproc dumps more than kernel but less than all:

kernel Kernel memory pages only. all All memory pages. curproc Kernel memory pages, and the memory pages of the process whose thread was currently executing on the CPU on which the crash dump was initiated. If the thread executing on that CPU is a kernel thread not associated with any user process, only kernel pages will be dumped.

smklein · 2023-07-16T01:59:21Z

illumos-utils/src/dumpadm.rs

+pub fn dump_flag_is_valid(
+    dump_slice: &Utf8PathBuf,
+) -> Result<bool, DumpHdrError> {


Does "true" here imply that there is a dump we should extract via savecore? If so: could we add that as a doc comment?

smklein · 2023-07-16T02:00:13Z

illumos-utils/src/dumpadm.rs

+            if savecore_dir.is_some() {
+                if let Ok(true) = dump_flag_is_valid(dump_slice) {
+                    return savecore();
+                }
+            }


Are these conditionals basically "have we just noticed that a dump already exists on this slice, ready for extraction via savecore?"

smklein · 2023-07-16T02:01:07Z

sled-agent/src/storage/dump_setup.rs

+
+#[derive(Default)]
+pub struct DumpSetup {
+    savecore_lock: Arc<std::sync::Mutex<()>>,


Since this lock is particularly opt-in, can we document what it's doing? Looks like "preventing us from running savecore concurrently", right?

smklein · 2023-07-16T02:07:47Z

sled-agent/src/storage/dump_setup.rs

+        disks: &mut MutexGuard<'_, HashMap<DiskIdentity, DiskWrapper>>,
+        log: Logger,
+    ) {
+        let mut dump_slices = Vec::new();


Not a major issue, but it feels a little wasteful to call this and re-calculate all state for all disks when we're only actually inserting one disk at a time. We do have the DumpSetup struct; we could keep a cache of all the information we need?

(Not a blocker for this PR though, the only difference here is a minor efficiency difference)

yeah, that's what i'm thinking for a cleaner follow-up, the thought behind what's here now is "get dumps configured at all, these machines can probably handle counting to 10 redundantly a few times at startup for now"

smklein · 2023-07-16T02:18:23Z

illumos-utils/src/dumpadm.rs

+// in the event that savecore(8) terminates before it finishes copying the
+// dump, the incomplete dump will remain in the target directory, but the next
+// invocation will overwrite it.


Gotcha, so bounds is like, "the last successful savecore iteration." seems like this overwrite behavior is kinda what we want then, eh?

smklein · 2023-07-16T02:23:37Z

sled-agent/src/storage/dump_setup.rs

+                        Ok(false) => {
+                            info!(log, "Dump slice {dump_slice:?} appears to have already been saved");
+                        }
+                        Err(err) => {
+                            debug!(log, "Dump slice {dump_slice:?} appears to be unused: {err:?}");
+                        }


Should we be breaking out early in these cases? Or is the goal to map "dump slice" to "dump dir" even if no dump exists?

on the contrary, in these cases we definitely want to configure the system to use the dump slice, as there's nothing important there to worry about overwriting

smklein · 2023-07-16T02:25:55Z

sled-agent/src/storage/dump_setup.rs

+            // TODO: a more reasonable way of deduplicating the effort.
+            let _guard = savecore_lock.lock();


It looks like this lock exists to prevent us from running savecore concurrently, but we're configuring dumpadm in such as way that it should be running savecore on reboot, if there was a crash.

Do we risk racing between the "system-initiated savecore" and the "sled-agent-initiated savecore"?

there's no system-initiated savecore at boot time, actually! i need to change the flags here to be more clear about that. at boot-time, our system has no configured dump slice until sled-agent sets one, and so won't ever invoke savecore by itself.

smklein · 2023-07-16T02:27:07Z

illumos-utils/src/dumpadm.rs

+pub fn dumpadm(
+    dump_slice: &Utf8PathBuf,
+    savecore_dir: Option<&Utf8PathBuf>,
+) -> Result<Option<OsString>, DumpAdmError> {


Can we document -- what's the returned string here?

iliana · 2023-07-16T02:06:58Z

illumos-utils/src/dumpadm.rs

+    let version =
+        f.read_u32::<LittleEndian>().map_err(DumpHdrError::ReadVersion)?;
+    if version != DUMP_VERSION {
+        return Err(DumpHdrError::InvalidVersion(version));
+    }


Is the point of checking the version that the flag semantics might change in the future? I am curious how often dump_version changes and whether this check is asking for trouble in the future.

yeah. i'm operating under the assumption that the position and semantics of the magic number and the version are stable, and that everything else may change if the version changes. ideally (and i've mentioned as much to rm) we wouldn't have any such version-sensitivity and this function would be replaced by shelling out to some kind of savecore --tell-me-if-there-is-one=/on/this/slice

iliana · 2023-07-16T02:30:06Z

illumos-utils/src/dumpadm.rs

+// invokes savecore(8) according to the system-wide config set by dumpadm.
+// savecore(8) creates a file in the savecore directory called `vmdump.<n>`,
+// where `<n>` is the number in the neighboring plaintext file called `bounds`.


probably mention the default if bounds is not present (or parseable) is 0

Each time the sled-agent upserts a new disk, it enumerates every disk it knows about. If an M.2 is found, it runs dumpadm(8) to set it as the dump device. If a U.2 is *also* found, it has dumpadm(8) invoke savecore(8) to save the previous dump on the M.2, if any, to the U.2, and mark it as empty. This is a bit of a kludge due to savecore(8) not yet exposing a clear way to save-and-clear a dump *partition* other than the one configured system-wide by dumpadm. While redundant, this is for our purposes idempotent - as long as *any* M.2 is configured, dumps have a place to go, and as long as any U.2 is present, we will attempt to save any yet-unsaved kernel core dumps on all known M.2s to it. (see RFD 118 for details of partition layout)

lifning force-pushed the dumpadm-setup branch 9 times, most recently from e4fa251 to af47e1f Compare July 14, 2023 08:08

jclulow mentioned this pull request Jul 14, 2023

image: disable interim crash dump management oxidecomputer/helios#102

Merged

lifning force-pushed the dumpadm-setup branch from af47e1f to 650aaed Compare July 14, 2023 21:49

lifning force-pushed the dumpadm-setup branch 6 times, most recently from b915ee6 to 5572094 Compare July 15, 2023 05:04

lifning marked this pull request as ready for review July 15, 2023 05:08

lifning force-pushed the dumpadm-setup branch from 5572094 to 4b979b0 Compare July 15, 2023 05:21

lifning requested a review from smklein July 15, 2023 05:26

lifning force-pushed the dumpadm-setup branch 2 times, most recently from ff70527 to 6fc06e7 Compare July 16, 2023 02:12

smklein approved these changes Jul 16, 2023

View reviewed changes

iliana reviewed Jul 16, 2023

View reviewed changes

lifning pushed a commit to lifning/omicron that referenced this pull request Jul 16, 2023

PR oxidecomputer#3586 feedback

c8d33d9

lifning force-pushed the dumpadm-setup branch 2 times, most recently from 54bdb28 to 69cb0f9 Compare July 16, 2023 07:24

lif added 3 commits July 16, 2023 08:32

Remove code for moving pilot-savecored dumps

7debde1

PR oxidecomputer#3586 feedback

fceaf70

Fix for log spam when savecore has nothing to do

3bf65e5

lifning force-pushed the dumpadm-setup branch from 69cb0f9 to 3bf65e5 Compare July 16, 2023 08:34

lifning enabled auto-merge (squash) July 16, 2023 10:22

lifning merged commit f1cc092 into oxidecomputer:main Jul 16, 2023

lifning deleted the dumpadm-setup branch July 16, 2023 10:48

This was referenced Jul 16, 2023

automatic debug data collection without running the system out of space #2478

Closed

[sled agent] every gimlet should have a dump device #2450

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sled-agent: Setup dump dev on M.2, savecore to U.2 #3586

sled-agent: Setup dump dev on M.2, savecore to U.2 #3586

lifning commented Jul 13, 2023 •

edited

Loading

jclulow commented Jul 14, 2023

jclulow commented Jul 15, 2023

smklein left a comment

smklein Jul 16, 2023

lifning Jul 16, 2023

smklein Jul 16, 2023

iliana Jul 16, 2023

smklein Jul 16, 2023

smklein Jul 16, 2023

smklein Jul 16, 2023

smklein Jul 16, 2023

lifning Jul 16, 2023

smklein Jul 16, 2023

smklein Jul 16, 2023

lifning Jul 16, 2023

smklein Jul 16, 2023

lifning Jul 16, 2023

smklein Jul 16, 2023

iliana Jul 16, 2023

lifning Jul 16, 2023

iliana Jul 16, 2023

		// TODO: a more reasonable way of deduplicating the effort.
		let _guard = savecore_lock.lock();

sled-agent: Setup dump dev on M.2, savecore to U.2 #3586

sled-agent: Setup dump dev on M.2, savecore to U.2 #3586

Conversation

lifning commented Jul 13, 2023 • edited Loading

jclulow commented Jul 14, 2023

jclulow commented Jul 15, 2023

smklein left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lifning commented Jul 13, 2023 •

edited

Loading