automatic debug data collection without running the system out of space #2478

davepacheco · 2023-03-03T21:41:46Z

This ticket covers minimum target work required to (1) make sure we have basic debug data collected on all systems, while (2) not putting system availability at risk (by filling up important ZFS datasets or the pool itself). This came out of a recorded discussion on 2023-03-03.

Make sure that on each sled we create a "debug" ZFS dataset, probably on one U.2 device, that will store crash dumps, core files, log files, and potentially other regularly-collected data. Determine (as a matter of policy) how large we're willing to let it get and assign that as its quota. (I assume this will be Sled Agent that does this but I'm not sure.) (Optional: create separate child datasets for core files, dumps, logs, etc. so that we can control their quotas separately [so that a flurry of crash dumps doesn't starve log files or vice versa].)
Set up a dump device (Sled Agent?)
Configure dumpadm to save crash dumps in that dataset. (Sled Agent?)
Configure coreadm to save core dumps in that dataset. (Sled Agent?)
Configure cron + logadm in all control plane zones to rotate all log files in the zone into some known location. (Probably part of the image build.)
Configure cron + logadm in the global zone to rotate all log files into the "debug" dataset. (Sled Agent? Host OS image?)
Update Sled Agent to manage the storage in the "debug" dataset according to some policy, which will presumably start very simply (e.g., delete the oldest files in the dataset until free space in the dataset reaches a watermark like 20%)

There are some known limitations of this: most notably that files are not replicated across multiple devices. Failure of the wrong U.2 (or removal from the system) means we lose the debug data from that system. This can be mitigated in future work (e.g., by copying them to a second dataset on another pool, or copying them into some intra-rack storage system, etc.).

It's conceivable that we ship an MVP without this, but it's a pretty big risk. Either we don't manage the storage (in which case we risk running out of disk space and disrupting service) or we turn off all these sources of data (in which case we'll have a pretty hard time fixing anything).

jclulow · 2023-03-03T21:47:43Z

Note that in addition to configuring the dump directory it will probably be necessary to run savecore with appropriate arguments, to check the dump device for an existing dump and save it out. There is also the question of which M.2 we use for dumping (is it the BSU from which we booted?) and then how and whether to check the other BSU for a prior dump we have not yet extracted.

wesolows · 2023-03-07T18:10:27Z

Does this cover getting the boot-time logs (including fmd state) out of the ramdisk and into persistent storage, once sled agent has created and imported that pool? There is some dance to be done there that will involve restarting daemons etc. Alternately, service dependencies need to be created so that services writing to /var/log and /var/fm (but not /var/lock or other places we really do NOT want to persist state) can't start until sled agent has done that. There are any number of ways to solve this set of problems and I don't mean to constrain it. This isn't strictly about support, so it could also be part of a different bug -- but this is related to how/where/when we run savecore.

davepacheco · 2023-03-07T18:55:23Z

Does this cover getting the boot-time logs (including fmd state) out of the ramdisk and into persistent storage, once sled agent has created and imported that pool? There is some dance to be done there that will involve restarting daemons etc. Alternately, service dependencies need to be created so that services writing to /var/log and /var/fm (but not /var/lock or other places we really do NOT want to persist state) can't start until sled agent has done that. There are any number of ways to solve this set of problems and I don't mean to constrain it. This isn't strictly about support, so it could also be part of a different bug -- but this is related to how/where/when we run savecore.

Good question. Yes, I was assuming that Sled Agent would be responsible for periodically copying a bunch of GZ log files into the debug dataset, including various FMA error logs and things like /var/log/messages. I was assuming (perhaps naively?) that we could rely on the existing logadm configuration that rotates those logs and then copy the rotated logs into the debug dataset. (Note that the logadm configuration for things like the FMA error log use fmadm rotate to do it, which I assume takes care of any signaling of fmd that's necessary.)

Each time the sled-agent upserts a new disk, it enumerates every disk it knows about. If an M.2 is found, it runs dumpadm(8) to set it as the dump device. If a U.2 is *also* found, it invokes savecore(8) to save the previous dump on the M.2, if any, to an encrypted zvol on the U.2, and mark the dump slice as empty. This is a bit of a kludge due to savecore(8) not yet exposing a clear way to save-and-clear a dump *partition* other than the one configured system-wide by dumpadm. While redundant, this is for our purposes idempotent - as long as *any* M.2 is configured, dumps have a place to go, and as long as any U.2 is present, we will attempt to save any yet-unsaved kernel core dumps on all known M.2s to it. (see RFD 118 for details of partition layout) (#2450, #2478)

morlandi7 · 2023-07-16T21:12:59Z

see #3586

(Part of oxidecomputer#2478)

(Part of oxidecomputer#2478) This configures coreadm to put all core dumps onto the M.2 'crash' dataset, and creates a thread that rotates them all onto a U.2 'debug' dataset every 5 minutes. This also refactors the dumpadm/savecore code to be less redundant and more flexible.

(Part of oxidecomputer#2478) This configures coreadm to put all core dumps onto the M.2 'crash' dataset, and creates a thread that rotates them all onto a U.2 'debug' dataset every 5 minutes. This also refactors the dumpadm/savecore code to be less redundant and more flexible, and adds an amount of arbitrary logic for e.g. picking the U.2 onto which to save cores. Yet to do: Monitoring for datasets reaching capacity and choosing a different one.

(Part of oxidecomputer#2478) This configures coreadm to put all core dumps onto the M.2 'crash' dataset, and creates a thread that rotates them all onto a U.2 'debug' dataset every 5 minutes. This also refactors the dumpadm/savecore code to be less redundant and more flexible, and adds an amount of arbitrary logic for e.g. picking the U.2 onto which to save cores.

(Part of oxidecomputer#2478) This configures coreadm to put all core dumps onto the M.2 'crash' dataset, and creates a thread that rotates them all onto a U.2 'debug' dataset every 5 minutes. This also refactors the dumpadm/savecore code to be less redundant and more flexible, and adds an amount of arbitrary logic for e.g. deciding the U.2 onto which to save cores.

lifning · 2023-07-22T06:10:31Z

awaiting review:

coreadm to m.2 crash dir and then archive to debug dataset: Put process core dumps onto the U.2 debug zvol #3677
move logadm-rotated log files in all zones to debug dataset: sled-agent performs archival of rotated logs for all zones onto U.2 debug dataset #3713
manage storage in debug dataset by deleting older archived files: sled-agent deletes old archived dumps/logs to free space on U.2 crypt/debug datasets approaching quota #3735

(Part of #2478, continued in #3713) This configures coreadm to put all core dumps onto the M.2 'crash' dataset, and creates a thread that moves them all onto a U.2 'debug' dataset every 5 minutes. This also refactors the dumpadm/savecore code to be less redundant and more flexible, and adds an amount of arbitrary logic for e.g. picking the U.2 onto which to save cores.

…ebug dataset (#3713) This periodically moves logs rotated by logadm in cron (oxidecomputer/helios#107) into the crypt/debug zfs dataset on the U.2 chosen by the logic in #3677. It replaces the rotated number (*.log.0, *.log.1) with the unix epoch timestamp of the rotated log's modification time such that they don't collide when collected repeatedly (logadm will reset numbering when the previous ones are moved away). (for #2478)

…/debug datasets approaching quota (#3735) In the event that the `crypt/debug` dataset currently in use for archival fills up past 80% of its 100G quota, sled-agent will switch to using one on another U.2. But if *all* of those datasets fill up that much, it will instead find whichever of these datasets has the oldest archived files to clear out in order to approach 70% usage of quota, and delete the oldest files. (NB: it isn't yet doing the calculation of how many files to delete in terms of on-disk size (after zfs's gzip-9)) (Part of #2478)

(Part of #2478, continued in #3713) This configures coreadm to put all core dumps onto the M.2 'crash' dataset, and creates a thread that moves them all onto a U.2 'debug' dataset every 5 minutes. This also refactors the dumpadm/savecore code to be less redundant and more flexible, and adds an amount of arbitrary logic for e.g. picking the U.2 onto which to save cores.

…ebug dataset (#3713) This periodically moves logs rotated by logadm in cron (oxidecomputer/helios#107) into the crypt/debug zfs dataset on the U.2 chosen by the logic in #3677. It replaces the rotated number (*.log.0, *.log.1) with the unix epoch timestamp of the rotated log's modification time such that they don't collide when collected repeatedly (logadm will reset numbering when the previous ones are moved away). (for #2478)

…/debug datasets approaching quota (#3735) In the event that the `crypt/debug` dataset currently in use for archival fills up past 80% of its 100G quota, sled-agent will switch to using one on another U.2. But if *all* of those datasets fill up that much, it will instead find whichever of these datasets has the oldest archived files to clear out in order to approach 70% usage of quota, and delete the oldest files. (NB: it isn't yet doing the calculation of how many files to delete in terms of on-disk size (after zfs's gzip-9)) (Part of #2478)

Verifies decision-making in different combinations of M.2/U.2 dataset and dump slice availability and occupancy, and tests log/core-archiving. (functionality that had been implemented for #2478)

davepacheco added this to the MVP milestone Mar 3, 2023

davepacheco mentioned this issue Mar 3, 2023

automatic collection of data for support #2480

Open

smklein mentioned this issue Mar 7, 2023

[sled agent] every gimlet should have a dump device #2450

Closed

askfongjojo modified the milestones: MVP, FCS May 22, 2023

lifning self-assigned this Jun 15, 2023

bnaecker mentioned this issue Jun 21, 2023

Adds service bundles for zones #3388

Merged

davepacheco mentioned this issue Jul 8, 2023

zone logs should not be in rpool #3533

Closed

lifning mentioned this issue Jul 14, 2023

sled-agent: Setup dump dev on M.2, savecore to U.2 #3586

Merged

lifning pushed a commit to lifning/omicron that referenced this issue Jul 17, 2023

(WIP) Put process core dumps onto the U.2 debug zvol

bbb8294

(Part of oxidecomputer#2478)

lifning mentioned this issue Jul 17, 2023

Put process core dumps onto the U.2 debug zvol #3677

Merged

morlandi7 assigned citrus-it Jul 17, 2023

lifning pushed a commit to lifning/omicron that referenced this issue Jul 18, 2023

(WIP) Put process core dumps onto the U.2 debug zvol

07cce93

(Part of oxidecomputer#2478)

lifning pushed a commit to lifning/omicron that referenced this issue Jul 19, 2023

(WIP) Put process core dumps onto the U.2 debug zvol

17b76f7

(Part of oxidecomputer#2478)

This was referenced Jul 19, 2023

sled-agent performs archival of rotated logs for all zones onto U.2 debug dataset #3713

Merged

sled-agent deletes old archived dumps/logs to free space on U.2 crypt/debug datasets approaching quota #3735

Merged

lifning mentioned this issue Jul 24, 2023

Configure log rotation in all non-global zones #3745

Merged

morlandi7 modified the milestones: FCS, 1.0.1 Jul 28, 2023

lifning mentioned this issue Jul 28, 2023

Unit tests for DumpSetup #3788

Merged

jordanhendricks added the Debugging For when you want better data in debugging an issue (log messages, post mortem debugging, and more) label Aug 11, 2023

morlandi7 modified the milestones: 1.0.1, 1.0.2 Aug 15, 2023

askfongjojo modified the milestones: 1.0.2, 3 Sep 1, 2023

This was referenced Oct 3, 2023

support bundles #2235

Open

Persist fault management data across reboots #4211

Closed

askfongjojo modified the milestones: 3, 4 Oct 17, 2023

wesolows mentioned this issue Oct 18, 2023

umbrella: notifications on crash dump savecore #4293

Open

leftwo mentioned this issue Nov 11, 2023

[sled agent] Consider setting uniform coreadm values to extract info from terminating processes? #1597

Closed

morlandi7 modified the milestones: 4, 5, 6 Nov 29, 2023

morlandi7 linked a pull request Jan 26, 2024 that will close this issue

Unit tests for DumpSetup #3788

Merged

lifning closed this as completed in #3788 Feb 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

automatic debug data collection without running the system out of space #2478

automatic debug data collection without running the system out of space #2478

davepacheco commented Mar 3, 2023 •

edited by lifning

Loading

jclulow commented Mar 3, 2023

wesolows commented Mar 7, 2023

davepacheco commented Mar 7, 2023

morlandi7 commented Jul 16, 2023

lifning commented Jul 22, 2023

automatic debug data collection without running the system out of space #2478

automatic debug data collection without running the system out of space #2478

Comments

davepacheco commented Mar 3, 2023 • edited by lifning Loading

jclulow commented Mar 3, 2023

wesolows commented Mar 7, 2023

davepacheco commented Mar 7, 2023

morlandi7 commented Jul 16, 2023

lifning commented Jul 22, 2023

davepacheco commented Mar 3, 2023 •

edited by lifning

Loading