Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

automatic debug data collection without running the system out of space #2478

Closed
7 tasks done
davepacheco opened this issue Mar 3, 2023 · 5 comments · Fixed by #3788
Closed
7 tasks done

automatic debug data collection without running the system out of space #2478

davepacheco opened this issue Mar 3, 2023 · 5 comments · Fixed by #3788
Assignees
Labels
Debugging For when you want better data in debugging an issue (log messages, post mortem debugging, and more)
Milestone

Comments

@davepacheco
Copy link
Collaborator

davepacheco commented Mar 3, 2023

This ticket covers minimum target work required to (1) make sure we have basic debug data collected on all systems, while (2) not putting system availability at risk (by filling up important ZFS datasets or the pool itself). This came out of a recorded discussion on 2023-03-03.

  • Make sure that on each sled we create a "debug" ZFS dataset, probably on one U.2 device, that will store crash dumps, core files, log files, and potentially other regularly-collected data. Determine (as a matter of policy) how large we're willing to let it get and assign that as its quota. (I assume this will be Sled Agent that does this but I'm not sure.) (Optional: create separate child datasets for core files, dumps, logs, etc. so that we can control their quotas separately [so that a flurry of crash dumps doesn't starve log files or vice versa].)
  • Set up a dump device (Sled Agent?)
  • Configure dumpadm to save crash dumps in that dataset. (Sled Agent?)
  • Configure coreadm to save core dumps in that dataset. (Sled Agent?)
  • Configure cron + logadm in all control plane zones to rotate all log files in the zone into some known location. (Probably part of the image build.)
  • Configure cron + logadm in the global zone to rotate all log files into the "debug" dataset. (Sled Agent? Host OS image?)
  • Update Sled Agent to manage the storage in the "debug" dataset according to some policy, which will presumably start very simply (e.g., delete the oldest files in the dataset until free space in the dataset reaches a watermark like 20%)

There are some known limitations of this: most notably that files are not replicated across multiple devices. Failure of the wrong U.2 (or removal from the system) means we lose the debug data from that system. This can be mitigated in future work (e.g., by copying them to a second dataset on another pool, or copying them into some intra-rack storage system, etc.).

It's conceivable that we ship an MVP without this, but it's a pretty big risk. Either we don't manage the storage (in which case we risk running out of disk space and disrupting service) or we turn off all these sources of data (in which case we'll have a pretty hard time fixing anything).

@jclulow
Copy link
Collaborator

jclulow commented Mar 3, 2023

Note that in addition to configuring the dump directory it will probably be necessary to run savecore with appropriate arguments, to check the dump device for an existing dump and save it out. There is also the question of which M.2 we use for dumping (is it the BSU from which we booted?) and then how and whether to check the other BSU for a prior dump we have not yet extracted.

@wesolows
Copy link

wesolows commented Mar 7, 2023

Does this cover getting the boot-time logs (including fmd state) out of the ramdisk and into persistent storage, once sled agent has created and imported that pool? There is some dance to be done there that will involve restarting daemons etc. Alternately, service dependencies need to be created so that services writing to /var/log and /var/fm (but not /var/lock or other places we really do NOT want to persist state) can't start until sled agent has done that. There are any number of ways to solve this set of problems and I don't mean to constrain it. This isn't strictly about support, so it could also be part of a different bug -- but this is related to how/where/when we run savecore.

@davepacheco
Copy link
Collaborator Author

Does this cover getting the boot-time logs (including fmd state) out of the ramdisk and into persistent storage, once sled agent has created and imported that pool? There is some dance to be done there that will involve restarting daemons etc. Alternately, service dependencies need to be created so that services writing to /var/log and /var/fm (but not /var/lock or other places we really do NOT want to persist state) can't start until sled agent has done that. There are any number of ways to solve this set of problems and I don't mean to constrain it. This isn't strictly about support, so it could also be part of a different bug -- but this is related to how/where/when we run savecore.

Good question. Yes, I was assuming that Sled Agent would be responsible for periodically copying a bunch of GZ log files into the debug dataset, including various FMA error logs and things like /var/log/messages. I was assuming (perhaps naively?) that we could rely on the existing logadm configuration that rotates those logs and then copy the rotated logs into the debug dataset. (Note that the logadm configuration for things like the FMA error log use fmadm rotate to do it, which I assume takes care of any signaling of fmd that's necessary.)

@askfongjojo askfongjojo modified the milestones: MVP, FCS May 22, 2023
@lifning lifning self-assigned this Jun 15, 2023
lifning added a commit that referenced this issue Jul 16, 2023
Each time the sled-agent upserts a new disk, it enumerates every disk it
knows about. If an M.2 is found, it runs dumpadm(8) to set it as the
dump device. If a U.2 is *also* found, it invokes savecore(8) to
save the previous dump on the M.2, if any, to an encrypted zvol
on the U.2, and mark the dump slice as empty.

This is a bit of a kludge due to savecore(8) not yet exposing a clear
way to save-and-clear a dump *partition* other than the one configured
system-wide by dumpadm. While redundant, this is for our purposes
idempotent - as long as *any* M.2 is configured, dumps have a place to
go, and as long as any U.2 is present, we will attempt to save any
yet-unsaved kernel core dumps on all known M.2s to it.

(see RFD 118 for details of partition layout) (#2450, #2478)
@morlandi7
Copy link

see #3586

lifning pushed a commit to lifning/omicron that referenced this issue Jul 17, 2023
lifning pushed a commit to lifning/omicron that referenced this issue Jul 18, 2023
lifning pushed a commit to lifning/omicron that referenced this issue Jul 19, 2023
lifning pushed a commit to lifning/omicron that referenced this issue Jul 19, 2023
(Part of oxidecomputer#2478)

This configures coreadm to put all core dumps onto the M.2 'crash'
dataset, and creates a thread that rotates them all onto a U.2 'debug'
dataset every 5 minutes.

This also refactors the dumpadm/savecore code to be less redundant and
more flexible.
lifning pushed a commit to lifning/omicron that referenced this issue Jul 19, 2023
(Part of oxidecomputer#2478)

This configures coreadm to put all core dumps onto the M.2 'crash'
dataset, and creates a thread that rotates them all onto a U.2 'debug'
dataset every 5 minutes.

This also refactors the dumpadm/savecore code to be less redundant and
more flexible, and adds an amount of arbitrary logic for e.g. picking
the U.2 onto which to save cores.

Yet to do: Monitoring for datasets reaching capacity and choosing a
different one.
lifning pushed a commit to lifning/omicron that referenced this issue Jul 19, 2023
(Part of oxidecomputer#2478)

This configures coreadm to put all core dumps onto the M.2 'crash'
dataset, and creates a thread that rotates them all onto a U.2 'debug'
dataset every 5 minutes.

This also refactors the dumpadm/savecore code to be less redundant and
more flexible, and adds an amount of arbitrary logic for e.g. picking
the U.2 onto which to save cores.
lifning pushed a commit to lifning/omicron that referenced this issue Jul 19, 2023
(Part of oxidecomputer#2478)

This configures coreadm to put all core dumps onto the M.2 'crash'
dataset, and creates a thread that rotates them all onto a U.2 'debug'
dataset every 5 minutes.

This also refactors the dumpadm/savecore code to be less redundant and
more flexible, and adds an amount of arbitrary logic for e.g. deciding
the U.2 onto which to save cores.
lifning pushed a commit to lifning/omicron that referenced this issue Jul 19, 2023
(Part of oxidecomputer#2478)

This configures coreadm to put all core dumps onto the M.2 'crash'
dataset, and creates a thread that rotates them all onto a U.2 'debug'
dataset every 5 minutes.

This also refactors the dumpadm/savecore code to be less redundant and
more flexible, and adds an amount of arbitrary logic for e.g. deciding
the U.2 onto which to save cores.
@lifning
Copy link
Contributor

lifning commented Jul 22, 2023

awaiting review:

lifning added a commit that referenced this issue Jul 24, 2023
(Part of #2478, continued in #3713)

This configures coreadm to put all core dumps onto the M.2 'crash'
dataset, and creates a thread that moves them all onto a U.2 'debug'
dataset every 5 minutes.

This also refactors the dumpadm/savecore code to be less redundant and
more flexible, and adds an amount of arbitrary logic for e.g. picking
the U.2 onto which to save cores.
lifning added a commit that referenced this issue Jul 24, 2023
…ebug dataset (#3713)

This periodically moves logs rotated by logadm in cron
(oxidecomputer/helios#107) into the crypt/debug
zfs dataset on the U.2 chosen by the logic in #3677. It replaces the
rotated number (*.log.0, *.log.1) with the unix epoch timestamp of the
rotated log's modification time such that they don't collide when
collected repeatedly (logadm will reset numbering when the previous ones
are moved away).

(for #2478)
lifning added a commit that referenced this issue Jul 24, 2023
…/debug datasets approaching quota (#3735)

In the event that the `crypt/debug` dataset currently in use for
archival fills up past 80% of its 100G quota, sled-agent will switch to
using one on another U.2. But if *all* of those datasets fill up that
much, it will instead find whichever of these datasets has the oldest
archived files to clear out in order to approach 70% usage of quota, and
delete the oldest files.

(NB: it isn't yet doing the calculation of how many files to delete in
terms of on-disk size (after zfs's gzip-9))

(Part of #2478)
leftwo pushed a commit that referenced this issue Jul 24, 2023
(Part of #2478, continued in #3713)

This configures coreadm to put all core dumps onto the M.2 'crash'
dataset, and creates a thread that moves them all onto a U.2 'debug'
dataset every 5 minutes.

This also refactors the dumpadm/savecore code to be less redundant and
more flexible, and adds an amount of arbitrary logic for e.g. picking
the U.2 onto which to save cores.
leftwo pushed a commit that referenced this issue Jul 24, 2023
…ebug dataset (#3713)

This periodically moves logs rotated by logadm in cron
(oxidecomputer/helios#107) into the crypt/debug
zfs dataset on the U.2 chosen by the logic in #3677. It replaces the
rotated number (*.log.0, *.log.1) with the unix epoch timestamp of the
rotated log's modification time such that they don't collide when
collected repeatedly (logadm will reset numbering when the previous ones
are moved away).

(for #2478)
leftwo pushed a commit that referenced this issue Jul 24, 2023
…/debug datasets approaching quota (#3735)

In the event that the `crypt/debug` dataset currently in use for
archival fills up past 80% of its 100G quota, sled-agent will switch to
using one on another U.2. But if *all* of those datasets fill up that
much, it will instead find whichever of these datasets has the oldest
archived files to clear out in order to approach 70% usage of quota, and
delete the oldest files.

(NB: it isn't yet doing the calculation of how many files to delete in
terms of on-disk size (after zfs's gzip-9))

(Part of #2478)
@morlandi7 morlandi7 modified the milestones: FCS, 1.0.1 Jul 28, 2023
@jordanhendricks jordanhendricks added the Debugging For when you want better data in debugging an issue (log messages, post mortem debugging, and more) label Aug 11, 2023
@morlandi7 morlandi7 modified the milestones: 1.0.1, 1.0.2 Aug 15, 2023
@askfongjojo askfongjojo modified the milestones: 1.0.2, 3 Sep 1, 2023
@askfongjojo askfongjojo modified the milestones: 3, 4 Oct 17, 2023
@morlandi7 morlandi7 modified the milestones: 4, 5, 6 Nov 29, 2023
@morlandi7 morlandi7 linked a pull request Jan 26, 2024 that will close this issue
lifning added a commit that referenced this issue Feb 1, 2024
Verifies decision-making in different combinations of M.2/U.2 dataset
and dump slice availability and occupancy, and tests log/core-archiving.
(functionality that had been implemented for #2478)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Debugging For when you want better data in debugging an issue (log messages, post mortem debugging, and more)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants