-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
automatic debug data collection without running the system out of space #2478
Comments
Note that in addition to configuring the dump directory it will probably be necessary to run |
Does this cover getting the boot-time logs (including fmd state) out of the ramdisk and into persistent storage, once sled agent has created and imported that pool? There is some dance to be done there that will involve restarting daemons etc. Alternately, service dependencies need to be created so that services writing to /var/log and /var/fm (but not /var/lock or other places we really do NOT want to persist state) can't start until sled agent has done that. There are any number of ways to solve this set of problems and I don't mean to constrain it. This isn't strictly about support, so it could also be part of a different bug -- but this is related to how/where/when we run savecore. |
Good question. Yes, I was assuming that Sled Agent would be responsible for periodically copying a bunch of GZ log files into the debug dataset, including various FMA error logs and things like /var/log/messages. I was assuming (perhaps naively?) that we could rely on the existing logadm configuration that rotates those logs and then copy the rotated logs into the debug dataset. (Note that the logadm configuration for things like the FMA error log use |
Each time the sled-agent upserts a new disk, it enumerates every disk it knows about. If an M.2 is found, it runs dumpadm(8) to set it as the dump device. If a U.2 is *also* found, it invokes savecore(8) to save the previous dump on the M.2, if any, to an encrypted zvol on the U.2, and mark the dump slice as empty. This is a bit of a kludge due to savecore(8) not yet exposing a clear way to save-and-clear a dump *partition* other than the one configured system-wide by dumpadm. While redundant, this is for our purposes idempotent - as long as *any* M.2 is configured, dumps have a place to go, and as long as any U.2 is present, we will attempt to save any yet-unsaved kernel core dumps on all known M.2s to it. (see RFD 118 for details of partition layout) (#2450, #2478)
see #3586 |
(Part of oxidecomputer#2478) This configures coreadm to put all core dumps onto the M.2 'crash' dataset, and creates a thread that rotates them all onto a U.2 'debug' dataset every 5 minutes. This also refactors the dumpadm/savecore code to be less redundant and more flexible.
(Part of oxidecomputer#2478) This configures coreadm to put all core dumps onto the M.2 'crash' dataset, and creates a thread that rotates them all onto a U.2 'debug' dataset every 5 minutes. This also refactors the dumpadm/savecore code to be less redundant and more flexible, and adds an amount of arbitrary logic for e.g. picking the U.2 onto which to save cores. Yet to do: Monitoring for datasets reaching capacity and choosing a different one.
(Part of oxidecomputer#2478) This configures coreadm to put all core dumps onto the M.2 'crash' dataset, and creates a thread that rotates them all onto a U.2 'debug' dataset every 5 minutes. This also refactors the dumpadm/savecore code to be less redundant and more flexible, and adds an amount of arbitrary logic for e.g. picking the U.2 onto which to save cores.
(Part of oxidecomputer#2478) This configures coreadm to put all core dumps onto the M.2 'crash' dataset, and creates a thread that rotates them all onto a U.2 'debug' dataset every 5 minutes. This also refactors the dumpadm/savecore code to be less redundant and more flexible, and adds an amount of arbitrary logic for e.g. deciding the U.2 onto which to save cores.
(Part of oxidecomputer#2478) This configures coreadm to put all core dumps onto the M.2 'crash' dataset, and creates a thread that rotates them all onto a U.2 'debug' dataset every 5 minutes. This also refactors the dumpadm/savecore code to be less redundant and more flexible, and adds an amount of arbitrary logic for e.g. deciding the U.2 onto which to save cores.
awaiting review:
|
(Part of #2478, continued in #3713) This configures coreadm to put all core dumps onto the M.2 'crash' dataset, and creates a thread that moves them all onto a U.2 'debug' dataset every 5 minutes. This also refactors the dumpadm/savecore code to be less redundant and more flexible, and adds an amount of arbitrary logic for e.g. picking the U.2 onto which to save cores.
…ebug dataset (#3713) This periodically moves logs rotated by logadm in cron (oxidecomputer/helios#107) into the crypt/debug zfs dataset on the U.2 chosen by the logic in #3677. It replaces the rotated number (*.log.0, *.log.1) with the unix epoch timestamp of the rotated log's modification time such that they don't collide when collected repeatedly (logadm will reset numbering when the previous ones are moved away). (for #2478)
…/debug datasets approaching quota (#3735) In the event that the `crypt/debug` dataset currently in use for archival fills up past 80% of its 100G quota, sled-agent will switch to using one on another U.2. But if *all* of those datasets fill up that much, it will instead find whichever of these datasets has the oldest archived files to clear out in order to approach 70% usage of quota, and delete the oldest files. (NB: it isn't yet doing the calculation of how many files to delete in terms of on-disk size (after zfs's gzip-9)) (Part of #2478)
(Part of #2478, continued in #3713) This configures coreadm to put all core dumps onto the M.2 'crash' dataset, and creates a thread that moves them all onto a U.2 'debug' dataset every 5 minutes. This also refactors the dumpadm/savecore code to be less redundant and more flexible, and adds an amount of arbitrary logic for e.g. picking the U.2 onto which to save cores.
…ebug dataset (#3713) This periodically moves logs rotated by logadm in cron (oxidecomputer/helios#107) into the crypt/debug zfs dataset on the U.2 chosen by the logic in #3677. It replaces the rotated number (*.log.0, *.log.1) with the unix epoch timestamp of the rotated log's modification time such that they don't collide when collected repeatedly (logadm will reset numbering when the previous ones are moved away). (for #2478)
…/debug datasets approaching quota (#3735) In the event that the `crypt/debug` dataset currently in use for archival fills up past 80% of its 100G quota, sled-agent will switch to using one on another U.2. But if *all* of those datasets fill up that much, it will instead find whichever of these datasets has the oldest archived files to clear out in order to approach 70% usage of quota, and delete the oldest files. (NB: it isn't yet doing the calculation of how many files to delete in terms of on-disk size (after zfs's gzip-9)) (Part of #2478)
Verifies decision-making in different combinations of M.2/U.2 dataset and dump slice availability and occupancy, and tests log/core-archiving. (functionality that had been implemented for #2478)
This ticket covers minimum target work required to (1) make sure we have basic debug data collected on all systems, while (2) not putting system availability at risk (by filling up important ZFS datasets or the pool itself). This came out of a recorded discussion on 2023-03-03.
There are some known limitations of this: most notably that files are not replicated across multiple devices. Failure of the wrong U.2 (or removal from the system) means we lose the debug data from that system. This can be mitigated in future work (e.g., by copying them to a second dataset on another pool, or copying them into some intra-rack storage system, etc.).
It's conceivable that we ship an MVP without this, but it's a pretty big risk. Either we don't manage the storage (in which case we risk running out of disk space and disrupting service) or we turn off all these sources of data (in which case we'll have a pretty hard time fixing anything).
The text was updated successfully, but these errors were encountered: