umbrella: notifications on crash dump savecore #4293
Labels
Debugging
For when you want better data in debugging an issue (log messages, post mortem debugging, and more)
mvp
Sled Agent
Related to the Per-Sled Configuration and Management
Milestone
There are a number of umbrella and individual tickets covering collection and management of data used to debug problems with the machine and our software. A few examples of these are #2235, #2478, and #3860. The general premise of these tickets is that when a crash (of a sled, or of component user software) occurs, we will preserve data that may be useful in understanding the cause.
This ticket covers notifying Oxide (or in principle a third-party support provider) that such an event has occurred and data is, or should be, available for retrieval. A simpler and more universal aspect of this is notifying operators; notifications of events like these is discussed in RFDs 55 and 307; while the latter was clearly intended to cover functionality available at RR, I'm unaware of any current means by which an operator can be notified when a sled has crashed and rebooted with a dump saved. The operator should also be able to query via API the state of debug data availability of each sled or a specific sled, or the entire machine, as well as some crash event history. There is probably also scope here for detection of sleds that are not functioning at all, tying in with #4287 and reporting this as an event even if no automated action is taken as a matter of policy. Consider RFDs 82 and 302 here.
This is another umbrella ticket that covers what is essentially project-scope work. Additional tickets for specific pieces are likely to be desirable.
The text was updated successfully, but these errors were encountered: