Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persist fault management data across reboots #4211

Closed
citrus-it opened this issue Oct 5, 2023 · 2 comments · Fixed by #4212
Closed

Persist fault management data across reboots #4211

citrus-it opened this issue Oct 5, 2023 · 2 comments · Fixed by #4212
Assignees
Labels
enhancement New feature or request.
Milestone

Comments

@citrus-it
Copy link
Contributor

citrus-it commented Oct 5, 2023

While @wesolows was digging into stlouis#281/cs#39 he found breadcrumbs in the kernel memory of the crash dump that indicated that the fault management system had done /something/ recently but, since the fault management logs and database are on non-persistent storage on the system ramdisk, that data was lost and we can only guess what occurred.

We should back the fault management data in /var/fm/fmd with a dataset on the current boot disk so that the fault management history is preserved. In the future, we should likely do the same for GZ system log files and other things.

Note that we have also previously seen a system fill up the root ramdisk (via /var/fm/fmd) due to a flood of ZFS errors so an appropriate quota should also be applied here.

@citrus-it citrus-it added the enhancement New feature or request. label Oct 5, 2023
@citrus-it citrus-it added this to the 3 milestone Oct 5, 2023
@citrus-it citrus-it self-assigned this Oct 5, 2023
citrus-it added a commit that referenced this issue Oct 5, 2023
/var/fm/fmd is where the illumos fault management system records data.
We want to preserve this data across system reboots and in real time
rather than via periodic data copying, so that the information is
available should the system panic shortly thereafter.

Fixes: #4211
citrus-it added a commit that referenced this issue Oct 5, 2023
/var/fm/fmd is where the illumos fault management system records data.
We want to preserve this data across system reboots and in real time
rather than via periodic data copying, so that the information is
available should the system panic shortly thereafter.

Fixes: #4211
citrus-it added a commit that referenced this issue Oct 5, 2023
/var/fm/fmd is where the illumos fault management system records data.
We want to preserve this data across system reboots and in real time
rather than via periodic data copying, so that the information is
available should the system panic shortly thereafter.

Fixes: #4211
citrus-it added a commit that referenced this issue Oct 5, 2023
/var/fm/fmd is where the illumos fault management system records data.
We want to preserve this data across system reboots and in real time
rather than via periodic data copying, so that the information is
available should the system panic shortly thereafter.

Fixes: #4211
@morlandi7
Copy link

Related to disk filling up: #2478

@citrus-it
Copy link
Contributor Author

A few notes from testing on a bench gimlet:

gimlet-sn06 # cat `svcs -L sled-agent` | looker -c 'r.component == "BackingFs"'
09:48:42.276Z INFO SledAgent (BackingFs): Processing fmd
09:48:42.277Z INFO SledAgent (BackingFs): Ensuring dataset oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b/backing/fmd
09:48:42.430Z INFO SledAgent (BackingFs): Stopping service svc:/system/fmd:default
09:48:43.673Z INFO SledAgent (BackingFs): Mounting oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b/backing/fmd on /var/fm/fmd
09:48:43.721Z INFO SledAgent (BackingFs): Starting service svc:/system/fmd:default
gimlet-sn06# svcadm restart sled-agent
10:06:46.841Z INFO SledAgent (BackingFs): Processing fmd
10:06:46.841Z INFO SledAgent (BackingFs): Ensuring dataset oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b/backing/fmd
10:06:46.932Z INFO SledAgent (BackingFs): /var/fm/fmd is already mounted
gimlet-sn06 # zfs get mounted  `zfs list -Honame | grep backing`
NAME                                                  PROPERTY  VALUE    SOURCE
oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b/backing      mounted   yes      -
oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b/backing/fmd  mounted   yes      -
oxi_b462a7f7-b628-40fe-80ff-4e4189e2d62b/backing      mounted   yes      -
oxi_b462a7f7-b628-40fe-80ff-4e4189e2d62b/backing/fmd  mounted   no       -
gimlet-sn06 # fmadm faulty
gimlet-sn06 # fmdump

gimlet-sn06 # /usr/lib/fm/fmd/fminject < /data/af/fminject
sending event er1 ... done

gimlet-sn06 # fmdump
TIME                 UUID                                 SUNW-MSG-ID EVENT
Oct 05 09:52:01.5727 d696a4f2-e1d9-4051-b543-7e9a4b4d7a35 SUNOS-8000-J0 Diagnosed

gimlet-sn06 # fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Oct 05 09:52:01 d696a4f2-e1d9-4051-b543-7e9a4b4d7a35  SUNOS-8000-J0  Major     injected

Host        : gimlet-sn06
Platform    : oxide     Chassis_id  :
Product_sn  :

Fault class : defect.sunos.eft.unexpected_telemetry 50%
              fault.sunos.eft.unexpected_telemetry 50%
Problem in  : dev:////pci@0,0
                  faulted and taken out of service

Description : The diagnosis engine encountered telemetry from the listed
              devices for which it was unable to perform a diagnosis -
              Refer to http://illumos.org/msg/SUNOS-8000-J0 for more
              information.  Refer to http://illumos.org/msg/SUNOS-8000-J0 for
              more information.

Response    : Error reports have been logged for examination by your illumos
              distribution team.

Impact      : Automated diagnosis and response for these events will not occur.

Action      : Ensure that the latest illumos Kernel and Predictive Self-Healing
              (PSH) updates are installed.

citrus-it added a commit that referenced this issue Oct 9, 2023
/var/fm/fmd is where the illumos fault management system records data.
We want to preserve this data across system reboots and in real time
rather than via periodic data copying, so that the information is
available should the system panic shortly thereafter.

Fixes: #4211
citrus-it added a commit that referenced this issue Oct 9, 2023
`/var/fm/fmd` is where the illumos fault management system records data.
We want to preserve this data across system reboots and in real time
rather than via periodic data copying, so that the information is
available should the system panic shortly thereafter.

Fixes: #4211
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants