[sled agent] Consider setting uniform coreadm values to extract info from terminating processes? #1597

smklein · 2022-08-16T17:21:56Z

In the 8/16 control plane sync, we discussed the possibility of using https://illumos.org/man/8/coreadm to set a filter to extract core files from crashing non-global zones into the global zone.

Currently, when non-global zone services terminate, Sled Agent stops and deletes the underlying zone. This helps avoid leakage of that resource - we have no further execution-time usage for it - but limits visibility.

By dumping core files into the global zone, we'd be able to inspect errors, even after the zone is destroyed.

rmustacc · 2022-08-16T17:30:45Z

In particular, we want to enable global cores and use the %z token to include the zone name for disambiguation. In the past we've done things like /var/xxx/%z/core.%f.%p.

jclulow · 2022-08-16T19:17:07Z

A request: probably please don't hard code /var paths (or use of rpool specifically) for any more large files. It seems fine as a default but in the ramdisk environment we're going to want to direct those files to specific tmpfs or other mounted pools, etc. The current use of /var/oxide for writing a bunch of larger files is something we'll probably have to unwind so it'll be good to avoid adding more things like that.

leftwo · 2022-08-29T16:48:36Z

In the past we've done things like /var/xxx/%z/core.%f.%p.

coreadm wants the directory to exist before it will create a core. If we use /%z/ as part of the path, then I believe something outside coreadm will need to create that directory. In previous use cases, was there another subsystem that created the %z directory, or did I just miss the coreadm option that would create on demand?

davepacheco · 2022-09-07T22:45:27Z

I believe that in the case being referenced (Joyent's SmartOS), the path was really /zones/%z/cores/core.%f.%p, and yes, that directory was created by the machinery that created the zone (vmadm(1M)).

leftwo · 2022-09-08T00:18:37Z

I believe that in the case being referenced (Joyent's SmartOS), the path was really /zones/%z/cores/core.%f.%p, and yes, that directory was created by the machinery that created the zone (vmadm(1M)).

https://www.illumos.org/issues/2123
That vmadm?

jclulow · 2022-09-08T00:34:05Z

If we need to create a cores directory we can do that in the brand code. It has hooks for installing and for booting and so on.

leftwo · 2022-09-08T00:36:03Z

If we do want a .../%z/... in the path, then something will need to both create that directory on zone creation, and remove it (if empty) when the zone goes away. Otherwise we are left with a record of every zone ever created.

davepacheco · 2022-09-08T15:33:54Z

I believe that in the case being referenced (Joyent's SmartOS), the path was really /zones/%z/cores/core.%f.%p, and yes, that directory was created by the machinery that created the zone (vmadm(1M)).

https://www.illumos.org/issues/2123 That vmadm?

I expect so. I'm not sure where the path was actually managed, though. Maybe as Josh suggested it was done in the brand code.

davepacheco · 2022-12-22T17:35:57Z

In @2088 @smklein asked:

Is this ultimately the sled agent's job?

I'm not sure. We've got to decide first where the core files will go. That'll presumably be some directory on a ZFS dataset on some zpool. Who creates the pool? The dataset? The directory? My first thought is that we put all of the core files into one directory per Sled (i.e., don't create a per-zone dataset or even directory). That's because I'm not sure what we'd gain from separate datasets or directories per zone, and this way we don't have to do anything here when zones come and go.

Still, I'm not sure what storage we want to put these on, so I don't know what pool or dataset we want to put these on, so I don't know who's responsible for it.

jclulow · 2022-12-24T01:29:36Z

Separate datasets per zone would allow us to have a separate quota for core files per zone, which I suspect would be valuable. It would be good to avoid a run-away core generator in zone A from preventing a subsequent single core file being generated by zone B. We'll also want an overall quota that inhibits cores from exhausting the space in the pool they're in.

I think we'll want to put this stuff on a dataset we create in some U.2 device or devices. A few thoughts:

we'll need to account for the space we're setting aside for core files in the same way that we'll need to account for Cockroach DB and Clickhouse data files and any other internal data files (RFD 118)
we'll eventually want to hoover these files up and put them somewhere other than where they're generated; this could be a sled-agent responsibility, but it might also be valuable as a separate and simpler process that would then not be in the same fault domain as sled-agent itself (e.g., what if it's the sled agent that keeps dumping core)
if we put all the core files on one U.2 device, that might ease management, but it would also mean that if that device fails we would lose all of the core files in the system
similarly, if we put the core files dataset for a zone on the same U.2 device as the rest of the storage for the zone, then if the crashes occur because of some underlying fault that also upsets the U.2 device or its ZFS pool, we might also not be able to write those core files
if we put the core files for a zone on another SSD, then we'd have to be careful responding to the removal of a device other than the device on which the zone is resident and repoint its cores dataset, etc; if they go on the same device as the zone storage, then at least it can all be torn down at once on device failure or removal

There is not, I suspect, a single best answer to this problem.

leftwo · 2023-11-11T01:15:59Z

Many/much/all of the work described here was completed in other PRs/Issues:

sled-agent performs archival of rotated logs for all zones onto U.2 debug dataset
Put process core dumps onto the U.2 debug zvol

I think if there are follow on issues, they should go here: #2478

leftwo · 2023-11-11T02:48:18Z

#2478

smklein added the Sled Agent Related to the Per-Sled Configuration and Management label Aug 16, 2022

This was referenced Aug 16, 2022

[sled agent] Figure out how to store/manage Service Bundles #1599

Closed

[sled agent] Tracking issue for "better zone death" #1600

Open

leftwo self-assigned this Aug 16, 2022

smklein changed the title ~~[sled agent] Consider setting uniform coreadm files to extract info from terminating processes?~~ [sled agent] Consider setting uniform coreadm values to extract info from terminating processes? Aug 16, 2022

leftwo mentioned this issue Aug 29, 2022

Panic in the Upstairs leaves an instance in a zombie coma #1652

Open

smklein mentioned this issue Dec 22, 2022

set GOTRACEBACK=crash when running cockroach #2088

Merged

leftwo closed this as completed Nov 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sled agent] Consider setting uniform coreadm values to extract info from terminating processes? #1597

[sled agent] Consider setting uniform coreadm values to extract info from terminating processes? #1597

smklein commented Aug 16, 2022

rmustacc commented Aug 16, 2022

jclulow commented Aug 16, 2022

leftwo commented Aug 29, 2022

davepacheco commented Sep 7, 2022

leftwo commented Sep 8, 2022

jclulow commented Sep 8, 2022

leftwo commented Sep 8, 2022

davepacheco commented Sep 8, 2022

davepacheco commented Dec 22, 2022

jclulow commented Dec 24, 2022

leftwo commented Nov 11, 2023

leftwo commented Nov 11, 2023

[sled agent] Consider setting uniform coreadm values to extract info from terminating processes? #1597

[sled agent] Consider setting uniform coreadm values to extract info from terminating processes? #1597

Comments

smklein commented Aug 16, 2022

rmustacc commented Aug 16, 2022

jclulow commented Aug 16, 2022

leftwo commented Aug 29, 2022

davepacheco commented Sep 7, 2022

leftwo commented Sep 8, 2022

jclulow commented Sep 8, 2022

leftwo commented Sep 8, 2022

davepacheco commented Sep 8, 2022

davepacheco commented Dec 22, 2022

jclulow commented Dec 24, 2022

leftwo commented Nov 11, 2023

leftwo commented Nov 11, 2023