Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds service bundles for zones #3388

Merged
merged 4 commits into from
Jun 27, 2023
Merged

Adds service bundles for zones #3388

merged 4 commits into from
Jun 27, 2023

Conversation

bnaecker
Copy link
Collaborator

  • Adds a dataset to the M.2s for storing debugging data.
  • Adds basic mechanism for setting a ZFS quota on datasets.
  • Adds HTTP endpoints for listing, creating, and fetching zone service bundles from the sled agent.
  • Adds methods to ServiceManager for implementing the above. Zone bundles run a set of commands to get the zone-wide output and some key process-specific data for relevant processes from an Oxide service zone. These are packed into a tarball along with a simple metdata file, describing the zone bundle.
  • Adds some helper methods in RunningZone and related for listing the expected SMF service names and processes associated with them based on the zone's manifest files.
  • Adds dev tool zb for talking to the sled agent to operate on zone bundles.

@bnaecker bnaecker requested review from davepacheco and smklein June 21, 2023 00:05
@bnaecker
Copy link
Collaborator Author

This is a first-cut at helping to resolve #1598. It provides the basic methods for creating, listing, and fetching "zone bundles" -- the state of an Oxide-managed zone at some point in time. In response to a client request (and notably, not in any automated way), it creates a tarball that contains:

  • The output of a bunch of zone wide commands, like svcs -p and ptree
  • The pfiles, pstack and pargs output for all the Oxide-managed services in the zone
  • The log files, including rotated ones, for all the Oxide-managed services in the zone

I've also added a rudimentary quota to the dataset storing this information, of 100GiB currently. That's a wild guess, and no special handling is done when this space fills up.

There are a few things I like about this. I think the code for packaging up the zone state is useful, even if we exercise it through a different mechanism, like an automated collection system or when zones are destroyed. I believe there are folks who would current like the ability to take a zone bundle on demand; @askfongjojo and @gjcolombo have specifically asked for ways to do this.

There are definitely drawbacks and next steps:

  • We do not currently collect data when zones are destroyed. It needs to be triggered by a developer. I think that can be added into the parts of the code that destroy zones without too much fuss.
  • We don't have a way to remove old bundles. They're on the filesystem, so someone on the system would need to delete them. I can add that fairly easily if this PR doesn't make a whole lot of sense without it.
  • This may not integrate super well with other kinds of automated debug data collection. I'm thinking here of the later steps outlined in automatic debug data collection without running the system out of space #2478, that collect the logs on a cron job or similar. I think the pieces here can be used by that, but the zone bundle itself may be a distinct thing.
  • This puts everything on the M.2s. That's expedient, but ultimately we would like to store this on the U.2s across the rack for a variety of reasons. That requires Nexus to be aware of managing the bundles and the disks they live on, which feels like something we can build on top of this PR.

@bnaecker
Copy link
Collaborator Author

There aren't a whole lot of ways I can test this code, so I'm including a bunch of testing notes. This is all from me running on my developer machine, the real sled agent on a non-Gimlet system.

First we can ask the sled-agent what zones are currently running:

bnaecker@shale : ~/omicron $ ./target/debug/zb --host fd00:1122:3344:101::1 list-zones
oxz_clickhouse_oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03
oxz_cockroachdb_oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03
oxz_crucible_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
oxz_crucible_oxp_e4b4dc87-ab46-49fb-a4b4-d361ae214c03
oxz_crucible_oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03
oxz_crucible_pantry
oxz_external_dns
oxz_internal_dns
oxz_nexus
oxz_ntp
oxz_oximeter
oxz_switch

Then we can list zone bundles for a specific zone:

bnaecker@shale : ~/omicron $ ./target/debug/zb --host fd00:1122:3344:101::1 ls oxz_switch
oxz_switch/2b6c6b4d-bf8c-4e5e-b230-351e9183c631

I've taken one zone bundle already, which is what we see. We can fetch that bundle directly (which is a glorified scp, if I'm being honest):

bnaecker@shale : ~/omicron $ ./target/debug/zb --host fd00:1122:3344:101::1 get --bundle-id 2b6c6b4d-bf8c-4e5e-b230-351e9183c631 oxz_switch
bnaecker@shale : ~/omicron $ tar tzf 2b6c6b4d-bf8c-4e5e-b230-351e9183c631.tar.gz
metadata
ptree
uptime
last
who
svcs
netstat
pfiles.20255
pstack.20255
pargs.20255
system-illumos-dendrite:default.log
pfiles.20263
pstack.20263
pargs.20263
oxide-mgs:default.log
pfiles.20281
pstack.20281
pargs.20281
system-illumos-mg-ddm:default.log
pfiles.20273
pstack.20273
pargs.20273
oxide-wicketd:default.log

We've got one file per kind of data, with a metadata file describing the bundle itself:

bnaecker@shale : ~/omicron $ tar xzf 2b6c6b4d-bf8c-4e5e-b230-351e9183c631.tar.gz metadata
bnaecker@shale : ~/omicron $ cat metadata
time_created = "2023-06-20T23:24:00.986843053Z"

[id]
zone_name = "oxz_switch"
bundle_id = "2b6c6b4d-bf8c-4e5e-b230-351e9183c631"

Each command file lists the command at the top, then the output itself:

bnaecker@shale : ~/omicron $ tar xzf 2b6c6b4d-bf8c-4e5e-b230-351e9183c631.tar.gz ptree
bnaecker@shale : ~/omicron $ cat ptree
Command: ["ptree"]
19387  /sbin/init
19394  /lib/svc/bin/svc.startd
  19993  /usr/lib/saf/ttymon -g -d /dev/console -l console -m ldterm,ttcompat -
19396  /lib/svc/bin/svc.configd
19436  /lib/inet/netcfgd
19439  /lib/inet/ipmgmtd
19698  /usr/lib/fm/fmd/fmd
19753  /usr/lib/pfexecd
19925  /usr/sbin/nscd
19978  /usr/lib/utmpd
19994  /usr/sbin/syslogd
20002  /usr/sbin/cron
20234  /usr/lib/inet/in.ripngd -s
20255  /opt/oxide/dendrite/bin/dpd run
20263  /opt/oxide/mgs/bin/mgs run --id-and-address-from-smf /var/svc/manifest/s
20272  ctrun -l child -o noorphan,regent /opt/oxide/wicketd/bin/wicketd run /va
  20273  /opt/oxide/wicketd/bin/wicketd run /var/svc/manifest/site/wicketd/conf
20280  ctrun -l child -o noorphan,regent /opt/oxide/mg-ddm/pkg/ddm_method_scrip
  20281  /opt/oxide/mg-ddm/bin/ddmd --admin-port 8000 --admin-addr :: --kind tr
20328  /sbin/dhcpagent
20622  /usr/lib/inet/in.ndpd
28242  ptree

Hopefully this lets us parse things with normal tools with the minimum of fuss.

You can create a new bundle with:

bnaecker@shale : ~/omicron $ ./target/debug/zb --host fd00:1122:3344:101::1 create oxz_switch
Created zone bundle: oxz_switch/057cedd1-5fc4-422a-a110-9f3c37ef4dc5

In the sled-agent logs, we can see what's going on:

00:31:39.481Z INFO SledAgent (dropshot (SledAgent)): accepted connection
    local_addr = [fd00:1122:3344:101::1]:12345
    remote_addr = [fd00:1122:3344:101::1]:49607
00:31:39.482Z INFO SledAgent (BootstrapAgent): creating zone bundle
    directories = ["/pool/int/b462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone", "/pool/int/a462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone"]
    zone = oxz_switch
00:31:39.482Z DEBG SledAgent (BootstrapAgent): creating bundle directory
    dir = /pool/int/b462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone/oxz_switch
00:31:39.482Z DEBG SledAgent (BootstrapAgent): creating bundle directory
    dir = /pool/int/a462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone/oxz_switch
00:31:39.482Z DEBG SledAgent (BootstrapAgent): created bundle tarball file
    path = /pool/int/b462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone/oxz_switch/057cedd1-5fc4-422a-a110-9f3c37ef4dc5.tar.gz
    zone = oxz_switch
00:31:39.482Z DEBG SledAgent (BootstrapAgent): wrote zone bundle metadata
    zone = oxz_switch
00:31:39.482Z DEBG SledAgent (BootstrapAgent): running zone bundle command
    command = ["ptree"]
    zone = oxz_switch
00:31:39.492Z DEBG SledAgent (BootstrapAgent): running zone bundle command
    command = ["uptime"]
    zone = oxz_switch
00:31:39.498Z DEBG SledAgent (BootstrapAgent): running zone bundle command
    command = ["last"]
    zone = oxz_switch
00:31:39.504Z DEBG SledAgent (BootstrapAgent): running zone bundle command
    command = ["who"]
    zone = oxz_switch
00:31:39.510Z DEBG SledAgent (BootstrapAgent): running zone bundle command
    command = ["svcs", "-p"]
    zone = oxz_switch
00:31:39.558Z DEBG SledAgent (BootstrapAgent): running zone bundle command
    command = ["netstat", "-an"]
    zone = oxz_switch
00:31:39.660Z DEBG SledAgent (BootstrapAgent): enumerated service processes
    procs = [ServiceProcess { service_name: "svc:/system/illumos/dendrite:default", binary: "/opt/oxide/dendrite/bin/dpd", pid: 20255, log_file: "/zone/oxz_switch/root/var/svc/log/system-illumos-dendrite:default.log", rotated_log_files: [] }, ServiceProcess { service_name: "svc:/oxide/mgs:default", binary: "/opt/oxide/mgs/bin/mgs", pid: 20263, log_file: "/zone/oxz_switch/root/var/svc/log/oxide-mgs:default.log", rotated_log_files: [] }, ServiceProcess { service_name: "svc:/system/illumos/mg-ddm:default", binary: "/opt/oxide/mg-ddm/bin/ddmd", pid: 20281, log_file: "/zone/oxz_switch/root/var/svc/log/system-illumos-mg-ddm:default.log", rotated_log_files: [] }, ServiceProcess { service_name: "svc:/oxide/wicketd:default", binary: "/opt/oxide/wicketd/bin/wicketd", pid: 20273, log_file: "/zone/oxz_switch/root/var/svc/log/oxide-wicketd:default.log", rotated_log_files: [] }]
    zone = oxz_switch
00:31:39.661Z DEBG SledAgent (BootstrapAgent): running zone bundle command
    command = ["pfiles", "20255"]
    zone = oxz_switch
00:31:39.669Z DEBG SledAgent (BootstrapAgent): running zone bundle command
    command = ["pstack", "20255"]
    zone = oxz_switch
00:31:39.862Z DEBG SledAgent (BootstrapAgent): running zone bundle command
    command = ["pargs", "20255"]
    zone = oxz_switch
00:31:40.059Z DEBG SledAgent (BootstrapAgent): appending current log file to zone bundle
    log_file = /zone/oxz_switch/root/var/svc/log/system-illumos-dendrite:default.log
    zone = oxz_switch
00:31:40.065Z DEBG SledAgent (BootstrapAgent): running zone bundle command
    command = ["pfiles", "20263"]
    zone = oxz_switch
00:31:40.076Z DEBG SledAgent (BootstrapAgent): running zone bundle command
    command = ["pstack", "20263"]
    zone = oxz_switch
00:31:40.167Z DEBG SledAgent (BootstrapAgent): running zone bundle command
    command = ["pargs", "20263"]
    zone = oxz_switch
00:31:40.254Z DEBG SledAgent (BootstrapAgent): appending current log file to zone bundle
    log_file = /zone/oxz_switch/root/var/svc/log/oxide-mgs:default.log
    zone = oxz_switch
00:31:40.709Z DEBG SledAgent (BootstrapAgent): running zone bundle command
    command = ["pfiles", "20281"]
    zone = oxz_switch
00:31:40.719Z DEBG SledAgent (BootstrapAgent): running zone bundle command
    command = ["pstack", "20281"]
    zone = oxz_switch
00:31:40.830Z DEBG SledAgent (BootstrapAgent): running zone bundle command
    command = ["pargs", "20281"]
    zone = oxz_switch
00:31:40.925Z DEBG SledAgent (BootstrapAgent): appending current log file to zone bundle
    log_file = /zone/oxz_switch/root/var/svc/log/system-illumos-mg-ddm:default.log
    zone = oxz_switch
00:31:40.925Z DEBG SledAgent (BootstrapAgent): running zone bundle command
    command = ["pfiles", "20273"]
    zone = oxz_switch
00:31:40.937Z DEBG SledAgent (BootstrapAgent): running zone bundle command
    command = ["pstack", "20273"]
    zone = oxz_switch
00:31:41.060Z DEBG SledAgent (BootstrapAgent): running zone bundle command
    command = ["pargs", "20273"]
    zone = oxz_switch
00:31:41.176Z DEBG SledAgent (BootstrapAgent): appending current log file to zone bundle
    log_file = /zone/oxz_switch/root/var/svc/log/oxide-wicketd:default.log
    zone = oxz_switch
00:31:41.211Z DEBG SledAgent (BootstrapAgent): copying bundle
    from = /pool/int/b462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone/oxz_switch/057cedd1-5fc4-422a-a110-9f3c37ef4dc5.tar.gz
    to = /pool/int/a462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone/oxz_switch/057cedd1-5fc4-422a-a110-9f3c37ef4dc5.tar.gz
00:31:41.214Z INFO SledAgent (BootstrapAgent): finished zone bundle
    metadata = ZoneBundleMetadata { id: ZoneBundleId { zone_name: "oxz_switch", bundle_id: 057cedd1-5fc4-422a-a110-9f3c37ef4dc5 }, time_created: 2023-06-21T00:31:39.482415770Z }
00:31:41.214Z INFO SledAgent (dropshot (SledAgent)): request completed
    local_addr = [fd00:1122:3344:101::1]:12345
    method = POST
    remote_addr = [fd00:1122:3344:101::1]:49607
    req_id = 6adca9a4-9a23-42dc-bcea-15e5b101031a
    response_code = 201
    uri = /zones/oxz_switch/bundles

Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truly superb work, this will be a boon to have.

I think the follow-ups you mentioned would be great next steps, with priority on:

  • Prioritizing auto-collection for destroyed zones
  • Removal of old bundles

illumos-utils/src/running_zone.rs Show resolved Hide resolved
sled-agent/src/bin/zb.rs Outdated Show resolved Hide resolved
sled-agent/src/bin/zb.rs Outdated Show resolved Hide resolved
sled-agent/src/services.rs Show resolved Hide resolved
sled-agent/src/services.rs Show resolved Hide resolved
sled-agent/src/services.rs Show resolved Hide resolved
sled-agent/src/services.rs Outdated Show resolved Hide resolved
sled-agent/src/services.rs Outdated Show resolved Hide resolved
sled-agent/src/services.rs Outdated Show resolved Hide resolved
sled-agent/src/sled_agent.rs Outdated Show resolved Hide resolved
@bnaecker
Copy link
Collaborator Author

Thanks for taking a look @smklein, and for the kind words! I'll move to the automated collection and automated cleanup parts, possibly at the same time, depending on how quickly my disks fill up :)

I've addressed everything in 3894ecd!

@bnaecker bnaecker enabled auto-merge (squash) June 23, 2023 05:09
bnaecker added 3 commits June 26, 2023 23:28
- Adds a dataset to the M.2s for storing debugging data.
- Adds basic mechanism for setting a ZFS quota on datasets.
- Adds HTTP endpoints for listing, creating, and fetching zone service
  bundles from the sled agent.
- Adds methods to `ServiceManager` for implementing the above. Zone
  bundles run a set of commands to get the zone-wide output and some key
  process-specific data for relevant processes from an Oxide service
  zone. These are packed into a tarball along with a simple metdata
  file, describing the zone bundle.
- Adds some helper methods in `RunningZone` and related for listing the
  expected SMF service names and processes associated with them based on
  the zone's manifest files.
- Adds dev tool `zb` for talking to the sled agent to operate on zone
  bundles.
- mv zb.rs -> zone-bundle.rs
- Add TOML extension to zone bundle metadata file
- Return 404 on bad zone name
- Typos, safety notes, and link to logadm(8)
@bnaecker bnaecker merged commit eab1cf5 into main Jun 27, 2023
@bnaecker bnaecker deleted the service-bundles branch June 27, 2023 06:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants