Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create zone bundles from ZFS snapshots #4225

Merged
merged 4 commits into from
Oct 16, 2023
Merged

Conversation

bnaecker
Copy link
Collaborator

@bnaecker bnaecker commented Oct 6, 2023

  • Fixes Corrupt zone bundles #4010
  • Previously, we copied log files directly out of their original locations, which meant we contended with several other components: logadm rotating the log file; the log archiver moving the to longer-term storage; and the program writing to the file itself. This commit changes the operation of the bundler, to first create a ZFS snapshot of the filesystem(s) containing the log files, clone them, and then copy files out of the clones. We destroy those clones / snapshots after completing, and when the sled-agent starts to help with crash-safety.

@bnaecker bnaecker requested a review from jmpesp October 6, 2023 22:12
@bnaecker
Copy link
Collaborator Author

bnaecker commented Oct 6, 2023

Here is a snippet of the sled-agent log during this operation:

18:49:43.954Z INFO SledAgent (dropshot (SledAgent)): accepted connection
    file = /home/bnaecker/.cargo/git/checkouts/dropshot-a4a923d29dccc492/fa728d0/dropshot/src/server.rs:769
    local_addr = [fd00:1122:3344:101::1]:12345
    remote_addr = [fd00:1122:3344:101::1]:51867
18:49:43.956Z INFO SledAgent (StorageManager): creating zone bundle
    context = ZoneBundleContext { storage_dirs: ["/pool/int/b462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone", "/pool/int/a462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone"], cause: ExplicitRequest, extra_log_dirs: ["/pool/ext/31bd71cd-4736-4a12-a387-9b74b050396f/crypt/debug/oxz_switch", "/pool/ext/e4b4dc87-ab46-49fb-a4b4-d361ae214c03/crypt/debug/oxz_switch", "/pool/ext/14b4dc87-ab46-49fb-a4b4-d361ae214c03/crypt/debug/oxz_switch", "/pool/ext/cd70d7f6-2354-4bf2-8012-55bf9eaf7930/crypt/debug/oxz_switch", "/pool/ext/d462a7f7-b628-40fe-80ff-4e4189e2d62b/crypt/debug/oxz_switch", "/pool/ext/24b4dc87-ab46-49fb-a4b4-d361ae214c03/crypt/debug/oxz_switch", "/pool/ext/616b26df-e62a-4c68-b506-f4a923d8aaf7/crypt/debug/oxz_switch", "/pool/ext/ceb4461c-cf56-4719-ad3c-14430bfdfb60/crypt/debug/oxz_switch", "/pool/ext/f4b4dc87-ab46-49fb-a4b4-d361ae214c03/crypt/debug/oxz_switch"] }
    file = sled-agent/src/zone_bundle.rs:504
    zone_name = oxz_switch
18:49:43.956Z DEBG SledAgent (StorageManager): creating bundle directory
    dir = /pool/int/b462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone/oxz_switch
18:49:43.956Z DEBG SledAgent (StorageManager): creating bundle directory
    dir = /pool/int/a462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone/oxz_switch
18:49:43.956Z DEBG SledAgent (StorageManager): created bundle tarball file
    path = /pool/int/b462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone/oxz_switch/3843028f-ab76-4443-8f83-719c26044c01.tar.gz
    zone = oxz_switch
18:49:43.958Z DEBG SledAgent (StorageManager): wrote zone bundle metadata
    zone = oxz_switch
18:49:43.958Z DEBG SledAgent (StorageManager): running zone bundle command
    command = ["ptree"]
    zone = oxz_switch
18:49:43.969Z DEBG SledAgent (ContractReaper): Abandoned contract 63525
18:49:43.969Z DEBG SledAgent (StorageManager): running zone bundle command
    command = ["uptime"]
    zone = oxz_switch
18:49:43.974Z DEBG SledAgent (ContractReaper): Abandoned contract 63526
18:49:43.975Z DEBG SledAgent (StorageManager): running zone bundle command
    command = ["last"]
    zone = oxz_switch
18:49:43.980Z DEBG SledAgent (ContractReaper): Abandoned contract 63527
18:49:43.980Z DEBG SledAgent (StorageManager): running zone bundle command
    command = ["who"]
    zone = oxz_switch
18:49:43.987Z DEBG SledAgent (ContractReaper): Abandoned contract 63528
18:49:43.987Z DEBG SledAgent (StorageManager): running zone bundle command
    command = ["svcs", "-p"]
    zone = oxz_switch
18:49:44.063Z DEBG SledAgent (ContractReaper): Abandoned contract 63529
18:49:44.063Z DEBG SledAgent (StorageManager): running zone bundle command
    command = ["netstat", "-an"]
    zone = oxz_switch
18:49:44.071Z DEBG SledAgent (ContractReaper): Abandoned contract 63530
18:49:44.092Z DEBG SledAgent (ContractReaper): Abandoned contract 63531
18:49:44.099Z DEBG SledAgent (ContractReaper): Abandoned contract 63532
18:49:44.106Z DEBG SledAgent (ContractReaper): Abandoned contract 63533
18:49:44.113Z DEBG SledAgent (ContractReaper): Abandoned contract 63534
18:49:44.119Z DEBG SledAgent (ContractReaper): Abandoned contract 63535
18:49:44.126Z DEBG SledAgent (ContractReaper): Abandoned contract 63536
18:49:44.133Z DEBG SledAgent (ContractReaper): Abandoned contract 63537
18:49:44.139Z DEBG SledAgent (ContractReaper): Abandoned contract 63538
18:49:44.146Z DEBG SledAgent (ContractReaper): Abandoned contract 63539
18:49:44.152Z DEBG SledAgent (ContractReaper): Abandoned contract 63540
18:49:44.166Z DEBG SledAgent (ContractReaper): Abandoned contract 63541
18:49:44.178Z DEBG SledAgent (ContractReaper): Abandoned contract 63542
18:49:44.189Z DEBG SledAgent (ContractReaper): Abandoned contract 63543
18:49:44.198Z DEBG SledAgent (ContractReaper): Abandoned contract 63544
18:49:44.208Z DEBG SledAgent (ContractReaper): Abandoned contract 63545
18:49:44.215Z DEBG SledAgent (ContractReaper): Abandoned contract 63546
18:49:44.215Z DEBG SledAgent (StorageManager): enumerated service processes
    procs = [ServiceProcess { service_name: "svc:/oxide/sp-sim:default", binary: "/opt/oxide/sp-sim/bin/sp-sim", pid: 4243, log_file: "/var/svc/log/oxide-sp-sim:default.log", rotated_log_files: [] }, ServiceProcess { service_name: "svc:/oxide/uplink:default", binary: "/opt/oxide/dendrite/bin/uplinkd", pid: 4189, log_file: "/var/svc/log/oxide-uplink:default.log", rotated_log_files: [] }, ServiceProcess { service_name: "svc:/oxide/tfport:default", binary: "/opt/oxide/dendrite/bin/tfportd", pid: 4229, log_file: "/var/svc/log/oxide-tfport:default.log", rotated_log_files: [] }, ServiceProcess { service_name: "svc:/oxide/dendrite:default", binary: "/opt/oxide/dendrite/bin/dpd", pid: 4176, log_file: "/var/svc/log/oxide-dendrite:default.log", rotated_log_files: [] }, ServiceProcess { service_name: "svc:/system/illumos/mg-ddm:default", binary: "/opt/oxide/mg-ddm/bin/ddmd", pid: 4212, log_file: "/var/svc/log/system-illumos-mg-ddm:default.log", rotated_log_files: [] }, ServiceProcess { service_name: "svc:/oxide/mgs:default", binary: "/opt/oxide/mgs/bin/mgs", pid: 4184, log_file: "/var/svc/log/oxide-mgs:default.log", rotated_log_files: [] }, ServiceProcess { service_name: "svc:/oxide/wicketd:default", binary: "/opt/oxide/wicketd/bin/wicketd", pid: 4202, log_file: "/var/svc/log/oxide-wicketd:default.log", rotated_log_files: [] }]
    zone = oxz_switch
18:49:44.284Z DEBG SledAgent (StorageManager): created snapshot
    filesystem = rpool/zone/oxz_switch
    snap_name = zone-root
18:49:44.348Z DEBG SledAgent (StorageManager): created clone from snapshot
    clone_name = rpool/oxide-sled-agent-zone-bundle/zone-root
    filesystem = rpool/zone/oxz_switch
    snap_name = zone-root
18:49:44.375Z DEBG SledAgent (StorageManager): running zone bundle command
    command = ["pfiles", "4243"]
    zone = oxz_switch
18:49:44.386Z DEBG SledAgent (ContractReaper): Abandoned contract 63547
18:49:44.386Z DEBG SledAgent (StorageManager): running zone bundle command
    command = ["pstack", "4243"]
    zone = oxz_switch
18:49:44.517Z DEBG SledAgent (ContractReaper): Abandoned contract 63548
18:49:44.518Z DEBG SledAgent (StorageManager): running zone bundle command
    command = ["pargs", "4243"]
    zone = oxz_switch
18:49:44.602Z DEBG SledAgent (ContractReaper): Abandoned contract 63549
18:49:44.603Z DEBG SledAgent (StorageManager): found log files
    log_files = ["/rpool/oxide-sled-agent-zone-bundle/zone-root/root/var/svc/log/oxide-sp-sim:default.log", "/rpool/oxide-sled-agent-zone-bundle/zone-root/root/var/svc/log/oxide-sp-sim:default.log.0"]
18:49:44.621Z DEBG SledAgent (StorageManager): appended log file to zone bundle
    log_file = /rpool/oxide-sled-agent-zone-bundle/zone-root/root/var/svc/log/oxide-sp-sim:default.log
    zone = oxz_switch
18:49:44.624Z DEBG SledAgent (StorageManager): appended log file to zone bundle
    log_file = /rpool/oxide-sled-agent-zone-bundle/zone-root/root/var/svc/log/oxide-sp-sim:default.log.0
    zone = oxz_switch
18:49:44.624Z DEBG SledAgent (StorageManager): running zone bundle command
    command = ["pfiles", "4189"]
    zone = oxz_switch
18:49:44.637Z DEBG SledAgent (ContractReaper): Abandoned contract 63550
18:49:44.637Z DEBG SledAgent (StorageManager): running zone bundle command
    command = ["pstack", "4189"]
    zone = oxz_switch
18:49:44.694Z DEBG SledAgent (ContractReaper): Abandoned contract 63551
18:49:44.695Z DEBG SledAgent (StorageManager): running zone bundle command
    command = ["pargs", "4189"]
    zone = oxz_switch
18:49:44.733Z DEBG SledAgent (ContractReaper): Abandoned contract 63552
18:49:44.908Z DEBG SledAgent (StorageManager): found log files
    log_files = ["/rpool/oxide-sled-agent-zone-bundle/zone-root/root/var/svc/log/oxide-tfport:default.log", "/rpool/oxide-sled-agent-zone-bundle/zone-root/root/var/svc/log/oxide-tfport:default.log.0"]
18:49:44.927Z DEBG SledAgent (StorageManager): appended log file to zone bundle
    log_file = /rpool/oxide-sled-agent-zone-bundle/zone-root/root/var/svc/log/oxide-tfport:default.log
    zone = oxz_switch
18:49:44.930Z DEBG SledAgent (StorageManager): appended log file to zone bundle
    log_file = /rpool/oxide-sled-agent-zone-bundle/zone-root/root/var/svc/log/oxide-tfport:default.log.0
    zone = oxz_switch
18:49:44.930Z DEBG SledAgent (StorageManager): running zone bundle command
    command = ["pfiles", "4176"]
    zone = oxz_switch
18:49:44.941Z DEBG SledAgent (ContractReaper): Abandoned contract 63556
18:49:44.941Z DEBG SledAgent (StorageManager): running zone bundle command
    command = ["pstack", "4176"]
    zone = oxz_switch
18:49:45.078Z DEBG SledAgent (ContractReaper): Abandoned contract 63557
18:49:45.079Z DEBG SledAgent (StorageManager): running zone bundle command
    command = ["pargs", "4176"]
    zone = oxz_switch
18:49:45.192Z DEBG SledAgent (ContractReaper): Abandoned contract 63558
18:49:45.192Z DEBG SledAgent (StorageManager): found log files
    log_files = ["/rpool/oxide-sled-agent-zone-bundle/zone-root/root/var/svc/log/oxide-dendrite:default.log", "/rpool/oxide-sled-agent-zone-bundle/zone-root/root/var/svc/log/oxide-dendrite:default.log.0"]
18:49:45.370Z DEBG SledAgent (StorageManager): appended log file to zone bundle
    log_file = /rpool/oxide-sled-agent-zone-bundle/zone-root/root/var/svc/log/oxide-dendrite:default.log
    zone = oxz_switch
18:49:45.403Z DEBG SledAgent (StorageManager): appended log file to zone bundle
    log_file = /rpool/oxide-sled-agent-zone-bundle/zone-root/root/var/svc/log/oxide-dendrite:default.log.0
    zone = oxz_switch
18:49:45.403Z DEBG SledAgent (StorageManager): running zone bundle command
    command = ["pfiles", "4212"]
    zone = oxz_switch
18:49:45.416Z DEBG SledAgent (ContractReaper): Abandoned contract 63559
18:49:45.416Z DEBG SledAgent (StorageManager): running zone bundle command
    command = ["pstack", "4212"]
    zone = oxz_switch
18:49:45.557Z DEBG SledAgent (ContractReaper): Abandoned contract 63560
18:49:45.559Z DEBG SledAgent (StorageManager): running zone bundle command
    command = ["pargs", "4212"]
    zone = oxz_switch
18:49:45.658Z DEBG SledAgent (ContractReaper): Abandoned contract 63561
18:49:45.659Z DEBG SledAgent (StorageManager): found log files
    log_files = ["/rpool/oxide-sled-agent-zone-bundle/zone-root/root/var/svc/log/system-illumos-mg-ddm:default.log", "/rpool/oxide-sled-agent-zone-bundle/zone-root/root/var/svc/log/system-illumos-mg-ddm:default.log.0"]
18:49:45.659Z DEBG SledAgent (StorageManager): appended log file to zone bundle
    log_file = /rpool/oxide-sled-agent-zone-bundle/zone-root/root/var/svc/log/system-illumos-mg-ddm:default.log
    zone = oxz_switch
18:49:45.660Z DEBG SledAgent (StorageManager): appended log file to zone bundle
    log_file = /rpool/oxide-sled-agent-zone-bundle/zone-root/root/var/svc/log/system-illumos-mg-ddm:default.log.0
    zone = oxz_switch
18:49:45.660Z DEBG SledAgent (StorageManager): running zone bundle command
    command = ["pfiles", "4184"]
    zone = oxz_switch
18:49:45.671Z DEBG SledAgent (ContractReaper): Abandoned contract 63562
18:49:45.671Z DEBG SledAgent (StorageManager): running zone bundle command
    command = ["pstack", "4184"]
    zone = oxz_switch
18:49:45.810Z DEBG SledAgent (ContractReaper): Abandoned contract 63563
18:49:45.812Z DEBG SledAgent (StorageManager): running zone bundle command
    command = ["pargs", "4184"]
    zone = oxz_switch
18:49:45.914Z DEBG SledAgent (ContractReaper): Abandoned contract 63564
18:49:45.915Z DEBG SledAgent (StorageManager): found log files
    log_files = ["/rpool/oxide-sled-agent-zone-bundle/zone-root/root/var/svc/log/oxide-mgs:default.log", "/rpool/oxide-sled-agent-zone-bundle/zone-root/root/var/svc/log/oxide-mgs:default.log.0"]
18:49:45.984Z DEBG SledAgent (BootstrapAgentStartup): client request
    DdmAdminClient = [::1]:8000
    body = None
    method = GET
    uri = http://[::1]:8000/prefixes
    18:49:45.985Z DEBG SledAgent (BootstrapAgentStartup): client response
    DdmAdminClient = [::1]:8000
    result = Ok(Response { url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Ipv6(::1)), port: Some(8000), path: "/prefixes", query: None, fragment: None }, status: 200, headers: {"content-type": "application/json", "x-request-id": "1e08c05c-e254-4510-ae43-980f4bbd852f", "content-length": "99", "date": "Thu, 05 Oct 2023 18:49:45 GMT"} })
18:49:45.985Z INFO SledAgent (BootstrapAgentStartup): Received prefixes from ddmd
    DdmAdminClient = [::1]:8000
    file = ddm-admin-client/src/lib.rs:119
    prefixes = {"fe80::9ce0:47ff:fe3f:2a59": [PathVector { destination: Ipv6Prefix { addr: fd00:99::, len: 64 }, path: ["oxz_switch"] }]}
18:49:46.016Z DEBG SledAgent (StorageManager): appended log file to zone bundle
    log_file = /rpool/oxide-sled-agent-zone-bundle/zone-root/root/var/svc/log/oxide-mgs:default.log
    zone = oxz_switch
18:49:46.035Z DEBG SledAgent (StorageManager): appended log file to zone bundle
    log_file = /rpool/oxide-sled-agent-zone-bundle/zone-root/root/var/svc/log/oxide-mgs:default.log.0
    zone = oxz_switch
18:49:46.035Z DEBG SledAgent (StorageManager): running zone bundle command
    command = ["pfiles", "4202"]
    zone = oxz_switch
18:49:46.050Z DEBG SledAgent (ContractReaper): Abandoned contract 63565
18:49:46.051Z DEBG SledAgent (StorageManager): running zone bundle command
    command = ["pstack", "4202"]
    zone = oxz_switch
18:49:46.265Z DEBG SledAgent (ContractReaper): Abandoned contract 63566
18:49:46.266Z DEBG SledAgent (StorageManager): running zone bundle command
    command = ["pargs", "4202"]
    zone = oxz_switch
18:49:46.405Z DEBG SledAgent (ContractReaper): Abandoned contract 63567
18:49:46.406Z DEBG SledAgent (StorageManager): found log files
    log_files = ["/rpool/oxide-sled-agent-zone-bundle/zone-root/root/var/svc/log/oxide-wicketd:default.log", "/rpool/oxide-sled-agent-zone-bundle/zone-root/root/var/svc/log/oxide-wicketd:default.log.0"]
18:49:46.407Z DEBG SledAgent (StorageManager): appended log file to zone bundle
    log_file = /rpool/oxide-sled-agent-zone-bundle/zone-root/root/var/svc/log/oxide-wicketd:default.log
    zone = oxz_switch
18:49:46.408Z DEBG SledAgent (StorageManager): appended log file to zone bundle
    log_file = /rpool/oxide-sled-agent-zone-bundle/zone-root/root/var/svc/log/oxide-wicketd:default.log.0
    zone = oxz_switch
18:49:46.408Z DEBG SledAgent (StorageManager): copying bundle
    from = /pool/int/b462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone/oxz_switch/3843028f-ab76-4443-8f83-719c26044c01.tar.gz
    to = /pool/int/a462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone/oxz_switch/3843028f-ab76-4443-8f83-719c26044c01.tar.gz
18:49:46.480Z DEBG SledAgent (StorageManager): destroyed zone bundle ZFS clone
    clone_name = rpool/oxide-sled-agent-zone-bundle/zone-root
18:49:46.504Z DEBG SledAgent (StorageManager): destroyed zone bundle ZFS snapshot
    snapshot = rpool/zone/oxz_switch@zone-root
18:49:46.504Z INFO SledAgent (StorageManager): finished zone bundle
    file = sled-agent/src/zone_bundle.rs:1397
    metadata = ZoneBundleMetadata { id: ZoneBundleId { zone_name: "oxz_switch", bundle_id: 3843028f-ab76-4443-8f83-719c26044c01 }, time_created: 2023-10-05T18:49:43.956834493Z, version: 0, cause: ExplicitRequest }
18:49:46.504Z INFO SledAgent (dropshot (SledAgent)): request completed
    file = /home/bnaecker/.cargo/git/checkouts/dropshot-a4a923d29dccc492/fa728d0/dropshot/src/server.rs:853
    latency_us = 2549839
    local_addr = [fd00:1122:3344:101::1]:12345
    method = POST
    remote_addr = [fd00:1122:3344:101::1]:51867
    req_id = fb4302ff-73ff-4505-9285-a1f440f47ab4
    response_code = 201
    uri = /zones/bundles/oxz_switch
18:49:46.506Z DEBG SledAgent (StorageManager): searching directory for zone bundles
    directory = /pool/int/b462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone/oxz_switch
18:49:46.507Z DEBG SledAgent (StorageManager): checking path as zone bundle
    path = /pool/int/b462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone/oxz_switch/3843028f-ab76-4443-8f83-719c26044c01.tar.gz
18:49:46.508Z DEBG SledAgent (StorageManager): searching directory for zone bundles
    directory = /pool/int/a462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone/oxz_switch
18:49:46.508Z DEBG SledAgent (StorageManager): checking path as zone bundle
    path = /pool/int/a462a7f7-b628-40fe-80ff-4e4189e2d62b/debug/bundle/zone/oxz_switch/3843028f-ab76-4443-8f83-719c26044c01.tar.gz
18:49:46.508Z INFO SledAgent (dropshot (SledAgent)): request completed
    file = /home/bnaecker/.cargo/git/checkouts/dropshot-a4a923d29dccc492/fa728d0/dropshot/src/server.rs:853
    latency_us = 2863
    local_addr = [fd00:1122:3344:101::1]:12345
    method = GET
    remote_addr = [fd00:1122:3344:101::1]:51867
    req_id = e4d27fc6-ee3e-4155-960a-104ee39124be
    response_code = 200
    uri = /zones/bundles/oxz_switch/3843028f-ab76-4443-8f83-719c26044c01

I've tested this a bunch on my local machine, and the bundles look accurate. I would still like to test this against the zone which is most likely to create problems, ClickHouse. That one spews an absolute mountain of data into its log files, so it'll be a good check that this does actually prevent corruption.

Here's a snippet of a succesful bundle:

bnaecker@shale : ~/omicron $ tar -tf 11064f9c-66d9-44b8-9efe-76d0635a56ad.tar.gz
metadata.toml
ptree
uptime
last
who
svcs
netstat
pfiles.2987
pstack.2987
pargs.2987
oxide-wicketd:default.log
oxide-wicketd:default.log.0
pfiles.2973
pstack.2973
pargs.2973
oxide-uplink:default.log
oxide-uplink:default.log.0
pfiles.3031
pstack.3031
pargs.3031
oxide-sp-sim:default.log
oxide-sp-sim:default.log.0
pfiles.3000
pstack.3000
pargs.3000
system-illumos-mg-ddm:default.log
system-illumos-mg-ddm:default.log.0
pfiles.2968
pstack.2968
pargs.2968
oxide-mgs:default.log
oxide-mgs:default.log.0
pfiles.2960
pstack.2960
pargs.2960
oxide-dendrite:default.log
oxide-dendrite:default.log.0
pfiles.3016
pstack.3016
pargs.3016
oxide-tfport:default.log
oxide-tfport:default.log.0

So we're still getting the rotated log files and all the command output. I'd also like to check that we get the archived bundles as well before merging this.

- Fixes #4010
- Previously, we copied log files directly out of their original
  locations, which meant we contended with several other components:
  `logadm` rotating the log file; the log archiver moving the to
  longer-term storage; and the program writing to the file itself. This
  commit changes the operation of the bundler, to first create a ZFS
  snapshot of the filesystem(s) containing the log files, clone them,
  and then copy files out of the clones. We destroy those clones /
  snapshots after completing, and when the sled-agent starts to help
  with crash-safety.
@bnaecker bnaecker force-pushed the bundle-logs-via-zfs-snapshot branch from ada2189 to fe1fc69 Compare October 7, 2023 03:41
@citrus-it
Copy link
Contributor

This is a really neat way to fix the consistency problem.
I just wanted to mention that you could skip the clone/destroy steps and access the read-only snapshot directly, if that helps in any way to make this faster or easier to parallelise:

BRM42220051 # zfs snapshot oxp_4a624324-003a-4255-98e8-546a90b5b7fa/crypt/zone/oxz_ntp_3ccea933-89f2-4ce5-8367-efb0afeffe97@bob
BRM42220051 # ls /pool/ext/4a624324-003a-4255-98e8-546a90b5b7fa/crypt/zone/oxz_ntp_3ccea933-89f2-4ce5-8367-efb0afeffe97/.zfs/snapshot/bob/root
bin     etc     home    mnt     proc    sbin    tmp     var
dev     export  lib     opt     root    system  usr

@bnaecker
Copy link
Collaborator Author

bnaecker commented Oct 7, 2023 via email

sled-agent/src/zone_bundle.rs Show resolved Hide resolved
- Use snapshots directly, no need for clones
- Set properties on ZFS snapshots at creation time, avoiding possible
  crash inconsistency
@bnaecker
Copy link
Collaborator Author

Thanks for the suggestion @jmpesp, I've added the properties directly at snapshot creation time. @citrus-it could you please take a look too, if you get a chance? I took your suggestion, reading the from the snapshots directly without an intervening clone.

@bnaecker bnaecker requested review from jmpesp and citrus-it October 10, 2023 19:44
@bnaecker
Copy link
Collaborator Author

I've confirmed that ClickHouse bundles can be made without any problems:

bnaecker@shale : ~/omicron/sled-agent $ cargo r --bin zone-bundle -- get --create oxz_clickhouse_0766c203-71a2-47c9-9f03-4b48a7dc39d0
    Finished dev [unoptimized + debuginfo] target(s) in 0.52s
     Running `/home/bnaecker/omicron/target/debug/zone-bundle get --create oxz_clickhouse_0766c203-71a2-47c9-9f03-4b48a7dc39d0`
Created zone bundle: oxz_clickhouse_0766c203-71a2-47c9-9f03-4b48a7dc39d0/1a191668-c941-463a-8bb0-b5e43ba1aaf2
bnaecker@shale : ~/omicron/sled-agent $ tar -tf 1a191668-c941-463a-8bb0-b5e43ba1aaf2.tar.gz
metadata.toml
ptree
uptime
last
who
svcs
netstat
pfiles.21868
pstack.21868
pargs.21868
oxide-clickhouse:default.log
oxide-clickhouse:default.log.0
pfiles.21897
pstack.21897
pargs.21897
oxide-clickhouse:default.log
oxide-clickhouse:default.log.0

I'd still like to test this on a real Gimlet before merging. The log archival process ignores the file-backed vdevs we use on my developer machine, so I want to make sure we get those correctly before closing this. I'll do that tonight, if I can snag a machine like madrid for a test run.

Copy link
Contributor

@citrus-it citrus-it left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This turned out neat. I just spotted a couple of typos and one possible correctness thing.

illumos-utils/src/zfs.rs Outdated Show resolved Hide resolved
sled-agent/src/zone_bundle.rs Outdated Show resolved Hide resolved
sled-agent/src/zone_bundle.rs Outdated Show resolved Hide resolved
@bnaecker
Copy link
Collaborator Author

Ok, I've had the chance to test this a bunch on madrid. I was stalled by #4269, which I'll also include a fix for in this PR because it's vanishingly small. Here's the result of bundling the switch zone, which requires that fix:

BRM44220001 # /opt/oxide/sled-agent/zone-bundle get --create oxz_switch
Created zone bundle: oxz_switch/79f7cf0a-fa6e-4ac2-b9e5-bf483cb4fbd5
BRM44220001 # tar -tf 79f7cf0a-fa6e-4ac2-b9e5-bf483cb4fbd5.tar.gz
metadata.toml
ptree
uptime
last
who
svcs
netstat
pfiles.24187
pstack.24187
pargs.24187
oxide-mgs:default.log
oxide-mgs:default.log.1697165100
oxide-mgs:default.log.1697134496
oxide-mgs:default.log.1697148899
oxide-mgs:default.log.1697166900
pfiles.24203
pstack.24203
pargs.24203
oxide-uplink:default.log
oxide-uplink:default.log.1697165075
oxide-uplink:default.log.1697166426
oxide-uplink:default.log.1697134361
oxide-uplink:default.log.1697134500
pfiles.24236
pstack.24236
pargs.24236
system-illumos-mg-ddm:default.log
system-illumos-mg-ddm:default.log.1697165100
system-illumos-mg-ddm:default.log.1697166518
system-illumos-mg-ddm:default.log.1697134386
system-illumos-mg-ddm:default.log.1697134501
pfiles.24293
pstack.24293
pargs.24293
oxide-dendrite:default.log
oxide-dendrite:default.log.1697165100
oxide-dendrite:default.log.1697134501
oxide-dendrite:default.log.1697148901
oxide-dendrite:default.log.1697166900
pfiles.24198
pstack.24198
pargs.24198
oxide-tfport:default.log
oxide-tfport:default.log.1697165091
oxide-tfport:default.log.1697134471
oxide-tfport:default.log.1697148872
oxide-tfport:default.log.1697166879

And ClickHouse, which was the worst offender in terms of zone-bundle corruption caused by the process itself writing into the log file concurrently with the bundling:

BRM44220001 # tar -tf 3f30b9ac-908c-4caa-93dd-136d57131d0f.tar.gz
metadata.toml
ptree
uptime
last
who
svcs
netstat
pfiles.6563
pstack.6563
pargs.6563
oxide-clickhouse:default.log
oxide-clickhouse:default.log.1697113799
oxide-clickhouse:default.log.1697084997
oxide-clickhouse:default.log.1697099398
oxide-clickhouse:default.log.1697128199
oxide-clickhouse:default.log.1697070601
oxide-clickhouse:default.log.1697051699
oxide-clickhouse:default.log.1697068801
oxide-clickhouse:default.log.1697165995
oxide-clickhouse:default.log.1697134500
oxide-clickhouse:default.log.1697148901
oxide-clickhouse:default.log.1697166896
pfiles.6588
pstack.6588
pargs.6588
oxide-clickhouse:default.log
oxide-clickhouse:default.log.1697113799
oxide-clickhouse:default.log.1697084997
oxide-clickhouse:default.log.1697099398
oxide-clickhouse:default.log.1697128199
oxide-clickhouse:default.log.1697070601
oxide-clickhouse:default.log.1697051699
oxide-clickhouse:default.log.1697068801
oxide-clickhouse:default.log.1697165995
oxide-clickhouse:default.log.1697134500
oxide-clickhouse:default.log.1697148901
oxide-clickhouse:default.log.1697166896

Note that neither hits the tar directory checksum errors. Bundling all zones on the host also no longer hits that failure mode:

BRM44220001 # /opt/oxide/sled-agent/zone-bundle bundle-all
BRM44220001 # tar -tf BRM44220001-2023-10-13T03-25-09.tar.gz
oxz_clickhouse_f60e58fd-1af3-46b9-956e-1a73bdb86deb/298d8e0b-d71c-465f-861b-fc6be86e1898.tar.gz
oxz_cockroachdb_55487272-5c6c-49ea-be82-caa0ea0cdab7/50bde425-62a9-45c3-bc5c-a89ee5b75130.tar.gz
oxz_cockroachdb_c5713ad3-6db6-496e-960f-d3a5edeb26f6/4e9dd778-9bf3-464b-8f3d-6771aacd9ee0.tar.gz
oxz_crucible_01809e1b-db56-4dac-ad37-6165f389044f/4d3564f3-a1d7-4a01-933d-4e4526588dd2.tar.gz
oxz_crucible_1eafee08-505a-425d-ac68-1de1c5d97011/57232ec3-7573-4388-8008-133dc2ea8d92.tar.gz
oxz_crucible_1f5accd3-2ce2-4290-b6cf-55ad46f602c9/ad57f187-2296-4a2d-8f80-656507706384.tar.gz
oxz_crucible_35273637-042c-458a-b74d-a311e906c24e/eec2c8a6-f6d2-4e56-afec-6570a2db5bab.tar.gz
oxz_crucible_40892d6e-9e53-4a27-ac4c-3407e0511260/c6fa873e-5d0d-48a1-83dd-cce2607bf560.tar.gz
oxz_crucible_49822385-8ddf-4088-bcc7-146de95c4122/de56c1b4-3be9-49c1-bc3d-0e48a16bff65.tar.gz
oxz_crucible_94de9b80-ab34-4d85-8e86-280a50d72eb6/073c7bd8-3531-4e83-888d-82a72dea4dcc.tar.gz
oxz_crucible_abfc0428-0af6-4476-ae07-bb23f0e5b284/f8b86efa-5c60-486a-8ca9-79904c7fca16.tar.gz
oxz_crucible_e6448d71-ab57-42d0-9fb9-f337eb08cf93/f357f377-457c-41c2-a077-0b1b86c2afa4.tar.gz
oxz_crucible_pantry_ed2d62ac-1470-495e-b72b-1d5d049b89c8/3422fa18-f397-4b0d-a774-2f8d0e6d5af9.tar.gz
oxz_internal_dns_10cab8c8-23e3-4c39-934e-acaae139dabf/8cc005fa-25fd-44b8-afc5-becb1ea14753.tar.gz
oxz_nexus_745cdd28-d1e5-4e41-a9ca-26b3343af260/dc567c9a-27dd-47d2-a68e-d5da45985ca6.tar.gz
oxz_ntp_754425fb-796f-4f86-a4e6-b8bfa3f02043/4bc74218-e040-4338-bf19-8e5e07b5e1bb.tar.gz
oxz_switch/15203c0a-61ee-4932-aeab-62c17040ca87.tar.gz

@bnaecker
Copy link
Collaborator Author

This also fixes #4269.

@bnaecker bnaecker requested a review from citrus-it October 13, 2023 03:28
- Fixes #4269, so the rpool/zone dataset can be mounted in the GZ
- Only fetch archived logs once
- Fix logic for matching archived log file names
@bnaecker bnaecker merged commit 7d88789 into main Oct 16, 2023
@bnaecker bnaecker deleted the bundle-logs-via-zfs-snapshot branch October 16, 2023 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Corrupt zone bundles
3 participants