Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sled-agent] Self assembling switch zone #5593

Merged
merged 60 commits into from
Jul 22, 2024

Conversation

karencfv
Copy link
Contributor

@karencfv karencfv commented Apr 22, 2024

Overview

This PR migrates the switch zone to a self assembling format. There are a few bits of old code I'll be cleaning up, more logs I'll be adding and some documentation about how the switch zone flow works, but I'll do this in follow up PRs to keep this one as compact as possible.

Caveats

I've tested this in a local single node deployment and in the a4x2 testbed. Unfortunately, this is not enough testing to make sure all of the services play nice together on a real rack. We'll have to keep an eye on dogfood when this is deployed.

The only services that depend on the common networking service are dendrite and MGS. While this makes sense on the a4x2 testbed, I'd like to verify that these dependencies make sense when running on a real rack.

As several people have worked on different parts of this zone, I've tagged a whole bunch of people for review, sorry if this is overkill! Just want to make sure I've got the right eyes on each service of the zone.

Related: #1898
Closes: #2884

TODO:

sled-agent/src/services.rs Outdated Show resolved Hide resolved
@karencfv karencfv self-assigned this Apr 22, 2024
@karencfv
Copy link
Contributor Author

karencfv commented May 3, 2024

Something that I wasn't expecting is going on here. The switch zone is booting before it's being given a chance to retrieve the data from the PropertyGroupBuilder.

I came to this conclusion with the failing svc:/oxide/zone-network-setup:default service. What surprised me, is that it's failing because the properties defined through the PropertyGroupBuilder aren't being propagated in time apparently.

The logs show :

root@oxz_switch:~# cat /var/svc/log/oxide-zone-network-setup:default.log
[ May  3 01:50:30 Enabled. ]
[ May  3 01:50:30 Rereading configuration. ]
[ May  3 01:50:32 Executing start method ("/opt/oxide/zone-setup-cli/bin/zone-setup common-networking -d unknown -s unknown -g unknown"). ]
note: configured to log to "/dev/stderr"
error: invalid value 'unknown' for '--datalink <STRING>': ERROR: Missing data link

For more information, try '--help'.
[ May  3 01:50:32 Method "start" exited with status 2. ]
[ May  3 01:50:32 Executing start method ("/opt/oxide/zone-setup-cli/bin/zone-setup common-networking -d unknown -s unknown -g unknown"). ]
note: configured to log to "/dev/stderr"
error: invalid value 'unknown' for '--datalink <STRING>': ERROR: Missing data link

For more information, try '--help'.
[ May  3 01:50:32 Method "start" exited with status 2. ]
[ May  3 01:50:32 Executing start method ("/opt/oxide/zone-setup-cli/bin/zone-setup common-networking -d unknown -s unknown -g unknown"). ]
note: configured to log to "/dev/stderr"
error: invalid value 'unknown' for '--datalink <STRING>': ERROR: Missing data link

For more information, try '--help'.
[ May  3 01:50:32 Method "start" exited with status 2. ]

What's interesting here is this line -> [ May 3 01:50:32 Executing start method ("/opt/oxide/zone-setup-cli/bin/zone-setup common-networking -d unknown -s unknown -g unknown"). ]

We can infer that when executing the start method, the service is starting up using the manifest that hasn't been populated by the PropertyGroupBuilder because of the presence of the "unknown" when running the command.

Weirdly, on the sled agent's logs the switch zone's profile has all the correct information:

01:34:29.965Z INFO SledAgent (ServiceManager): Profile for oxz_switch:
    <!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
    <service_bundle type="profile" name="omicron">
      <service version="1" type="service" name="oxide/zone-network-setup">
          <property_group type="application" name="config">
            <propval type="astring" name="datalink" value="oxControlService0"/>
            <propval type="astring" name="gateway" value="fd00:1122:3344:101::1"/>
            <propval type="astring" name="static_addr" value="::1"/>
          </property_group>
        <instance enabled="true" name="default">
        </instance>
      </service>
      <service version="1" type="service" name="network/dns/client">
        <instance enabled="false" name="default">
        </instance>
      </service>
      <service version="1" type="service" name="oxide/mgs">
        <instance enabled="true" name="default">
          <property_group type="application" name="config">
            <propval type="astring" name="address" value="[::1]:12225"/>
            <propval type="astring" name="id" value="73bce50a-3cc5-42da-bb07-2f70eca38852"/>
            <propval type="astring" name="rack_id" value="305d0504-786e-48c6-adba-d8ab5db6d0ed"/>
          </property_group>
        </instance>
      </service>
      <service version="1" type="service" name="oxide/wicketd">
        <instance enabled="true" name="default">
          <property_group type="application" name="config">
            <propval type="astring" name="address" value="[::1]:12226"/>
            <propval type="astring" name="artifact-address" value="[fdb0:a8a1:59c7:4e85::2]:12227"/>
            <propval type="astring" name="mgs-address" value="[::1]:12225"/>
            <propval type="astring" name="nexus-proxy-address" value="[::]:12229"/>
            <propval type="astring" name="rack-subnet" value="fd00:1122:3344::"/>
          </property_group>
        </instance>
      </service>
      <service version="1" type="service" name="oxide/switch_zone_setup">
        <instance enabled="true" name="default">
          <property_group type="application" name="config">
            <propval type="astring" name="zone_name" value="oxz_switch"/>
            <propval type="astring" name="bootstrap_addr" value="fdb0:a8a1:59c7:4e85::2"/>
            <propval type="astring" name="bootstrap_name" value="oxBootstrap0"/>
            <propval type="astring" name="bootstrap_vnic" value="oxBootstrap0"/>
            <propval type="astring" name="gz_local_link_addr" value="fe80::8:20ff:fed8:3afe"/>
            <propval type="astring" name="link_local_links" value="tfportrear0_0"/>
            <propval type="astring" name="config/baseboard_info" value="{
      "type": "pc",
      "identifier": "centzon",
      "model": "i86pc"
    }"/>
          </property_group>
        </instance>
      </service>
      <service version="1" type="service" name="oxide/dendrite">
        <instance enabled="true" name="default">
          <property_group type="application" name="config">
            <propval type="astring" name="sled_id" value="7ba81148-31ce-4df8-99c3-79abd576b1e5"/>
            <propval type="astring" name="rack_id" value="305d0504-786e-48c6-adba-d8ab5db6d0ed"/>
            <propval type="astring" name="address" value="[::1]:12224"/>
            <propval type="astring" name="front_ports" value="1"/>
            <propval type="astring" name="rear_ports" value="1"/>
            <propval type="astring" name="port_config" value="/opt/oxide/dendrite/misc/softnpu_single_sled_config.toml"/>
            <propval type="astring" name="mgmt" value="uds"/>
            <propval type="astring" name="uds_path" value="/opt/softnpu/stuff"/>
          </property_group>
        </instance>
      </service>
      <service version="1" type="service" name="oxide/tfport">
        <instance enabled="true" name="default">
          <property_group type="application" name="config">
            <propval type="astring" name="host" value="[::1]"/>
            <propval type="astring" name="port" value="12224"/>
            <propval type="astring" name="flags" value="--sync-only"/>
          </property_group>
        </instance>
      </service>
      <service version="1" type="service" name="oxide/lldpd">
        <instance enabled="true" name="default">
          <property_group type="application" name="config">
            <propval type="astring" name="board_rev" value="softnpu_front_1_rear_1"/>
            <propval type="astring" name="scrimlet_id" value="centzon"/>
            <propval type="astring" name="scrimlet_model" value="i86pc"/>
            <propval type="astring" name="address" value="[::1]:12230"/>
          </property_group>
        </instance>
      </service>
      <service version="1" type="service" name="oxide/pumpkind">
      </service>
      <service version="1" type="service" name="oxide/mgd">
        <instance enabled="true" name="default">
          <property_group type="application" name="config">
            <propval type="astring" name="sled_uuid" value="7ba81148-31ce-4df8-99c3-79abd576b1e5"/>
            <propval type="astring" name="rack_uuid" value="305d0504-786e-48c6-adba-d8ab5db6d0ed"/>
          </property_group>
        </instance>
      </service>
      <service version="1" type="service" name="oxide/mg-ddm">
        <instance enabled="true" name="default">
          <property_group type="application" name="config">
            <propval type="astring" name="mode" value="transit"/>
            <propval type="astring" name="dendrite" value="true"/>
            <propval type="astring" name="sled_uuid" value="7ba81148-31ce-4df8-99c3-79abd576b1e5"/>
            <propval type="astring" name="rack_uuid" value="305d0504-786e-48c6-adba-d8ab5db6d0ed"/>
            <propval type="astring" name="interfaces" value="("tfportrear0_0/ll")"/>
          </property_group>
        </instance>
      </service>
    </service_bundle>
    file = sled-agent/src/profile.rs:34

I'm a bit unsure about what's going on here. @smklein perhaps you could point me in the right direction?

@karencfv
Copy link
Contributor Author

karencfv commented May 3, 2024

Ok, so it looks like ensure_zone() https://github.com/oxidecomputer/omicron/blob/main/sled-agent/src/services.rs#L3848-L3856 initialises the switch zone. When the sled local zone is disabled, ensure_zone calls start_zone() which in turn calls initialize_zone() for the switch zone. Then it should populate the switch zone profile with the PropertyBuilder information, but it does not.

sus

@karencfv
Copy link
Contributor Author

karencfv commented May 7, 2024

Update after a very helpful chat with @smklein :

The services appear to be using the unpopulated manifest files in /var/svc/manifest/site/{SERVICE}/manifest.xml rather than in /var/svc/profile/site.xml, as can be seen below:

root@oxz_switch:~# cat /var/svc/profile/site.xml
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<service_bundle type="profile" name="omicron">
  <service version="1" type="service" name="oxide/zone-network-setup">
      <property_group type="application" name="config">
        <propval type="astring" name="datalink" value="oxControlService0"/>
        <propval type="astring" name="gateway" value="fd00:1122:3344:101::1"/>
        <propval type="astring" name="static_addr" value="::1"/>
      </property_group>
    <instance enabled="true" name="default">
    </instance>
  </service>
  <service version="1" type="service" name="network/dns/client">
    <instance enabled="false" name="default">
    </instance>
  </service>
  <service version="1" type="service" name="oxide/mgs">
    <instance enabled="true" name="default">
      <property_group type="application" name="config">
        <propval type="astring" name="address" value="[::1]:12225"/>
        <propval type="astring" name="id" value="cdde4d2b-9c04-4c06-9087-9a04980308ec"/>
        <propval type="astring" name="rack_id" value="efa8e75a-4172-4867-a290-b8b0a7635198"/>
      </property_group>
    </instance>
  </service>
  <service version="1" type="service" name="oxide/wicketd">
    <instance enabled="true" name="default">
      <property_group type="application" name="config">
        <propval type="astring" name="address" value="[::1]:12226"/>
        <propval type="astring" name="artifact-address" value="[fdb0:a8a1:59c7:4e85::2]:12227"/>
        <propval type="astring" name="mgs-address" value="[::1]:12225"/>
        <propval type="astring" name="nexus-proxy-address" value="[::]:12229"/>
        <propval type="astring" name="rack-subnet" value="fd00:1122:3344::"/>
      </property_group>
    </instance>
  </service>
  <service version="1" type="service" name="oxide/switch_zone_setup">
    <instance enabled="true" name="default">
      <property_group type="application" name="config">
        <propval type="astring" name="zone_name" value="oxz_switch"/>
        <propval type="astring" name="bootstrap_addr" value="fdb0:a8a1:59c7:4e85::2"/>
        <propval type="astring" name="bootstrap_name" value="oxBootstrap0"/>
        <propval type="astring" name="bootstrap_vnic" value="oxBootstrap0"/>
        <propval type="astring" name="gz_local_link_addr" value="fe80::8:20ff:fe53:10fb"/>
        <propval type="astring" name="link_local_links" value="tfportrear0_0"/>
        <propval type="astring" name="baseboard_info" value="{
  "type": "pc",
  "identifier": "centzon",
  "model": "i86pc"
}"/>
      </property_group>
    </instance>
  </service>
  <service version="1" type="service" name="oxide/dendrite">
    <instance enabled="true" name="default">
      <property_group type="application" name="config">
        <propval type="astring" name="sled_id" value="1367174e-7e97-4358-aecb-f5d14f1fe98d"/>
        <propval type="astring" name="rack_id" value="efa8e75a-4172-4867-a290-b8b0a7635198"/>
        <propval type="astring" name="address" value="[::1]:12224"/>
        <propval type="astring" name="front_ports" value="1"/>
        <propval type="astring" name="rear_ports" value="1"/>
        <propval type="astring" name="port_config" value="/opt/oxide/dendrite/misc/softnpu_single_sled_config.toml"/>
        <propval type="astring" name="mgmt" value="uds"/>
        <propval type="astring" name="uds_path" value="/opt/softnpu/stuff"/>
      </property_group>
    </instance>
  </service>
  <service version="1" type="service" name="oxide/tfport">
    <instance enabled="true" name="default">
      <property_group type="application" name="config">
        <propval type="astring" name="host" value="[::1]"/>
        <propval type="astring" name="port" value="12224"/>
        <propval type="astring" name="flags" value="--sync-only"/>
      </property_group>
    </instance>
  </service>
  <service version="1" type="service" name="oxide/lldpd">
    <instance enabled="true" name="default">
      <property_group type="application" name="config">
        <propval type="astring" name="board_rev" value="softnpu_front_1_rear_1"/>
        <propval type="astring" name="scrimlet_id" value="centzon"/>
        <propval type="astring" name="scrimlet_model" value="i86pc"/>
        <propval type="astring" name="address" value="[::1]:12230"/>
      </property_group>
    </instance>
  </service>
  <service version="1" type="service" name="oxide/pumpkind">
  </service>
  <service version="1" type="service" name="oxide/mgd">
    <instance enabled="true" name="default">
      <property_group type="application" name="config">
        <propval type="astring" name="sled_uuid" value="1367174e-7e97-4358-aecb-f5d14f1fe98d"/>
        <propval type="astring" name="rack_uuid" value="efa8e75a-4172-4867-a290-b8b0a7635198"/>
      </property_group>
    </instance>
  </service>
  <service version="1" type="service" name="oxide/mg-ddm">
    <instance enabled="true" name="default">
      <property_group type="application" name="config">
        <propval type="astring" name="mode" value="transit"/>
        <propval type="astring" name="dendrite" value="true"/>
        <propval type="astring" name="sled_uuid" value="1367174e-7e97-4358-aecb-f5d14f1fe98d"/>
        <propval type="astring" name="rack_uuid" value="efa8e75a-4172-4867-a290-b8b0a7635198"/>
        <propval type="astring" name="interfaces" value="("tfportrear0_0/ll")"/>
      </property_group>
    </instance>
  </service>
</service_bundle>

root@oxz_switch:~# cat /var/svc/manifest/site/zone-network-setup/manifest.xml 
<?xml version="1.0"?>
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">

<service_bundle type='manifest' name='zone-network-setup'>

<service name='oxide/zone-network-setup' type='service' version='1'>
  <create_default_instance enabled='true' />

  <!-- Run after the operating system's svc:/network/physical service is done. -->
  <dependency name='physical' grouping='require_all' restart_on='none'
    type='service'>
  <service_fmri value='svc:/network/physical:default' />
  </dependency>

  <dependency name='multi_user' grouping='require_all' restart_on='none'
    type='service'>
  <service_fmri value='svc:/milestone/multi-user:default' />
  </dependency>

  <exec_method type='method' name='start'
    exec='/opt/oxide/zone-setup-cli/bin/zone-setup common-networking -d %{config/datalink} -s %{config/static_addr} -g %{config/gateway}'
    timeout_seconds='0' />
  
  <property_group name='startd' type='framework'>
    <propval name='duration' type='astring' value='transient' />
  </property_group>

  <property_group name='config' type='application'>
    <propval name='datalink' type='astring' value='unknown' />
    <propval name='gateway' type='astring' value='unknown' />
    <propval name='static_addr' type='astring' value='unknown' />
  </property_group>

  <stability value='Unstable' />

  <template>
    <common_name>
      <loctext xml:lang='C'>Oxide Zone Network Setup</loctext>
    </common_name>
    <description>
      <loctext xml:lang='C'>Configures networking for control plane zones</loctext>
    </description>
  </template>
</service>

</service_bundle>

even after running svcadm refresh and svcadm clear, the service reads from the wrong manifest and the logs remain the same:

root@oxz_switch:~# svcadm refresh svc:/oxide/zone-network-setup:default
root@oxz_switch:~# cat /var/svc/log/oxide-zone-network-setup:default.log
[ May  7 01:30:02 Rereading configuration. ]
[ May  7 01:30:09 Rereading configuration. ]
[ May  7 01:30:15 Rereading configuration. ]
root@oxz_switch:~# svcadm clear svc:/oxide/zone-network-setup:default 
root@oxz_switch:~# cat /var/svc/log/oxide-zone-network-setup:default.log
[ May  7 01:30:02 Rereading configuration. ]
[ May  7 01:30:09 Rereading configuration. ]
[ May  7 01:30:15 Rereading configuration. ]
[ May  7 01:32:43 Leaving maintenance because clear requested. ]
[ May  7 01:32:43 Enabled. ]
[ May  7 01:32:43 Executing start method ("/opt/oxide/zone-setup-cli/bin/zone-setup common-networking -d unknown -s unknown -g unknown"). ]
note: configured to log to "/dev/stderr"
error: invalid value 'unknown' for '--datalink <STRING>': ERROR: Missing data link

For more information, try '--help'.
[ May  7 01:32:43 Method "start" exited with status 2. ]

I think I'm out of my depth here, @jclulow or @citrus-it might either of you have any idea of what's going on, and what may be different about the switch zone, that this behaviour is not present in any of the other zones?

@jclulow
Copy link
Collaborator

jclulow commented May 7, 2024

If you svccfg validate /var/svc/profile/site.xml are there any errors?

@karencfv
Copy link
Contributor Author

karencfv commented May 7, 2024

aahhhhh!!!! aha! yes there are

root@oxz_switch:~# svccfg validate /var/svc/profile/site.xml
/var/svc/profile/site.xml:46: parser error : attributes construct error
  "type": "pc",
   ^
/var/svc/profile/site.xml:46: parser error : Couldn't find end of Start Tag propval line 45
  "type": "pc",
   ^
/var/svc/profile/site.xml:103: parser error : attributes construct error
        <propval type="astring" name="interfaces" value="("tfportrear0_0/ll")"/>
                                                           ^
/var/svc/profile/site.xml:103: parser error : Couldn't find end of Start Tag propval line 103
        <propval type="astring" name="interfaces" value="("tfportrear0_0/ll")"/>
                                                           ^
svccfg: couldn't parse document

I'll escape those double quotes and see what happens

@jgallagher
Copy link
Contributor

I have one remaining question. How do I log in to the console? Apparently typing madrid.eng.oxide.computer on my browser is not the way 🙃 (yes, I'm on the VPN and I also checked the external DNS zone to make sure all services were up)

This should do it: https://madrid.sys.rack2.eng.oxide.computer/

Probably need to set up via the recovery silo first: https://recovery.sys.madrid.eng.oxide.computer

Comment on lines 3005 to 3008
// Part of the process to ensure bootstrap address is to set up
// an IPv6 address within the Global Zone.
// This means we cannot run bootstrap setup via a service running on
// the switch zone itself.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I'm understanding this: These steps involve:

  • calling zlogin to the switch zone, after it's booted
  • setting up a route in the global zone, after the zone has been booted

Although I do think these steps are important, don't they make the switch zone pretty intractably not self-assembling?

Would it be possible to have the switch zone create this bootstrap address itself, and for the global zone to create this route before booting the switch zone?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent quite a bit of time to get this working like you mention. Unfortunately, it's not precisely the route forwarding that's the problem. It's when you create the bootstrap address that you need to set up an IPv6 address on the Global zone.

ensure_bootstrap_address() -> Zones::ensure_address -> Zones::create_address. This create_address method contains the follwoing:

        // Do any prep work before allocating the address.
        //
        // Currently, this only happens when allocating IPv6 addresses in the
        // non-global zone - to access these addresses, we must first set up
        // an arbitrary IPv6 address within the Global Zone.
        if let Some(zone) = zone {
            match addrtype {
                AddressRequest::Dhcp => {}
                AddressRequest::Static(addr) => {
                    if addr.is_ipv6() {
                        // Finally, actually ensure that the v6 address we want
                        // exists within the zone.
                        let link_local_addrobj =
                            addrobj.link_local_on_same_interface()?;
                        Self::ensure_has_link_local_v6_address(
                            Some(zone),
                            &link_local_addrobj,
                        )?;
                    }
                }
            }
        };

Although I do think these steps are important, don't they make the switch zone pretty intractably not self-assembling?

Perhaps? :(

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, if that's the issue, couldn't we allocate all the "state that needs to exist in the global zone" before we boot the switch zone?

E.g.:

  1. Pre-Boot:
  • GZ ensure an arbitrary link local v6 address exists
  • GZ creates the route necessary to access the switch zone bootstrap address
  • GZ edits the manifest of the switch zone, instructing it to create a specific bootstrap address
  1. Switch Zone Boots
  • Switch zones creates bootstrap address

This way, if the switch zone has booted, the sled agent can safely assume all this routing exists. If we don't keep this order strict, then "after an arbitrary reboot" the sled agent doesn't really know if the switch zone has been partially or fully initialized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give it a go again, see if I can make this work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I have moved setting the link local address and bootstrap address to the switch zone setup service. Sadly, forwarding bootstrap traffic cannot be done before the zone boots :( There are a few reasons for this:

  • Forwarding traffic won't work until the interface is up.
  • The bootstrap interface is set up after the switch zone starts.
  • We don't know what the interface will be called beforehand. The switch zone tries to initialize on a loop until it's able to locate "gzonly.txt" in any baseline directory (this is a host OS thing). Each time the zone attempts to start, the bootstrap interface is given a different name (oxBootstrap0, oxBootstrap1, oxBootstrap2 etc). This means we can't even do something hacky like have the route add command running on a loop in the background before boot because we don't know which boot will succeed, and therefore what name the interface will have.

Copy link
Contributor Author

@karencfv karencfv Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it effectively is using zlogin (technically, it's using zone_enter, but it's functionally doing the same thing

Ha! I was under the impression the command to forward bootstrap traffic was running/had to run from the global zone. I added some temporary debug logging in commit 9499512 to see exactly what commands were being run. Adding the bootstrap address and link local had pfexec zlogin oxz_switch while the traffic forwarding command did not:

root@g3:~# cat $(svcs -L sled-agent) | grep "DEBUG" | grep "cmd" | looker
02:39:22.169Z INFO SledAgent (ServiceManager): DEBUG Zones::ensure_has_link_local_v6_address: No link-local address was found,
                attempt to make one.
    cmd = Command {\n    program: "/usr/bin/pfexec",\n    args: [\n        "/usr/bin/pfexec",\n        "/usr/sbin/zlogin",\n        "oxz_switch",\n        "/usr/sbin/ipadm",\n        "create-addr",\n        "-t",\n        "-T",\n        "addrconf",\n        "oxBootstrap36/ll",\n    ],\n}
    file = illumos-utils/src/zone.rs:792
    zone = oxz_switch
02:39:26.196Z INFO SledAgent (ServiceManager): DEBUG Zones::create_address_internal: Attempt to create the requested address
    addrobj = AddrObject {\n    interface: "oxBootstrap36",\n    name: "bootstrap6",\n}
    addrtype = Static(\n    V6(\n        Ipv6Network {\n            addr: fdb0:a840:2500:7::2,\n            prefix: 64,\n        },\n    ),\n)
    cmd = Command {\n    program: "/usr/bin/pfexec",\n    args: [\n        "/usr/bin/pfexec",\n        "/usr/sbin/zlogin",\n        "oxz_switch",\n        "/usr/sbin/ipadm",\n        "create-addr",\n        "-t",\n        "-T",\n        "static",\n        "-a",\n        "fdb0:a840:2500:7::2/64",\n        "oxBootstrap36/bootstrap6",\n    ],\n}
    file = illumos-utils/src/zone.rs:713
    zone = Some(\n    "oxz_switch",\n)
02:39:28.719Z INFO SledAgent (ServiceManager): DEBUG RunningZone::add_bootstrap_route: Adding bootstrap route
    bootstrap_prefix = 64944
    cmd = [\n    "/usr/sbin/route",\n    "add",\n    "-inet6",\n    "fdb0::/16",\n    "fe80::8:20ff:fe6d:b0da",\n    "-ifp",\n    "oxBootstrap36",\n]
    file = illumos-utils/src/running_zone.rs:805
    gz_bootstrap_addr = fe80::8:20ff:fe6d:b0da
    zone_vnic_name = "oxBootstrap36"

So, I guess that's what confused me re: which commands where running where.

I have added the forwarding command to the switch_zone_setup service and so far it works fine on the a4x2 testbed. I will test on Madrid once the TUF repo is ready.

After a quick offline chat with Sean and Ry, none of us can remember why they switch zone has a bootstrap address at all. I can see wicket listening on it, but AFAIK we always talk to wicket using a link-local address. It's possible that its existence is vestigial. It may also have been added in anticipation of a need, which never came to fruition.

In any case, I think it's worth skipping this part of the switch zone setup to see what (if anything) breaks.

Good to know! After this PR is merged, there are a few bits of the switch zone startup flow I was planning to clean up in a follow up PR. Is it OK with you if I attempt to remove the bootstrap address in that follow up PR? This one has been open for quite a bit now, and I'd like to minimise the risk of things breaking 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I was only offering this as a way to put this PR to bed more quickly. If it will actually slow you down, then it's totally fine to address it (heh) in a follow-on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

@andrewjstone andrewjstone Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a quick offline chat with Sean and Ry, none of us can remember why they switch zone has a bootstrap address at all. I can see wicket listening on it, but AFAIK we always talk to wicket using a link-local address. It's possible that its existence is vestigial. It may also have been added in anticipation of a need, which never came to fruition.

In any case, I think it's worth skipping this part of the switch zone setup to see what (if anything) breaks.

Sorry for the delay in reading this. Busy week :) I added this a long time ago. The original issue is here, with a somewhat lengthy discussion.

The bootstrap address in the switch zone is needed so that installinator can pull the real OS phase 2 and Control Plane images over the bootstrap network. Installinator uses ddmd to find wicketd on the bootstrap network so that it can pull artifacts from it.

Unfortunately, since this is a mupdate related issue, it's possible, even likely, that if your removed this bootstrap network related code CI would still pass and a mupdate of madrid to this new code would work. However, mupdating again likely would not work, because installinator can't retrieve from wicketd and I don't believe we actually serve artifacts from sled-agent yet. This would also cause any new installs after this code was merged to fail.

In short, I'm glad you didn't remove the bootstrap address, and please don't do it in a follow up :). This is yet another reason we really need automated tests for mupdate. Until then, Chesterton's fence y'all!

CC @jgallagher @sunshowers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof 😅

@karencfv
Copy link
Contributor Author

nice! Thanks @Nieuwejaar and @jgallagher

@karencfv
Copy link
Contributor Author

Alright, tested latest commit 36c95ca on Madrid and everything is looking good.

Cubby 16:

BRM44220001 # cat $(svcs -L sled-agent) | grep Handoff | looker
03:53:13.706Z INFO SledAgent (RSS): Handoff to Nexus is complete
    file = sled-agent/src/rack_setup/service.rs:890
BRM44220001 # zoneadm list
global
oxz_switch
oxz_internal_dns_8f33e371-efef-4d7d-8a54-b113ce1c5c85
oxz_ntp_62c44754-514f-4c80-8f2c-550cd086cab0
oxz_cockroachdb_0af925ef-705d-42d1-98df-585b9a07740d
oxz_crucible_721a1310-4db6-4010-bff0-d4aa19aee4dc
oxz_crucible_ec650de6-a8e3-4e0d-a9bc-5149877081d8
oxz_crucible_df247ec3-b9cb-4519-8597-71ab8ca7769b
oxz_crucible_6b47bdb9-a1b8-42d9-8287-e317b1bc2b29
oxz_crucible_4b75a0c3-4b45-4230-88a1-67ed331b5da2
oxz_crucible_5e3ca0ce-1121-43b4-9b2f-9af2b6555e83
oxz_crucible_61cb20fc-9ebc-4bcc-9dd3-151502383391
oxz_crucible_9fef84e9-2532-4043-a607-4c7e8b99a08a
oxz_crucible_ef476ddb-ded2-4aab-9423-bc9be19c6409
oxz_nexus_a7dae1cc-ccc3-4918-b3b8-888f054f61b4
oxz_clickhouse_cfcc8ea5-1604-40ea-b41e-93d685e00796
BRM44220001 # ipadm
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
lo0/v6            static   ok           ::1/128
igb0/ll           addrconf ok           fe80::eaea:6aff:fe09:7f66%igb0/10
cxgbe0/ll         addrconf ok           fe80::aa40:25ff:fe04:355%cxgbe0/10
cxgbe1/ll         addrconf ok           fe80::aa40:25ff:fe04:35d%cxgbe1/10
bootstrap0/ll     addrconf ok           fe80::8:20ff:fe30:cfca%bootstrap0/10
bootstrap0/bootstrap6 static ok         fdb0:a840:2504:355::1/64
underlay0/ll      addrconf ok           fe80::8:20ff:fe37:220f%underlay0/10
underlay0/sled6   static   ok           fd00:1122:3344:102::1/64
underlay0/internaldns1 static ok        fd00:1122:3344:2::2/64
BRM44220001 # dladm
LINK        CLASS     MTU    STATE    BRIDGE     OVER
igb0        phys      1500   up       --         --
tfpkt0      phys      1500   up       --         --
cxgbe0      phys      9000   up       --         --
cxgbe1      phys      9000   up       --         --
bootstrap_stub0 etherstub 9000 up     --         --
bootstrap0  vnic      1500   up       --         bootstrap_stub0
underlay_stub0 etherstub 9000 up      --         --
underlay0   vnic      9000   up       --         underlay_stub0
oxBootstrap0 vnic     1500   up       --         bootstrap_stub0
oxControlService0 vnic 9000  up       --         underlay_stub0
oxControlService1 vnic 9000  up       --         underlay_stub0
opte0       misc      1500   up       --         --
vopte0      vnic      1500   up       --         opte0
oxControlService2 vnic 9000  up       --         underlay_stub0
oxControlService3 vnic 9000  up       --         underlay_stub0
oxControlService4 vnic 9000  up       --         underlay_stub0
oxControlService5 vnic 9000  up       --         underlay_stub0
oxControlService6 vnic 9000  up       --         underlay_stub0
oxControlService7 vnic 9000  up       --         underlay_stub0
oxControlService8 vnic 9000  up       --         underlay_stub0
oxControlService9 vnic 9000  up       --         underlay_stub0
oxControlService10 vnic 9000 up       --         underlay_stub0
oxControlService11 vnic 9000 up       --         underlay_stub0
oxControlService12 vnic 9000 up       --         underlay_stub0
oxControlService13 vnic 9000 up       --         underlay_stub0
opte1       misc      1500   up       --         --
vopte1      vnic      1500   up       --         opte1
oxControlService14 vnic 9000 up       --         underlay_stub0
BRM44220001 # zlogin oxz_switch
[Connected to zone 'oxz_switch' pts/5]
The illumos Project     helios-2.0.22740        June 2024
root@oxz_switch1:~# svcs -x
root@oxz_switch1:~# ipadm
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
tfportqsfp0_0/uplink1 static ok         172.20.15.37/29
lo0/v6            static   ok           ::1/128
oxControlService0/ll addrconf ok        fe80::8:20ff:fe03:91%oxControlService0/10
oxControlService0/omicron6 static ok    fd00:1122:3344:102::2/64
oxBootstrap0/ll   addrconf ok           fe80::8:20ff:fe17:557c%oxBootstrap0/10
oxBootstrap0/bootstrap6 static ok       fdb0:a840:2504:355::2/64
gimlet0/ll        addrconf ok           fe80::aa40:25ff:fe05:602%gimlet0/10
gimlet1/ll        addrconf ok           fe80::aa40:25ff:fe05:602%gimlet1/10
gimlet2/ll        addrconf ok           fe80::aa40:25ff:fe05:602%gimlet2/10
gimlet3/ll        addrconf ok           fe80::aa40:25ff:fe05:602%gimlet3/10
gimlet4/ll        addrconf ok           fe80::aa40:25ff:fe05:602%gimlet4/10
gimlet5/ll        addrconf ok           fe80::aa40:25ff:fe05:602%gimlet5/10
gimlet6/ll        addrconf ok           fe80::aa40:25ff:fe05:602%gimlet6/10
gimlet7/ll        addrconf ok           fe80::aa40:25ff:fe05:602%gimlet7/10
gimlet8/ll        addrconf ok           fe80::aa40:25ff:fe05:602%gimlet8/10
gimlet9/ll        addrconf ok           fe80::aa40:25ff:fe05:602%gimlet9/10
gimlet10/ll       addrconf ok           fe80::aa40:25ff:fe05:602%gimlet10/10
gimlet11/ll       addrconf ok           fe80::aa40:25ff:fe05:602%gimlet11/10
gimlet12/ll       addrconf ok           fe80::aa40:25ff:fe05:602%gimlet12/10
gimlet13/ll       addrconf ok           fe80::aa40:25ff:fe05:602%gimlet13/10
gimlet14/ll       addrconf ok           fe80::aa40:25ff:fe05:602%gimlet14/10
gimlet15/ll       addrconf ok           fe80::aa40:25ff:fe05:602%gimlet15/10
gimlet16/ll       addrconf ok           fe80::aa40:25ff:fe05:602%gimlet16/10
gimlet17/ll       addrconf ok           fe80::aa40:25ff:fe05:602%gimlet17/10
gimlet18/ll       addrconf ok           fe80::aa40:25ff:fe05:602%gimlet18/10
gimlet19/ll       addrconf ok           fe80::aa40:25ff:fe05:602%gimlet19/10
gimlet20/ll       addrconf ok           fe80::aa40:25ff:fe05:602%gimlet20/10
gimlet21/ll       addrconf ok           fe80::aa40:25ff:fe05:602%gimlet21/10
gimlet22/ll       addrconf ok           fe80::aa40:25ff:fe05:602%gimlet22/10
gimlet23/ll       addrconf ok           fe80::aa40:25ff:fe05:602%gimlet23/10
gimlet24/ll       addrconf ok           fe80::aa40:25ff:fe05:602%gimlet24/10
gimlet25/ll       addrconf ok           fe80::aa40:25ff:fe05:602%gimlet25/10
gimlet26/ll       addrconf ok           fe80::aa40:25ff:fe05:602%gimlet26/10
gimlet27/ll       addrconf ok           fe80::aa40:25ff:fe05:602%gimlet27/10
gimlet28/ll       addrconf ok           fe80::aa40:25ff:fe05:602%gimlet28/10
gimlet29/ll       addrconf ok           fe80::aa40:25ff:fe05:602%gimlet29/10
gimlet30/ll       addrconf ok           fe80::aa40:25ff:fe05:602%gimlet30/10
gimlet31/ll       addrconf ok           fe80::aa40:25ff:fe05:602%gimlet31/10
psc0/ll           addrconf ok           fe80::aa40:25ff:fe05:602%psc0/10
psc1/ll           addrconf ok           fe80::aa40:25ff:fe05:602%psc1/10
sidecar0/ll       addrconf ok           fe80::aa40:25ff:fe05:602%sidecar0/10
sidecar1/ll       addrconf ok           fe80::aa40:25ff:fe05:602%sidecar1/10
techport0/ll      addrconf ok           fe80::aa40:25ff:fe05:602%techport0/10
techport1/ll      addrconf ok           fe80::aa40:25ff:fe05:602%techport1/10
tofino0/ll        addrconf ok           fe80::aa40:25ff:fe05:602%tofino0/10
tfportint0_0/ll   addrconf ok           fe80::aa40:25ff:fe05:602%tfportint0_0/10
tfportrear0_0/ll  addrconf ok           fe80::aa40:25ff:fe05:603%tfportrear0_0/10
tfportrear10_0/ll addrconf ok           fe80::aa40:25ff:fe05:60d%tfportrear10_0/10
tfportrear11_0/ll addrconf ok           fe80::aa40:25ff:fe05:60e%tfportrear11_0/10
tfportrear12_0/ll addrconf ok           fe80::aa40:25ff:fe05:60f%tfportrear12_0/10
tfportrear13_0/ll addrconf ok           fe80::aa40:25ff:fe05:610%tfportrear13_0/10
tfportrear14_0/ll addrconf ok           fe80::aa40:25ff:fe05:611%tfportrear14_0/10
tfportrear15_0/ll addrconf ok           fe80::aa40:25ff:fe05:612%tfportrear15_0/10
tfportrear16_0/ll addrconf ok           fe80::aa40:25ff:fe05:613%tfportrear16_0/10
tfportrear17_0/ll addrconf ok           fe80::aa40:25ff:fe05:614%tfportrear17_0/10
ADDROBJ           TYPE     STATE        ADDR
tfportrear18_0/ll addrconf ok           fe80::aa40:25ff:fe05:615%tfportrear18_0/10
tfportrear19_0/ll addrconf ok           fe80::aa40:25ff:fe05:616%tfportrear19_0/10
tfportrear1_0/ll  addrconf ok           fe80::aa40:25ff:fe05:604%tfportrear1_0/10
tfportrear20_0/ll addrconf ok           fe80::aa40:25ff:fe05:617%tfportrear20_0/10
tfportrear21_0/ll addrconf ok           fe80::aa40:25ff:fe05:618%tfportrear21_0/10
tfportrear22_0/ll addrconf ok           fe80::aa40:25ff:fe05:619%tfportrear22_0/10
tfportrear23_0/ll addrconf ok           fe80::aa40:25ff:fe05:61a%tfportrear23_0/10
tfportrear24_0/ll addrconf ok           fe80::aa40:25ff:fe05:61b%tfportrear24_0/10
tfportrear25_0/ll addrconf ok           fe80::aa40:25ff:fe05:61c%tfportrear25_0/10
tfportrear26_0/ll addrconf ok           fe80::aa40:25ff:fe05:61d%tfportrear26_0/10
tfportrear27_0/ll addrconf ok           fe80::aa40:25ff:fe05:61e%tfportrear27_0/10
tfportrear28_0/ll addrconf ok           fe80::aa40:25ff:fe05:61f%tfportrear28_0/10
tfportrear29_0/ll addrconf ok           fe80::aa40:25ff:fe05:620%tfportrear29_0/10
tfportrear2_0/ll  addrconf ok           fe80::aa40:25ff:fe05:605%tfportrear2_0/10
tfportrear30_0/ll addrconf ok           fe80::aa40:25ff:fe05:621%tfportrear30_0/10
tfportrear31_0/ll addrconf ok           fe80::aa40:25ff:fe05:622%tfportrear31_0/10
tfportrear3_0/ll  addrconf ok           fe80::aa40:25ff:fe05:606%tfportrear3_0/10
tfportrear4_0/ll  addrconf ok           fe80::aa40:25ff:fe05:607%tfportrear4_0/10
tfportrear5_0/ll  addrconf ok           fe80::aa40:25ff:fe05:608%tfportrear5_0/10
tfportrear6_0/ll  addrconf ok           fe80::aa40:25ff:fe05:609%tfportrear6_0/10
tfportrear7_0/ll  addrconf ok           fe80::aa40:25ff:fe05:60a%tfportrear7_0/10
tfportrear8_0/ll  addrconf ok           fe80::aa40:25ff:fe05:60b%tfportrear8_0/10
tfportrear9_0/ll  addrconf ok           fe80::aa40:25ff:fe05:60c%tfportrear9_0/10

Cubby 14:

BRM42220081 # zoneadm list
global
oxz_switch
oxz_ntp_68d2abb5-3b4b-4bb7-a2e3-3d44d9ad4f0c
oxz_cockroachdb_32ba20c5-f738-4a47-8460-8470c73575b5
oxz_cockroachdb_5f7444ca-83f2-42ec-af13-61e8d4c5ea77
oxz_crucible_pantry_2ff2c56c-07d8-4f95-b617-aef088c474a1
oxz_crucible_8321dbc2-8917-48e3-9014-ed43580caf12
oxz_crucible_ba879730-ce1e-469f-a2fd-969d6582b162
oxz_crucible_2d9f088c-0c5a-4c00-843d-1775e528bc8e
oxz_crucible_ef25f86b-e785-4683-ae67-4bc6438142d5
oxz_crucible_553d6d3e-85dd-46d2-8f0d-89c7a3b0052d
oxz_crucible_830c60ba-fcf2-440f-a9d0-23831b307109
oxz_crucible_40f46c8d-3031-4760-ac1d-239adf45c557
oxz_crucible_66e35f57-598f-4fc4-b68c-d614b76e3031
oxz_crucible_48d7d15b-e16c-4f86-b95a-f3f0fb823058
oxz_crucible_cc1e99bb-fed0-4bf2-a7a8-d993833358fb
oxz_nexus_b46e8a39-2fa1-4bb0-9d2a-9a173e84fbea
BRM42220081 # ipadm
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
lo0/v6            static   ok           ::1/128
cxgbe0/ll         addrconf ok           fe80::aa40:25ff:fe04:3d2%cxgbe0/10
cxgbe1/ll         addrconf ok           fe80::aa40:25ff:fe04:3da%cxgbe1/10
bootstrap0/ll     addrconf ok           fe80::8:20ff:fe80:888%bootstrap0/10
bootstrap0/bootstrap6 static ok         fdb0:a840:2504:3d2::1/64
underlay0/ll      addrconf ok           fe80::8:20ff:fe94:dfcf%underlay0/10
underlay0/sled6   static   ok           fd00:1122:3344:104::1/64
BRM42220081 # dladm
LINK        CLASS     MTU    STATE    BRIDGE     OVER
tfpkt0      phys      1500   up       --         --
cxgbe0      phys      9000   up       --         --
cxgbe1      phys      9000   up       --         --
bootstrap_stub0 etherstub 9000 up     --         --
bootstrap0  vnic      1500   up       --         bootstrap_stub0
underlay_stub0 etherstub 9000 up      --         --
underlay0   vnic      9000   up       --         underlay_stub0
oxBootstrap0 vnic     1500   up       --         bootstrap_stub0
oxControlService0 vnic 9000  up       --         underlay_stub0
oxControlService1 vnic 9000  up       --         underlay_stub0
oxControlService2 vnic 9000  up       --         underlay_stub0
oxControlService3 vnic 9000  up       --         underlay_stub0
oxControlService4 vnic 9000  up       --         underlay_stub0
oxControlService5 vnic 9000  up       --         underlay_stub0
oxControlService6 vnic 9000  up       --         underlay_stub0
oxControlService7 vnic 9000  up       --         underlay_stub0
oxControlService8 vnic 9000  up       --         underlay_stub0
oxControlService9 vnic 9000  up       --         underlay_stub0
oxControlService10 vnic 9000 up       --         underlay_stub0
oxControlService11 vnic 9000 up       --         underlay_stub0
oxControlService12 vnic 9000 up       --         underlay_stub0
oxControlService13 vnic 9000 up       --         underlay_stub0
oxControlService14 vnic 9000 up       --         underlay_stub0
LINK        CLASS     MTU    STATE    BRIDGE     OVER
opte0       misc      1500   up       --         --
vopte0      vnic      1500   up       --         opte0
oxControlService15 vnic 9000 up       --         underlay_stub0
BRM42220081 # zlogin oxz_switch
[Connected to zone 'oxz_switch' pts/1]
The illumos Project     helios-2.0.22740        June 2024
root@oxz_switch0:~# svcs -x
root@oxz_switch0:~# ipadm
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
tfportqsfp0_0/uplink1 static ok         172.20.15.38/29
lo0/v6            static   ok           ::1/128
oxControlService0/ll addrconf ok        fe80::8:20ff:fe20:af7d%oxControlService0/10
oxControlService0/omicron6 static ok    fd00:1122:3344:104::2/64
oxBootstrap0/ll   addrconf ok           fe80::8:20ff:fe8f:d6cb%oxBootstrap0/10
oxBootstrap0/bootstrap6 static ok       fdb0:a840:2504:3d2::2/64
gimlet0/ll        addrconf ok           fe80::aa40:25ff:fe05:102%gimlet0/10
gimlet1/ll        addrconf ok           fe80::aa40:25ff:fe05:102%gimlet1/10
gimlet2/ll        addrconf ok           fe80::aa40:25ff:fe05:102%gimlet2/10
gimlet3/ll        addrconf ok           fe80::aa40:25ff:fe05:102%gimlet3/10
gimlet4/ll        addrconf ok           fe80::aa40:25ff:fe05:102%gimlet4/10
gimlet5/ll        addrconf ok           fe80::aa40:25ff:fe05:102%gimlet5/10
gimlet6/ll        addrconf ok           fe80::aa40:25ff:fe05:102%gimlet6/10
gimlet7/ll        addrconf ok           fe80::aa40:25ff:fe05:102%gimlet7/10
gimlet8/ll        addrconf ok           fe80::aa40:25ff:fe05:102%gimlet8/10
gimlet9/ll        addrconf ok           fe80::aa40:25ff:fe05:102%gimlet9/10
gimlet10/ll       addrconf ok           fe80::aa40:25ff:fe05:102%gimlet10/10
gimlet11/ll       addrconf ok           fe80::aa40:25ff:fe05:102%gimlet11/10
gimlet12/ll       addrconf ok           fe80::aa40:25ff:fe05:102%gimlet12/10
gimlet13/ll       addrconf ok           fe80::aa40:25ff:fe05:102%gimlet13/10
gimlet14/ll       addrconf ok           fe80::aa40:25ff:fe05:102%gimlet14/10
gimlet15/ll       addrconf ok           fe80::aa40:25ff:fe05:102%gimlet15/10
ADDROBJ           TYPE     STATE        ADDR
gimlet16/ll       addrconf ok           fe80::aa40:25ff:fe05:102%gimlet16/10
gimlet17/ll       addrconf ok           fe80::aa40:25ff:fe05:102%gimlet17/10
gimlet18/ll       addrconf ok           fe80::aa40:25ff:fe05:102%gimlet18/10
gimlet19/ll       addrconf ok           fe80::aa40:25ff:fe05:102%gimlet19/10
gimlet20/ll       addrconf ok           fe80::aa40:25ff:fe05:102%gimlet20/10
gimlet21/ll       addrconf ok           fe80::aa40:25ff:fe05:102%gimlet21/10
gimlet22/ll       addrconf ok           fe80::aa40:25ff:fe05:102%gimlet22/10
gimlet23/ll       addrconf ok           fe80::aa40:25ff:fe05:102%gimlet23/10
gimlet24/ll       addrconf ok           fe80::aa40:25ff:fe05:102%gimlet24/10
gimlet25/ll       addrconf ok           fe80::aa40:25ff:fe05:102%gimlet25/10
gimlet26/ll       addrconf ok           fe80::aa40:25ff:fe05:102%gimlet26/10
gimlet27/ll       addrconf ok           fe80::aa40:25ff:fe05:102%gimlet27/10
gimlet28/ll       addrconf ok           fe80::aa40:25ff:fe05:102%gimlet28/10
gimlet29/ll       addrconf ok           fe80::aa40:25ff:fe05:102%gimlet29/10
gimlet30/ll       addrconf ok           fe80::aa40:25ff:fe05:102%gimlet30/10
gimlet31/ll       addrconf ok           fe80::aa40:25ff:fe05:102%gimlet31/10
psc0/ll           addrconf ok           fe80::aa40:25ff:fe05:102%psc0/10
psc1/ll           addrconf ok           fe80::aa40:25ff:fe05:102%psc1/10
sidecar0/ll       addrconf ok           fe80::aa40:25ff:fe05:102%sidecar0/10
sidecar1/ll       addrconf ok           fe80::aa40:25ff:fe05:102%sidecar1/10
techport0/ll      addrconf ok           fe80::aa40:25ff:fe05:102%techport0/10
techport1/ll      addrconf ok           fe80::aa40:25ff:fe05:102%techport1/10
tofino0/ll        addrconf ok           fe80::aa40:25ff:fe05:102%tofino0/10
ADDROBJ           TYPE     STATE        ADDR
tfportint0_0/ll   addrconf ok           fe80::aa40:25ff:fe05:102%tfportint0_0/10
tfportrear0_0/ll  addrconf ok           fe80::aa40:25ff:fe05:103%tfportrear0_0/10
tfportrear10_0/ll addrconf ok           fe80::aa40:25ff:fe05:10d%tfportrear10_0/10
tfportrear11_0/ll addrconf ok           fe80::aa40:25ff:fe05:10e%tfportrear11_0/10
tfportrear12_0/ll addrconf ok           fe80::aa40:25ff:fe05:10f%tfportrear12_0/10
tfportrear13_0/ll addrconf ok           fe80::aa40:25ff:fe05:110%tfportrear13_0/10
tfportrear14_0/ll addrconf ok           fe80::aa40:25ff:fe05:111%tfportrear14_0/10
tfportrear15_0/ll addrconf ok           fe80::aa40:25ff:fe05:112%tfportrear15_0/10
tfportrear16_0/ll addrconf ok           fe80::aa40:25ff:fe05:113%tfportrear16_0/10
tfportrear17_0/ll addrconf ok           fe80::aa40:25ff:fe05:114%tfportrear17_0/10
tfportrear18_0/ll addrconf ok           fe80::aa40:25ff:fe05:115%tfportrear18_0/10
tfportrear19_0/ll addrconf ok           fe80::aa40:25ff:fe05:116%tfportrear19_0/10
tfportrear1_0/ll  addrconf ok           fe80::aa40:25ff:fe05:104%tfportrear1_0/10
tfportrear20_0/ll addrconf ok           fe80::aa40:25ff:fe05:117%tfportrear20_0/10
tfportrear21_0/ll addrconf ok           fe80::aa40:25ff:fe05:118%tfportrear21_0/10
tfportrear22_0/ll addrconf ok           fe80::aa40:25ff:fe05:119%tfportrear22_0/10
tfportrear23_0/ll addrconf ok           fe80::aa40:25ff:fe05:11a%tfportrear23_0/10
tfportrear24_0/ll addrconf ok           fe80::aa40:25ff:fe05:11b%tfportrear24_0/10
tfportrear25_0/ll addrconf ok           fe80::aa40:25ff:fe05:11c%tfportrear25_0/10
tfportrear26_0/ll addrconf ok           fe80::aa40:25ff:fe05:11d%tfportrear26_0/10
tfportrear27_0/ll addrconf ok           fe80::aa40:25ff:fe05:11e%tfportrear27_0/10
tfportrear28_0/ll addrconf ok           fe80::aa40:25ff:fe05:11f%tfportrear28_0/10
tfportrear29_0/ll addrconf ok           fe80::aa40:25ff:fe05:120%tfportrear29_0/10
ADDROBJ           TYPE     STATE        ADDR
tfportrear2_0/ll  addrconf ok           fe80::aa40:25ff:fe05:105%tfportrear2_0/10
tfportrear30_0/ll addrconf ok           fe80::aa40:25ff:fe05:121%tfportrear30_0/10
tfportrear31_0/ll addrconf ok           fe80::aa40:25ff:fe05:122%tfportrear31_0/10
tfportrear3_0/ll  addrconf ok           fe80::aa40:25ff:fe05:106%tfportrear3_0/10
tfportrear4_0/ll  addrconf ok           fe80::aa40:25ff:fe05:107%tfportrear4_0/10
tfportrear5_0/ll  addrconf ok           fe80::aa40:25ff:fe05:108%tfportrear5_0/10
tfportrear6_0/ll  addrconf ok           fe80::aa40:25ff:fe05:109%tfportrear6_0/10
tfportrear7_0/ll  addrconf ok           fe80::aa40:25ff:fe05:10a%tfportrear7_0/10
tfportrear8_0/ll  addrconf ok           fe80::aa40:25ff:fe05:10b%tfportrear8_0/10
tfportrear9_0/ll  addrconf ok           fe80::aa40:25ff:fe05:10c%tfportrear9_0/10

Cubby 15:

BRM42220046 # zoneadm list
global
oxz_internal_dns_0b4522ea-4678-4b83-99f3-1cfd06b5e1fa
oxz_ntp_509cf2d6-33dd-4abd-8e83-fe341bdce607
oxz_cockroachdb_569c5df4-3e31-44db-9f7f-4de7f0c355b8
oxz_crucible_da5f2c49-69b1-489d-9ff0-805cd8171d90
oxz_crucible_e663fa88-c6f0-40a1-855e-359fbf671d7c
oxz_crucible_1a5c52ed-89f8-4b04-83a0-322d477f287f
oxz_crucible_bb87a194-6f90-4fb7-974a-e4883f72aed6
oxz_crucible_e750464b-1ba8-47c8-89ab-7c91ed804780
oxz_crucible_30a2f7e2-76af-424e-ad66-b6f69819ea7c
oxz_crucible_bbda646e-8734-4fa5-959f-10daf4dead09
oxz_crucible_pantry_df12fb2b-9aa5-40ef-86e8-8117691bb246
oxz_crucible_8ef1cb7c-0a87-4a38-9621-63c53dd760bf
oxz_crucible_55ca2d30-8115-4872-82d6-0b67931afdb0
oxz_crucible_e687777e-62dc-48d7-9cd3-2b34bbd6777f
oxz_nexus_80e041cb-955e-4b9b-9c38-6278aa8ae3d8

Cubby 17:

BRM42220004 # zoneadm list
global
oxz_internal_dns_92544cea-b243-4901-a5b7-9c1af7ccbe5b
oxz_ntp_ee9491b6-f8d2-4680-9b97-cdbf35ed395a
oxz_cockroachdb_59e69a70-bcdd-4b29-a6b2-81a941a3dfc4
oxz_crucible_pantry_830ed07d-35da-4422-b465-74f9beb6f11d
oxz_external_dns_6c38a2e1-ce4b-46fa-b70f-f3ce56c2cc74
oxz_crucible_bb0da5e0-1962-4b87-9b0f-cc0df3ec731e
oxz_crucible_8efafd73-3b32-43ee-b1dd-81e619d8005b
oxz_crucible_94b6d8fd-4559-4e78-8051-6f1ec6f70c9c
oxz_crucible_42357855-a156-4676-9496-ce8fe3f6926f
oxz_crucible_641a3d43-bb55-426d-b902-93c2d3616abc
oxz_crucible_cfa05bd7-4f3a-4dd6-a247-f50ac150dd98
oxz_crucible_6b0b2993-e3c1-45c6-9464-600c4e7a59f0
oxz_crucible_d68cb673-3510-404e-8b10-9686434d7b4e
oxz_crucible_08330f65-0681-4fd1-9ca9-bf0b042cebc9
oxz_crucible_3b823ac1-cb04-4b85-862e-d522503fc3dd
oxz_oximeter_1ee7493c-5a92-435e-ad68-96de925426c5

And I created a project

Screenshot from 2024-06-27 16-17-30

I think I've addressed all comments. Please let me know if there's a loose end somewhere.

I'll need to coordinate merging the Dendrite PR because some of the services there depend on the common networking service and/or the switch zone setup service. https://github.com/oxidecomputer/dendrite/pull/990

The maghemite, lldp and pumpkind PRs don't matter as much as I'm just setting them as enabled by default. These don't make as much difference because the ServiceInstanceBuilder in the sled-agent/src/services.rs file sets them as enabled (or disabled in some instances for the pumpkind service), so those can be merged in and hashes in package-manifest.toml updated whenever.

@karencfv karencfv requested a review from smklein June 27, 2024 04:30
@andrewjstone
Copy link
Contributor

I think I've addressed all comments. Please let me know if there's a loose end somewhere.

Awesome work Karen! I know this one was a slog. Thanks for toughing it out.

I do have one more request. Sorry for adding more work! I know we left the bootstrap address in the switch zone, and I see it in the output. That's great. But it got me thinking that this is such a significant change, that we may want to test mupdating a sled running this code again, to ensure that it still works. I don't have any reason to think that it wouldn't, but I'd rather be on the safe side given the lack of automated testing here. I think you should be able to just upload the same tuf repo and mupdate a single sled again, even if it's the same version. If not, a new commit and repo build would work.

@karencfv
Copy link
Contributor Author

karencfv commented Jun 27, 2024

I do have one more request. Sorry for adding more work! I know we left the bootstrap address in the switch zone, and I see it in the output. That's great. But it got me thinking that this is such a significant change, that we may want to test mupdating a sled running this code again, to ensure that it still works. I don't have any reason to think that it wouldn't, but I'd rather be on the safe side given the lack of automated testing here. I think you should be able to just upload the same tuf repo and mupdate a single sled again, even if it's the same version. If not, a new commit and repo build would work.

Nw @andrewjstone , rather get this working right. I'll make a dummy commit and mupdate Madrid. So, basically just starting from here https://github.com/oxidecomputer/meta/blob/master/engineering/lab/env/madrid/index.adoc#442-updating-the-scrimlet-host-os without running a clean slate?

@andrewjstone
Copy link
Contributor

I do have one more request. Sorry for adding more work! I know we left the bootstrap address in the switch zone, and I see it in the output. That's great. But it got me thinking that this is such a significant change, that we may want to test mupdating a sled running this code again, to ensure that it still works. I don't have any reason to think that it wouldn't, but I'd rather be on the safe side given the lack of automated testing here. I think you should be able to just upload the same tuf repo and mupdate a single sled again, even if it's the same version. If not, a new commit and repo build would work.

Nw @andrewjstone , rather get this working right. I'll make a dummy commit and mupdate Madrid. So, basically just starting from here https://github.com/oxidecomputer/meta/blob/master/engineering/lab/env/madrid/index.adoc#442-updating-the-scrimlet-host-os without running a clean slate?

Thank you so much! Nope, luckily it's even easier than that I believe. You should only have to mupdate a single non-scrimlet sled to see if it works. So it's just a matter of updating the tuf repo to wicket and then installing to one of the other sleds. No need for any manual shenanigans :) You can start from here

@andrewjstone
Copy link
Contributor

I do have one more request. Sorry for adding more work! I know we left the bootstrap address in the switch zone, and I see it in the output. That's great. But it got me thinking that this is such a significant change, that we may want to test mupdating a sled running this code again, to ensure that it still works. I don't have any reason to think that it wouldn't, but I'd rather be on the safe side given the lack of automated testing here. I think you should be able to just upload the same tuf repo and mupdate a single sled again, even if it's the same version. If not, a new commit and repo build would work.

Nw @andrewjstone , rather get this working right. I'll make a dummy commit and mupdate Madrid. So, basically just starting from here https://github.com/oxidecomputer/meta/blob/master/engineering/lab/env/madrid/index.adoc#442-updating-the-scrimlet-host-os without running a clean slate?

Thank you so much! Nope, luckily it's even easier than that I believe. You should only have to mupdate a single non-scrimlet sled to see if it works. So it's just a matter of updating the tuf repo to wicket and then installing to one of the other sleds. No need for any manual shenanigans :) You can start from here

Then I'd just go ahead after mupdate and log into that sled to see if what is expected to run is running. It should be running the same zones as last time and look normal. You shouldn't have to clean slate or run RSS again or anything, as this is just an update.

@karencfv
Copy link
Contributor Author

@andrewjstone I ran a mupdate with fae95de on a non-scrimlet sled and everything looks fine to me :)

BRM42220046 # ipadm
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
lo0/v6            static   ok           ::1/128
cxgbe0/ll         addrconf ok           fe80::aa40:25ff:fe04:396%cxgbe0/10
cxgbe1/ll         addrconf ok           fe80::aa40:25ff:fe04:39e%cxgbe1/10
bootstrap0/ll     addrconf ok           fe80::8:20ff:fefb:d7f4%bootstrap0/10
bootstrap0/bootstrap6 static ok         fdb0:a840:2504:396::1/64
underlay0/ll      addrconf ok           fe80::8:20ff:feee:2d9b%underlay0/10
underlay0/sled6   static   ok           fd00:1122:3344:103::1/64
underlay0/internaldns2 static ok        fd00:1122:3344:3::2/64
BRM42220046 # zoneadm list
global
oxz_ntp_509cf2d6-33dd-4abd-8e83-fe341bdce607
oxz_internal_dns_0b4522ea-4678-4b83-99f3-1cfd06b5e1fa
oxz_crucible_1a5c52ed-89f8-4b04-83a0-322d477f287f
oxz_crucible_55ca2d30-8115-4872-82d6-0b67931afdb0
oxz_crucible_e687777e-62dc-48d7-9cd3-2b34bbd6777f
oxz_crucible_pantry_df12fb2b-9aa5-40ef-86e8-8117691bb246
oxz_crucible_da5f2c49-69b1-489d-9ff0-805cd8171d90
oxz_crucible_30a2f7e2-76af-424e-ad66-b6f69819ea7c
oxz_crucible_e663fa88-c6f0-40a1-855e-359fbf671d7c
oxz_crucible_bbda646e-8734-4fa5-959f-10daf4dead09
oxz_crucible_e750464b-1ba8-47c8-89ab-7c91ed804780
oxz_crucible_8ef1cb7c-0a87-4a38-9621-63c53dd760bf
oxz_crucible_bb87a194-6f90-4fb7-974a-e4883f72aed6
oxz_nexus_80e041cb-955e-4b9b-9c38-6278aa8ae3d8
oxz_cockroachdb_569c5df4-3e31-44db-9f7f-4de7f0c355b8
BRM42220046 # svcs -x
BRM42220046 #
root@oxz_ntp_509cf2d6:~# svcs -x
root@oxz_ntp_509cf2d6:~#
root@oxz_internal_dns_0b4522ea:~# svcs -x
root@oxz_internal_dns_0b4522ea:~#
root@oxz_crucible_1a5c52ed:~# svcs -x
root@oxz_crucible_1a5c52ed:~#
root@oxz_crucible_pantry_df12fb2b:~# svcs -x
root@oxz_crucible_pantry_df12fb2b:~#
root@oxz_nexus_80e041cb:~# svcs -x
root@oxz_nexus_80e041cb:~# 
root@oxz_cockroachdb_569c5df4:~# svcs -x
root@oxz_cockroachdb_569c5df4:~#

Made a second project

Screenshot from 2024-06-27 18-35-10

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on testing I believe this is good to go. You may want to get a second approval from Sean, since he's the expert in self-assembling zones and has been the primary reviewer here.

Thanks for all the hard work here Karen.

Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the hard work on this! Congrats on getting it through!

.map_err(|err| {
Error::io("Failed to setup Switch zone profile", err)
})?;
return Ok(RunningZone::boot(installed_zone).await?);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweeeeeeeet, love to see this -- booting being the last step is perfect, this is absolutely what I was hoping for (and I wasn't even 100% sure if it was possible)!

@karencfv
Copy link
Contributor Author

karencfv commented Jul 3, 2024

Thanks everyone for taking the time to review!

As mentioned during the update meeting, I will be merging this until release 9 is out the door. It’s a significant change to the switch zone flow, and it'll be good for it to be running for longer on the dogfood rack before installing it on customer racks.

@karencfv
Copy link
Contributor Author

Ready to merge now, just depends on https://github.com/oxidecomputer/dendrite/pull/990 being merged and the hashes being updated in this PR. This should be done in tandem, otherwise deployments will break

@karencfv
Copy link
Contributor Author

karencfv commented Jul 19, 2024

Hm, looks like 27fb3af is failing which is trying to update this commit -> https://github.com/oxidecomputer/dendrite/pull/997

This means that if I update here, it will most likely fail as well because of that previous commit.

@bnaecker, do you mind taking a look? I believe the PR was yours.

We'll have to coordinate updating dendrite in omicron, as my dendrite commit requires this PR to work.

@karencfv karencfv merged commit 67d0cbd into oxidecomputer:main Jul 22, 2024
17 checks passed
@karencfv karencfv deleted the switch-zone-self-assembling branch July 22, 2024 01:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Convert Switch Zone to be self-assembling
8 participants