Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inserting a NetworkInterface should allow the caller to specify the slot #5056

Closed
jgallagher opened this issue Feb 13, 2024 · 0 comments · Fixed by #5080
Closed

Inserting a NetworkInterface should allow the caller to specify the slot #5056

jgallagher opened this issue Feb 13, 2024 · 0 comments · Fixed by #5080

Comments

@jgallagher
Copy link
Contributor

When RSS is creating NetworkInterfaces to send to sled-agent, it chooses a slot number (and always chooses slot: 0; e.g., here for Nexus instances). After Nexus receives the handoff from RSS upon completion RSS copmletion, Nexus converts the RSS-created NetworkInterfaces to IncompleteNetworkInterfaces. IncompleteNetworkInterface allows the caller to specify the NIC ID, MAC, and IP, all of which we do; however, it does not allow specifying the slot. The slot is chosen by a NextItem query helper, which may choose any slot from 0-7; see #5055.

This leads to some inconsistency that may be a little painful as we work on the reconfiguration system. The inventory system collects NetworkInterfaces as a part of collecting each sled's service zone configuration; these are the values chosen by RSS. On dogfood, we see that each has a slot of 0, as expected:

root@[fd00:1122:3344:109::3]:32221/omicron  OPEN> select * from inv_omicron_zone_nic where inv_collection_id = '373b8b21-d878-4da0-8eb7-6f45fb04f49c';
           inv_collection_id           |                  id                  |                       name                        |     ip     |       mac       |    subnet     | vni | is_primary | slot
---------------------------------------+--------------------------------------+---------------------------------------------------+------------+-----------------+---------------+-----+------------+-------
  373b8b21-d878-4da0-8eb7-6f45fb04f49c | 2e9a412e-c79a-48fe-8fa4-f5a6afed1040 | nexus-2898657e-4141-4c05-851b-147bffc6bbbd        | 172.30.2.7 | 184993468892761 | 172.30.2.0/24 | 100 |    true    |    0
  373b8b21-d878-4da0-8eb7-6f45fb04f49c | 364b0ecd-bf08-4cac-a993-bbf4a70564c7 | nexus-20b100d0-84c3-4119-aa9b-0c632b0b6a3a        | 172.30.2.6 | 184993468888257 | 172.30.2.0/24 | 100 |    true    |    0
  373b8b21-d878-4da0-8eb7-6f45fb04f49c | 4effd079-ed4e-4cf6-8545-bb9574f516d2 | ntp-6ea2684c-115e-48a6-8453-ab52d1cecd73          | 172.30.3.6 | 184993468883193 | 172.30.3.0/24 | 100 |    true    |    0
  373b8b21-d878-4da0-8eb7-6f45fb04f49c | 99b759fc-8e2e-44b7-aca8-93c3b201974d | external-dns-edd99650-5df1-4241-815d-253e4ef2399c | 172.30.1.5 | 184993468887196 | 172.30.1.0/24 | 100 |    true    |    0
  373b8b21-d878-4da0-8eb7-6f45fb04f49c | a3e13dde-a2bc-4170-ad84-aad8085b6034 | nexus-65a11c18-7f59-41ac-b9e7-680627f996e7        | 172.30.2.5 | 184993468884611 | 172.30.2.0/24 | 100 |    true    |    0
  373b8b21-d878-4da0-8eb7-6f45fb04f49c | a4b9bacf-6c04-431a-81ad-9bf0302af96e | ntp-c3ec3d1a-3172-4d36-bfd3-f54a04d5ba55          | 172.30.3.5 | 184993468887634 | 172.30.3.0/24 | 100 |    true    |    0
  373b8b21-d878-4da0-8eb7-6f45fb04f49c | b0b42776-3914-4a69-889f-4831dc72327c | external-dns-f500d564-c40a-4eca-ac8a-a26b435f2037 | 172.30.1.6 | 184993468895412 | 172.30.1.0/24 | 100 |    true    |    0
(7 rows)

However, the network_interfaces table that was populated by Nexus during the RSS handoff has one NIC with slot=0, and the rest have slot=1:

root@[fd00:1122:3344:109::3]:32221/omicron  OPEN> select * from service_network_interface;
                   id                  |                       name                        |        description        |         time_created          |         time_modified         | time_deleted |              service_id              |                vpc_id                |              subnet_id               |       mac       |     ip     | slot | is_primary
---------------------------------------+---------------------------------------------------+---------------------------+-------------------------------+-------------------------------+--------------+--------------------------------------+--------------------------------------+--------------------------------------+-----------------+------------+------+-------------
  2e9a412e-c79a-48fe-8fa4-f5a6afed1040 | nexus-2898657e-4141-4c05-851b-147bffc6bbbd        | nexus service vNIC        | 2023-08-30 18:59:11.487953+00 | 2023-08-30 18:59:11.487953+00 | NULL         | 2898657e-4141-4c05-851b-147bffc6bbbd | 001de000-074c-4000-8000-000000000000 | 001de000-c470-4000-8000-000000000002 | 184993468892761 | 172.30.2.7 |    0 |    true
  364b0ecd-bf08-4cac-a993-bbf4a70564c7 | nexus-20b100d0-84c3-4119-aa9b-0c632b0b6a3a        | nexus service vNIC        | 2023-08-30 18:59:11.009175+00 | 2023-08-30 18:59:11.009175+00 | NULL         | 20b100d0-84c3-4119-aa9b-0c632b0b6a3a | 001de000-074c-4000-8000-000000000000 | 001de000-c470-4000-8000-000000000002 | 184993468888257 | 172.30.2.6 |    1 |    true
  4effd079-ed4e-4cf6-8545-bb9574f516d2 | ntp-6ea2684c-115e-48a6-8453-ab52d1cecd73          | ntp service vNIC          | 2023-08-30 18:59:11.556418+00 | 2023-08-30 18:59:11.556418+00 | NULL         | 6ea2684c-115e-48a6-8453-ab52d1cecd73 | 001de000-074c-4000-8000-000000000000 | 001de000-c470-4000-8000-000000000003 | 184993468883193 | 172.30.3.6 |    1 |    true
  99b759fc-8e2e-44b7-aca8-93c3b201974d | external-dns-edd99650-5df1-4241-815d-253e4ef2399c | external_dns service vNIC | 2023-08-30 18:59:11.405221+00 | 2023-08-30 18:59:11.405221+00 | NULL         | edd99650-5df1-4241-815d-253e4ef2399c | 001de000-074c-4000-8000-000000000000 | 001de000-c470-4000-8000-000000000001 | 184993468887196 | 172.30.1.5 |    1 |    true
  a3e13dde-a2bc-4170-ad84-aad8085b6034 | nexus-65a11c18-7f59-41ac-b9e7-680627f996e7        | nexus service vNIC        | 2023-08-30 18:59:10.82071+00  | 2023-08-30 18:59:10.82071+00  | NULL         | 65a11c18-7f59-41ac-b9e7-680627f996e7 | 001de000-074c-4000-8000-000000000000 | 001de000-c470-4000-8000-000000000002 | 184993468884611 | 172.30.2.5 |    1 |    true
  a4b9bacf-6c04-431a-81ad-9bf0302af96e | ntp-c3ec3d1a-3172-4d36-bfd3-f54a04d5ba55          | ntp service vNIC          | 2023-08-30 18:59:11.077996+00 | 2023-08-30 18:59:11.077996+00 | NULL         | c3ec3d1a-3172-4d36-bfd3-f54a04d5ba55 | 001de000-074c-4000-8000-000000000000 | 001de000-c470-4000-8000-000000000003 | 184993468887634 | 172.30.3.5 |    1 |    true
  b0b42776-3914-4a69-889f-4831dc72327c | external-dns-f500d564-c40a-4eca-ac8a-a26b435f2037 | external_dns service vNIC | 2023-08-30 18:59:11.333446+00 | 2023-08-30 18:59:11.333446+00 | NULL         | f500d564-c40a-4eca-ac8a-a26b435f2037 | 001de000-074c-4000-8000-000000000000 | 001de000-c470-4000-8000-000000000001 | 184993468895412 | 172.30.1.6 |    1 |    true
(7 rows)

There are two issues here:

  1. We should fix service NIC insertion to allow the caller to specify the slot. (This affects RSS today, and will affect the reconfigurator as of Reconfigurator: Record external networking allocations when realizing a blueprint #5045.)
  2. We should fix up existing, incorrect service NIC slot numbers. I'm not sure if we should do this automatically (e.g., reconcile with the inventory system) or manually (e.g., how we do schema migrations during update).
jgallagher added a commit that referenced this issue Feb 15, 2024
…5065)

This makes a several minor changes to plumb slots through:

* `IncompleteNetworkInterface` now stores an optional slot, just like it
stores an optional IP/MAC address
* In `network_interfaces::InsertQuery`, if the incoming slot is set, we
use it directly instead of running the `NextItem`-based subquery
* Adds a partial index to ensure uniqueness of a slot within a single
`parent_id` (I believe this is correct, but would love confirmation from
someone more familiar!)
* `IncompleteNetworkInterface::new_service()` now takes a _non-optional_
IP, MAC address, and slot. This matches how it was called in all
non-test code.
* Tweaked the Nexus internal API used for RSS handoff to include the
slot in the description of NICs.

This is a partial fix for #5056, and should produce correct behavior on
new systems that run through RSS, even without a fix for #5055 (because
we bypass `NextItem` altogether with this change). In particular, I
think this should unblock testing of #5045 on madrid / testbeds. It does
not address the already-recorded-NICs-with-incorrect-slots on systems
like dogfood; I'll take care of that in a subsequent PR.
jgallagher added a commit that referenced this issue Feb 15, 2024
…ist-eips` (#5064)

This PR has some omdb commands I wanted during dev/debug of #5045. It
expands the output of `list-vnics` to include the parent's ID
(particularly useful when trying to determine the external IP of a
specific Nexus instance, for example):

```
 IP                PORTS        KIND       STATE     OWNER_KIND  OWNER_ID                              OWNER_DESCRIPTION
 10.1.1.3/32       0/65535      floating   Attached  instance    4e6fb33a-7ba2-4a5e-abc5-dc9b047c01e0  v6/some-vm2
 10.1.1.4/32       0/16383      SNAT       Attached  instance    4e6fb33a-7ba2-4a5e-abc5-dc9b047c01e0  v6/some-vm2
 10.1.1.5/32       0/65535      ephemeral  Attached  instance    4e6fb33a-7ba2-4a5e-abc5-dc9b047c01e0  v6/some-vm2
 172.20.26.1/32    0/65535      floating   Attached  service     edd99650-5df1-4241-815d-253e4ef2399c  ExternalDns
 172.20.26.2/32    0/65535      floating   Attached  service     f500d564-c40a-4eca-ac8a-a26b435f2037  ExternalDns
 172.20.26.3/32    0/65535      floating   Attached  service     65a11c18-7f59-41ac-b9e7-680627f996e7  Nexus
 172.20.26.4/32    0/65535      floating   Attached  service     20b100d0-84c3-4119-aa9b-0c632b0b6a3a  Nexus
 172.20.26.5/32    0/65535      floating   Attached  service     2898657e-4141-4c05-851b-147bffc6bbbd  Nexus
 172.20.26.6/32    0/16383      SNAT       Attached  service     c3ec3d1a-3172-4d36-bfd3-f54a04d5ba55  Ntp
```

and adds a `list-vnics` command to show allocated vnics:

```
 IP                 MAC                SLOT  PRIMARY  KIND      SUBNET           PARENT_ID                             DESCRIPTION
 172.30.0.5/32      A8:40:25:F8:A5:8C  1     true     instance  172.30.0.0/22    2a4afdda-e269-48bc-913f-01ad57c50543  default primary interface for p4
 172.30.0.5/32      A8:40:25:F5:AF:F0  1     true     instance  172.30.0.0/22    be705808-d507-4693-9a97-186c92970e7b  default primary interface for updateinst
 172.30.0.5/32      A8:40:25:F7:3B:00  1     true     instance  172.30.0.0/22    0ab1939f-af6e-4ea2-a155-71f210e937fc  a sample nic
 172.30.1.5/32      A8:40:25:FF:B0:9C  1     true     service   172.30.1.0/24    edd99650-5df1-4241-815d-253e4ef2399c  external_dns service vNIC
 172.30.1.6/32      A8:40:25:FF:D0:B4  1     true     service   172.30.1.0/24    f500d564-c40a-4eca-ac8a-a26b435f2037  external_dns service vNIC
 172.30.2.5/32      A8:40:25:FF:A6:83  1     true     service   172.30.2.0/24    65a11c18-7f59-41ac-b9e7-680627f996e7  nexus service vNIC
 172.30.2.6/32      A8:40:25:FF:B4:C1  1     true     service   172.30.2.0/24    20b100d0-84c3-4119-aa9b-0c632b0b6a3a  nexus service vNIC
 172.30.2.7/32      A8:40:25:FF:C6:59  0     true     service   172.30.2.0/24    2898657e-4141-4c05-851b-147bffc6bbbd  nexus service vNIC
 172.30.3.5/32      A8:40:25:FF:B2:52  1     true     service   172.30.3.0/24    c3ec3d1a-3172-4d36-bfd3-f54a04d5ba55  ntp service vNIC
 172.30.3.6/32      A8:40:25:FF:A0:F9  1     true     service   172.30.3.0/24    6ea2684c-115e-48a6-8453-ab52d1cecd73  ntp service vNIC
```

(This command immediately revealed issues with the slot number recording
on dogfood, which led to opening #5056.)
jgallagher added a commit that referenced this issue Feb 15, 2024
This is the second half of the fix for #5056. #5065 (already merged)
fixed _how_ we were getting service NICs with nonzero slot values, and
this PR adds a schema migration to apply a one-time fix to any existing
service NICs with nonzero slot values. This matters to the
Reconfigurator, because currently the NICs sled-agent thinks it has
don't match the NICs recorded in CRDB (differing only by slot number).

Closes #5056.
jgallagher added a commit that referenced this issue Feb 15, 2024
This is the second half of the fix for #5056. #5065 (already merged)
fixed _how_ we were getting service NICs with nonzero slot values, and
this PR adds a schema migration to apply a one-time fix to any existing
service NICs with nonzero slot values. This matters to the
Reconfigurator, because currently the NICs sled-agent thinks it has
don't match the NICs recorded in CRDB (differing only by slot number).

Closes #5056.
jgallagher added a commit that referenced this issue Feb 16, 2024
This is the second half of the fix for #5056. #5065 (already merged)
fixed _how_ we were getting service NICs with nonzero slot values, and
this PR adds a schema migration to apply a one-time fix to any existing
service NICs with nonzero slot values. This matters to the
Reconfigurator, because currently the NICs sled-agent thinks it has
don't match the NICs recorded in CRDB (differing only by slot number).

Closes #5056.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant