Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track more kinds of datalinks on a sled #6208

Merged
merged 4 commits into from
Aug 7, 2024
Merged

Conversation

bnaecker
Copy link
Collaborator

@bnaecker bnaecker commented Aug 2, 2024

  • Drop the old physical_data_link:* timeseries, in favor of an expanded sled_data_link:*. This includes the sled identifiers, and also the kind of link, which incorporates physical, VNIC, and OPTE devices. Expunge the old timeseries.
  • Make the existing metrics manager into a small wrapper around a background task. Add message types for asking the task to start / stop tracking various things, for now just VNICs and OPTE ports. Physical links can also be tracked (but not untracked), which the sled agent does immediately after creating the task.
  • Add the metrics request queue to the instance manager, instance, and instance runner, and have the runner start / stop tracking the control VNIC and OPTE ports after booting the zone and before stopping it respectively.
  • Add the metrics request queue to the probe manager, and also start / stop tracking the links in the zones.
  • Add the metrics queue to the service manager. This one is more complicated, because this object has to exist before the SledAgent object itself, in order to start the switch zone. Instead, the manager is provided the queue when it's notified that the SledAgent exists, and at the same time tries to use the queue to notify the metrics task about the control VNIC that must have already been plumbed into the zone. The service manager also tracks / untracks the VNICs and OPTE ports for the Omicron zones it starts, which is much simpler.
  • Add some helper methods into the {Running,Installed}Zone} types for listing the names of the control VNIC, bootstrap VNIC, and any OPTE port names. These are used to tell the metrics task what links to track.
  • Clean up a few straggling comments or references to the VNICs that were previously required between OPTE ports and the guest Viona driver. Those were removed in [sled-agent] Remove VNICs from XDE devices #5989.

@bnaecker bnaecker force-pushed the track-more-sled-datalinks branch 2 times, most recently from 4c85d82 to 14bedbb Compare August 2, 2024 19:48
- Drop the old `physical_data_link:*` timeseries, in favor of an
  expanded `sled_data_link:*`. This includes the sled identifiers, and
  also the _kind_ of link, which incorporates physical, VNIC, and OPTE
  devices. Expunge the old timeseries.
- Make the existing metrics manager into a small wrapper around a
  background task. Add message types for asking the task to start / stop
  tracking various things, for now just VNICs and OPTE ports. Physical
  links can also be tracked (but not untracked), which the sled agent
  does immediately after creating the task.
- Add the metrics request queue to the instance manager, instance, and
  instance runner, and have the runner start / stop tracking the
  control VNIC and OPTE ports after booting the zone and before stopping
  it respectively.
- Add the metrics request queue to the probe manager, and also start /
  stop tracking the links in the zones.
- Add the metrics queue to the service manager. This one is more
  complicated, because this object has to exist before the `SledAgent`
  object itself, in order to start the switch zone. Instead, the manager
  is provided the queue when it's notified that the `SledAgent` exists,
  and at the same time tries to use the queue to notify the metrics task
  about the control VNIC that must have already been plumbed into the
  zone. The service manager also tracks / untracks the VNICs and OPTE
  ports for the _Omicron_ zones it starts, which is much simpler.
- Add some helper methods into the `{Running,Installed}Zone}` types for
  listing the names of the control VNIC, bootstrap VNIC, and any OPTE
  port names. These are used to tell the metrics task what links to
  track.
- Clean up a few straggling comments or references to the VNICs that
  were previously required between OPTE ports and the guest Viona
  driver. Those were removed in #5989.
@bnaecker
Copy link
Collaborator Author

bnaecker commented Aug 2, 2024

In addition to the units I added in the PR, I did a couple of quick tests installing the whole control plane on my dev machine. Here's what we see when starting everything up:

bnaecker@shale : ~/omicron $ reinstall_omicron && tail -F $(svcs -L sled-agent) | looker -c 'r.contains("link_name")'
    Finished `release` profile [optimized] target(s) in 3.07s
Logging to: /home/bnaecker/omicron/out/LOG

<SNIP>

20:55:50.607Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = physical
    link_name = net0
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = global
20:55:50.607Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = physical
    link_name = net1
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = global
20:55:58.876Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService0
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_switch
20:55:58.876Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxBootstrap0
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_switch
20:55:58.876Z DEBG SledAgent: received message to track VNIC, but it is already being tracked
    link_name = oxControlService0
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
20:55:58.876Z DEBG SledAgent: received message to track VNIC, but it is already being tracked
    link_name = oxBootstrap0
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
20:56:20.917Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService2
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_internal_dns_52b672f8-a33b-4aed-9d4c-06f1bb3ddfbf
20:56:21.516Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService1
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_internal_dns_e958b9e5-1a8e-4ee3-8044-dcc76c4b6dbc
20:56:21.723Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService3
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_internal_dns_fb337524-e11b-4766-9d60-bbee384a2909
20:56:48.212Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService4
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_ntp_93b97b95-4ef7-401e-b8db-7874785020dd
20:56:48.212Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = opte
    link_name = opte0
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_ntp_93b97b95-4ef7-401e-b8db-7874785020dd
20:58:09.676Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService9
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_cockroachdb_19210169-3ea2-448d-9123-ee66defa4b52
20:58:09.907Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService5
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_cockroachdb_7c8471bc-2394-471f-b0aa-d674eed50a6c
20:58:09.981Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService7
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_cockroachdb_b18a5327-b6b0-4ae7-b489-3c2e59681786
20:58:10.275Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService6
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_cockroachdb_3246ae8d-fb3e-48eb-9c6b-8fcc35095fac
20:58:10.605Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService8
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_cockroachdb_a501ca14-f030-410c-9b03-43ff9efb2c44
21:03:24.598Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService19
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_oximeter_1f9c3f3c-3b21-454d-a38f-43c72ef0ce97
21:03:24.600Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService17
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_crucible_9c01bd75-7fd9-4824-9a28-1f57dc3361ef
21:03:24.881Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService24
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_nexus_7eae326b-1a71-495b-887b-7994ee8b4d20
21:03:24.883Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = opte
    link_name = opte2
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_nexus_7eae326b-1a71-495b-887b-7994ee8b4d20
21:03:25.277Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService13
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_crucible_7fadeeed-b124-429c-bd5b-bbad9d789ba4
21:03:25.291Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService22
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_crucible_09931461-014c-486c-a759-172a65861080
21:03:25.321Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService15
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_crucible_pantry_caf2d809-d5a6-4acd-ba91-c9d3ca94ca3b
21:03:25.388Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService21
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_crucible_2c80d835-0fcd-4722-920a-ae3a4a078b0c
21:03:25.618Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService10
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_crucible_pantry_a3e901af-36c3-4ca7-b85e-8f75ee4e552b
21:03:26.034Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService12
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_crucible_59a92896-9f15-4684-8ce3-6fe1a67a6e35
21:03:26.244Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService20
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_crucible_pantry_fbfc92dd-198c-4ed4-9830-9554c615b723
21:03:26.460Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService26
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_nexus_acd49014-b0ce-4682-b932-87236aa48f41
21:03:26.460Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = opte
    link_name = opte1
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_nexus_acd49014-b0ce-4682-b932-87236aa48f41
21:03:26.554Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService28
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_external_dns_11f0f5d5-f181-45ac-9cd4-f336087b470a
21:03:26.554Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = opte
    link_name = opte5
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_external_dns_11f0f5d5-f181-45ac-9cd4-f336087b470a
21:03:29.306Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService23
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_crucible_29548ac9-34b9-4ae9-9bd1-3121c8b4e0ad
21:03:31.239Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService25
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_external_dns_585c82eb-ad8c-46da-a630-c2a827188662
21:03:31.240Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = opte
    link_name = opte3
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_external_dns_585c82eb-ad8c-46da-a630-c2a827188662
21:03:32.264Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService27
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_nexus_9ea7be9b-a306-4a77-b585-c61fc8dd3f94
21:03:32.264Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = opte
    link_name = opte4
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_nexus_9ea7be9b-a306-4a77-b585-c61fc8dd3f94
21:03:33.614Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService16
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_clickhouse_26375786-ffaf-409d-9b23-67d74d9b6121
21:06:16.768Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService18
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_crucible_e5b15d26-5831-4450-a982-9efb9efca96c
21:06:16.867Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService14
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_crucible_ab22c7ec-196d-4d2a-867d-73834eb32109
21:06:17.121Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlService11
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_crucible_dc47ebee-1556-4af9-924d-17e37370c424
21:14:51.557Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlInstance0
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_propolis-server_2263d392-f79d-4d47-aa5f-5d1ecd8e6867
21:14:51.564Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = opte
    link_name = opte6
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_propolis-server_2263d392-f79d-4d47-aa5f-5d1ecd8e6867

I also created and then started / stopped an instance a few times, and we can see the following:

bnaecker@shale : ~/omicron $ tail -F $(svcs -L sled-agent) | looker -c 'r.contains("link_name")'
21:15:17.146Z DEBG SledAgent: Removed VNIC from tracked links
    link_name = oxControlInstance0
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
21:15:17.146Z DEBG SledAgent: Removed VNIC from tracked links
    link_name = opte6
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
21:15:41.845Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = vnic
    link_name = oxControlInstance1
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_propolis-server_6fca6a05-a7ca-429b-9a5c-2dbcf888ece6
21:15:41.845Z DEBG SledAgent: Added new link to kstat sampler
    link_kind = opte
    link_name = opte7
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
    zone_name = oxz_propolis-server_6fca6a05-a7ca-429b-9a5c-2dbcf888ece6
21:15:50.305Z DEBG SledAgent: Removed VNIC from tracked links
    link_name = oxControlInstance1
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15
21:15:50.305Z DEBG SledAgent: Removed VNIC from tracked links
    link_name = opte7
    sled_id = abc3f548-dca8-42f6-acd0-3e4ba9d60a15

That's the control VNIC and OPTE port for the guest being added / removed when the instance is started and stopped.

Copy link
Collaborator

@zeeshanlakhani zeeshanlakhani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still going through this @bnaecker, but adding the first runthrough.

illumos-utils/src/opte/port_manager.rs Show resolved Hide resolved
oximeter/instruments/src/kstat/link.rs Show resolved Hide resolved
oximeter/oximeter/schema/sled-data-link.toml Show resolved Hide resolved
sled-agent/src/metrics.rs Show resolved Hide resolved
sled-agent/src/metrics.rs Outdated Show resolved Hide resolved
sled-agent/src/services.rs Outdated Show resolved Hide resolved
sled-agent/src/services.rs Outdated Show resolved Hide resolved
- Hashmap over B-tree map
- Return indicator of send failure, log failures to send at call site.
@bnaecker bnaecker requested a review from zeeshanlakhani August 6, 2024 19:02
@bnaecker
Copy link
Collaborator Author

bnaecker commented Aug 6, 2024

I'm going to merge with main once #6243 lands, to pick up the semantic merge fix. That'll also obviate the changes I made here to the instance-zone-boot-timeout test.

Copy link
Collaborator

@zeeshanlakhani zeeshanlakhani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating from the feedback @bnaecker. This also acks waiting for #6243 first.

@bnaecker bnaecker enabled auto-merge (squash) August 6, 2024 22:18
@bnaecker bnaecker merged commit afe7040 into main Aug 7, 2024
22 checks passed
@bnaecker bnaecker deleted the track-more-sled-datalinks branch August 7, 2024 00:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants