Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gateway] Add Oximeter HTTP service metrics #6432

Merged
merged 3 commits into from
Aug 26, 2024
Merged

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Aug 24, 2024

Now that #6354 has added an Oximeter producer endpoint to MGS for
publishing SP sensor metrics, it seemed like a nice idea to also
instrument the MGS HTTP server, similar to the existing instrumentation
for other control plane services. I don't think we'll be doing a lot of
tuning of MGS performance, but the metrics seem like they could still be
useful because they also include the distribution of HTTP status codes,
and in many cases, the latency measurements also serve as a proxy for
how long it takes the SP to perform a certain operation, which could
be a valuable signal.

This commit adds an oximeter_instruments::http::LatencyTracker to the
MGS HTTP servers.

To test that it works, I started a local Clickhouse and a standalone
Oximeter, and ran MGS and the SP simulator using cargo xtask mgs-dev run.
Then, I made a few HTTP requests to various MGS APIs using curl;
most of which were expected to succeed, and a few for SP slots that the
simulator wasn't configured to simulate a SP in (to ensure that the
request would fail). We can see the new metrics in OxQL:

0x〉\d
hardware_component:current
hardware_component:fan_speed
hardware_component:sensor_error_count
hardware_component:temperature
hardware_component:voltage
http_service:request_latency_histogram
oximeter_collector:collections
0x〉get http_service:request_latency_histogram | last 1

http_service:request_latency_histogram

 id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263
 name: management-gateway-service
 operation_id: sp_get
 status_code: 200
   [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.125731231]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 0, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 0, 0.0005: 1, 0.0006000000000000001: 1, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000556233, max: 0.000603704, mean: 0.0005799685000000001, std_dev: 0.00002373549999999997, p50: 0, p90: 0.000603704, p99: 0.000603704]

 id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263
 name: management-gateway-service
 operation_id: ignition_list
 status_code: 200
   [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.125290346]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 0, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 1, 0.0005: 0, 0.0006000000000000001: 0, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000427249, max: 0.000427249, mean: 0.000427249, std_dev: 0, p50: 0, p90: 0.000427249, p99: 0.000427249]

 id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263
 name: management-gateway-service
 operation_id: sp_get
 status_code: 400
   [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.126114126]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 2, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 0, 0.0005: 0, 0.0006000000000000001: 0, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000020368, max: 0.000021581, mean: 0.0000209745, std_dev: 0.0000006064999999999992, p50: 0, p90: 0.000021581, p99: 0.000021581]
0x〉exit

Now that #6354 has added an Oximeter producer endpoint to MGS for
publishing SP sensor metrics, it seemed like a nice idea to also
instrument the MGS HTTP server, similar to the existing instrumentation
for other control plane services. I don't think we'll be doing a lot of
tuning of MGS performance, but the metrics seem like they could still be
useful because they also include the distribution of HTTP status codes,
and in many cases, the latency measurements also serve as a proxy for
how long it takes the *SP* to perform a certain operation, which could
be a valuable signal.

This commit adds an `oximeter_instruments::http::LatencyTracker` to the
MGS HTTP servers.

To test that it works, I started a local Clickhouse and a standalone
Oximeter, and ran MGS and the SP simulator using `cargo xtask mgs-dev
run`. Then, I made a few HTTP requests to various MGS APIs using `curl`;
most of which were expected to succeed, and a few for SP slots that the
simulator wasn't configured to simulate a SP in (to ensure that the
request would fail). We can see the new metrics in OxQL:

```
0x〉\d
hardware_component:current
hardware_component:fan_speed
hardware_component:sensor_error_count
hardware_component:temperature
hardware_component:voltage
http_service:request_latency_histogram
oximeter_collector:collections
0x〉get http_service:request_latency_histogram | last 1

http_service:request_latency_histogram

 id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263
 name: management-gateway-service
 operation_id: sp_get
 status_code: 200
   [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.125731231]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 0, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 0, 0.0005: 1, 0.0006000000000000001: 1, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000556233, max: 0.000603704, mean: 0.0005799685000000001, std_dev: 0.00002373549999999997, p50: 0, p90: 0.000603704, p99: 0.000603704]

 id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263
 name: management-gateway-service
 operation_id: ignition_list
 status_code: 200
   [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.125290346]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 0, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 1, 0.0005: 0, 0.0006000000000000001: 0, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000427249, max: 0.000427249, mean: 0.000427249, std_dev: 0, p50: 0, p90: 0.000427249, p99: 0.000427249]

 id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263
 name: management-gateway-service
 operation_id: sp_get
 status_code: 400
   [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.126114126]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 2, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 0, 0.0005: 0, 0.0006000000000000001: 0, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000020368, max: 0.000021581, mean: 0.0000209745, std_dev: 0.0000006064999999999992, p50: 0, p90: 0.000021581, p99: 0.000021581]
0x〉exit
```
@hawkw hawkw requested review from bnaecker and jgallagher August 24, 2024 19:06
@bnaecker
Copy link
Collaborator

-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0

Oh, floating-point may you never change.

In all seriousness, we may want to consider changing this definition to be in nanos or micros or similar. The only real reason to keep it in float is to use the base SI unit of seconds. The values cannot be negative though, and it's pretty hard to read all this as-is. Thoughts?

Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to MGS LGTM. I'll defer to you and Ben for the Oximeter-specific details.

// TODO(eliza): I'm not sure whether there's a way to make
// `oximeter_instruments`'s HTTP latency tracker work with websockets
// requests? It would be nice to get the latency and any error returned
// prior to actually returning the websocket stream...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also possible we could drop this endpoint. Today if anyone attaches to the serial console it happens via faux-mgs, not MGS proper, and it doesn't seem like something we need to expose to the control plane in general.

(I think this method's existence long predates any discussions of the tech port, support keys, etc. It's possible to use it via gateway-cli usart-attach ..., but I strongly suspect that's never been used outside of its initial development.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't speak to whether we want to keep this particular endpoint. I do think it's possible that we want to either implement dropshot::HttpResponse for WebsocketEndpointResult; or that we can loosen the requirements on instrument_dropshot_handler(). We previously needed the API metadata from the request, but that's no longer true. We only need the status code now. I'm not sure if either of those is a good idea, or which one is better.

If you do want to pursue either one, I think a follow-up PR is in order, since that will require changes outside this particular package.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, that makes sense --- perhaps we should open a new issue to discuss deprecating/removing it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - filed #6437

Copy link
Collaborator

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks pretty straightforward to me, nice work! I have a few small points and questions, but I'm on board overall and happy to merge at your discretion!

@@ -42,6 +42,7 @@ uuid.workspace = true
omicron-workspace-hack.workspace = true
oximeter.workspace = true
oximeter-producer.workspace = true
oximeter-instruments.workspace = true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to be a stickler about features, there's an http-instruments feature specifically for getting the bit you need for this PR.

Comment on lines +42 to +43
const START_LATENCY_DECADE: i16 = -6;
const END_LATENCY_DECADE: i16 = 3;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these appropriate here? I think they're definitely fine to use, since the cost is pretty low, but the upper bin especially seems a bit high. Do you expect MGS to service requests that take tens to hundreds of seconds?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, maybe uploading an update image does take a long time! This might actually be low then! In general, I'd probably pick about an order of magnitude higher than you could reasonably imagine a request taking, or high enough that lumping everything together is "fine" and not something you care to distinguish.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be entirely honest, I just snagged them from Nexus because I don't have a super clear sense for what kinds of latency to expect in MGS. Perhaps the API endpoints for updating SP firmware can take multiple seconds? @jgallagher what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updates are broken up into "start" / "status" / "abort", each of which should be relatively quick. (Callers are expected to poll "status" until the update is complete, in general.)

However, resetting an SP can take a while, especially if it's a sidecar, because we wait for it to come back online before returning. The timeout for this is currently 30 seconds per attempt, with some small number of retries possible. In practice I think a sidecar reset is currently around 15 seconds, assuming success.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, so I suppose that if we retry a reset three times before deciding it's failed, that's about 90 seconds...so I think let's leave this the way it is!

// Also, register the producer for the HTTP API metrics.
registry
.register_producer(apictx.latencies.clone())
// TODO(ben): do this one too pls
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wontfix :)

// TODO(eliza): I'm not sure whether there's a way to make
// `oximeter_instruments`'s HTTP latency tracker work with websockets
// requests? It would be nice to get the latency and any error returned
// prior to actually returning the websocket stream...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't speak to whether we want to keep this particular endpoint. I do think it's possible that we want to either implement dropshot::HttpResponse for WebsocketEndpointResult; or that we can loosen the requirements on instrument_dropshot_handler(). We previously needed the API metadata from the request, but that's no longer true. We only need the status code now. I'm not sure if either of those is a good idea, or which one is better.

If you do want to pursue either one, I think a follow-up PR is in order, since that will require changes outside this particular package.

@hawkw
Copy link
Member Author

hawkw commented Aug 26, 2024

In all seriousness, we may want to consider changing this definition to be in nanos or micros or similar. The only real reason to keep it in float is to use the base SI unit of seconds. The values cannot be negative though, and it's pretty hard to read all this as-is. Thoughts?

If we switched to nanoseconds, I imagine we could just use an integer number of nanoseconds, too --- a u64 of nanos would let us represent a maximum duration of 18446744073.709551615 seconds, or 584 years. So, if an HTTP request takes longer than a 584-year bucket, I think we have worse problems than maybe recording a wrong latency metric! :)

@bnaecker
Copy link
Collaborator

I imagine we could just use an integer number of nanoseconds

Yeah, that's the main benefit. It's unsigned, which is "more correct" and one could argue better uses the 64 bits available to us. The fixed width makes these goofy lines we show in oxql easier to read too. If you concur, I can file an issue to update the LatencyTracker?

@hawkw
Copy link
Member Author

hawkw commented Aug 26, 2024

I imagine we could just use an integer number of nanoseconds

Yeah, that's the main benefit. It's unsigned, which is "more correct" and one could argue better uses the 64 bits available to us. The fixed width makes these goofy lines we show in oxql easier to read too. If you concur, I can file an issue to update the LatencyTracker?

Ah, I didn't see this earlier! Yeah, let's open an issue about that!

@bnaecker
Copy link
Collaborator

@hawkw #6443

@hawkw hawkw merged commit 6e0bf12 into main Aug 26, 2024
23 checks passed
@hawkw hawkw deleted the eliza/mgs-dropshot-metrics branch August 26, 2024 23:19
nickryand pushed a commit to nickryand/omicron that referenced this pull request Aug 27, 2024
Now that oxidecomputer#6354 has added an Oximeter producer endpoint to MGS for
publishing SP sensor metrics, it seemed like a nice idea to also
instrument the MGS HTTP server, similar to the existing instrumentation
for other control plane services. I don't think we'll be doing a lot of
tuning of MGS performance, but the metrics seem like they could still be
useful because they also include the distribution of HTTP status codes,
and in many cases, the latency measurements also serve as a proxy for
how long it takes the *SP* to perform a certain operation, which could
be a valuable signal.

This commit adds an `oximeter_instruments::http::LatencyTracker` to the
MGS HTTP servers.

To test that it works, I started a local Clickhouse and a standalone
Oximeter, and ran MGS and the SP simulator using `cargo xtask mgs-dev
run`.
Then, I made a few HTTP requests to various MGS APIs using `curl`;
most of which were expected to succeed, and a few for SP slots that the
simulator wasn't configured to simulate a SP in (to ensure that the
request would fail). We can see the new metrics in OxQL:

```
0x〉\d
hardware_component:current
hardware_component:fan_speed
hardware_component:sensor_error_count
hardware_component:temperature
hardware_component:voltage
http_service:request_latency_histogram
oximeter_collector:collections
0x〉get http_service:request_latency_histogram | last 1

http_service:request_latency_histogram

 id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263
 name: management-gateway-service
 operation_id: sp_get
 status_code: 200
   [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.125731231]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 0, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 0, 0.0005: 1, 0.0006000000000000001: 1, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000556233, max: 0.000603704, mean: 0.0005799685000000001, std_dev: 0.00002373549999999997, p50: 0, p90: 0.000603704, p99: 0.000603704]

 id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263
 name: management-gateway-service
 operation_id: ignition_list
 status_code: 200
   [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.125290346]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 0, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 1, 0.0005: 0, 0.0006000000000000001: 0, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000427249, max: 0.000427249, mean: 0.000427249, std_dev: 0, p50: 0, p90: 0.000427249, p99: 0.000427249]

 id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263
 name: management-gateway-service
 operation_id: sp_get
 status_code: 400
   [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.126114126]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 2, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 0, 0.0005: 0, 0.0006000000000000001: 0, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000020368, max: 0.000021581, mean: 0.0000209745, std_dev: 0.0000006064999999999992, p50: 0, p90: 0.000021581, p99: 0.000021581]
0x〉exit
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants