-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[gateway] Add Oximeter HTTP service metrics #6432
Conversation
Now that #6354 has added an Oximeter producer endpoint to MGS for publishing SP sensor metrics, it seemed like a nice idea to also instrument the MGS HTTP server, similar to the existing instrumentation for other control plane services. I don't think we'll be doing a lot of tuning of MGS performance, but the metrics seem like they could still be useful because they also include the distribution of HTTP status codes, and in many cases, the latency measurements also serve as a proxy for how long it takes the *SP* to perform a certain operation, which could be a valuable signal. This commit adds an `oximeter_instruments::http::LatencyTracker` to the MGS HTTP servers. To test that it works, I started a local Clickhouse and a standalone Oximeter, and ran MGS and the SP simulator using `cargo xtask mgs-dev run`. Then, I made a few HTTP requests to various MGS APIs using `curl`; most of which were expected to succeed, and a few for SP slots that the simulator wasn't configured to simulate a SP in (to ensure that the request would fail). We can see the new metrics in OxQL: ``` 0x〉\d hardware_component:current hardware_component:fan_speed hardware_component:sensor_error_count hardware_component:temperature hardware_component:voltage http_service:request_latency_histogram oximeter_collector:collections 0x〉get http_service:request_latency_histogram | last 1 http_service:request_latency_histogram id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263 name: management-gateway-service operation_id: sp_get status_code: 200 [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.125731231]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 0, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 0, 0.0005: 1, 0.0006000000000000001: 1, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000556233, max: 0.000603704, mean: 0.0005799685000000001, std_dev: 0.00002373549999999997, p50: 0, p90: 0.000603704, p99: 0.000603704] id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263 name: management-gateway-service operation_id: ignition_list status_code: 200 [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.125290346]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 0, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 1, 0.0005: 0, 0.0006000000000000001: 0, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000427249, max: 0.000427249, mean: 0.000427249, std_dev: 0, p50: 0, p90: 0.000427249, p99: 0.000427249] id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263 name: management-gateway-service operation_id: sp_get status_code: 400 [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.126114126]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 2, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 0, 0.0005: 0, 0.0006000000000000001: 0, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000020368, max: 0.000021581, mean: 0.0000209745, std_dev: 0.0000006064999999999992, p50: 0, p90: 0.000021581, p99: 0.000021581] 0x〉exit ```
Oh, floating-point may you never change. In all seriousness, we may want to consider changing this definition to be in nanos or micros or similar. The only real reason to keep it in float is to use the base SI unit of seconds. The values cannot be negative though, and it's pretty hard to read all this as-is. Thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes to MGS LGTM. I'll defer to you and Ben for the Oximeter-specific details.
// TODO(eliza): I'm not sure whether there's a way to make | ||
// `oximeter_instruments`'s HTTP latency tracker work with websockets | ||
// requests? It would be nice to get the latency and any error returned | ||
// prior to actually returning the websocket stream... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also possible we could drop this endpoint. Today if anyone attaches to the serial console it happens via faux-mgs
, not MGS proper, and it doesn't seem like something we need to expose to the control plane in general.
(I think this method's existence long predates any discussions of the tech port, support keys, etc. It's possible to use it via gateway-cli usart-attach ...
, but I strongly suspect that's never been used outside of its initial development.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't speak to whether we want to keep this particular endpoint. I do think it's possible that we want to either implement dropshot::HttpResponse
for WebsocketEndpointResult
; or that we can loosen the requirements on instrument_dropshot_handler()
. We previously needed the API metadata from the request, but that's no longer true. We only need the status code now. I'm not sure if either of those is a good idea, or which one is better.
If you do want to pursue either one, I think a follow-up PR is in order, since that will require changes outside this particular package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm, that makes sense --- perhaps we should open a new issue to discuss deprecating/removing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure - filed #6437
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks pretty straightforward to me, nice work! I have a few small points and questions, but I'm on board overall and happy to merge at your discretion!
gateway/Cargo.toml
Outdated
@@ -42,6 +42,7 @@ uuid.workspace = true | |||
omicron-workspace-hack.workspace = true | |||
oximeter.workspace = true | |||
oximeter-producer.workspace = true | |||
oximeter-instruments.workspace = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to be a stickler about features, there's an http-instruments
feature specifically for getting the bit you need for this PR.
const START_LATENCY_DECADE: i16 = -6; | ||
const END_LATENCY_DECADE: i16 = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these appropriate here? I think they're definitely fine to use, since the cost is pretty low, but the upper bin especially seems a bit high. Do you expect MGS to service requests that take tens to hundreds of seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, maybe uploading an update image does take a long time! This might actually be low then! In general, I'd probably pick about an order of magnitude higher than you could reasonably imagine a request taking, or high enough that lumping everything together is "fine" and not something you care to distinguish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be entirely honest, I just snagged them from Nexus because I don't have a super clear sense for what kinds of latency to expect in MGS. Perhaps the API endpoints for updating SP firmware can take multiple seconds? @jgallagher what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updates are broken up into "start" / "status" / "abort", each of which should be relatively quick. (Callers are expected to poll "status" until the update is complete, in general.)
However, resetting an SP can take a while, especially if it's a sidecar, because we wait for it to come back online before returning. The timeout for this is currently 30 seconds per attempt, with some small number of retries possible. In practice I think a sidecar reset is currently around 15 seconds, assuming success.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm, so I suppose that if we retry a reset three times before deciding it's failed, that's about 90 seconds...so I think let's leave this the way it is!
// Also, register the producer for the HTTP API metrics. | ||
registry | ||
.register_producer(apictx.latencies.clone()) | ||
// TODO(ben): do this one too pls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wontfix :)
// TODO(eliza): I'm not sure whether there's a way to make | ||
// `oximeter_instruments`'s HTTP latency tracker work with websockets | ||
// requests? It would be nice to get the latency and any error returned | ||
// prior to actually returning the websocket stream... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't speak to whether we want to keep this particular endpoint. I do think it's possible that we want to either implement dropshot::HttpResponse
for WebsocketEndpointResult
; or that we can loosen the requirements on instrument_dropshot_handler()
. We previously needed the API metadata from the request, but that's no longer true. We only need the status code now. I'm not sure if either of those is a good idea, or which one is better.
If you do want to pursue either one, I think a follow-up PR is in order, since that will require changes outside this particular package.
If we switched to nanoseconds, I imagine we could just use an integer number of nanoseconds, too --- a |
Yeah, that's the main benefit. It's unsigned, which is "more correct" and one could argue better uses the 64 bits available to us. The fixed width makes these goofy lines we show in |
Ah, I didn't see this earlier! Yeah, let's open an issue about that! |
Now that oxidecomputer#6354 has added an Oximeter producer endpoint to MGS for publishing SP sensor metrics, it seemed like a nice idea to also instrument the MGS HTTP server, similar to the existing instrumentation for other control plane services. I don't think we'll be doing a lot of tuning of MGS performance, but the metrics seem like they could still be useful because they also include the distribution of HTTP status codes, and in many cases, the latency measurements also serve as a proxy for how long it takes the *SP* to perform a certain operation, which could be a valuable signal. This commit adds an `oximeter_instruments::http::LatencyTracker` to the MGS HTTP servers. To test that it works, I started a local Clickhouse and a standalone Oximeter, and ran MGS and the SP simulator using `cargo xtask mgs-dev run`. Then, I made a few HTTP requests to various MGS APIs using `curl`; most of which were expected to succeed, and a few for SP slots that the simulator wasn't configured to simulate a SP in (to ensure that the request would fail). We can see the new metrics in OxQL: ``` 0x〉\d hardware_component:current hardware_component:fan_speed hardware_component:sensor_error_count hardware_component:temperature hardware_component:voltage http_service:request_latency_histogram oximeter_collector:collections 0x〉get http_service:request_latency_histogram | last 1 http_service:request_latency_histogram id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263 name: management-gateway-service operation_id: sp_get status_code: 200 [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.125731231]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 0, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 0, 0.0005: 1, 0.0006000000000000001: 1, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000556233, max: 0.000603704, mean: 0.0005799685000000001, std_dev: 0.00002373549999999997, p50: 0, p90: 0.000603704, p99: 0.000603704] id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263 name: management-gateway-service operation_id: ignition_list status_code: 200 [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.125290346]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 0, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 1, 0.0005: 0, 0.0006000000000000001: 0, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000427249, max: 0.000427249, mean: 0.000427249, std_dev: 0, p50: 0, p90: 0.000427249, p99: 0.000427249] id: 1ac73746-2d3b-46d8-ac7c-44512c5f2263 name: management-gateway-service operation_id: sp_get status_code: 400 [2024-08-24 18:54:47.978590056, 2024-08-24 18:58:18.126114126]: [-179769313486231570000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000: 0, 0.000001: 0, 0.000002: 0, 0.000003: 0, 0.000004: 0, 0.000005: 0, 0.0000059999999999999985: 0, 0.000007: 0, 0.000008: 0, 0.000009: 0, 0.00001: 0, 0.00002: 2, 0.000030000000000000004: 0, 0.00004: 0, 0.00005: 0, 0.00006: 0, 0.00007000000000000001: 0, 0.00008: 0, 0.00009: 0, 0.0001: 0, 0.0002: 0, 0.0003: 0, 0.0004: 0, 0.0005: 0, 0.0006000000000000001: 0, 0.0007: 0, 0.0007999999999999999: 0, 0.0009: 0, 0.001: 0, 0.002: 0, 0.003: 0, 0.004: 0, 0.005: 0, 0.006: 0, 0.007: 0, 0.008: 0, 0.009000000000000001: 0, 0.01: 0, 0.020000000000000004: 0, 0.03000000000000001: 0, 0.04000000000000001: 0, 0.05000000000000001: 0, 0.06000000000000001: 0, 0.07: 0, 0.08: 0, 0.09000000000000001: 0, 0.1: 0, 0.2: 0, 0.30000000000000004: 0, 0.4: 0, 0.5: 0, 0.6: 0, 0.7000000000000001: 0, 0.8: 0, 0.9: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 20: 0, 30: 0, 40: 0, 50: 0, 60: 0, 70: 0, 80: 0, 90: 0, 100: 0, 200: 0, 300: 0, 400: 0, 500: 0, 600: 0, 700: 0, 800: 0, 900: 0, 1000: 0, min: 0.000020368, max: 0.000021581, mean: 0.0000209745, std_dev: 0.0000006064999999999992, p50: 0, p90: 0.000021581, p99: 0.000021581] 0x〉exit ```
Now that #6354 has added an Oximeter producer endpoint to MGS for
publishing SP sensor metrics, it seemed like a nice idea to also
instrument the MGS HTTP server, similar to the existing instrumentation
for other control plane services. I don't think we'll be doing a lot of
tuning of MGS performance, but the metrics seem like they could still be
useful because they also include the distribution of HTTP status codes,
and in many cases, the latency measurements also serve as a proxy for
how long it takes the SP to perform a certain operation, which could
be a valuable signal.
This commit adds an
oximeter_instruments::http::LatencyTracker
to theMGS HTTP servers.
To test that it works, I started a local Clickhouse and a standalone
Oximeter, and ran MGS and the SP simulator using
cargo xtask mgs-dev run
.Then, I made a few HTTP requests to various MGS APIs using
curl
;most of which were expected to succeed, and a few for SP slots that the
simulator wasn't configured to simulate a SP in (to ensure that the
request would fail). We can see the new metrics in OxQL: