Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gateway] Add Oximeter HTTP service metrics #6432

Merged
merged 3 commits into from
Aug 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions gateway/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ uuid.workspace = true
omicron-workspace-hack.workspace = true
oximeter.workspace = true
oximeter-producer.workspace = true
oximeter-instruments = { workspace = true, features = ["http-instruments"] }

[dev-dependencies]
expectorate.workspace = true
Expand Down
16 changes: 16 additions & 0 deletions gateway/src/context.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,13 @@ pub struct ServerContext {
pub mgmt_switch: ManagementSwitch,
pub host_phase2_provider: Arc<InMemoryHostPhase2Provider>,
pub rack_id: OnceLock<Uuid>,
pub latencies: oximeter_instruments::http::LatencyTracker,
pub log: Logger,
}

impl ServerContext {
pub async fn new(
id: Uuid,
host_phase2_provider: Arc<InMemoryHostPhase2Provider>,
switch_config: SwitchConfig,
rack_id_config: Option<Uuid>,
Expand All @@ -37,7 +39,21 @@ impl ServerContext {
OnceLock::new()
};

const START_LATENCY_DECADE: i16 = -6;
const END_LATENCY_DECADE: i16 = 3;
Comment on lines +42 to +43
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these appropriate here? I think they're definitely fine to use, since the cost is pretty low, but the upper bin especially seems a bit high. Do you expect MGS to service requests that take tens to hundreds of seconds?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, maybe uploading an update image does take a long time! This might actually be low then! In general, I'd probably pick about an order of magnitude higher than you could reasonably imagine a request taking, or high enough that lumping everything together is "fine" and not something you care to distinguish.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be entirely honest, I just snagged them from Nexus because I don't have a super clear sense for what kinds of latency to expect in MGS. Perhaps the API endpoints for updating SP firmware can take multiple seconds? @jgallagher what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updates are broken up into "start" / "status" / "abort", each of which should be relatively quick. (Callers are expected to poll "status" until the update is complete, in general.)

However, resetting an SP can take a while, especially if it's a sidecar, because we wait for it to come back online before returning. The timeout for this is currently 30 seconds per attempt, with some small number of retries possible. In practice I think a sidecar reset is currently around 15 seconds, assuming success.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, so I suppose that if we retry a reset three times before deciding it's failed, that's about 90 seconds...so I think let's leave this the way it is!

let latencies =
oximeter_instruments::http::LatencyTracker::with_latency_decades(
oximeter_instruments::http::HttpService {
name: "management-gateway-service".into(),
id,
},
START_LATENCY_DECADE,
END_LATENCY_DECADE,
)
.expect("start and end decades are hardcoded and should be valid");

Ok(Arc::new(ServerContext {
latencies,
mgmt_switch,
host_phase2_provider,
rack_id,
Expand Down
Loading
Loading