`oximeter` collector binary has surprisingly large heap usage after overnight instance creation loop #3808

gjcolombo · 2023-08-01T16:37:21Z

Repro steps:

Set up a dev cluster on a machine with 64 GiB RAM, 16 GiB swapfile, 50% VMM reservoir
Run a loop that creates, starts, stops, and destroys the same instance; the instance has 2 vCPUs, 1 GiB RAM, and no disks or NICs
Let simmer overnight

Observed: A brand-new oximeter process's heap usage is about 8 MiB per pmap. After the overnight run this has ballooned to ~2.1 GiB.

Expected: Oximeter's heap usage remains relatively modest.

I don't have a good theory on this one. The next step is likely to find or cook up a DTrace script that'll let us find the culprit stacks.

The text was updated successfully, but these errors were encountered:

bnaecker · 2023-08-01T16:42:29Z

I'm not actually that surprised about this one, though it's unfortunate. The design of oximeter is such that it attempts to collect from each producer it's told about forever. Specifically, if it can't reach them, it'll emit a warning but continue trying, the theory being that oximeter itself can't distinguish an instance being destroyed from the propolis server managing it and producing the metrics being unreachable. This may be exacerbated by a choice I made early on to spawn a new tokio task for each producer. That can certainly be addressed relatively easily (I think, anyway).

The proper solution here is to finally implement de-registration of metric producers in Nexus. E.g., when an instance is destroyed, nexus should remove its metric producer from the right CRDB table and tell its assigned oximeter to stop attempting to collect from it.

smklein · 2023-08-01T17:14:39Z

@bnaecker WDYT about having something defensive from the Oximeter side too -- namely, if a producer can't be queried after a certain amount of time, we stop trying to collect from it?

bnaecker · 2023-08-01T17:43:39Z

That might be OK, though I'm a bit nervous about (1) picking a duration, and (2) getting oximeter to restart collection at some point. And as you're alluding to, that's really solving a different problem than this issue. Here, the instances are not unreachable, they were intentionally and correctly destroyed, information which nexus has already and could make available to oximeter.

bnaecker · 2023-11-14T21:13:31Z

I'm opting to close this since (1) we understand the cause (never removing a producer-collector assignment) and (2) it will not get worse once #4495 is merged, in the absence of the edge-case race noted in that PR thread. I will be including schema updates that should mitigate the issue on existing deployments in the short term in a coming PR.

gjcolombo added the bug Something that isn't working. label Aug 1, 2023

This was referenced Aug 1, 2023

[oximeter] Add endpoint for de-registering a producer from a collector #3811

Closed

[nexus] De-register an instance's producer with oximeter when the instance is destroyed #3812

Closed

smklein added the Metrics label Oct 12, 2023

This was referenced Nov 13, 2023

Report statistics about oximeter collections #4370

Merged

Stop collecting Propolis metrics on instance stop #4495

Merged

bnaecker closed this as completed Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`oximeter` collector binary has surprisingly large heap usage after overnight instance creation loop #3808

`oximeter` collector binary has surprisingly large heap usage after overnight instance creation loop #3808

gjcolombo commented Aug 1, 2023

bnaecker commented Aug 1, 2023

smklein commented Aug 1, 2023

bnaecker commented Aug 1, 2023

bnaecker commented Nov 14, 2023

oximeter collector binary has surprisingly large heap usage after overnight instance creation loop #3808

oximeter collector binary has surprisingly large heap usage after overnight instance creation loop #3808

Comments

gjcolombo commented Aug 1, 2023

bnaecker commented Aug 1, 2023

smklein commented Aug 1, 2023

bnaecker commented Aug 1, 2023

bnaecker commented Nov 14, 2023

`oximeter` collector binary has surprisingly large heap usage after overnight instance creation loop #3808

`oximeter` collector binary has surprisingly large heap usage after overnight instance creation loop #3808