Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oximeter collector binary has surprisingly large heap usage after overnight instance creation loop #3808

Closed
gjcolombo opened this issue Aug 1, 2023 · 4 comments
Labels
bug Something that isn't working. Metrics

Comments

@gjcolombo
Copy link
Contributor

Repro steps:

  • Set up a dev cluster on a machine with 64 GiB RAM, 16 GiB swapfile, 50% VMM reservoir
  • Run a loop that creates, starts, stops, and destroys the same instance; the instance has 2 vCPUs, 1 GiB RAM, and no disks or NICs
  • Let simmer overnight

Observed: A brand-new oximeter process's heap usage is about 8 MiB per pmap. After the overnight run this has ballooned to ~2.1 GiB.

Expected: Oximeter's heap usage remains relatively modest.

I don't have a good theory on this one. The next step is likely to find or cook up a DTrace script that'll let us find the culprit stacks.

@gjcolombo gjcolombo added the bug Something that isn't working. label Aug 1, 2023
@bnaecker
Copy link
Collaborator

bnaecker commented Aug 1, 2023

I'm not actually that surprised about this one, though it's unfortunate. The design of oximeter is such that it attempts to collect from each producer it's told about forever. Specifically, if it can't reach them, it'll emit a warning but continue trying, the theory being that oximeter itself can't distinguish an instance being destroyed from the propolis server managing it and producing the metrics being unreachable. This may be exacerbated by a choice I made early on to spawn a new tokio task for each producer. That can certainly be addressed relatively easily (I think, anyway).

The proper solution here is to finally implement de-registration of metric producers in Nexus. E.g., when an instance is destroyed, nexus should remove its metric producer from the right CRDB table and tell its assigned oximeter to stop attempting to collect from it.

@smklein
Copy link
Collaborator

smklein commented Aug 1, 2023

@bnaecker WDYT about having something defensive from the Oximeter side too -- namely, if a producer can't be queried after a certain amount of time, we stop trying to collect from it?

@bnaecker
Copy link
Collaborator

bnaecker commented Aug 1, 2023

That might be OK, though I'm a bit nervous about (1) picking a duration, and (2) getting oximeter to restart collection at some point. And as you're alluding to, that's really solving a different problem than this issue. Here, the instances are not unreachable, they were intentionally and correctly destroyed, information which nexus has already and could make available to oximeter.

@bnaecker
Copy link
Collaborator

I'm opting to close this since (1) we understand the cause (never removing a producer-collector assignment) and (2) it will not get worse once #4495 is merged, in the absence of the edge-case race noted in that PR thread. I will be including schema updates that should mitigate the issue on existing deployments in the short term in a coming PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that isn't working. Metrics
Projects
None yet
Development

No branches or pull requests

3 participants