-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Oximeter producer deletion resulted in many 404 errors with request latencies of 10+ seconds #5364
Comments
There are more examples of this same error in catacomb:/staff/core/omicron-5363/oxide-nexus_default.log.1711843200. There are 54 occurrences of them when I was deleting 75 instances with Terraform (which used 10 concurrent threads by default). |
This might be caused by #5326: Oximeter now periodically refreshes its collectors and deletes those that are gone, which means we could get this sequence from
I think we'd expect this to be a relatively rare race, since Oximeter would have to refresh its collectors in the window between 1 and 3, but it's very possible. I think we should do one of two things:
I have a moderate preference for 1, because otherwise all callers of |
I've uploaded the oximeter log file that corresponds to the Mar 30 22:30 to 23:30 time-window to catacomb:/staff/core/omicron-5363. |
It looks like this is what happened, although maybe my analysis of this having to happen inside a window of
I think this means my proposed fixes above would be correct, but this does make me ask two more questions:
|
The oximeter communication errors (the one encountered during customer-support#116) did not show up during mass instance deletion. I'll file a separate issue so that it's not mixed with @jgallagher's analysis above. |
I agree we should do this. I thought it already was doing this, honestly! |
Fair and same, until I ran into on the GC background task recently. I think what it's doing now is more correctly from an HTTP point of view; trying to |
I agree this is confusing. It does look like Nexus stalled out for a lot of them, and they were all picked up during the periodic |
Ben and I looked at this more, and believe both are explained by oxidecomputer/progenitor#755. When Oximeter refreshes its list of producers from Nexus, that bug is causing it to only see 100 producers, so it prunes all but 100 (erroneously!). The gap of 6.5 minutes is because the producer should not have been deleted by Oximeter in the first place. |
It looks like at least part of the issue here is oxidecomputer/progenitor#755. As of #5326,
First, it's fetching only 100 tasks, which is suspiciously the same as the page limit. Second, it's removing 82 tasks! Those should be removed as the producer itself goes away, such as when we destroy a VM instance. Removing 82 at once is possible, but definitely surprising. This leads to a bunch of thrashing, where |
The 404's here have been explained, and are harmless and not a release 7 blocker. In the attached
That is pretty surprising, since it means that there is a Dropshot server listening at the address but that we're not hitting the right endpoint to fetch metric data. So what producer is that? Looking from the CRDB zone on the Dogfood rack, we see:
There isn't a service with that ID in the
Looking for files that
And if we look at the log files, we see this:
That shows that (1) there are some requests for metrics on And that's why we have multiple producer records here, returning a 404: This is exactly the kind of situation the forth-coming producer garbage collection RPW is designed to handle. The Until then, the 404s here will consume an |
The request latencies here are likely explained by the bad interactions between Dogfood and racktest. We understand the 404s, which have been fixed by #5366, so I'm marking this closed. |
While investigating #5363 and customer-support#116, I saw a number of 500 errors resulted from oximeter producer removal errors, e.g.
They might have changed the overall responsiveness of stop-instance requests. It's unclear how they might cause other request timeouts.
The text was updated successfully, but these errors were encountered: