-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nexus request latency histograms are too fine-grained #6331
Comments
This is great, @bnaecker. The only thing I'd add is that we can imagine interactive and/or dynamic analytics in the future to hone in on specific problems. For example, if we see that there are latency outliers for a particular operation, we might want more info on what distinguishes those. It's impossible to imagine every type of question users might ask of the system, and I think it makes more sense to aggregate them rather than keeping every datum and incurring the cost of aggregation on every question. ... especially when many of these flow charts will inevitably end up in a call to Oxide support for the moment. |
- Add TTLs to all field tables, by using a materialized column with the time each record is inserted. ClickHouse will retain the latest timestamp, so when we stop inserting, the TTL clock will start counting down on those timeseries records. - Update Dropshot dependency. - Add operation ID to HTTP service timeseries, remove other fields. Expunge the old timeseries too. - Remove unnecessary stingifying of URIs in latency tracking. - Fixes #6328 and #6331
- Add TTLs to all field tables, by using a materialized column with the time each record is inserted. ClickHouse will retain the latest timestamp, so when we stop inserting, the TTL clock will start counting down on those timeseries records. - Update Dropshot dependency. - Add operation ID to HTTP service timeseries, remove other fields. Expunge the old timeseries too. - Remove unnecessary stingifying of URIs in latency tracking. - Fixes #6328 and #6331
Closed by #6352 |
We're currently tracking the latency to handle every request in Nexus. That uses a histogram with buckets defined here:
omicron/nexus/src/context.rs
Lines 147 to 162 in 5ccb386
The latency buckets there span from 1us to 1000s, with 10 buckets per decade (that's a power of 10, Adam). That equates to 83 buckets in the database. We're using
f64
s as the support, andu64
s as the count, so that's 16 bytes per bucket in total. That works out to 1328 bytes per sample, which we collect every 10s from Nexus.Given the definition of the timeseries itself, we keep track of different histograms for every concrete API route Nexus receives. Since we often use things like UUIDs in API routes, this means we are carrying around a histogram and generating a sample for endpoints that could only ever be called once! (Consider deleting an instance by ID.) This is all extremely inefficient.
First of all, we just have a ridiculous number of rows in the histogram table:
All of these are the request latency histogram, to be clear, since we don't track any other histograms today.
What's the distribution of total counts in those histograms? This is a bit tricky to get at, since the histograms are cumulative. I first made a temporary table like so:
That generates a table with the timeseries key, timestamp, and total request count in the last bin of each timeseries when ordered by time. This table takes a while to make, so I made it temporarily while doing the rest of the analysis. Then we can look at the histogram of total counts with this:
We can summarize this a little bit, by just looking at the fraction of timeseries where the histogram is compresses the data, i.e., where we have a higher total count in the buckets than the number of buckets:
So roughly 1.6% of the histograms actually do the job of compressing the data. That's not great!
Why is it so bad? The real reason is the schema for this timeseries, and in particular, the fields:
omicron/oximeter/oximeter/schema/http-service.toml
Lines 20 to 38 in eeb723c
We're distinguishing the histograms on the basis of a few things, such as status code (good) and the actual API route (not so good). As mentioned above, that means we create a whole histogram for endpoints that can only be hit once, such as when deleting an instance. We also report that histogram in perpetuity, once it starts recording data, which increases this inefficiency dramatically. For example, let's look at one of those timeseries where the total request count is 1 (there was only 1 request ever):
So we recorded 5K samples for this one timeseries, only the very first one of which contained any actually useful data. And sure enough, this was deleting an instance:
What do we do about this? After talking with @ahl at length, one of the better ideas is to change the fields we're reporting. Instead of the actual concrete API route in the HTTP request, what we really care about is the operation ID. I don't particularly care in this case which instance was being deleted -- I want to know the distribution of latencies for deleting any instance. That provides the overall system health information we want from this timeseries, and also lets us get detailed enough information about particular API operations that could be misbehaving so that we can know how to further investigate.
The text was updated successfully, but these errors were encountered: