Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nexus request latency histograms are too fine-grained #6331

Closed
bnaecker opened this issue Aug 14, 2024 · 2 comments
Closed

Nexus request latency histograms are too fine-grained #6331

bnaecker opened this issue Aug 14, 2024 · 2 comments
Labels
clickhouse Related to the ClickHouse metrics DBMS oximeter

Comments

@bnaecker
Copy link
Collaborator

bnaecker commented Aug 14, 2024

We're currently tracking the latency to handle every request in Nexus. That uses a histogram with buckets defined here:

let create_tracker = |name: &str| {
let target = HttpService {
name: name.to_string().into(),
id: config.deployment.id,
};
const START_LATENCY_DECADE: i16 = -6;
const END_LATENCY_DECADE: i16 = 3;
LatencyTracker::with_latency_decades(
target,
START_LATENCY_DECADE,
END_LATENCY_DECADE,
)
.unwrap()
};
let internal_latencies = create_tracker("nexus-internal");
let external_latencies = create_tracker("nexus-external");

The latency buckets there span from 1us to 1000s, with 10 buckets per decade (that's a power of 10, Adam). That equates to 83 buckets in the database. We're using f64s as the support, and u64s as the count, so that's 16 bytes per bucket in total. That works out to 1328 bytes per sample, which we collect every 10s from Nexus.

Given the definition of the timeseries itself, we keep track of different histograms for every concrete API route Nexus receives. Since we often use things like UUIDs in API routes, this means we are carrying around a histogram and generating a sample for endpoints that could only ever be called once! (Consider deleting an instance by ID.) This is all extremely inefficient.

First of all, we just have a ridiculous number of rows in the histogram table:

oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759.local :) select total_rows, formatReadableSize(total_rows) from system.tables where name = 'measurements_histogramf64';

SELECT
    total_rows,
    formatReadableSize(total_rows)
FROM system.tables
WHERE name = 'measurements_histogramf64'

Query id: 6e67a86d-69ac-4163-b057-14aaee9ffc80

┌─total_rows─┬─formatReadableSize(total_rows)─┐
│  302465504 │ 288.45 MiB                     │
└────────────┴────────────────────────────────┘

1 row in set. Elapsed: 0.002 sec.

All of these are the request latency histogram, to be clear, since we don't track any other histograms today.

What's the distribution of total counts in those histograms? This is a bit tricky to get at, since the histograms are cumulative. I first made a temporary table like so:

oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759.local :) create temporary table total_request_count engine = Memory() as select timeseries_key, timestamp, total_count from (select timeseries_key, timestamp, arraySum(counts) as total_count from measurements_histogramf64 as meas join (select timeseries_key, max(timestamp) as last_timestamp from measurements_histogramf64 where timeseries_name = 'http_service:request_latency_histogram' group by timeseries_key) as t on timeseries_name = 'http_service:request_latency_histogram' and meas.timeseries_key = t.timeseries_key and meas.timestamp = t.last_timestamp);

CREATE TEMPORARY TABLE total_request_count
ENGINE = Memory AS
SELECT
    timeseries_key,
    timestamp,
    total_count
FROM
(
    SELECT
        timeseries_key,
        timestamp,
        arraySum(counts) AS total_count
    FROM measurements_histogramf64 AS meas
    INNER JOIN
    (
        SELECT
            timeseries_key,
            max(timestamp) AS last_timestamp
        FROM measurements_histogramf64
        WHERE timeseries_name = 'http_service:request_latency_histogram'
        GROUP BY timeseries_key
    ) AS t ON (timeseries_name = 'http_service:request_latency_histogram') AND (meas.timeseries_key = t.timeseries_key) AND (meas.timestamp = t.last_timestamp)
)

Query id: e5a6dc81-23c0-4368-9431-238ce90d8e8d

Ok.

0 rows in set. Elapsed: 33.326 sec. Processed 607.61 million rows, 242.44 GB (18.23 million rows/s., 7.27 GB/s.)
Peak memory usage: 2.78 GiB.

That generates a table with the timeseries key, timestamp, and total request count in the last bin of each timeseries when ordered by time. This table takes a while to make, so I made it temporarily while doing the rest of the analysis. Then we can look at the histogram of total counts with this:

oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759.local :) select parts.1 as lower, parts.2 as upper, parts.3 as count from (select arrayJoin(histogram(16)(total_count)) as parts from total_request_count)

SELECT
    parts.1 AS lower,
    parts.2 AS upper,
    parts.3 AS count
FROM
(
    SELECT arrayJoin(histogram(16)(total_count)) AS parts
    FROM total_request_count
)

Query id: a33e1a88-5e93-47fe-805d-d87a7a3856ef

┌──────────────lower─┬──────────────upper─┬──count─┐
│                  1 │  610.4061919504644 │ 7349.5 │
│  610.4061919504644 │ 1647.8166666666666 │   1058 │
│ 1647.8166666666666 │  2735.583333333333 │    6.5 │
│  2735.583333333333 │ 3854.6666666666665 │  5.875 │
│ 3854.6666666666665 │  4689.388888888889 │  5.625 │
│  4689.388888888889 │  5633.388888888889 │  7.625 │
│  5633.388888888889 │             7326.5 │   2.75 │
│             7326.5 │               8810 │  1.125 │
│               8810 │              12462 │      1 │
│              12462 │            16667.5 │      1 │
│            16667.5 │            18305.5 │      1 │
│            18305.5 │              27451 │      1 │
│              27451 │            43005.5 │      1 │
│            43005.5 │        50994.71875 │  2.875 │
│        50994.71875 │        52444.71875 │ 12.375 │
│        52444.71875 │              53081 │   3.75 │
└────────────────────┴────────────────────┴────────┘

We can summarize this a little bit, by just looking at the fraction of timeseries where the histogram is compresses the data, i.e., where we have a higher total count in the buckets than the number of buckets:

oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759.local :) select sum(total_count > 83) as compressed, count(*) as n_rows, (compressed / n_rows) as pct_compressed from total_request_count;

SELECT
    sum(total_count > 83) AS compressed,
    count(*) AS n_rows,
    compressed / n_rows AS pct_compressed
FROM total_request_count

Query id: 1f281efa-d6db-4fc5-a4b1-bf200629f689

┌─compressed─┬─n_rows─┬───────pct_compressed─┐
│        136 │   8461 │ 0.016073750147736675 │
└────────────┴────────┴──────────────────────┘

1 row in set. Elapsed: 0.011 sec. Processed 8.46 thousand rows, 67.69 KB (755.74 thousand rows/s., 6.05 MB/s.)
Peak memory usage: 269.76 KiB.

So roughly 1.6% of the histograms actually do the job of compressing the data. That's not great!

Why is it so bad? The real reason is the schema for this timeseries, and in particular, the fields:

[fields.name]
type = "string"
description = "The name of the HTTP server, or program running it"
[fields.id]
type = "uuid"
description = "UUID of the HTTP server"
[fields.route]
type = "string"
description = "HTTP route in the request"
[fields.method]
type = "string"
description = "HTTP method in the request"
[fields.status_code]
type = "i64"
description = "HTTP status code in the server's response"

We're distinguishing the histograms on the basis of a few things, such as status code (good) and the actual API route (not so good). As mentioned above, that means we create a whole histogram for endpoints that can only be hit once, such as when deleting an instance. We also report that histogram in perpetuity, once it starts recording data, which increases this inefficiency dramatically. For example, let's look at one of those timeseries where the total request count is 1 (there was only 1 request ever):

oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759.local :) select * from total_request_count where total_count = 1 limit 1;

SELECT *
FROM total_request_count
WHERE total_count = 1
LIMIT 1

Query id: 80aa83b3-5839-4b51-a2f4-810ff5dcb92f

┌───────timeseries_key─┬─────────────────────timestamp─┬─total_count─┐
│ 18115906329520737224 │ 2024-07-26 16:07:06.266371179 │           1 │
└──────────────────────┴───────────────────────────────┴─────────────┘

1 row in set. Elapsed: 0.002 sec. Processed 8.46 thousand rows, 203.06 KB (4.70 million rows/s., 112.73 MB/s.)
Peak memory usage: 2.71 KiB.

oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759.local :) select count(*) from measurements_histogramf64 where timeseries_name = 'http_service:request_latency_histogram' and timeseries_key = 18115906329520737224;

SELECT count(*)
FROM measurements_histogramf64
WHERE (timeseries_name = 'http_service:request_latency_histogram') AND (timeseries_key = 18115906329520737224)

Query id: 6ec5458f-7e6d-4aa0-83ea-e9909c9dc417

┌─count()─┐
│    5253 │
└─────────┘

1 row in set. Elapsed: 0.028 sec. Processed 52.77 thousand rows, 803.20 KB (1.90 million rows/s., 28.98 MB/s.)
Peak memory usage: 1.55 MiB.

So we recorded 5K samples for this one timeseries, only the very first one of which contained any actually useful data. And sure enough, this was deleting an instance:

oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759.local :) select field_value from fields_string where (field_name = 'route' or field_name = 'method') and timeseries_name = 'http_service:request_latency_histogram' and timeseries_key = 18115906329520737224;

SELECT field_value
FROM fields_string
WHERE ((field_name = 'route') OR (field_name = 'method')) AND (timeseries_name = 'http_service:request_latency_histogram') AND (timeseries_key = 18115906329520737224)

Query id: 8d53bf2d-9090-40fb-88df-4046f7af59c4

┌─field_value────────────────────────────────────────┐
│ DELETE                                             │
│ /v1/instances/cd293c79-b923-437b-944f-033726e9b5a5 │
└────────────────────────────────────────────────────┘

2 rows in set. Elapsed: 0.063 sec. Processed 240.64 thousand rows, 3.56 MB (3.80 million rows/s., 56.23 MB/s.)
Peak memory usage: 1.62 MiB.

What do we do about this? After talking with @ahl at length, one of the better ideas is to change the fields we're reporting. Instead of the actual concrete API route in the HTTP request, what we really care about is the operation ID. I don't particularly care in this case which instance was being deleted -- I want to know the distribution of latencies for deleting any instance. That provides the overall system health information we want from this timeseries, and also lets us get detailed enough information about particular API operations that could be misbehaving so that we can know how to further investigate.

@bnaecker bnaecker added oximeter clickhouse Related to the ClickHouse metrics DBMS labels Aug 14, 2024
@ahl
Copy link
Contributor

ahl commented Aug 14, 2024

This is great, @bnaecker. The only thing I'd add is that we can imagine interactive and/or dynamic analytics in the future to hone in on specific problems. For example, if we see that there are latency outliers for a particular operation, we might want more info on what distinguishes those. It's impossible to imagine every type of question users might ask of the system, and I think it makes more sense to aggregate them rather than keeping every datum and incurring the cost of aggregation on every question.

... especially when many of these flow charts will inevitably end up in a call to Oxide support for the moment.

bnaecker added a commit that referenced this issue Aug 15, 2024
- Add TTLs to all field tables, by using a materialized column with the
  time each record is inserted. ClickHouse will retain the latest
  timestamp, so when we stop inserting, the TTL clock will start
  counting down on those timeseries records.
- Update Dropshot dependency.
- Add operation ID to HTTP service timeseries, remove other fields.
  Expunge the old timeseries too.
- Remove unnecessary stingifying of URIs in latency tracking.
- Fixes #6328 and #6331
bnaecker added a commit that referenced this issue Aug 15, 2024
- Add TTLs to all field tables, by using a materialized column with the
  time each record is inserted. ClickHouse will retain the latest
  timestamp, so when we stop inserting, the TTL clock will start
  counting down on those timeseries records.
- Update Dropshot dependency.
- Add operation ID to HTTP service timeseries, remove other fields.
  Expunge the old timeseries too.
- Remove unnecessary stingifying of URIs in latency tracking.
- Fixes #6328 and #6331
@bnaecker
Copy link
Collaborator Author

Closed by #6352

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clickhouse Related to the ClickHouse metrics DBMS oximeter
Projects
None yet
Development

No branches or pull requests

2 participants