Nexus request latency histograms are too fine-grained #6331

bnaecker · 2024-08-14T20:06:14Z

We're currently tracking the latency to handle every request in Nexus. That uses a histogram with buckets defined here:

Lines 147 to 162 in 5ccb386

    
           let create_tracker = |name: &str| { 
        
               let target = HttpService { 
        
                   name: name.to_string().into(), 
        
                   id: config.deployment.id, 
        
               }; 
        
               const START_LATENCY_DECADE: i16 = -6; 
        
               const END_LATENCY_DECADE: i16 = 3; 
        
               LatencyTracker::with_latency_decades( 
        
                   target, 
        
                   START_LATENCY_DECADE, 
        
                   END_LATENCY_DECADE, 
        
               ) 
        
               .unwrap() 
        
           }; 
        
           let internal_latencies = create_tracker("nexus-internal"); 
        
           let external_latencies = create_tracker("nexus-external");

The latency buckets there span from 1us to 1000s, with 10 buckets per decade (that's a power of 10, Adam). That equates to 83 buckets in the database. We're using f64s as the support, and u64s as the count, so that's 16 bytes per bucket in total. That works out to 1328 bytes per sample, which we collect every 10s from Nexus.

Given the definition of the timeseries itself, we keep track of different histograms for every concrete API route Nexus receives. Since we often use things like UUIDs in API routes, this means we are carrying around a histogram and generating a sample for endpoints that could only ever be called once! (Consider deleting an instance by ID.) This is all extremely inefficient.

First of all, we just have a ridiculous number of rows in the histogram table:

oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759.local :) select total_rows, formatReadableSize(total_rows) from system.tables where name = 'measurements_histogramf64';

SELECT
    total_rows,
    formatReadableSize(total_rows)
FROM system.tables
WHERE name = 'measurements_histogramf64'

Query id: 6e67a86d-69ac-4163-b057-14aaee9ffc80

┌─total_rows─┬─formatReadableSize(total_rows)─┐
│  302465504 │ 288.45 MiB                     │
└────────────┴────────────────────────────────┘

1 row in set. Elapsed: 0.002 sec.

All of these are the request latency histogram, to be clear, since we don't track any other histograms today.

What's the distribution of total counts in those histograms? This is a bit tricky to get at, since the histograms are cumulative. I first made a temporary table like so:

oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759.local :) create temporary table total_request_count engine = Memory() as select timeseries_key, timestamp, total_count from (select timeseries_key, timestamp, arraySum(counts) as total_count from measurements_histogramf64 as meas join (select timeseries_key, max(timestamp) as last_timestamp from measurements_histogramf64 where timeseries_name = 'http_service:request_latency_histogram' group by timeseries_key) as t on timeseries_name = 'http_service:request_latency_histogram' and meas.timeseries_key = t.timeseries_key and meas.timestamp = t.last_timestamp);

CREATE TEMPORARY TABLE total_request_count
ENGINE = Memory AS
SELECT
    timeseries_key,
    timestamp,
    total_count
FROM
(
    SELECT
        timeseries_key,
        timestamp,
        arraySum(counts) AS total_count
    FROM measurements_histogramf64 AS meas
    INNER JOIN
    (
        SELECT
            timeseries_key,
            max(timestamp) AS last_timestamp
        FROM measurements_histogramf64
        WHERE timeseries_name = 'http_service:request_latency_histogram'
        GROUP BY timeseries_key
    ) AS t ON (timeseries_name = 'http_service:request_latency_histogram') AND (meas.timeseries_key = t.timeseries_key) AND (meas.timestamp = t.last_timestamp)
)

Query id: e5a6dc81-23c0-4368-9431-238ce90d8e8d

Ok.

0 rows in set. Elapsed: 33.326 sec. Processed 607.61 million rows, 242.44 GB (18.23 million rows/s., 7.27 GB/s.)
Peak memory usage: 2.78 GiB.

That generates a table with the timeseries key, timestamp, and total request count in the last bin of each timeseries when ordered by time. This table takes a while to make, so I made it temporarily while doing the rest of the analysis. Then we can look at the histogram of total counts with this:

oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759.local :) select parts.1 as lower, parts.2 as upper, parts.3 as count from (select arrayJoin(histogram(16)(total_count)) as parts from total_request_count)

SELECT
    parts.1 AS lower,
    parts.2 AS upper,
    parts.3 AS count
FROM
(
    SELECT arrayJoin(histogram(16)(total_count)) AS parts
    FROM total_request_count
)

Query id: a33e1a88-5e93-47fe-805d-d87a7a3856ef

┌──────────────lower─┬──────────────upper─┬──count─┐
│                  1 │  610.4061919504644 │ 7349.5 │
│  610.4061919504644 │ 1647.8166666666666 │   1058 │
│ 1647.8166666666666 │  2735.583333333333 │    6.5 │
│  2735.583333333333 │ 3854.6666666666665 │  5.875 │
│ 3854.6666666666665 │  4689.388888888889 │  5.625 │
│  4689.388888888889 │  5633.388888888889 │  7.625 │
│  5633.388888888889 │             7326.5 │   2.75 │
│             7326.5 │               8810 │  1.125 │
│               8810 │              12462 │      1 │
│              12462 │            16667.5 │      1 │
│            16667.5 │            18305.5 │      1 │
│            18305.5 │              27451 │      1 │
│              27451 │            43005.5 │      1 │
│            43005.5 │        50994.71875 │  2.875 │
│        50994.71875 │        52444.71875 │ 12.375 │
│        52444.71875 │              53081 │   3.75 │
└────────────────────┴────────────────────┴────────┘

We can summarize this a little bit, by just looking at the fraction of timeseries where the histogram is compresses the data, i.e., where we have a higher total count in the buckets than the number of buckets:

oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759.local :) select sum(total_count > 83) as compressed, count(*) as n_rows, (compressed / n_rows) as pct_compressed from total_request_count;

SELECT
    sum(total_count > 83) AS compressed,
    count(*) AS n_rows,
    compressed / n_rows AS pct_compressed
FROM total_request_count

Query id: 1f281efa-d6db-4fc5-a4b1-bf200629f689

┌─compressed─┬─n_rows─┬───────pct_compressed─┐
│        136 │   8461 │ 0.016073750147736675 │
└────────────┴────────┴──────────────────────┘

1 row in set. Elapsed: 0.011 sec. Processed 8.46 thousand rows, 67.69 KB (755.74 thousand rows/s., 6.05 MB/s.)
Peak memory usage: 269.76 KiB.

So roughly 1.6% of the histograms actually do the job of compressing the data. That's not great!

Why is it so bad? The real reason is the schema for this timeseries, and in particular, the fields:

omicron/oximeter/oximeter/schema/http-service.toml

Lines 20 to 38 in eeb723c

    
           [fields.name] 
        
           type = "string" 
        
           description = "The name of the HTTP server, or program running it" 
        
           [fields.id] 
        
           type = "uuid" 
        
           description = "UUID of the HTTP server" 
        
           [fields.route] 
        
           type = "string" 
        
           description = "HTTP route in the request" 
        
           [fields.method] 
        
           type = "string" 
        
           description = "HTTP method in the request" 
        
           [fields.status_code] 
        
           type = "i64" 
        
           description = "HTTP status code in the server's response"

We're distinguishing the histograms on the basis of a few things, such as status code (good) and the actual API route (not so good). As mentioned above, that means we create a whole histogram for endpoints that can only be hit once, such as when deleting an instance. We also report that histogram in perpetuity, once it starts recording data, which increases this inefficiency dramatically. For example, let's look at one of those timeseries where the total request count is 1 (there was only 1 request ever):

oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759.local :) select * from total_request_count where total_count = 1 limit 1;

SELECT *
FROM total_request_count
WHERE total_count = 1
LIMIT 1

Query id: 80aa83b3-5839-4b51-a2f4-810ff5dcb92f

┌───────timeseries_key─┬─────────────────────timestamp─┬─total_count─┐
│ 18115906329520737224 │ 2024-07-26 16:07:06.266371179 │           1 │
└──────────────────────┴───────────────────────────────┴─────────────┘

1 row in set. Elapsed: 0.002 sec. Processed 8.46 thousand rows, 203.06 KB (4.70 million rows/s., 112.73 MB/s.)
Peak memory usage: 2.71 KiB.

oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759.local :) select count(*) from measurements_histogramf64 where timeseries_name = 'http_service:request_latency_histogram' and timeseries_key = 18115906329520737224;

SELECT count(*)
FROM measurements_histogramf64
WHERE (timeseries_name = 'http_service:request_latency_histogram') AND (timeseries_key = 18115906329520737224)

Query id: 6ec5458f-7e6d-4aa0-83ea-e9909c9dc417

┌─count()─┐
│    5253 │
└─────────┘

1 row in set. Elapsed: 0.028 sec. Processed 52.77 thousand rows, 803.20 KB (1.90 million rows/s., 28.98 MB/s.)
Peak memory usage: 1.55 MiB.

So we recorded 5K samples for this one timeseries, only the very first one of which contained any actually useful data. And sure enough, this was deleting an instance:

oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759.local :) select field_value from fields_string where (field_name = 'route' or field_name = 'method') and timeseries_name = 'http_service:request_latency_histogram' and timeseries_key = 18115906329520737224;

SELECT field_value
FROM fields_string
WHERE ((field_name = 'route') OR (field_name = 'method')) AND (timeseries_name = 'http_service:request_latency_histogram') AND (timeseries_key = 18115906329520737224)

Query id: 8d53bf2d-9090-40fb-88df-4046f7af59c4

┌─field_value────────────────────────────────────────┐
│ DELETE                                             │
│ /v1/instances/cd293c79-b923-437b-944f-033726e9b5a5 │
└────────────────────────────────────────────────────┘

2 rows in set. Elapsed: 0.063 sec. Processed 240.64 thousand rows, 3.56 MB (3.80 million rows/s., 56.23 MB/s.)
Peak memory usage: 1.62 MiB.

What do we do about this? After talking with @ahl at length, one of the better ideas is to change the fields we're reporting. Instead of the actual concrete API route in the HTTP request, what we really care about is the operation ID. I don't particularly care in this case which instance was being deleted -- I want to know the distribution of latencies for deleting any instance. That provides the overall system health information we want from this timeseries, and also lets us get detailed enough information about particular API operations that could be misbehaving so that we can know how to further investigate.

The text was updated successfully, but these errors were encountered:

ahl · 2024-08-14T20:57:59Z

This is great, @bnaecker. The only thing I'd add is that we can imagine interactive and/or dynamic analytics in the future to hone in on specific problems. For example, if we see that there are latency outliers for a particular operation, we might want more info on what distinguishes those. It's impossible to imagine every type of question users might ask of the system, and I think it makes more sense to aggregate them rather than keeping every datum and incurring the cost of aggregation on every question.

... especially when many of these flow charts will inevitably end up in a call to Oxide support for the moment.

- Add TTLs to all field tables, by using a materialized column with the time each record is inserted. ClickHouse will retain the latest timestamp, so when we stop inserting, the TTL clock will start counting down on those timeseries records. - Update Dropshot dependency. - Add operation ID to HTTP service timeseries, remove other fields. Expunge the old timeseries too. - Remove unnecessary stingifying of URIs in latency tracking. - Fixes #6328 and #6331

bnaecker · 2024-08-19T16:26:09Z

Closed by #6352

bnaecker added oximeter clickhouse Related to the ClickHouse metrics DBMS labels Aug 14, 2024

bnaecker closed this as completed Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nexus request latency histograms are too fine-grained #6331

Nexus request latency histograms are too fine-grained #6331

bnaecker commented Aug 14, 2024 •

edited

Loading

ahl commented Aug 14, 2024

bnaecker commented Aug 19, 2024

Nexus request latency histograms are too fine-grained #6331

Nexus request latency histograms are too fine-grained #6331

Comments

bnaecker commented Aug 14, 2024 • edited Loading

ahl commented Aug 14, 2024

bnaecker commented Aug 19, 2024

bnaecker commented Aug 14, 2024 •

edited

Loading