Metrics API get requests return "Query must resolve to a single timeseries if limit is specified" #4221

askfongjojo · 2023-10-06T01:28:02Z

Console is no longer getting back any metrics (for utilization or disk metrics). It's hitting 500 errors in Nexus, e.g.

01:23:15.075Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (dropshot_external): request completed
    error_message_external = Internal Server Error
    error_message_internal = Query must resolve to a single timeseries if limit is specified
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/fa728d0/dropshot/src/server.rs:841
    latency_us = 429136
    local_addr = 172.30.2.5:443
    method = GET
    remote_addr = 172.20.17.42:58517
    req_id = 6620223f-f85a-4bba-9c1a-e4db78417764
    response_code = 500
    uri = https://oxide.sys.rack2.eng.oxide.computer/v1/disks/dbdata-ext4-primary/metrics/write?project=try&start_time=2023-10-05T01%3A23%3A14.591Z&end_time=2023-10-06T01%3A23%3A14.591Z&limit=3000

I invoked a similar query via API without the limit parameter to see what would happen and it's returning the same error:

01:25:12.706Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (dropshot_external): request completed
    error_message_external = Internal Server Error
    error_message_internal = Query must resolve to a single timeseries if limit is specified
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/fa728d0/dropshot/src/server.rs:841
    latency_us = 139882
    local_addr = 172.30.2.5:443
    method = GET
    remote_addr = 172.20.17.42:58551
    req_id = 5a8175d3-47b8-4add-bda7-f5d2734a954b
    response_code = 500
    uri = /v1/disks/c64f5fbe-8a2b-4f8c-a229-13a03dbf1c39/metrics/read_bytes?start_time=2023-10-04T16%3A34%3A22.508Z&end_time=2023-10-6T16%3A34%3A22.508Z

The text was updated successfully, but these errors were encountered:

david-crespo · 2023-10-06T02:34:22Z

Some notes on how this works. Will probably need @bnaecker's help to puzzle this out. What we'll probably want to do is look in Clickhouse to see why multiple timeseries with the same name (at least, I think that's what it is) exist.

~~I'm guessing you only see this for disk metrics and not for capacity and utilization metrics. Let me know whether that's true.~~ I just saw you said both disk and utilization.

This is the top level call for disk metrics:

omicron/nexus/src/external_api/http_entrypoints.rs

Lines 1653 to 1660 in c76ca69

    
           let result = nexus 
        
               .select_timeseries( 
        
                   &format!("crucible_upstairs:{}", path.metric), 
        
                   &[&format!("upstairs_uuid=={}", authz_disk.id())], 
        
                   query, 
        
                   limit, 
        
               ) 
        
               .await?;

So the timeseries names here are crucible_upstairs:write or crucible_upstairs:read_bytes, respectively.

Here is where the limit constraint is enforced. I implemented this check in #3269 when I added order by timestamp so that I could do reverse order limit 1 to get the one most recent data point.

omicron/oximeter/db/src/client.rs

Lines 128 to 145 in c76ca69

    
           let info = match query.field_query() { 
        
               Some(field_query) => { 
        
                   self.select_matching_timeseries_info(&field_query, &schema) 
        
                       .await? 
        
               } 
        
               None => BTreeMap::new(), 
        
           }; 
        
           if info.is_empty() { 
        
               // allow queries that resolve to zero timeseries even with limit 
        
               // specified because the results are not misleading 
        
               Ok(vec![]) 
        
           } else if limit.is_some() && info.len() != 1 { 
        
               // Error if a limit is specified with a query that resolves to 
        
               // multiple timeseries. Because query results are ordered in part by 
        
               // timeseries key, results with a limit will select measurements in 
        
               // a way that is arbitrary with respect to the query. 
        
               Err(Error::InvalidLimitQuery)

The reason leaving off limit doesn't do anything is that we always call this function with a limit:

omicron/nexus/src/app/oximeter.rs

Lines 249 to 256 in c76ca69

    
           .select_timeseries_with( 
        
               timeseries_name, 
        
               criteria, 
        
               Some(start_time), 
        
               Some(end_time), 
        
               Some(limit), 
        
               order, 
        
           )

In order for this to become an issue when it wasn't before, somehow the timeseries name has started resolving to more than one timeseries in Clickhouse. I don't understand very well what determines a timeseries uniquely besides its name, and therefore what could cause the query here to resolve two timeseries. When I run this test with some debug logging added

omicron/oximeter/db/src/client.rs

Line 2139 in c76ca69

async fn test_select_timeseries_with_order() {

here is the SQL that runs to get the list of timeseries:

select
    filter0.timeseries_key as timeseries_key,
    filter0.field_name,
    filter0.field_value,
    filter1.field_name,
    filter1.field_value,
    filter2.field_name,
    filter2.field_value,
    filter3.field_name,
    filter3.field_value,
    filter4.field_name,
    filter4.field_value
from
    (
        select *
        from oximeter.fields_uuid
        where timeseries_name = 'service:request_latency' and field_name = 'id'
    ) as filter0
inner join
    (
        select *
        from oximeter.fields_string
        where
            timeseries_name = 'service:request_latency'
            and field_name = 'method'
            and field_value = 'GET'
    ) as filter1
    on (
        filter0.timeseries_name = filter1.timeseries_name
        and filter0.timeseries_key = filter1.timeseries_key
    )
inner join
    (
        select *
        from oximeter.fields_string
        where
            timeseries_name = 'service:request_latency'
            and field_name = 'name'
            and field_value = 'oximeter'
    ) as filter2
    on (
        filter1.timeseries_name = filter2.timeseries_name
        and filter1.timeseries_key = filter2.timeseries_key
    )
inner join
    (
        select *
        from oximeter.fields_string
        where
            timeseries_name = 'service:request_latency'
            and field_name = 'route'
            and field_value = '/a'
    ) as filter3
    on (
        filter2.timeseries_name = filter3.timeseries_name
        and filter2.timeseries_key = filter3.timeseries_key
    )
inner join
    (
        select *
        from oximeter.fields_i64
        where
            timeseries_name = 'service:request_latency'
            and field_name = 'status_code'
            and field_value = 200
    ) as filter4
    on (
        filter3.timeseries_name = filter4.timeseries_name
        and filter3.timeseries_key = filter4.timeseries_key
    )
order by (filter0.timeseries_name, filter0.timeseries_key) format jsoneachrow
;

The following function indicates that if the above query comes back with multiple rows with distinct values in the timeseries_key column, the resulting BTreeMap will have multiple entries, and that's how we're going to fail the info.len() != 1 check.

omicron/oximeter/db/src/client.rs

Lines 315 to 332 in 230637a

    
           // Select the timeseries, including keys and field values, that match the given field-selection 
        
           // query. 
        
           async fn select_matching_timeseries_info( 
        
               &self, 
        
               field_query: &str, 
        
               schema: &TimeseriesSchema, 
        
           ) -> Result<BTreeMap<TimeseriesKey, (Target, Metric)>, Error> { 
        
               let body = self.execute_with_body(field_query).await?; 
        
               let mut results = BTreeMap::new(); 
        
               for line in body.lines() { 
        
                   let row: model::FieldSelectRow = serde_json::from_str(line) 
        
                       .expect("Unable to deserialize an expected row"); 
        
                   let (id, target, metric) = 
        
                       model::parse_field_select_row(&row, schema); 
        
                   results.insert(id, (target, metric)); 
        
               } 
        
               Ok(results) 
        
           }

bnaecker · 2023-10-06T03:20:20Z

I need to investigate more to be sure, but I’m betting ClickHouse’s lack of unique primary keys is biting us here. We use a table engine which is supposed to remove duplicate rows, but there is some asynchrony to that deduplication. The queries themselves need to ask for distinct records if they want to be sure to get only the unique entries. I thought I had put that in place a long time ago, but I’m obviously misremembering.

askfongjojo · 2023-10-06T03:36:07Z

~~I'm guessing you only see this for disk metrics and not for capacity and utilization metrics. Let me know whether that's true.~~ I just saw you said both disk and utilization.

Actually the issue only happens with objects that exist prior to the update. The metrics for newly created disks can be queried without problems but disks created previously hit the non-unique timeseries error. Would it be possible that there is a new key/format for ClickHouse timeseries and the newly inserted metrics are mixed with some older data, causing the query to see them as two different datasets of the same timeseries?

bnaecker · 2023-10-06T04:23:05Z

I don’t believe there’s been a change to either the ClickHouse format or how we generate our own keys for timeseries. I’ve assiduously avoided that because we have no mechanisms for updating schema. The fact that we only see this for older data is consistent with my theory about non-unique records. The database can choose to repack the data at basically any point. For the table engine we’re using for the fields, it deduplicates those records at those merge points. But I don’t believe there are requirements like “older records are merged first.” The database can basically decide to do that arbitrarily, although we do have ways of requesting a merge. And data may always remain unprocessed, I think. We could confirm if this is the case by looking for fields coming from the offending timeseries, and comparing that to those which don’t trigger the failure case. Angela, could you provide an example Nexus API call that works and one that fails? What I expect to see is that the failing case ultimately fetches rows from one of the field tables in which there are duplicate timeseries keys. In contrast, the records from the successful query will have been merged and deduplicated. I should also say, even if this particular theory does not explain this failure, it certainly can cause a failure like this in the future! The queries should select distinct matching rows only.

askfongjojo · 2023-10-06T05:33:58Z

@bnaecker - Here is an example of the bad ones (disks created pre-update):

BRM42220009 # zlogin oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759
[Connected to zone 'oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759' pts/3]
The illumos Project     helios-2.0.22216        October 2023
root@oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759:~# ipadm
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
lo0/v6            static   ok           ::1/128
oxControlService8/ll addrconf ok        fe80::8:20ff:fe43:6a81%oxControlService8/10
oxControlService8/omicron6 static ok    fd00:1122:3344:107::4/64
root@oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759:~# echo "select timeseries_name, timeseries_key from fields_uuid where field_name = 'upstairs_uuid' and field_value = 'c64f5fbe-8a2b-4f8c-a229-13a03dbf1c39'" | curl 'http://[fd00:1122:3344:107::4]:8123/?database=oximeter'  --data-binary @-
crucible_upstairs:activated     4790757247494827247
crucible_upstairs:activated     8189252800987326883
crucible_upstairs:extent_no_op  4790757247494827247
crucible_upstairs:extent_no_op  8189252800987326883
crucible_upstairs:extent_reopen 4790757247494827247
crucible_upstairs:extent_reopen 8189252800987326883
crucible_upstairs:extent_repair 4790757247494827247
crucible_upstairs:extent_repair 8189252800987326883
crucible_upstairs:flush 4790757247494827247
crucible_upstairs:flush 8189252800987326883
crucible_upstairs:flush_close   4790757247494827247
crucible_upstairs:flush_close   8189252800987326883
crucible_upstairs:read  4790757247494827247
crucible_upstairs:read  8189252800987326883
crucible_upstairs:read_bytes    4790757247494827247
crucible_upstairs:read_bytes    8189252800987326883
crucible_upstairs:write 4790757247494827247
crucible_upstairs:write 8189252800987326883
crucible_upstairs:write_bytes   4790757247494827247
crucible_upstairs:write_bytes   8189252800987326883

and a good one (created post-update):

root@oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759:~# echo "select timeseries_name, timeseries_key from fields_uuid where field_name = 'upstairs_uuid' and field_value = '823b9822-a9c2-4379-9e9c-de99c78a2821'" | curl 'http://[fd00:1122:3344:107::4]:8123/?database=oximeter'  --data-binary @-
crucible_upstairs:activated     3670736591403169362
crucible_upstairs:extent_no_op  3670736591403169362
crucible_upstairs:extent_reopen 3670736591403169362
crucible_upstairs:extent_repair 3670736591403169362
crucible_upstairs:flush 3670736591403169362
crucible_upstairs:flush_close   3670736591403169362
crucible_upstairs:read  3670736591403169362
crucible_upstairs:read_bytes    3670736591403169362
crucible_upstairs:write 3670736591403169362
crucible_upstairs:write_bytes   3670736591403169362

david-crespo · 2023-10-06T14:12:21Z

there is some asynchrony to that deduplication

So in theory this could fix itself at any time, but we don't know when?

bnaecker · 2023-10-06T14:44:05Z

In theory, yes, which is why the query needs to dedupe if that’s important. I’m also not sure this is the cause, since Angela’s data suggests there was a key change that slipped through. I need to look in more detail to better understand what’s going on.

bnaecker · 2023-10-06T18:26:32Z

So, I think this may be much worse than I originally feared. On dogfood I see a bunch of duplicate timeseries schema that I don't expect:

oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759 :) select * from timeseries_schema where timeseries_name = 'switch_table:updates';

SELECT *
FROM timeseries_schema
WHERE timeseries_name = 'switch_table:updates'

Query id: 03839295-73a7-445d-a75b-e0518fb65a5a

┌─timeseries_name──────┬─fields.name────────────────────────────────┬─fields.type─────────────────────┬─fields.source─────────────────────────┬─datum_type────┬───────────────────────created─┐
│ switch_table:updates │ ['rack_id','sidecar_id','sled_id','table'] │ ['Uuid','Uuid','Uuid','String'] │ ['Target','Target','Target','Target'] │ CumulativeI64 │ 2023-08-31 03:14:47.609267491 │
│ switch_table:updates │ ['rack_id','sidecar_id','sled_id','table'] │ ['Uuid','Uuid','Uuid','String'] │ ['Target','Target','Target','Target'] │ CumulativeI64 │ 2023-09-07 19:38:18.424051930 │
│ switch_table:updates │ ['rack_id','sidecar_id','sled_id','table'] │ ['Uuid','Uuid','Uuid','String'] │ ['Target','Target','Target','Target'] │ CumulativeI64 │ 2023-09-12 20:47:06.198727848 │
│ switch_table:updates │ ['rack_id','sidecar_id','sled_id','table'] │ ['Uuid','Uuid','Uuid','String'] │ ['Target','Target','Target','Target'] │ CumulativeI64 │ 2023-10-05 21:33:58.527238735 │
│ switch_table:updates │ ['rack_id','sled_id','sidecar_id','table'] │ ['Uuid','Uuid','Uuid','String'] │ ['Target','Target','Target','Target'] │ CumulativeI64 │ 2023-08-30 18:57:28.291684241 │
└──────────────────────┴────────────────────────────────────────────┴─────────────────────────────────┴───────────────────────────────────────┴───────────────┴───────────────────────────────┘

5 rows in set. Elapsed: 0.008 sec.

There are at least two things going on here, one of which I understand. The last record has the fields in a different order than the others. I think this is yet another consequence of this thread. We relied on the order of the fields in ways I had forgotten in the 2+ years between when I wrote the code and that PR.

We store the fields in the schema as an array, but I changed the storage when the samples are collected to a map after that comment. So we're inserting a duplicated schema, only because the order of the fields has changed. We need to sort all the fields, all the time, whenever we do a comparison. The ordering doesn't matter for identity, but the code implicitly relied on it. I changed one reliant piece, but not all of them.

I think this may also explain the issue we see here. As part of inserting a sample into the database, we need two pieces of data: the timeseries schema; and the actual, concrete instantiation of that timeseries, which the database calls the timeseries_key. The former is comprised of the timeseries name, plus the field names and types. The latter is comprised of those, plus the actual value for those fields, which we compute by hashing the fields. Again, that's dependent on the ordering, and always should have been sorting the fields before hashing. Since it does not, it gets the fields in the order the container iterates over them, which also changed as part of that PR. So, the same producer generating a sample that spans this change results in: (1) a new timeseries schema record and (2) a different timeseries key. We're kind of hosed here, since there's no way a client query can distinguish (or merge) the samples from those two timeseries.

What I'm not sure of is why there are multiple records with the same field order, but different created values. Those should compare equal, which means the client should not have inserted it more than once. I'll have to dig more into this part, once I stop kicking myself for the BS over the timeseries keys.

bnaecker · 2023-10-06T18:50:38Z

This is highly related to #4008. We're obviously going to need to revisit the hashing, and at least introduce another discontinuity into the system when we fix this bug. I want to try to bundle that up with moving to a stable hash, so that we can try to limit it to one.

smklein · 2023-10-09T22:37:33Z

So there are a number of different things that could be potential issues here. It may be worth splitting some of these out into separate issues.

As @bnaecker noted:

timeseries_schema changed the order of fields around august 31st, almost certainly triggered by Enforce unique field names between targets and metrics for timeseries #3878

The hash algorithm to calculate the timeseries_key is not necessarily stable:

omicron/oximeter/db/src/lib.rs

Lines 334 to 345 in d624bce

    
           pub(crate) fn timeseries_key_for<'a>( 
        
               target_fields: impl Iterator<Item = &'a Field>, 
        
               metric_fields: impl Iterator<Item = &'a Field>, 
        
           ) -> TimeseriesKey { 
        
               use std::collections::hash_map::DefaultHasher; 
        
               use std::hash::{Hash, Hasher}; 
        
               let mut hasher = DefaultHasher::new(); 
        
               for field in target_fields.chain(metric_fields) { 
        
                   field.hash(&mut hasher); 
        
               } 
        
               hasher.finish() 
        
           }

(see: Timeseries need to use a stable hash algorithm for computing keys #4008)

However, I don't think those issues caused these metrics APIs to fail.

Using this handy utility function:

function query() {
  echo "$1" | curl 'http://[fd00:1122:3344:107::4]:8123/?database=oximeter'  --data-binary @-
}

I can run the following:

$ query "select * from timeseries_schema WHERE timeseries_name='crucible_upstairs:read_bytes' FORMAT JSONEachRow"
{"timeseries_name":"crucible_upstairs:read_bytes","fields.name":["upstairs_uuid"],"fields.type":["Uuid"],"fields.source":["Target"],"datum_type":"CumulativeI64","created":"2023-08-30 19:16:02.832396806"}
{"timeseries_name":"crucible_upstairs:read_bytes","fields.name":["upstairs_uuid"],"fields.type":["Uuid"],"fields.source":["Target"],"datum_type":"CumulativeI64","created":"2023-09-12 21:30:21.496972480"}
{"timeseries_name":"crucible_upstairs:read_bytes","fields.name":["upstairs_uuid"],"fields.type":["Uuid"],"fields.source":["Target"],"datum_type":"CumulativeI64","created":"2023-10-05 21:42:18.187637346"}

From this data, I can see "there is only one field", so the order of fields doesn't matter in this specific case.

Furthermore, looking at https://github.com/rust-lang/rust/blob/master/library/std/src/collections/hash/map.rs , it doesn't seem like the hashing function has changed in the past few months?

Sources: https://github.com/rust-lang/rust/blob/cdddcd3bea35049dab56888d4391cb9b5b1b4491/library/std/src/collections/hash/map.rs#L3147-L3170 , which depends on https://github.com/rust-lang/rust/blob/master/library/core/src/hash/sip.rs (and that file hasn't been touched in 6 months?)

smklein · 2023-10-09T23:24:19Z

However, there are other issues that appear to be present:

Nothing prevents duplicating the schema / fields tables in the DB. Specifically, the body of insert_samples appears to call verify_sample_schema to make the call on "is this a new schema or not". This relies on reading from an in-memory cache of the schema here, though I don't see why the schema would necessarily be populated at this point. It seems very possible for us to "insert samples" before the first "DB query", in which case we'd insert a new row for the sample, and for all subsequent insertions of the sample until the first query populates the in-memory cache.

Even if we could "read before the first insert", there's still a race condition that's present here, since there's nothing stopping duplicate schemas from being inserted:

omicron/oximeter/db/src/client.rs

Lines 528 to 540 in 47a6b42

    
           // Insert the new schema into the database 
        
           // 
        
           // TODO-robustness There's still a race possible here. If two distinct clients receive new 
        
           // but conflicting schema, they will both try to insert those at some point into the schema 
        
           // tables. It's not clear how to handle this, since ClickHouse provides no transactions. 
        
           // This is unlikely to happen at this point, because the design is such that there will be 
        
           // a single `oximeter` instance, which has one client object, connected to a single 
        
           // ClickHouse server. But once we start replicating data, the window within which the race 
        
           // can occur is much larger, since it includes the time it takes ClickHouse to replicate 
        
           // data between nodes. 
        
           // 
        
           // NOTE: This is an issue even in the case where the schema don't conflict. Two clients may 
        
           // receive a sample with a new schema, and both would then try to insert that schema.

Clickhouse also does not have UNIQUE constraints, and Primary Keys are also not unique, so storing the schema there seems like it always risks having duplicates.

As @askfongjojo identified here, the issue really causing problems is multiple distinct values for timeseries_key on one of the fields. I do not know what's causing this -- If the hash changed, this would be the most obvious explanation, but I'm not sure that happened. I will need to investigate more.

david-crespo · 2023-10-09T23:44:09Z

We were able to see in the DB that measurements up to the last dogfood deploy have one timeseries key and measurements after have another. We will need to see if we can tell from stuff that’s currently in Clickhouse whether there were different inputs to the timeseries_key hash calculation or the hash algorithm itself changed. The former seems more likely given that we didn’t see any evidence of the default hasher changing. I suppose it’s conceivable something else about the environment changed that caused the hash algorithm to produce different results. If we can’t tell what changed from the DB contents we will have to spend more time reasoning about the code.

You can see the two timeseries_key entries here. This is the 2 != 1 that triggers the error response.

$ query "SELECT * FROM oximeter.fields_uuid WHERE timeseries_name = 'collection_target:ram_provisioned' AND field_name = 'id' AND field_value = '001de000-1334-4000-8000-000000000000' FORMAT JSONEachRow"
{"timeseries_name":"collection_target:ram_provisioned","timeseries_key":"5670937711533663662","field_name":"id","field_value":"001de000-1334-4000-8000-000000000000"}
{"timeseries_name":"collection_target:ram_provisioned","timeseries_key":"11989520059948388015","field_name":"id","field_value":"001de000-1334-4000-8000-000000000000"}

You can see when we ask for the earliest and latest measurement for each timeseries_key, we get dates that correspond to the last dogfood deploy.

$ query 'select * from oximeter.measurements_i64 where timeseries_key =11989520059948388015 order by timestamp asc limit 1'
collection_target:virtual_disk_space_provisioned        11989520059948388015    2023-10-05 22:34:31.258015000   14095008923648
$ query 'select * from oximeter.measurements_i64 where timeseries_key =11989520059948388015 order by timestamp desc limit 1'
collection_target:virtual_disk_space_provisioned        11989520059948388015    2023-10-09 17:21:57.348964000   15600394960896

$ query 'select * from oximeter.measurements_i64 where timeseries_key =5670937711533663662 order by timestamp asc limit 1'
collection_target:virtual_disk_space_provisioned        5670937711533663662     2023-09-12 11:19:59.882556000   18495202918400
$ query 'select * from oximeter.measurements_i64 where timeseries_key =5670937711533663662 order by timestamp desc limit 1'
collection_target:ram_provisioned       5670937711533663662     2023-10-05 16:07:48.990993000   1415191724032

Shorter:

Range for timeseries_key 5670937711533663662:  2023-09-12 11:19:59.882556000 to 2023-10-05 16:07:48.990993000
Range for timeseries_key 11989520059948388015: 2023-10-05 22:34:31.258015000 to 2023-10-09 17:21:57.348964000

smklein · 2023-10-09T23:50:08Z

I'm not hashing the same values as oximeter, but as a simple test:

fn hash(fields: &[&str]) -> u64 {
    use std::collections::hash_map::DefaultHasher;
    use std::hash::{Hash, Hasher};
    let mut hasher = DefaultHasher::new();
    for field in fields {
        field.hash(&mut hasher);
    }
    hasher.finish()
}

fn main() {
    let h = hash(&["a", "b", "c"]);
    let expected = 10937667955699260984;
    assert_eq!(h, expected);
}

And:

#/bin/bash

set -eu

VERSIONS=(
    "1.73.0"
    "1.72.1"
    "1.72.0"
    "1.71.1"
    "1.71.0"
    "1.70.0"
    "1.69.0"
    "1.68.2"
    "1.68.1"
    "1.68.0"
    "1.67.1"
    "1.67.0"
    "1.66.1"
    "1.66.0"
    "1.65.0"
    "1.64.0"
    "1.63.0"
    "1.62.1"
    "1.62.0"
)

for VERSION in "${VERSIONS[@]}"; do
    rustup install "$VERSION" && cargo "+$VERSION" run
done

This passes for all the specified rust toolchain versions

david-crespo · 2023-10-10T02:07:26Z

By adding some debug prints to an integration test that fetches a system metric, I was able to figure out the inputs to the hash and generate the same timeseries_key=11989520059948388015 that we see on dogfood from 10/5 to now.

[oximeter/db/src/lib.rs:341] metric_fields = []
[oximeter/db/src/lib.rs:341] target_fields = [
    Field {
        name: "id",
        value: Uuid(
             001de000-1334-4000-8000-000000000000,
        ),
    },
]
[oximeter/db/src/lib.rs:344] hasher.finish() = 11989520059948388015

Now I'm going to check out an older version of omicron and see if I can get the old hash out of it.

smklein · 2023-10-10T02:12:23Z

Chatting with @david-crespo , we suspect #4091 is the source of the problem.

The "hash" algorithm didn't change -- the contents being hashed changed. We hash the FieldValue enum when calculating the timeseries_key, and this was modified significantly in #4091.

david-crespo · 2023-10-10T02:21:27Z

Got em. Checking out 41fb704 (the commit before #4091), and running the same dbg!s as above (pushed up at 50dcd62) gets me — you guessed it — the old timeseries key, 5670937711533663662.

[oximeter/db/src/lib.rs:341] target_fields = [
    Field {
        name: "id",
        value: Uuid(
            001de000-1334-4000-8000-000000000000,
        ),
    },
]
[oximeter/db/src/lib.rs:341] metric_fields = []
[oximeter/db/src/lib.rs:344] hasher.finish() = 5670937711533663662

Now we have to figure out whether we can tweak the new code to give us the old key again without reverting the entire 6000 line change in #4091.

david-crespo · 2023-10-10T02:38:34Z

And for my final trick, based on Sean's guess that the enum reordering here is the problem:

https://github.com/oxidecomputer/omicron/blame/47a6b42c986c65292ee61b0c79090fa57dec5fe9/oximeter/oximeter/src/types.rs#L96-L109

Moving the new entries to the end of the enum definition (see 41c74f6) and re-running my tests does in fact bring back the old timeseries_key of 5670937711533663662! 🥳 🎉

[oximeter/db/src/lib.rs:341] target_fields = [
    Field {
        name: "id",
        value: Uuid(
            001de000-1334-4000-8000-000000000000,
        ),
    },
]
[oximeter/db/src/lib.rs:341] metric_fields = []
[oximeter/db/src/lib.rs:344] hasher.finish() = 5670937711533663662

So now we just need to figure out how to write some tests that prevent this from happening again.

smklein · 2023-10-10T03:05:46Z

Short Term

We can shuffle the enum fields back to the old order, and append the new ones to the end. We can add some change-detector tests that the hashes are stable. Toolchain revs could still break the calculation of the timeseries_key, because the hashing algorithm could change, or because the memory layout could change.

This should let us unblock the release.

Long Term

If we work on #4008 , the hashing algorithm could become stable, and we'd be protected against rust stdlib changing the DefaultHasher algorithm.

I don't think this is enough for correctness. According to https://doc.rust-lang.org/std/hash/trait.Hash.html#portability :

Due to differences in endianness and type sizes, data fed by Hash to a Hasher should not be considered portable across platforms. Additionally the data passed by most standard library types should not be considered stable between compiler versions.

The change-detector tests should help us catch this, but even with a stable hasher, the contents being hashed could change between different toolchains / architectures.

I think we need a fundamentally different way to calculate the timeseries_key, that does not rely on these sensitive details of memory layout, if we want the generated value to be consistent.

smklein · 2023-10-10T03:33:10Z

Two approaches in the long-term, that could potentially mitigate this issue:

Hash it yourself

Manually implementing Hash for all types feeding into timeseries_key would provide stability. We'd have to be really careful with this approach:

All enums could be implemented as hash(variant number) + hash(contents of variant as bytes)
Structs like String would also need to be hashed manually for stability, as the byte-oriented nature of hashing could change if their fields are re-ordered
This would be really manual, and pretty rough to validate that we aren't accidentally using #[derive(Hash)] where we shouldn't be. That being said, the set of things to hash is small?

Use a stable hash of a stably-serialized output?

Suppose we wanted to hash the contents of the Field struct (since we basically do -- we actually want to hash a bunch of them concatenated together, but we want to hash this struct):

omicron/oximeter/oximeter/src/types.rs

Lines 248 to 255 in 47a6b42

    
           /// A `Field` is a named aspect of a target or metric. 
        
           #[derive( 
        
               Clone, Debug, Hash, PartialEq, Eq, JsonSchema, Serialize, Deserialize, 
        
           )] 
        
           pub struct Field { 
        
               pub name: String, 
        
               pub value: FieldValue, 
        
           }

I think we could probably pick a better serializer, but, as an example, we could do the following:

// Assume we start with some input to be hashed...
let value: FieldSet = ...;

// Convert from in-memory representation to a serialized representation that is stable.
// I HAVE NOT VALIDATED THAT SERDE_JSON WORKS HERE -- in fact, for maps, I know it
// would not necessarily be stable.
// However, this helps us avoid any changes to memory layout that could be imposed on us by
// https://doc.rust-lang.org/std/hash/trait.Hash.html#portability
let serialized_as_json = serde_json::to_string(&value).unwrap();

// Finally, feed this serialized format into a stable hasher.
// The serialized format should be stable, the hash should be stable.
let timeseries_key = stable_hash(&serialized_as_json);

@jgallagher mentioned that https://crates.io/crates/hubpack could be a good fit in this area, though the lack of String support means we'd need to figure out a workaround somehow.

bnaecker · 2023-10-10T13:21:10Z

Thank you all for tracking this down.

In the interest of unblocking the release, I feel like we should revert both 23eee1a and 8ffd476. The former of those introduce a change in the hash by reordering the variants of an enum, as Sean noted. The latter did so by changing the order in which fields are submitted to oximeter. We previously used a Vec, which was unsorted, but that commit switched to a BTreeMap which implicitly sorts the fields. I don't think that'll be visible to customers, since the utilization metrics they can currently see have exactly 1 field. But I still think we should revert it to avoid adding duplicate keys.

Nothing prevents duplicating the schema / fields tables in the DB

This is just a latent bug. We have a method which looks up a timeseries schema, including fetching them all from the DB if it can't be found in the local cache. We could call that here, but it doesn't strictly prevent the race. I've considered moving schema to CRDB, which would address the consistency, but require a lot of other changes.

smklein · 2023-10-10T17:21:51Z

Though it is possible to revert 23eee1a , 8ffd476 was included in the last customer release, so reverting it might be a bit complicated?

I suspect that 23eee1a changed almost all timeseries_key hashes, whereas 8ffd476 only had an impact on timeseries_key hashes where the order changed. Both are instances of this bug, but one had a larger blast radius than the other.

smklein · 2023-10-10T17:49:21Z

Also, for later, @sunshowers mentioned https://crates.io/crates/bcs as being "stably serialized output" format, which seems like a great fit

andrewjstone · 2023-10-10T22:38:17Z

Also, for later, @sunshowers mentioned https://crates.io/crates/bcs as being "stably serialized output" format, which seems like a great fit

I'm a big fan of bcs and was hoping to use it for trust quorum work in the future. I didn't use it for LRTQ because it doesn't work well with some serde macros, like those used for renaming (IIRC).

smklein · 2023-10-11T04:43:10Z

I have two PRs to address this issue.

[oximeter] Use stable hash, stable format, for deriving timeseries_key #4251 addresses the issue with hash stability (e.g., what if DefaultHasher changes?) as well as input stability (e.g., What if the enum used as input gets a new variant?). I also added a regression test to catch this for all variants of the FieldValue.

This PR, just like the prior changes which caused this issue, also changes the derivation of the timeseries_key, and thus breaks backwards-compatibility. However, I'd also argue that "backwards compatibility has been broken in 23eee1a and 8ffd476 already, and we cannot revert both" (they cross a release boundary).

That being said, the stability change PR doesn't fix this issue, it just prevents it in the future. That's where my second PR comes in:

[oximeter] Add minimal versioning support #4246 adds single u64 to the format of Clickhouse, to be used as a version of the "schema used by Oximeter". If we don't find a value, we treat it as 0. Otherwise, we hard-code the starting value of 1. In lieu of providing an "upgrade" mechanism, this PR also provides a default behavior where "schema changes simply wipe the metrics database". This is a big hammer we hopefully won't need to use often, but considering that metrics data is transient anyway, this seemed like a reasonable tradeoff.

smklein · 2023-10-11T04:52:57Z

Also, for later, @sunshowers mentioned https://crates.io/crates/bcs as being "stably serialized output" format, which seems like a great fit

I'm a big fan of bcs and was hoping to use it for trust quorum work in the future. I didn't use it for LRTQ because it doesn't work well with some serde macros, like those used for renaming (IIRC).

For what it's worth - it looks independent of struct names / field variant names?

https://docs.rs/bcs/0.1.5/bcs/index.html#structures

#4251) - Uses [bcs](https://crates.io/crates/bcs) as a stable binary format for inputs to the `timeseries_key` function - Uses [highway](https://crates.io/crates/highway) as a stable hash algorithm (it's keyed, fast, portable, and well-distributed). - Additionally, adds an EXPECTORATE test validating the stability of timeseries_key values. Fixes #4008 , and also addresses the issue raised in #4221 regarding stable input NOTE: This PR itself *also* breaks the stability of the `timeseries_key` (hopefully for the last time), and will rely on #4246 to wipe the metrics DB for the next release.

Provides a mechanism to explicitly version Oximeter schema used within Clickhouse, and to wipe that data if any schema-breaking changes are made. This is retroactively intended to solve the problems discussed within #4221 , where the `timeseries_key` generation was altered.

smklein · 2023-10-12T21:56:07Z

Both PRs have been merged. Once we validate this issue has been corrected on dogfood, I'll mark this as closed.

smklein · 2023-10-17T17:08:20Z

Confirmed that dogfood metrics are showing up -- I'll close this issue now.

askfongjojo added this to the 3 milestone Oct 6, 2023

morlandi7 assigned smklein, bnaecker and david-crespo Oct 9, 2023

This was referenced Oct 11, 2023

[oximeter] Add minimal versioning support #4246

Merged

[oximeter] Use stable hash, stable format, for deriving timeseries_key #4251

Merged

smklein added the Metrics label Oct 12, 2023

smklein closed this as completed Oct 17, 2023

bnaecker mentioned this issue Oct 23, 2023

Sort fields when extracting timeseries schema #4312

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics API get requests return "Query must resolve to a single timeseries if limit is specified" #4221

Metrics API get requests return "Query must resolve to a single timeseries if limit is specified" #4221

askfongjojo commented Oct 6, 2023

david-crespo commented Oct 6, 2023 •

edited

Loading

bnaecker commented Oct 6, 2023 via email

askfongjojo commented Oct 6, 2023

bnaecker commented Oct 6, 2023 via email

askfongjojo commented Oct 6, 2023

david-crespo commented Oct 6, 2023

bnaecker commented Oct 6, 2023 via email

bnaecker commented Oct 6, 2023

bnaecker commented Oct 6, 2023

smklein commented Oct 9, 2023

smklein commented Oct 9, 2023

david-crespo commented Oct 9, 2023 •

edited

Loading

smklein commented Oct 9, 2023

david-crespo commented Oct 10, 2023

smklein commented Oct 10, 2023

david-crespo commented Oct 10, 2023 •

edited

Loading

david-crespo commented Oct 10, 2023

smklein commented Oct 10, 2023

smklein commented Oct 10, 2023

bnaecker commented Oct 10, 2023

smklein commented Oct 10, 2023

smklein commented Oct 10, 2023

andrewjstone commented Oct 10, 2023

smklein commented Oct 11, 2023

smklein commented Oct 11, 2023

smklein commented Oct 12, 2023

smklein commented Oct 17, 2023

Metrics API get requests return "Query must resolve to a single timeseries if limit is specified" #4221

Metrics API get requests return "Query must resolve to a single timeseries if limit is specified" #4221

Comments

askfongjojo commented Oct 6, 2023

david-crespo commented Oct 6, 2023 • edited Loading

bnaecker commented Oct 6, 2023 via email

askfongjojo commented Oct 6, 2023

bnaecker commented Oct 6, 2023 via email

askfongjojo commented Oct 6, 2023

david-crespo commented Oct 6, 2023

bnaecker commented Oct 6, 2023 via email

bnaecker commented Oct 6, 2023

bnaecker commented Oct 6, 2023

smklein commented Oct 9, 2023

smklein commented Oct 9, 2023

david-crespo commented Oct 9, 2023 • edited Loading

smklein commented Oct 9, 2023

david-crespo commented Oct 10, 2023

smklein commented Oct 10, 2023

david-crespo commented Oct 10, 2023 • edited Loading

david-crespo commented Oct 10, 2023

smklein commented Oct 10, 2023

Short Term

Long Term

smklein commented Oct 10, 2023

Hash it yourself

Use a stable hash of a stably-serialized output?

bnaecker commented Oct 10, 2023

smklein commented Oct 10, 2023

smklein commented Oct 10, 2023

andrewjstone commented Oct 10, 2023

smklein commented Oct 11, 2023

smklein commented Oct 11, 2023

smklein commented Oct 12, 2023

smklein commented Oct 17, 2023

david-crespo commented Oct 6, 2023 •

edited

Loading

david-crespo commented Oct 9, 2023 •

edited

Loading

david-crespo commented Oct 10, 2023 •

edited

Loading