Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics API get requests return "Query must resolve to a single timeseries if limit is specified" #4221

Closed
askfongjojo opened this issue Oct 6, 2023 · 27 comments
Assignees
Labels
Milestone

Comments

@askfongjojo
Copy link

Console is no longer getting back any metrics (for utilization or disk metrics). It's hitting 500 errors in Nexus, e.g.

01:23:15.075Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (dropshot_external): request completed
    error_message_external = Internal Server Error
    error_message_internal = Query must resolve to a single timeseries if limit is specified
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/fa728d0/dropshot/src/server.rs:841
    latency_us = 429136
    local_addr = 172.30.2.5:443
    method = GET
    remote_addr = 172.20.17.42:58517
    req_id = 6620223f-f85a-4bba-9c1a-e4db78417764
    response_code = 500
    uri = https://oxide.sys.rack2.eng.oxide.computer/v1/disks/dbdata-ext4-primary/metrics/write?project=try&start_time=2023-10-05T01%3A23%3A14.591Z&end_time=2023-10-06T01%3A23%3A14.591Z&limit=3000

I invoked a similar query via API without the limit parameter to see what would happen and it's returning the same error:

01:25:12.706Z INFO 65a11c18-7f59-41ac-b9e7-680627f996e7 (dropshot_external): request completed
    error_message_external = Internal Server Error
    error_message_internal = Query must resolve to a single timeseries if limit is specified
    file = /home/build/.cargo/git/checkouts/dropshot-a4a923d29dccc492/fa728d0/dropshot/src/server.rs:841
    latency_us = 139882
    local_addr = 172.30.2.5:443
    method = GET
    remote_addr = 172.20.17.42:58551
    req_id = 5a8175d3-47b8-4add-bda7-f5d2734a954b
    response_code = 500
    uri = /v1/disks/c64f5fbe-8a2b-4f8c-a229-13a03dbf1c39/metrics/read_bytes?start_time=2023-10-04T16%3A34%3A22.508Z&end_time=2023-10-6T16%3A34%3A22.508Z
@askfongjojo askfongjojo added this to the 3 milestone Oct 6, 2023
@david-crespo
Copy link
Contributor

david-crespo commented Oct 6, 2023

Some notes on how this works. Will probably need @bnaecker's help to puzzle this out. What we'll probably want to do is look in Clickhouse to see why multiple timeseries with the same name (at least, I think that's what it is) exist.

I'm guessing you only see this for disk metrics and not for capacity and utilization metrics. Let me know whether that's true. I just saw you said both disk and utilization.


This is the top level call for disk metrics:

let result = nexus
.select_timeseries(
&format!("crucible_upstairs:{}", path.metric),
&[&format!("upstairs_uuid=={}", authz_disk.id())],
query,
limit,
)
.await?;

So the timeseries names here are crucible_upstairs:write or crucible_upstairs:read_bytes, respectively.

Here is where the limit constraint is enforced. I implemented this check in #3269 when I added order by timestamp so that I could do reverse order limit 1 to get the one most recent data point.

let info = match query.field_query() {
Some(field_query) => {
self.select_matching_timeseries_info(&field_query, &schema)
.await?
}
None => BTreeMap::new(),
};
if info.is_empty() {
// allow queries that resolve to zero timeseries even with limit
// specified because the results are not misleading
Ok(vec![])
} else if limit.is_some() && info.len() != 1 {
// Error if a limit is specified with a query that resolves to
// multiple timeseries. Because query results are ordered in part by
// timeseries key, results with a limit will select measurements in
// a way that is arbitrary with respect to the query.
Err(Error::InvalidLimitQuery)

The reason leaving off limit doesn't do anything is that we always call this function with a limit:

.select_timeseries_with(
timeseries_name,
criteria,
Some(start_time),
Some(end_time),
Some(limit),
order,
)

In order for this to become an issue when it wasn't before, somehow the timeseries name has started resolving to more than one timeseries in Clickhouse. I don't understand very well what determines a timeseries uniquely besides its name, and therefore what could cause the query here to resolve two timeseries. When I run this test with some debug logging added

async fn test_select_timeseries_with_order() {

here is the SQL that runs to get the list of timeseries:

select
    filter0.timeseries_key as timeseries_key,
    filter0.field_name,
    filter0.field_value,
    filter1.field_name,
    filter1.field_value,
    filter2.field_name,
    filter2.field_value,
    filter3.field_name,
    filter3.field_value,
    filter4.field_name,
    filter4.field_value
from
    (
        select *
        from oximeter.fields_uuid
        where timeseries_name = 'service:request_latency' and field_name = 'id'
    ) as filter0
inner join
    (
        select *
        from oximeter.fields_string
        where
            timeseries_name = 'service:request_latency'
            and field_name = 'method'
            and field_value = 'GET'
    ) as filter1
    on (
        filter0.timeseries_name = filter1.timeseries_name
        and filter0.timeseries_key = filter1.timeseries_key
    )
inner join
    (
        select *
        from oximeter.fields_string
        where
            timeseries_name = 'service:request_latency'
            and field_name = 'name'
            and field_value = 'oximeter'
    ) as filter2
    on (
        filter1.timeseries_name = filter2.timeseries_name
        and filter1.timeseries_key = filter2.timeseries_key
    )
inner join
    (
        select *
        from oximeter.fields_string
        where
            timeseries_name = 'service:request_latency'
            and field_name = 'route'
            and field_value = '/a'
    ) as filter3
    on (
        filter2.timeseries_name = filter3.timeseries_name
        and filter2.timeseries_key = filter3.timeseries_key
    )
inner join
    (
        select *
        from oximeter.fields_i64
        where
            timeseries_name = 'service:request_latency'
            and field_name = 'status_code'
            and field_value = 200
    ) as filter4
    on (
        filter3.timeseries_name = filter4.timeseries_name
        and filter3.timeseries_key = filter4.timeseries_key
    )
order by (filter0.timeseries_name, filter0.timeseries_key) format jsoneachrow
;

The following function indicates that if the above query comes back with multiple rows with distinct values in the timeseries_key column, the resulting BTreeMap will have multiple entries, and that's how we're going to fail the info.len() != 1 check.

// Select the timeseries, including keys and field values, that match the given field-selection
// query.
async fn select_matching_timeseries_info(
&self,
field_query: &str,
schema: &TimeseriesSchema,
) -> Result<BTreeMap<TimeseriesKey, (Target, Metric)>, Error> {
let body = self.execute_with_body(field_query).await?;
let mut results = BTreeMap::new();
for line in body.lines() {
let row: model::FieldSelectRow = serde_json::from_str(line)
.expect("Unable to deserialize an expected row");
let (id, target, metric) =
model::parse_field_select_row(&row, schema);
results.insert(id, (target, metric));
}
Ok(results)
}

@bnaecker
Copy link
Collaborator

bnaecker commented Oct 6, 2023 via email

@askfongjojo
Copy link
Author

I'm guessing you only see this for disk metrics and not for capacity and utilization metrics. Let me know whether that's true. I just saw you said both disk and utilization.

Actually the issue only happens with objects that exist prior to the update. The metrics for newly created disks can be queried without problems but disks created previously hit the non-unique timeseries error. Would it be possible that there is a new key/format for ClickHouse timeseries and the newly inserted metrics are mixed with some older data, causing the query to see them as two different datasets of the same timeseries?

@bnaecker
Copy link
Collaborator

bnaecker commented Oct 6, 2023 via email

@askfongjojo
Copy link
Author

@bnaecker - Here is an example of the bad ones (disks created pre-update):

BRM42220009 # zlogin oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759
[Connected to zone 'oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759' pts/3]
The illumos Project     helios-2.0.22216        October 2023
root@oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759:~# ipadm
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
lo0/v6            static   ok           ::1/128
oxControlService8/ll addrconf ok        fe80::8:20ff:fe43:6a81%oxControlService8/10
oxControlService8/omicron6 static ok    fd00:1122:3344:107::4/64
root@oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759:~# echo "select timeseries_name, timeseries_key from fields_uuid where field_name = 'upstairs_uuid' and field_value = 'c64f5fbe-8a2b-4f8c-a229-13a03dbf1c39'" | curl 'http://[fd00:1122:3344:107::4]:8123/?database=oximeter'  --data-binary @-
crucible_upstairs:activated     4790757247494827247
crucible_upstairs:activated     8189252800987326883
crucible_upstairs:extent_no_op  4790757247494827247
crucible_upstairs:extent_no_op  8189252800987326883
crucible_upstairs:extent_reopen 4790757247494827247
crucible_upstairs:extent_reopen 8189252800987326883
crucible_upstairs:extent_repair 4790757247494827247
crucible_upstairs:extent_repair 8189252800987326883
crucible_upstairs:flush 4790757247494827247
crucible_upstairs:flush 8189252800987326883
crucible_upstairs:flush_close   4790757247494827247
crucible_upstairs:flush_close   8189252800987326883
crucible_upstairs:read  4790757247494827247
crucible_upstairs:read  8189252800987326883
crucible_upstairs:read_bytes    4790757247494827247
crucible_upstairs:read_bytes    8189252800987326883
crucible_upstairs:write 4790757247494827247
crucible_upstairs:write 8189252800987326883
crucible_upstairs:write_bytes   4790757247494827247
crucible_upstairs:write_bytes   8189252800987326883

and a good one (created post-update):

root@oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759:~# echo "select timeseries_name, timeseries_key from fields_uuid where field_name = 'upstairs_uuid' and field_value = '823b9822-a9c2-4379-9e9c-de99c78a2821'" | curl 'http://[fd00:1122:3344:107::4]:8123/?database=oximeter'  --data-binary @-
crucible_upstairs:activated     3670736591403169362
crucible_upstairs:extent_no_op  3670736591403169362
crucible_upstairs:extent_reopen 3670736591403169362
crucible_upstairs:extent_repair 3670736591403169362
crucible_upstairs:flush 3670736591403169362
crucible_upstairs:flush_close   3670736591403169362
crucible_upstairs:read  3670736591403169362
crucible_upstairs:read_bytes    3670736591403169362
crucible_upstairs:write 3670736591403169362
crucible_upstairs:write_bytes   3670736591403169362

@david-crespo
Copy link
Contributor

there is some asynchrony to that deduplication

So in theory this could fix itself at any time, but we don't know when?

@bnaecker
Copy link
Collaborator

bnaecker commented Oct 6, 2023 via email

@bnaecker
Copy link
Collaborator

bnaecker commented Oct 6, 2023

So, I think this may be much worse than I originally feared. On dogfood I see a bunch of duplicate timeseries schema that I don't expect:

oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759 :) select * from timeseries_schema where timeseries_name = 'switch_table:updates';

SELECT *
FROM timeseries_schema
WHERE timeseries_name = 'switch_table:updates'

Query id: 03839295-73a7-445d-a75b-e0518fb65a5a

┌─timeseries_name──────┬─fields.name────────────────────────────────┬─fields.type─────────────────────┬─fields.source─────────────────────────┬─datum_type────┬───────────────────────created─┐
│ switch_table:updates │ ['rack_id','sidecar_id','sled_id','table'] │ ['Uuid','Uuid','Uuid','String'] │ ['Target','Target','Target','Target'] │ CumulativeI64 │ 2023-08-31 03:14:47.609267491 │
│ switch_table:updates │ ['rack_id','sidecar_id','sled_id','table'] │ ['Uuid','Uuid','Uuid','String'] │ ['Target','Target','Target','Target'] │ CumulativeI64 │ 2023-09-07 19:38:18.424051930 │
│ switch_table:updates │ ['rack_id','sidecar_id','sled_id','table'] │ ['Uuid','Uuid','Uuid','String'] │ ['Target','Target','Target','Target'] │ CumulativeI64 │ 2023-09-12 20:47:06.198727848 │
│ switch_table:updates │ ['rack_id','sidecar_id','sled_id','table'] │ ['Uuid','Uuid','Uuid','String'] │ ['Target','Target','Target','Target'] │ CumulativeI64 │ 2023-10-05 21:33:58.527238735 │
│ switch_table:updates │ ['rack_id','sled_id','sidecar_id','table'] │ ['Uuid','Uuid','Uuid','String'] │ ['Target','Target','Target','Target'] │ CumulativeI64 │ 2023-08-30 18:57:28.291684241 │
└──────────────────────┴────────────────────────────────────────────┴─────────────────────────────────┴───────────────────────────────────────┴───────────────┴───────────────────────────────┘

5 rows in set. Elapsed: 0.008 sec.

There are at least two things going on here, one of which I understand. The last record has the fields in a different order than the others. I think this is yet another consequence of this thread. We relied on the order of the fields in ways I had forgotten in the 2+ years between when I wrote the code and that PR.

We store the fields in the schema as an array, but I changed the storage when the samples are collected to a map after that comment. So we're inserting a duplicated schema, only because the order of the fields has changed. We need to sort all the fields, all the time, whenever we do a comparison. The ordering doesn't matter for identity, but the code implicitly relied on it. I changed one reliant piece, but not all of them.

I think this may also explain the issue we see here. As part of inserting a sample into the database, we need two pieces of data: the timeseries schema; and the actual, concrete instantiation of that timeseries, which the database calls the timeseries_key. The former is comprised of the timeseries name, plus the field names and types. The latter is comprised of those, plus the actual value for those fields, which we compute by hashing the fields. Again, that's dependent on the ordering, and always should have been sorting the fields before hashing. Since it does not, it gets the fields in the order the container iterates over them, which also changed as part of that PR. So, the same producer generating a sample that spans this change results in: (1) a new timeseries schema record and (2) a different timeseries key. We're kind of hosed here, since there's no way a client query can distinguish (or merge) the samples from those two timeseries.

What I'm not sure of is why there are multiple records with the same field order, but different created values. Those should compare equal, which means the client should not have inserted it more than once. I'll have to dig more into this part, once I stop kicking myself for the BS over the timeseries keys.

@bnaecker
Copy link
Collaborator

bnaecker commented Oct 6, 2023

This is highly related to #4008. We're obviously going to need to revisit the hashing, and at least introduce another discontinuity into the system when we fix this bug. I want to try to bundle that up with moving to a stable hash, so that we can try to limit it to one.

@smklein
Copy link
Collaborator

smklein commented Oct 9, 2023

So there are a number of different things that could be potential issues here. It may be worth splitting some of these out into separate issues.

As @bnaecker noted:

However, I don't think those issues caused these metrics APIs to fail.

Using this handy utility function:

function query() {
  echo "$1" | curl 'http://[fd00:1122:3344:107::4]:8123/?database=oximeter'  --data-binary @-
}

I can run the following:

$ query "select * from timeseries_schema WHERE timeseries_name='crucible_upstairs:read_bytes' FORMAT JSONEachRow"
{"timeseries_name":"crucible_upstairs:read_bytes","fields.name":["upstairs_uuid"],"fields.type":["Uuid"],"fields.source":["Target"],"datum_type":"CumulativeI64","created":"2023-08-30 19:16:02.832396806"}
{"timeseries_name":"crucible_upstairs:read_bytes","fields.name":["upstairs_uuid"],"fields.type":["Uuid"],"fields.source":["Target"],"datum_type":"CumulativeI64","created":"2023-09-12 21:30:21.496972480"}
{"timeseries_name":"crucible_upstairs:read_bytes","fields.name":["upstairs_uuid"],"fields.type":["Uuid"],"fields.source":["Target"],"datum_type":"CumulativeI64","created":"2023-10-05 21:42:18.187637346"}

From this data, I can see "there is only one field", so the order of fields doesn't matter in this specific case.

Furthermore, looking at https://github.com/rust-lang/rust/blob/master/library/std/src/collections/hash/map.rs , it doesn't seem like the hashing function has changed in the past few months?

Sources: https://github.com/rust-lang/rust/blob/cdddcd3bea35049dab56888d4391cb9b5b1b4491/library/std/src/collections/hash/map.rs#L3147-L3170 , which depends on https://github.com/rust-lang/rust/blob/master/library/core/src/hash/sip.rs (and that file hasn't been touched in 6 months?)

@smklein
Copy link
Collaborator

smklein commented Oct 9, 2023

However, there are other issues that appear to be present:

  • Nothing prevents duplicating the schema / fields tables in the DB. Specifically, the body of insert_samples appears to call verify_sample_schema to make the call on "is this a new schema or not". This relies on reading from an in-memory cache of the schema here, though I don't see why the schema would necessarily be populated at this point. It seems very possible for us to "insert samples" before the first "DB query", in which case we'd insert a new row for the sample, and for all subsequent insertions of the sample until the first query populates the in-memory cache.
  • Even if we could "read before the first insert", there's still a race condition that's present here, since there's nothing stopping duplicate schemas from being inserted:
    // Insert the new schema into the database
    //
    // TODO-robustness There's still a race possible here. If two distinct clients receive new
    // but conflicting schema, they will both try to insert those at some point into the schema
    // tables. It's not clear how to handle this, since ClickHouse provides no transactions.
    // This is unlikely to happen at this point, because the design is such that there will be
    // a single `oximeter` instance, which has one client object, connected to a single
    // ClickHouse server. But once we start replicating data, the window within which the race
    // can occur is much larger, since it includes the time it takes ClickHouse to replicate
    // data between nodes.
    //
    // NOTE: This is an issue even in the case where the schema don't conflict. Two clients may
    // receive a sample with a new schema, and both would then try to insert that schema.
    Clickhouse also does not have UNIQUE constraints, and Primary Keys are also not unique, so storing the schema there seems like it always risks having duplicates.
  • As @askfongjojo identified here, the issue really causing problems is multiple distinct values for timeseries_key on one of the fields. I do not know what's causing this -- If the hash changed, this would be the most obvious explanation, but I'm not sure that happened. I will need to investigate more.

@david-crespo
Copy link
Contributor

david-crespo commented Oct 9, 2023

We were able to see in the DB that measurements up to the last dogfood deploy have one timeseries key and measurements after have another. We will need to see if we can tell from stuff that’s currently in Clickhouse whether there were different inputs to the timeseries_key hash calculation or the hash algorithm itself changed. The former seems more likely given that we didn’t see any evidence of the default hasher changing. I suppose it’s conceivable something else about the environment changed that caused the hash algorithm to produce different results. If we can’t tell what changed from the DB contents we will have to spend more time reasoning about the code.

You can see the two timeseries_key entries here. This is the 2 != 1 that triggers the error response.

$ query "SELECT * FROM oximeter.fields_uuid WHERE timeseries_name = 'collection_target:ram_provisioned' AND field_name = 'id' AND field_value = '001de000-1334-4000-8000-000000000000' FORMAT JSONEachRow"
{"timeseries_name":"collection_target:ram_provisioned","timeseries_key":"5670937711533663662","field_name":"id","field_value":"001de000-1334-4000-8000-000000000000"}
{"timeseries_name":"collection_target:ram_provisioned","timeseries_key":"11989520059948388015","field_name":"id","field_value":"001de000-1334-4000-8000-000000000000"}

You can see when we ask for the earliest and latest measurement for each timeseries_key, we get dates that correspond to the last dogfood deploy.

$ query 'select * from oximeter.measurements_i64 where timeseries_key =11989520059948388015 order by timestamp asc limit 1'
collection_target:virtual_disk_space_provisioned        11989520059948388015    2023-10-05 22:34:31.258015000   14095008923648
$ query 'select * from oximeter.measurements_i64 where timeseries_key =11989520059948388015 order by timestamp desc limit 1'
collection_target:virtual_disk_space_provisioned        11989520059948388015    2023-10-09 17:21:57.348964000   15600394960896

$ query 'select * from oximeter.measurements_i64 where timeseries_key =5670937711533663662 order by timestamp asc limit 1'
collection_target:virtual_disk_space_provisioned        5670937711533663662     2023-09-12 11:19:59.882556000   18495202918400
$ query 'select * from oximeter.measurements_i64 where timeseries_key =5670937711533663662 order by timestamp desc limit 1'
collection_target:ram_provisioned       5670937711533663662     2023-10-05 16:07:48.990993000   1415191724032

Shorter:

Range for timeseries_key 5670937711533663662:  2023-09-12 11:19:59.882556000 to 2023-10-05 16:07:48.990993000
Range for timeseries_key 11989520059948388015: 2023-10-05 22:34:31.258015000 to 2023-10-09 17:21:57.348964000

@smklein
Copy link
Collaborator

smklein commented Oct 9, 2023

I'm not hashing the same values as oximeter, but as a simple test:

fn hash(fields: &[&str]) -> u64 {
    use std::collections::hash_map::DefaultHasher;
    use std::hash::{Hash, Hasher};
    let mut hasher = DefaultHasher::new();
    for field in fields {
        field.hash(&mut hasher);
    }
    hasher.finish()
}

fn main() {
    let h = hash(&["a", "b", "c"]);
    let expected = 10937667955699260984;
    assert_eq!(h, expected);
}

And:

#/bin/bash

set -eu

VERSIONS=(
    "1.73.0"
    "1.72.1"
    "1.72.0"
    "1.71.1"
    "1.71.0"
    "1.70.0"
    "1.69.0"
    "1.68.2"
    "1.68.1"
    "1.68.0"
    "1.67.1"
    "1.67.0"
    "1.66.1"
    "1.66.0"
    "1.65.0"
    "1.64.0"
    "1.63.0"
    "1.62.1"
    "1.62.0"
)

for VERSION in "${VERSIONS[@]}"; do
    rustup install "$VERSION" && cargo "+$VERSION" run
done

This passes for all the specified rust toolchain versions

@david-crespo
Copy link
Contributor

By adding some debug prints to an integration test that fetches a system metric, I was able to figure out the inputs to the hash and generate the same timeseries_key=11989520059948388015 that we see on dogfood from 10/5 to now.

[oximeter/db/src/lib.rs:341] metric_fields = []
[oximeter/db/src/lib.rs:341] target_fields = [
    Field {
        name: "id",
        value: Uuid(
             001de000-1334-4000-8000-000000000000,
        ),
    },
]
[oximeter/db/src/lib.rs:344] hasher.finish() = 11989520059948388015

Now I'm going to check out an older version of omicron and see if I can get the old hash out of it.

@smklein
Copy link
Collaborator

smklein commented Oct 10, 2023

Chatting with @david-crespo , we suspect #4091 is the source of the problem.

The "hash" algorithm didn't change -- the contents being hashed changed. We hash the FieldValue enum when calculating the timeseries_key, and this was modified significantly in #4091.

@david-crespo
Copy link
Contributor

david-crespo commented Oct 10, 2023

Got em. Checking out 41fb704 (the commit before #4091), and running the same dbg!s as above (pushed up at 50dcd62) gets me — you guessed it — the old timeseries key, 5670937711533663662.

[oximeter/db/src/lib.rs:341] target_fields = [
    Field {
        name: "id",
        value: Uuid(
            001de000-1334-4000-8000-000000000000,
        ),
    },
]
[oximeter/db/src/lib.rs:341] metric_fields = []
[oximeter/db/src/lib.rs:344] hasher.finish() = 5670937711533663662

Now we have to figure out whether we can tweak the new code to give us the old key again without reverting the entire 6000 line change in #4091.

@david-crespo
Copy link
Contributor

And for my final trick, based on Sean's guess that the enum reordering here is the problem:

https://github.com/oxidecomputer/omicron/blame/47a6b42c986c65292ee61b0c79090fa57dec5fe9/oximeter/oximeter/src/types.rs#L96-L109

Moving the new entries to the end of the enum definition (see 41c74f6) and re-running my tests does in fact bring back the old timeseries_key of 5670937711533663662! 🥳 🎉

[oximeter/db/src/lib.rs:341] target_fields = [
    Field {
        name: "id",
        value: Uuid(
            001de000-1334-4000-8000-000000000000,
        ),
    },
]
[oximeter/db/src/lib.rs:341] metric_fields = []
[oximeter/db/src/lib.rs:344] hasher.finish() = 5670937711533663662

So now we just need to figure out how to write some tests that prevent this from happening again.

@smklein
Copy link
Collaborator

smklein commented Oct 10, 2023

Short Term

  • We can shuffle the enum fields back to the old order, and append the new ones to the end. We can add some change-detector tests that the hashes are stable. Toolchain revs could still break the calculation of the timeseries_key, because the hashing algorithm could change, or because the memory layout could change.

This should let us unblock the release.

Long Term

If we work on #4008 , the hashing algorithm could become stable, and we'd be protected against rust stdlib changing the DefaultHasher algorithm.

I don't think this is enough for correctness. According to https://doc.rust-lang.org/std/hash/trait.Hash.html#portability :

Due to differences in endianness and type sizes, data fed by Hash to a Hasher should not be considered portable across platforms. Additionally the data passed by most standard library types should not be considered stable between compiler versions.

The change-detector tests should help us catch this, but even with a stable hasher, the contents being hashed could change between different toolchains / architectures.

I think we need a fundamentally different way to calculate the timeseries_key, that does not rely on these sensitive details of memory layout, if we want the generated value to be consistent.

@smklein
Copy link
Collaborator

smklein commented Oct 10, 2023

Two approaches in the long-term, that could potentially mitigate this issue:

Hash it yourself

Manually implementing Hash for all types feeding into timeseries_key would provide stability. We'd have to be really careful with this approach:

  • All enums could be implemented as hash(variant number) + hash(contents of variant as bytes)
  • Structs like String would also need to be hashed manually for stability, as the byte-oriented nature of hashing could change if their fields are re-ordered
  • This would be really manual, and pretty rough to validate that we aren't accidentally using #[derive(Hash)] where we shouldn't be. That being said, the set of things to hash is small?

Use a stable hash of a stably-serialized output?

Suppose we wanted to hash the contents of the Field struct (since we basically do -- we actually want to hash a bunch of them concatenated together, but we want to hash this struct):

/// A `Field` is a named aspect of a target or metric.
#[derive(
Clone, Debug, Hash, PartialEq, Eq, JsonSchema, Serialize, Deserialize,
)]
pub struct Field {
pub name: String,
pub value: FieldValue,
}

I think we could probably pick a better serializer, but, as an example, we could do the following:

// Assume we start with some input to be hashed...
let value: FieldSet = ...;

// Convert from in-memory representation to a serialized representation that is stable.
// I HAVE NOT VALIDATED THAT SERDE_JSON WORKS HERE -- in fact, for maps, I know it
// would not necessarily be stable.
// However, this helps us avoid any changes to memory layout that could be imposed on us by
// https://doc.rust-lang.org/std/hash/trait.Hash.html#portability
let serialized_as_json = serde_json::to_string(&value).unwrap();

// Finally, feed this serialized format into a stable hasher.
// The serialized format should be stable, the hash should be stable.
let timeseries_key = stable_hash(&serialized_as_json);

@jgallagher mentioned that https://crates.io/crates/hubpack could be a good fit in this area, though the lack of String support means we'd need to figure out a workaround somehow.

@bnaecker
Copy link
Collaborator

Thank you all for tracking this down.

In the interest of unblocking the release, I feel like we should revert both 23eee1a and 8ffd476. The former of those introduce a change in the hash by reordering the variants of an enum, as Sean noted. The latter did so by changing the order in which fields are submitted to oximeter. We previously used a Vec, which was unsorted, but that commit switched to a BTreeMap which implicitly sorts the fields. I don't think that'll be visible to customers, since the utilization metrics they can currently see have exactly 1 field. But I still think we should revert it to avoid adding duplicate keys.

Nothing prevents duplicating the schema / fields tables in the DB

This is just a latent bug. We have a method which looks up a timeseries schema, including fetching them all from the DB if it can't be found in the local cache. We could call that here, but it doesn't strictly prevent the race. I've considered moving schema to CRDB, which would address the consistency, but require a lot of other changes.

@smklein
Copy link
Collaborator

smklein commented Oct 10, 2023

Though it is possible to revert 23eee1a , 8ffd476 was included in the last customer release, so reverting it might be a bit complicated?

I suspect that 23eee1a changed almost all timeseries_key hashes, whereas 8ffd476 only had an impact on timeseries_key hashes where the order changed. Both are instances of this bug, but one had a larger blast radius than the other.

@smklein
Copy link
Collaborator

smklein commented Oct 10, 2023

Also, for later, @sunshowers mentioned https://crates.io/crates/bcs as being "stably serialized output" format, which seems like a great fit

@andrewjstone
Copy link
Contributor

Also, for later, @sunshowers mentioned https://crates.io/crates/bcs as being "stably serialized output" format, which seems like a great fit

I'm a big fan of bcs and was hoping to use it for trust quorum work in the future. I didn't use it for LRTQ because it doesn't work well with some serde macros, like those used for renaming (IIRC).

@smklein
Copy link
Collaborator

smklein commented Oct 11, 2023

I have two PRs to address this issue.

  1. [oximeter] Use stable hash, stable format, for deriving timeseries_key #4251 addresses the issue with hash stability (e.g., what if DefaultHasher changes?) as well as input stability (e.g., What if the enum used as input gets a new variant?). I also added a regression test to catch this for all variants of the FieldValue.

This PR, just like the prior changes which caused this issue, also changes the derivation of the timeseries_key, and thus breaks backwards-compatibility. However, I'd also argue that "backwards compatibility has been broken in 23eee1a and 8ffd476 already, and we cannot revert both" (they cross a release boundary).

That being said, the stability change PR doesn't fix this issue, it just prevents it in the future. That's where my second PR comes in:

  1. [oximeter] Add minimal versioning support #4246 adds single u64 to the format of Clickhouse, to be used as a version of the "schema used by Oximeter". If we don't find a value, we treat it as 0. Otherwise, we hard-code the starting value of 1. In lieu of providing an "upgrade" mechanism, this PR also provides a default behavior where "schema changes simply wipe the metrics database". This is a big hammer we hopefully won't need to use often, but considering that metrics data is transient anyway, this seemed like a reasonable tradeoff.

@smklein
Copy link
Collaborator

smklein commented Oct 11, 2023

Also, for later, @sunshowers mentioned https://crates.io/crates/bcs as being "stably serialized output" format, which seems like a great fit

I'm a big fan of bcs and was hoping to use it for trust quorum work in the future. I didn't use it for LRTQ because it doesn't work well with some serde macros, like those used for renaming (IIRC).

For what it's worth - it looks independent of struct names / field variant names?

https://docs.rs/bcs/0.1.5/bcs/index.html#structures

smklein added a commit that referenced this issue Oct 12, 2023
#4251)

- Uses [bcs](https://crates.io/crates/bcs) as a stable binary format for
inputs to the `timeseries_key` function
- Uses [highway](https://crates.io/crates/highway) as a stable hash
algorithm (it's keyed, fast, portable, and well-distributed).
- Additionally, adds an EXPECTORATE test validating the stability of
timeseries_key values.

Fixes #4008 , and also
addresses the issue raised in
#4221 regarding stable
input

NOTE: This PR itself *also* breaks the stability of the `timeseries_key`
(hopefully for the last time), and will rely on
#4246 to wipe the metrics
DB for the next release.
smklein added a commit that referenced this issue Oct 12, 2023
Provides a mechanism to explicitly version Oximeter schema used within
Clickhouse, and to wipe that data
if any schema-breaking changes are made.

This is retroactively intended to solve the problems discussed within
#4221 , where the
`timeseries_key` generation was altered.
@smklein
Copy link
Collaborator

smklein commented Oct 12, 2023

Both PRs have been merged. Once we validate this issue has been corrected on dogfood, I'll mark this as closed.

@smklein
Copy link
Collaborator

smklein commented Oct 17, 2023

Confirmed that dogfood metrics are showing up -- I'll close this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants