-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics API get requests return "Query must resolve to a single timeseries if limit is specified" #4221
Comments
Some notes on how this works. Will probably need @bnaecker's help to puzzle this out. What we'll probably want to do is look in Clickhouse to see why multiple timeseries with the same name (at least, I think that's what it is) exist.
This is the top level call for disk metrics: omicron/nexus/src/external_api/http_entrypoints.rs Lines 1653 to 1660 in c76ca69
So the timeseries names here are Here is where the limit constraint is enforced. I implemented this check in #3269 when I added order by timestamp so that I could do reverse order limit 1 to get the one most recent data point. omicron/oximeter/db/src/client.rs Lines 128 to 145 in c76ca69
The reason leaving off omicron/nexus/src/app/oximeter.rs Lines 249 to 256 in c76ca69
In order for this to become an issue when it wasn't before, somehow the timeseries name has started resolving to more than one timeseries in Clickhouse. I don't understand very well what determines a timeseries uniquely besides its name, and therefore what could cause the query here to resolve two timeseries. When I run this test with some debug logging added omicron/oximeter/db/src/client.rs Line 2139 in c76ca69
here is the SQL that runs to get the list of timeseries: select
filter0.timeseries_key as timeseries_key,
filter0.field_name,
filter0.field_value,
filter1.field_name,
filter1.field_value,
filter2.field_name,
filter2.field_value,
filter3.field_name,
filter3.field_value,
filter4.field_name,
filter4.field_value
from
(
select *
from oximeter.fields_uuid
where timeseries_name = 'service:request_latency' and field_name = 'id'
) as filter0
inner join
(
select *
from oximeter.fields_string
where
timeseries_name = 'service:request_latency'
and field_name = 'method'
and field_value = 'GET'
) as filter1
on (
filter0.timeseries_name = filter1.timeseries_name
and filter0.timeseries_key = filter1.timeseries_key
)
inner join
(
select *
from oximeter.fields_string
where
timeseries_name = 'service:request_latency'
and field_name = 'name'
and field_value = 'oximeter'
) as filter2
on (
filter1.timeseries_name = filter2.timeseries_name
and filter1.timeseries_key = filter2.timeseries_key
)
inner join
(
select *
from oximeter.fields_string
where
timeseries_name = 'service:request_latency'
and field_name = 'route'
and field_value = '/a'
) as filter3
on (
filter2.timeseries_name = filter3.timeseries_name
and filter2.timeseries_key = filter3.timeseries_key
)
inner join
(
select *
from oximeter.fields_i64
where
timeseries_name = 'service:request_latency'
and field_name = 'status_code'
and field_value = 200
) as filter4
on (
filter3.timeseries_name = filter4.timeseries_name
and filter3.timeseries_key = filter4.timeseries_key
)
order by (filter0.timeseries_name, filter0.timeseries_key) format jsoneachrow
; The following function indicates that if the above query comes back with multiple rows with distinct values in the omicron/oximeter/db/src/client.rs Lines 315 to 332 in 230637a
|
I need to investigate more to be sure, but I’m betting ClickHouse’s lack of unique primary keys is biting us here. We use a table engine which is supposed to remove duplicate rows, but there is some asynchrony to that deduplication. The queries themselves need to ask for distinct records if they want to be sure to get only the unique entries.
I thought I had put that in place a long time ago, but I’m obviously misremembering.
|
Actually the issue only happens with objects that exist prior to the update. The metrics for newly created disks can be queried without problems but disks created previously hit the non-unique timeseries error. Would it be possible that there is a new key/format for ClickHouse timeseries and the newly inserted metrics are mixed with some older data, causing the query to see them as two different datasets of the same timeseries? |
I don’t believe there’s been a change to either the ClickHouse format or how we generate our own keys for timeseries. I’ve assiduously avoided that because we have no mechanisms for updating schema.
The fact that we only see this for older data is consistent with my theory about non-unique records. The database can choose to repack the data at basically any point. For the table engine we’re using for the fields, it deduplicates those records at those merge points. But I don’t believe there are requirements like “older records are merged first.” The database can basically decide to do that arbitrarily, although we do have ways of requesting a merge. And data may always remain unprocessed, I think.
We could confirm if this is the case by looking for fields coming from the offending timeseries, and comparing that to those which don’t trigger the failure case. Angela, could you provide an example Nexus API call that works and one that fails? What I expect to see is that the failing case ultimately fetches rows from one of the field tables in which there are duplicate timeseries keys. In contrast, the records from the successful query will have been merged and deduplicated.
I should also say, even if this particular theory does not explain this failure, it certainly can cause a failure like this in the future! The queries should select distinct matching rows only.
|
@bnaecker - Here is an example of the bad ones (disks created pre-update):
and a good one (created post-update):
|
So in theory this could fix itself at any time, but we don't know when? |
In theory, yes, which is why the query needs to dedupe if that’s important. I’m also not sure this is the cause, since Angela’s data suggests there was a key change that slipped through. I need to look in more detail to better understand what’s going on.
|
So, I think this may be much worse than I originally feared. On dogfood I see a bunch of duplicate timeseries schema that I don't expect:
There are at least two things going on here, one of which I understand. The last record has the fields in a different order than the others. I think this is yet another consequence of this thread. We relied on the order of the fields in ways I had forgotten in the 2+ years between when I wrote the code and that PR. We store the fields in the schema as an array, but I changed the storage when the samples are collected to a map after that comment. So we're inserting a duplicated schema, only because the order of the fields has changed. We need to sort all the fields, all the time, whenever we do a comparison. The ordering doesn't matter for identity, but the code implicitly relied on it. I changed one reliant piece, but not all of them. I think this may also explain the issue we see here. As part of inserting a sample into the database, we need two pieces of data: the timeseries schema; and the actual, concrete instantiation of that timeseries, which the database calls the What I'm not sure of is why there are multiple records with the same field order, but different |
This is highly related to #4008. We're obviously going to need to revisit the hashing, and at least introduce another discontinuity into the system when we fix this bug. I want to try to bundle that up with moving to a stable hash, so that we can try to limit it to one. |
So there are a number of different things that could be potential issues here. It may be worth splitting some of these out into separate issues. As @bnaecker noted:
However, I don't think those issues caused these metrics APIs to fail. Using this handy utility function: function query() {
echo "$1" | curl 'http://[fd00:1122:3344:107::4]:8123/?database=oximeter' --data-binary @-
} I can run the following: $ query "select * from timeseries_schema WHERE timeseries_name='crucible_upstairs:read_bytes' FORMAT JSONEachRow"
{"timeseries_name":"crucible_upstairs:read_bytes","fields.name":["upstairs_uuid"],"fields.type":["Uuid"],"fields.source":["Target"],"datum_type":"CumulativeI64","created":"2023-08-30 19:16:02.832396806"}
{"timeseries_name":"crucible_upstairs:read_bytes","fields.name":["upstairs_uuid"],"fields.type":["Uuid"],"fields.source":["Target"],"datum_type":"CumulativeI64","created":"2023-09-12 21:30:21.496972480"}
{"timeseries_name":"crucible_upstairs:read_bytes","fields.name":["upstairs_uuid"],"fields.type":["Uuid"],"fields.source":["Target"],"datum_type":"CumulativeI64","created":"2023-10-05 21:42:18.187637346"} From this data, I can see "there is only one field", so the order of fields doesn't matter in this specific case. Furthermore, looking at https://github.com/rust-lang/rust/blob/master/library/std/src/collections/hash/map.rs , it doesn't seem like the hashing function has changed in the past few months? Sources: https://github.com/rust-lang/rust/blob/cdddcd3bea35049dab56888d4391cb9b5b1b4491/library/std/src/collections/hash/map.rs#L3147-L3170 , which depends on https://github.com/rust-lang/rust/blob/master/library/core/src/hash/sip.rs (and that file hasn't been touched in 6 months?) |
However, there are other issues that appear to be present:
|
We were able to see in the DB that measurements up to the last dogfood deploy have one timeseries key and measurements after have another. We will need to see if we can tell from stuff that’s currently in Clickhouse whether there were different inputs to the You can see the two
You can see when we ask for the earliest and latest measurement for each
Shorter:
|
I'm not hashing the same values as oximeter, but as a simple test: fn hash(fields: &[&str]) -> u64 {
use std::collections::hash_map::DefaultHasher;
use std::hash::{Hash, Hasher};
let mut hasher = DefaultHasher::new();
for field in fields {
field.hash(&mut hasher);
}
hasher.finish()
}
fn main() {
let h = hash(&["a", "b", "c"]);
let expected = 10937667955699260984;
assert_eq!(h, expected);
} And: #/bin/bash
set -eu
VERSIONS=(
"1.73.0"
"1.72.1"
"1.72.0"
"1.71.1"
"1.71.0"
"1.70.0"
"1.69.0"
"1.68.2"
"1.68.1"
"1.68.0"
"1.67.1"
"1.67.0"
"1.66.1"
"1.66.0"
"1.65.0"
"1.64.0"
"1.63.0"
"1.62.1"
"1.62.0"
)
for VERSION in "${VERSIONS[@]}"; do
rustup install "$VERSION" && cargo "+$VERSION" run
done This passes for all the specified rust toolchain versions |
By adding some debug prints to an integration test that fetches a system metric, I was able to figure out the inputs to the hash and generate the same
Now I'm going to check out an older version of omicron and see if I can get the old hash out of it. |
Chatting with @david-crespo , we suspect #4091 is the source of the problem. The "hash" algorithm didn't change -- the contents being hashed changed. We hash the |
Got em. Checking out 41fb704 (the commit before #4091), and running the same dbg!s as above (pushed up at 50dcd62) gets me — you guessed it — the old timeseries key, 5670937711533663662.
Now we have to figure out whether we can tweak the new code to give us the old key again without reverting the entire 6000 line change in #4091. |
And for my final trick, based on Sean's guess that the enum reordering here is the problem: Moving the new entries to the end of the enum definition (see 41c74f6) and re-running my tests does in fact bring back the old
So now we just need to figure out how to write some tests that prevent this from happening again. |
Short Term
This should let us unblock the release. Long TermIf we work on #4008 , the hashing algorithm could become stable, and we'd be protected against rust stdlib changing the I don't think this is enough for correctness. According to https://doc.rust-lang.org/std/hash/trait.Hash.html#portability :
The change-detector tests should help us catch this, but even with a stable hasher, the contents being hashed could change between different toolchains / architectures. I think we need a fundamentally different way to calculate the |
Two approaches in the long-term, that could potentially mitigate this issue: Hash it yourselfManually implementing
Use a stable hash of a stably-serialized output?Suppose we wanted to hash the contents of the omicron/oximeter/oximeter/src/types.rs Lines 248 to 255 in 47a6b42
I think we could probably pick a better serializer, but, as an example, we could do the following: // Assume we start with some input to be hashed...
let value: FieldSet = ...;
// Convert from in-memory representation to a serialized representation that is stable.
// I HAVE NOT VALIDATED THAT SERDE_JSON WORKS HERE -- in fact, for maps, I know it
// would not necessarily be stable.
// However, this helps us avoid any changes to memory layout that could be imposed on us by
// https://doc.rust-lang.org/std/hash/trait.Hash.html#portability
let serialized_as_json = serde_json::to_string(&value).unwrap();
// Finally, feed this serialized format into a stable hasher.
// The serialized format should be stable, the hash should be stable.
let timeseries_key = stable_hash(&serialized_as_json); @jgallagher mentioned that https://crates.io/crates/hubpack could be a good fit in this area, though the lack of |
Thank you all for tracking this down. In the interest of unblocking the release, I feel like we should revert both 23eee1a and 8ffd476. The former of those introduce a change in the hash by reordering the variants of an enum, as Sean noted. The latter did so by changing the order in which fields are submitted to
This is just a latent bug. We have a method which looks up a timeseries schema, including fetching them all from the DB if it can't be found in the local cache. We could call that here, but it doesn't strictly prevent the race. I've considered moving schema to CRDB, which would address the consistency, but require a lot of other changes. |
Though it is possible to revert 23eee1a , 8ffd476 was included in the last customer release, so reverting it might be a bit complicated? I suspect that 23eee1a changed almost all |
Also, for later, @sunshowers mentioned https://crates.io/crates/bcs as being "stably serialized output" format, which seems like a great fit |
I'm a big fan of |
I have two PRs to address this issue.
This PR, just like the prior changes which caused this issue, also changes the derivation of the That being said, the stability change PR doesn't fix this issue, it just prevents it in the future. That's where my second PR comes in:
|
For what it's worth - it looks independent of struct names / field variant names? |
#4251) - Uses [bcs](https://crates.io/crates/bcs) as a stable binary format for inputs to the `timeseries_key` function - Uses [highway](https://crates.io/crates/highway) as a stable hash algorithm (it's keyed, fast, portable, and well-distributed). - Additionally, adds an EXPECTORATE test validating the stability of timeseries_key values. Fixes #4008 , and also addresses the issue raised in #4221 regarding stable input NOTE: This PR itself *also* breaks the stability of the `timeseries_key` (hopefully for the last time), and will rely on #4246 to wipe the metrics DB for the next release.
Provides a mechanism to explicitly version Oximeter schema used within Clickhouse, and to wipe that data if any schema-breaking changes are made. This is retroactively intended to solve the problems discussed within #4221 , where the `timeseries_key` generation was altered.
Both PRs have been merged. Once we validate this issue has been corrected on dogfood, I'll mark this as closed. |
Confirmed that dogfood metrics are showing up -- I'll close this issue now. |
Console is no longer getting back any metrics (for utilization or disk metrics). It's hitting 500 errors in Nexus, e.g.
I invoked a similar query via API without the
limit
parameter to see what would happen and it's returning the same error:The text was updated successfully, but these errors were encountered: