-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[oximeter] Use stable hash, stable format, for deriving timeseries_key #4251
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A+
oximeter/db/src/lib.rs
Outdated
use std::hash::{Hash, Hasher}; | ||
let mut hasher = DefaultHasher::new(); | ||
let mut hasher = HighwayHasher::default(); | ||
for field in target_fields.chain(metric_fields) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to sort these before hashing.
With the change in 8ffd476, these should be sorted when the oximeter
collector gets them, since they're deserialized into a BTreeMap
. But that was one of the pitfalls of that change, that the key-generation should have been sorting them in the first place. Let's do it now, to avoid an implicit reliance on it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added this in 2e91139 -- as expected, this does not change the hashes, since things are already sorted, but it does make it explicit.
I opted to alter the interface rather than re-sorting internally.
oximeter/db/src/lib.rs
Outdated
), | ||
); | ||
|
||
// However, order still matters ("a, b" != "b, a") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test does accurately reflect the original implementation, but I think we don't want order to matter. What matters is the unordered set of field names, types, and values. Permuting them should not change the key, IMO, because that's really the same timeseries. So if we sort the keys in the function above, we should then assert_eq
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed this test -- "stability of a sample" is more important -- but yes, this is fixed.
For what it's worth, we're sorting "the targets fields" and "the metric fields" separately, but I think we're on the same page here.
oximeter/db/src/lib.rs
Outdated
@@ -335,11 +335,13 @@ pub(crate) fn timeseries_key_for<'a>( | |||
target_fields: impl Iterator<Item = &'a Field>, | |||
metric_fields: impl Iterator<Item = &'a Field>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So one other issue I've been ambivalent about is including the timeseries name in the key. This impl currently will generate the same key for instances of two different timeseries that happen to have the same fields. The database currently includes an ordering key on fields and measurement tables that looks like (timeseries_name, timeseries_key, ...)
, so the key only needs to be unique within a timeseries. I'm basically wondering whether we want it globablly unique. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to disambiguate and I don't see much downside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other thing -- I think the hashing is currently based purely from the Field
structs:
omicron/oximeter/oximeter/src/types.rs
Lines 248 to 255 in 7e88bdf
/// A `Field` is a named aspect of a target or metric. | |
#[derive( | |
Clone, Debug, Hash, PartialEq, Eq, JsonSchema, Serialize, Deserialize, | |
)] | |
pub struct Field { | |
pub name: String, | |
pub value: FieldValue, | |
} |
However, the target and metric also have names, which could be distinct from the fields:
omicron/oximeter/oximeter/src/types.rs
Lines 644 to 649 in 7e88bdf
// A helper type for representing the name and fields derived from targets and metrics | |
#[derive(Clone, Debug, PartialEq, JsonSchema, Deserialize, Serialize)] | |
pub(crate) struct FieldSet { | |
pub name: String, | |
pub fields: BTreeMap<String, Field>, | |
} |
Should I be hashing this value too?
So in the case of:
Sample {
measurement: N/A
timeseries_name: "MyTimeseries",
target: FieldSet {
name: "MyTarget",
fields: [ ("a", valueA), ("b", valueB)],
},
metric: FieldSet {
name: "MyMetric",
fields: [ ("c", valueC), ("d", valueD)],
},
}
The thing we have been hashing -- even before this PR -- are the Field
values, which are:
[ ("a", valueA), ("b", valueB), ("c", valueC), ("d", valueD)]
(NOTE: This means editing the "MyTarget" or "MyMetric" names will not change the hash, currently)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(EDIT: UPDATED BASED ON FEEDBACK BELOW)
WDYT about hashing the following for a sample, in-order:
- The timeseries_name (which itself is derived from "target name + metric name")
- The target fields
- The metric fields
- The measurement datum type
The WIP I had to address this before the urgency of the 500s did that basically. Hash the timeseries name; target field names, types, and values; and then metric field names, types, and values; and datum type.
I think with the nice canonical serialization strategy you’ve got here, we don’t need the types of the target and metric I think? That should be included by the serialized variant index, based on by reading of the bcs documentation.
We don’t strictly need the names of the targets and metrics themselves. Those are concatenated into the timeseries name, so I think we can go without them.
So definitely the timeseries name, and field names and values for the target and metric. I strongly believe each of those lists of name-value pairs needs to be sorted before hashing. I also think we need to include the datum type. If that we’re to change, I think we’d consider that a different timeseries.
|
Yep, sorting target fields and metric fields separately, then concatenations and hashing. Same page!
|
Oh, I see, we're saying "the timeseries name" is derived from the "target name + metric name", so it's implicit.
Gotcha, updated in c0110b7 |
Sweet, this all looks very good! Thanks for addressing everything, and adding lots of comments around the details of the implementation.
I’m happy with the whole change, so feel free to merge when you’re ready.
|
timeseries_key
functionFixes #4008 , and also addresses the issue raised in #4221 regarding stable input
NOTE: This PR itself also breaks the stability of the
timeseries_key
(hopefully for the last time), and will rely on #4246 to wipe the metrics DB for the next release.