Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[oximeter] Use stable hash, stable format, for deriving timeseries_key #4251

Merged
merged 5 commits into from
Oct 12, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -199,6 +199,7 @@ headers = "0.3.9"
heck = "0.4"
hex = "0.4.3"
hex-literal = "0.4.1"
highway = "1.1.0"
hkdf = "0.12.3"
http = "0.2.9"
httptest = "0.15.4"
Expand Down
4 changes: 4 additions & 0 deletions oximeter/db/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,12 @@ license = "MPL-2.0"
[dependencies]
anyhow.workspace = true
async-trait.workspace = true
bcs.workspace = true
bytes = { workspace = true, features = [ "serde" ] }
chrono.workspace = true
clap.workspace = true
dropshot.workspace = true
highway.workspace = true
oximeter.workspace = true
regex.workspace = true
reqwest = { workspace = true, features = [ "json" ] }
Expand All @@ -28,9 +30,11 @@ uuid.workspace = true
omicron-workspace-hack.workspace = true

[dev-dependencies]
expectorate.workspace = true
itertools.workspace = true
omicron-test-utils.workspace = true
slog-dtrace.workspace = true
strum.workspace = true

[[bin]]
name = "oxdb"
Expand Down
94 changes: 91 additions & 3 deletions oximeter/db/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -335,11 +335,13 @@ pub(crate) fn timeseries_key_for<'a>(
target_fields: impl Iterator<Item = &'a Field>,
metric_fields: impl Iterator<Item = &'a Field>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So one other issue I've been ambivalent about is including the timeseries name in the key. This impl currently will generate the same key for instances of two different timeseries that happen to have the same fields. The database currently includes an ordering key on fields and measurement tables that looks like (timeseries_name, timeseries_key, ...), so the key only needs to be unique within a timeseries. I'm basically wondering whether we want it globablly unique. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to disambiguate and I don't see much downside.

Copy link
Collaborator Author

@smklein smklein Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other thing -- I think the hashing is currently based purely from the Field structs:

/// A `Field` is a named aspect of a target or metric.
#[derive(
Clone, Debug, Hash, PartialEq, Eq, JsonSchema, Serialize, Deserialize,
)]
pub struct Field {
pub name: String,
pub value: FieldValue,
}

However, the target and metric also have names, which could be distinct from the fields:

// A helper type for representing the name and fields derived from targets and metrics
#[derive(Clone, Debug, PartialEq, JsonSchema, Deserialize, Serialize)]
pub(crate) struct FieldSet {
pub name: String,
pub fields: BTreeMap<String, Field>,
}

Should I be hashing this value too?

So in the case of:

Sample {
  measurement: N/A
  timeseries_name: "MyTimeseries",
  target: FieldSet {
    name: "MyTarget",
    fields: [ ("a", valueA), ("b", valueB)],
  },
  metric: FieldSet {
    name: "MyMetric",
    fields: [ ("c", valueC), ("d", valueD)],
  },
}

The thing we have been hashing -- even before this PR -- are the Field values, which are:

  • [ ("a", valueA), ("b", valueB), ("c", valueC), ("d", valueD)]

(NOTE: This means editing the "MyTarget" or "MyMetric" names will not change the hash, currently)

Copy link
Collaborator Author

@smklein smklein Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(EDIT: UPDATED BASED ON FEEDBACK BELOW)

WDYT about hashing the following for a sample, in-order:

  1. The timeseries_name (which itself is derived from "target name + metric name")
  2. The target fields
  3. The metric fields
  4. The measurement datum type

) -> TimeseriesKey {
use std::collections::hash_map::DefaultHasher;
use highway::HighwayHasher;
use std::hash::{Hash, Hasher};
let mut hasher = DefaultHasher::new();
let mut hasher = HighwayHasher::default();
for field in target_fields.chain(metric_fields) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to sort these before hashing.

With the change in 8ffd476, these should be sorted when the oximeter collector gets them, since they're deserialized into a BTreeMap. But that was one of the pitfalls of that change, that the key-generation should have been sorting them in the first place. Let's do it now, to avoid an implicit reliance on it again.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this in 2e91139 -- as expected, this does not change the hashes, since things are already sorted, but it does make it explicit.

I opted to alter the interface rather than re-sorting internally.

field.hash(&mut hasher);
bcs::to_bytes(&field)
.expect("Failed to serialized field to bytes")
smklein marked this conversation as resolved.
Show resolved Hide resolved
.hash(&mut hasher);
}
hasher.finish()
}
Expand Down Expand Up @@ -393,4 +395,90 @@ mod tests {
assert!(TimeseriesName::try_from("a:").is_err());
assert!(TimeseriesName::try_from("123").is_err());
}

#[test]
fn test_timeseries_key_generation_hashes_fields_sequentially() {
use super::timeseries_key_for;
use oximeter::{Field, FieldValue};

let f = |name: &str, value| Field { name: name.to_string(), value };

// Confirm that "targets" and "metrics" are interchangeable,
// we just hash everything sequentially.
assert_eq!(
timeseries_key_for(
[&f("a", FieldValue::String("a".to_string()))].into_iter(),
[&f("b", FieldValue::String("b".to_string()))].into_iter(),
),
timeseries_key_for(
[
&f("a", FieldValue::String("a".to_string())),
&f("b", FieldValue::String("b".to_string())),
]
.into_iter(),
[].into_iter(),
),
);

// However, order still matters ("a, b" != "b, a")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test does accurately reflect the original implementation, but I think we don't want order to matter. What matters is the unordered set of field names, types, and values. Permuting them should not change the key, IMO, because that's really the same timeseries. So if we sort the keys in the function above, we should then assert_eq here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed this test -- "stability of a sample" is more important -- but yes, this is fixed.

For what it's worth, we're sorting "the targets fields" and "the metric fields" separately, but I think we're on the same page here.

assert_ne!(
timeseries_key_for(
[&f("a", FieldValue::String("a".to_string()))].into_iter(),
[&f("b", FieldValue::String("b".to_string()))].into_iter(),
),
timeseries_key_for(
[&f("b", FieldValue::String("b".to_string()))].into_iter(),
[&f("a", FieldValue::String("a".to_string()))].into_iter(),
),
);
}
smklein marked this conversation as resolved.
Show resolved Hide resolved

#[test]
fn test_timeseries_key_stability() {
use super::timeseries_key_for;
use oximeter::{Field, FieldValue};
use strum::EnumCount;

let values = [
("string", FieldValue::String(String::default())),
("i8", FieldValue::I8(-0x0A)),
("u8", FieldValue::U8(0x0A)),
("i16", FieldValue::I16(-0x0ABC)),
("u16", FieldValue::U16(0x0ABC)),
("i32", FieldValue::I32(-0x0ABC_0000)),
("u32", FieldValue::U32(0x0ABC_0000)),
("i64", FieldValue::I64(-0x0ABC_0000_0000_0000)),
("u64", FieldValue::U64(0x0ABC_0000_0000_0000)),
(
"ipaddr",
FieldValue::IpAddr(std::net::IpAddr::V4(
std::net::Ipv4Addr::LOCALHOST,
)),
),
("uuid", FieldValue::Uuid(uuid::Uuid::nil())),
("bool", FieldValue::Bool(true)),
];

// Exhaustively testing enums is a bit tricky. Although it's easy to
// check "all variants of an enum are matched", it harder to test "all
// variants of an enum have been supplied".
//
// We use this as a proxy, confirming that each variant is represented
// here for the purposes of tracking stability.
assert_eq!(values.len(), FieldValue::COUNT);
smklein marked this conversation as resolved.
Show resolved Hide resolved

let mut output = vec![];
for (name, value) in values {
let key = timeseries_key_for(
[&Field { name: name.to_string(), value }].into_iter(),
[].into_iter(),
);
output.push(format!("{name} -> {key}"));
}

expectorate::assert_contents(
"test-output/timeseries-keys.txt",
&output.join("\n"),
);
}
}
12 changes: 12 additions & 0 deletions oximeter/db/test-output/timeseries-keys.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
string -> 11027713821408113420
i8 -> 11040867383141339877
u8 -> 15607688254753473406
i16 -> 13295860386975871341
u16 -> 2145565374036862015
i32 -> 10931977742674674182
u32 -> 6138968585984979982
i64 -> 981603688282082647
u64 -> 4754979326506678930
ipaddr -> 17297273002898199682
uuid -> 9564253805848631232
bool -> 13392625023595096799
1 change: 1 addition & 0 deletions oximeter/oximeter/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ omicron-common.workspace = true
oximeter-macro-impl.workspace = true
schemars = { workspace = true, features = [ "uuid1", "bytes", "chrono" ] }
serde.workspace = true
strum.workspace = true
thiserror.workspace = true
uuid.workspace = true
omicron-workspace-hack.workspace = true
Expand Down
10 changes: 9 additions & 1 deletion oximeter/oximeter/src/types.rs
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,15 @@ impl_field_type_from! { bool, FieldType::Bool }

/// The `FieldValue` contains the value of a target or metric field.
#[derive(
Clone, Debug, Hash, PartialEq, Eq, JsonSchema, Serialize, Deserialize,
Clone,
Debug,
Hash,
PartialEq,
Eq,
JsonSchema,
Serialize,
Deserialize,
strum::EnumCount,
)]
#[serde(tag = "type", content = "value", rename_all = "snake_case")]
pub enum FieldValue {
Expand Down