[oximeter] Use stable hash, stable format, for deriving timeseries_key #4251

smklein · 2023-10-11T04:32:43Z

Uses bcs as a stable binary format for inputs to the timeseries_key function
Uses highway as a stable hash algorithm (it's keyed, fast, portable, and well-distributed).
Additionally, adds an EXPECTORATE test validating the stability of timeseries_key values.

Fixes #4008 , and also addresses the issue raised in #4221 regarding stable input

NOTE: This PR itself also breaks the stability of the timeseries_key (hopefully for the last time), and will rely on #4246 to wipe the metrics DB for the next release.

oximeter/db/src/lib.rs

david-crespo

A+

bnaecker · 2023-10-11T18:12:01Z

oximeter/db/src/lib.rs

    use std::hash::{Hash, Hasher};
-    let mut hasher = DefaultHasher::new();
+    let mut hasher = HighwayHasher::default();
    for field in target_fields.chain(metric_fields) {


I think we need to sort these before hashing.

With the change in 8ffd476, these should be sorted when the oximeter collector gets them, since they're deserialized into a BTreeMap. But that was one of the pitfalls of that change, that the key-generation should have been sorting them in the first place. Let's do it now, to avoid an implicit reliance on it again.

I've added this in 2e91139 -- as expected, this does not change the hashes, since things are already sorted, but it does make it explicit.

I opted to alter the interface rather than re-sorting internally.

bnaecker · 2023-10-11T18:14:34Z

oximeter/db/src/lib.rs

+            ),
+        );
+
+        // However, order still matters ("a, b" != "b, a")


I think this test does accurately reflect the original implementation, but I think we don't want order to matter. What matters is the unordered set of field names, types, and values. Permuting them should not change the key, IMO, because that's really the same timeseries. So if we sort the keys in the function above, we should then assert_eq here.

I have removed this test -- "stability of a sample" is more important -- but yes, this is fixed.

For what it's worth, we're sorting "the targets fields" and "the metric fields" separately, but I think we're on the same page here.

bnaecker · 2023-10-11T18:26:51Z

oximeter/db/src/lib.rs

@@ -335,11 +335,13 @@ pub(crate) fn timeseries_key_for<'a>(
    target_fields: impl Iterator<Item = &'a Field>,
    metric_fields: impl Iterator<Item = &'a Field>,


So one other issue I've been ambivalent about is including the timeseries name in the key. This impl currently will generate the same key for instances of two different timeseries that happen to have the same fields. The database currently includes an ordering key on fields and measurement tables that looks like (timeseries_name, timeseries_key, ...), so the key only needs to be unique within a timeseries. I'm basically wondering whether we want it globablly unique. Thoughts?

I think it would be nice to disambiguate and I don't see much downside.

One other thing -- I think the hashing is currently based purely from the Field structs:

omicron/oximeter/oximeter/src/types.rs

Lines 248 to 255 in 7e88bdf

/// A `Field` is a named aspect of a target or metric.

#[derive(

Clone, Debug, Hash, PartialEq, Eq, JsonSchema, Serialize, Deserialize,

)]

pub struct Field {

pub name: String,

pub value: FieldValue,

}

However, the target and metric also have names, which could be distinct from the fields:

omicron/oximeter/oximeter/src/types.rs

Lines 644 to 649 in 7e88bdf

// A helper type for representing the name and fields derived from targets and metrics

#[derive(Clone, Debug, PartialEq, JsonSchema, Deserialize, Serialize)]

pub(crate) struct FieldSet {

pub name: String,

pub fields: BTreeMap<String, Field>,

}

Should I be hashing this value too?

So in the case of:

Sample { measurement: N/A timeseries_name: "MyTimeseries", target: FieldSet { name: "MyTarget", fields: [ ("a", valueA), ("b", valueB)], }, metric: FieldSet { name: "MyMetric", fields: [ ("c", valueC), ("d", valueD)], }, }

The thing we have been hashing -- even before this PR -- are the Field values, which are:

[ ("a", valueA), ("b", valueB), ("c", valueC), ("d", valueD)]

(NOTE: This means editing the "MyTarget" or "MyMetric" names will not change the hash, currently)

(EDIT: UPDATED BASED ON FEEDBACK BELOW)

WDYT about hashing the following for a sample, in-order:

The timeseries_name (which itself is derived from "target name + metric name")

The target fields

The metric fields

The measurement datum type

bnaecker · 2023-10-12T19:54:04Z

The WIP I had to address this before the urgency of the 500s did that basically. Hash the timeseries name; target field names, types, and values; and then metric field names, types, and values; and datum type. I think with the nice canonical serialization strategy you’ve got here, we don’t need the types of the target and metric I think? That should be included by the serialized variant index, based on by reading of the bcs documentation. We don’t strictly need the names of the targets and metrics themselves. Those are concatenated into the timeseries name, so I think we can go without them. So definitely the timeseries name, and field names and values for the target and metric. I strongly believe each of those lists of name-value pairs needs to be sorted before hashing. I also think we need to include the datum type. If that we’re to change, I think we’d consider that a different timeseries.

bnaecker · 2023-10-12T20:00:25Z

Yep, sorting target fields and metric fields separately, then concatenations and hashing. Same page!

smklein · 2023-10-12T20:10:10Z

I think with the nice canonical serialization strategy you’ve got here, we don’t need the types of the target and metric I think? That should be included by the serialized variant index, based on by reading of the bcs documentation. We don’t strictly need the names of the targets and metrics themselves. Those are concatenated into the timeseries name, so I think we can go without them.

Oh, I see, we're saying "the timeseries name" is derived from the "target name + metric name", so it's implicit.

So definitely the timeseries name, and field names and values for the target and metric. I strongly believe each of those lists of name-value pairs needs to be sorted before hashing. I also think we need to include the datum type. If that we’re to change, I think we’d consider that a different timeseries.

Gotcha, updated in c0110b7

bnaecker · 2023-10-12T20:33:19Z

Sweet, this all looks very good! Thanks for addressing everything, and adding lots of comments around the details of the implementation. I’m happy with the whole change, so feel free to merge when you’re ready.

Use stable hash, stable format, for deriving timeseries_key

9c5a2ec

smklein mentioned this pull request Oct 11, 2023

Metrics API get requests return "Query must resolve to a single timeseries if limit is specified" #4221

Closed

smklein requested review from bnaecker and david-crespo October 11, 2023 04:44

smklein commented Oct 11, 2023

View reviewed changes

oximeter/db/src/lib.rs Outdated Show resolved Hide resolved

david-crespo reviewed Oct 12, 2023

View reviewed changes

oximeter/db/src/lib.rs Show resolved Hide resolved

david-crespo reviewed Oct 12, 2023

View reviewed changes

oximeter/db/src/lib.rs Show resolved Hide resolved

david-crespo approved these changes Oct 12, 2023

View reviewed changes

smklein added 2 commits October 12, 2023 10:26

Merge branch 'main' into oximeter-stable

c544c46

Better tests

929e761

smklein enabled auto-merge (squash) October 12, 2023 18:39

bnaecker reviewed Oct 12, 2023

View reviewed changes

smklein disabled auto-merge October 12, 2023 18:46

Sorted fields

2e91139

Hash party

c0110b7

bnaecker approved these changes Oct 12, 2023

View reviewed changes

smklein merged commit 7d55382 into main Oct 12, 2023
20 of 23 checks passed

smklein deleted the oximeter-stable branch October 12, 2023 21:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[oximeter] Use stable hash, stable format, for deriving timeseries_key #4251

[oximeter] Use stable hash, stable format, for deriving timeseries_key #4251

smklein commented Oct 11, 2023 •

edited

Loading

david-crespo left a comment

bnaecker Oct 11, 2023

smklein Oct 12, 2023

bnaecker Oct 11, 2023

smklein Oct 12, 2023

bnaecker Oct 11, 2023

david-crespo Oct 12, 2023

smklein Oct 12, 2023 •

edited

Loading

smklein Oct 12, 2023 •

edited

Loading

bnaecker commented Oct 12, 2023 via email

bnaecker commented Oct 12, 2023 via email

smklein commented Oct 12, 2023

bnaecker commented Oct 12, 2023 via email

		@@ -335,11 +335,13 @@ pub(crate) fn timeseries_key_for<'a>(
		target_fields: impl Iterator<Item = &'a Field>,
		metric_fields: impl Iterator<Item = &'a Field>,

	/// A `Field` is a named aspect of a target or metric.
	#[derive(
	Clone, Debug, Hash, PartialEq, Eq, JsonSchema, Serialize, Deserialize,
	)]
	pub struct Field {
	pub name: String,
	pub value: FieldValue,
	}

	// A helper type for representing the name and fields derived from targets and metrics
	#[derive(Clone, Debug, PartialEq, JsonSchema, Deserialize, Serialize)]
	pub(crate) struct FieldSet {
	pub name: String,
	pub fields: BTreeMap<String, Field>,
	}

[oximeter] Use stable hash, stable format, for deriving timeseries_key #4251

[oximeter] Use stable hash, stable format, for deriving timeseries_key #4251

Conversation

smklein commented Oct 11, 2023 • edited Loading

david-crespo left a comment

Choose a reason for hiding this comment

bnaecker Oct 11, 2023

Choose a reason for hiding this comment

smklein Oct 12, 2023

Choose a reason for hiding this comment

bnaecker Oct 11, 2023

Choose a reason for hiding this comment

smklein Oct 12, 2023

Choose a reason for hiding this comment

bnaecker Oct 11, 2023

Choose a reason for hiding this comment

david-crespo Oct 12, 2023

Choose a reason for hiding this comment

smklein Oct 12, 2023 • edited Loading

Choose a reason for hiding this comment

smklein Oct 12, 2023 • edited Loading

Choose a reason for hiding this comment

bnaecker commented Oct 12, 2023 via email

bnaecker commented Oct 12, 2023 via email

smklein commented Oct 12, 2023

bnaecker commented Oct 12, 2023 via email

smklein commented Oct 11, 2023 •

edited

Loading

smklein Oct 12, 2023 •

edited

Loading

smklein Oct 12, 2023 •

edited

Loading