Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[oximeter] Use stable hash, stable format, for deriving timeseries_key #4251

Merged
merged 5 commits into from
Oct 12, 2023

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Oct 11, 2023

  • Uses bcs as a stable binary format for inputs to the timeseries_key function
  • Uses highway as a stable hash algorithm (it's keyed, fast, portable, and well-distributed).
  • Additionally, adds an EXPECTORATE test validating the stability of timeseries_key values.

Fixes #4008 , and also addresses the issue raised in #4221 regarding stable input

NOTE: This PR itself also breaks the stability of the timeseries_key (hopefully for the last time), and will rely on #4246 to wipe the metrics DB for the next release.

oximeter/db/src/lib.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@david-crespo david-crespo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A+

@smklein smklein enabled auto-merge (squash) October 12, 2023 18:39
use std::hash::{Hash, Hasher};
let mut hasher = DefaultHasher::new();
let mut hasher = HighwayHasher::default();
for field in target_fields.chain(metric_fields) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to sort these before hashing.

With the change in 8ffd476, these should be sorted when the oximeter collector gets them, since they're deserialized into a BTreeMap. But that was one of the pitfalls of that change, that the key-generation should have been sorting them in the first place. Let's do it now, to avoid an implicit reliance on it again.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this in 2e91139 -- as expected, this does not change the hashes, since things are already sorted, but it does make it explicit.

I opted to alter the interface rather than re-sorting internally.

),
);

// However, order still matters ("a, b" != "b, a")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test does accurately reflect the original implementation, but I think we don't want order to matter. What matters is the unordered set of field names, types, and values. Permuting them should not change the key, IMO, because that's really the same timeseries. So if we sort the keys in the function above, we should then assert_eq here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed this test -- "stability of a sample" is more important -- but yes, this is fixed.

For what it's worth, we're sorting "the targets fields" and "the metric fields" separately, but I think we're on the same page here.

@@ -335,11 +335,13 @@ pub(crate) fn timeseries_key_for<'a>(
target_fields: impl Iterator<Item = &'a Field>,
metric_fields: impl Iterator<Item = &'a Field>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So one other issue I've been ambivalent about is including the timeseries name in the key. This impl currently will generate the same key for instances of two different timeseries that happen to have the same fields. The database currently includes an ordering key on fields and measurement tables that looks like (timeseries_name, timeseries_key, ...), so the key only needs to be unique within a timeseries. I'm basically wondering whether we want it globablly unique. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to disambiguate and I don't see much downside.

Copy link
Collaborator Author

@smklein smklein Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other thing -- I think the hashing is currently based purely from the Field structs:

/// A `Field` is a named aspect of a target or metric.
#[derive(
Clone, Debug, Hash, PartialEq, Eq, JsonSchema, Serialize, Deserialize,
)]
pub struct Field {
pub name: String,
pub value: FieldValue,
}

However, the target and metric also have names, which could be distinct from the fields:

// A helper type for representing the name and fields derived from targets and metrics
#[derive(Clone, Debug, PartialEq, JsonSchema, Deserialize, Serialize)]
pub(crate) struct FieldSet {
pub name: String,
pub fields: BTreeMap<String, Field>,
}

Should I be hashing this value too?

So in the case of:

Sample {
  measurement: N/A
  timeseries_name: "MyTimeseries",
  target: FieldSet {
    name: "MyTarget",
    fields: [ ("a", valueA), ("b", valueB)],
  },
  metric: FieldSet {
    name: "MyMetric",
    fields: [ ("c", valueC), ("d", valueD)],
  },
}

The thing we have been hashing -- even before this PR -- are the Field values, which are:

  • [ ("a", valueA), ("b", valueB), ("c", valueC), ("d", valueD)]

(NOTE: This means editing the "MyTarget" or "MyMetric" names will not change the hash, currently)

Copy link
Collaborator Author

@smklein smklein Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(EDIT: UPDATED BASED ON FEEDBACK BELOW)

WDYT about hashing the following for a sample, in-order:

  1. The timeseries_name (which itself is derived from "target name + metric name")
  2. The target fields
  3. The metric fields
  4. The measurement datum type

@smklein smklein disabled auto-merge October 12, 2023 18:46
@bnaecker
Copy link
Collaborator

bnaecker commented Oct 12, 2023 via email

@bnaecker
Copy link
Collaborator

bnaecker commented Oct 12, 2023 via email

@smklein
Copy link
Collaborator Author

smklein commented Oct 12, 2023

I think with the nice canonical serialization strategy you’ve got here, we don’t need the types of the target and metric I think? That should be included by the serialized variant index, based on by reading of the bcs documentation. We don’t strictly need the names of the targets and metrics themselves. Those are concatenated into the timeseries name, so I think we can go without them.

Oh, I see, we're saying "the timeseries name" is derived from the "target name + metric name", so it's implicit.

So definitely the timeseries name, and field names and values for the target and metric. I strongly believe each of those lists of name-value pairs needs to be sorted before hashing. I also think we need to include the datum type. If that we’re to change, I think we’d consider that a different timeseries.

Gotcha, updated in c0110b7

@bnaecker
Copy link
Collaborator

bnaecker commented Oct 12, 2023 via email

@smklein smklein merged commit 7d55382 into main Oct 12, 2023
20 of 23 checks passed
@smklein smklein deleted the oximeter-stable branch October 12, 2023 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Timeseries need to use a stable hash algorithm for computing keys
3 participants