Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort fields when extracting timeseries schema #4312

Merged
merged 3 commits into from
Oct 24, 2023

Conversation

bnaecker
Copy link
Collaborator

  • Fields are reported in a sample via the implementations of Target and Metric, which may or may not be sorted. They'll be in declaration order, if folks derive the traits. When deriving a schema from the sample, collect fields into a set to ignore order.
  • Convert between a DbFieldList and BTreeSet when inserting / reading fields from the nested tables in ClickHouse.
  • Add sanity test that we're sorting field schema correctly.
  • Errors for schema mismatches report entire schema, not just fields.

@bnaecker
Copy link
Collaborator Author

This is intended to help smoke out #4220, but may actually resolve that as well. We collect fields in the order their defined from the sample, which may not be sorted. This stores them in a set to ensure sorting by name.

@bnaecker
Copy link
Collaborator Author

Ok, I'm looking more at this and am confused. The fields that are causing the issue are all from the target. A Sample does in fact store and return those sorted by name, so that means we must have derived a schema where the fields are sorted by name. Unless I'm missing something, this means we have data in the database where they're not sorted by name, so that the derived sample doesn't match. That's pretty surprising, since we previously wiped the database as part of the updates around #4008.

I'm going to wait for dogfood to come back up and look in the database. I may scale this back to only change how the error is printed if I can't get a good theory for where this error comes from.

@bnaecker bnaecker force-pushed the improve-oximeter-error-reporting branch 2 times, most recently from faa91d9 to 24fddc9 Compare October 23, 2023 16:00
@bnaecker
Copy link
Collaborator Author

I've updated this to be slightly larger than originally intended. It addresses the point @smklein and I discussed here, where we don't actually check the database for existing schema when a sample is seen for the first time. It also sorts all the fields, from both the target and the metric, when deriving the schema. It also does that when reading from the DB, to handle schema in the DB which were inserted prior to this change. Sorting all fields should give us a consistent representation of the schema. I'm going to deploy this on madrid to make sure that didn't break anything else I'm not considering.

@bnaecker bnaecker force-pushed the improve-oximeter-error-reporting branch from 24fddc9 to 1a4fe9e Compare October 23, 2023 19:39
- Fields are reported in a sample sorted within a target or metric, but
  not necessarily between them. When we derive a schema from a sample,
  this commit collects fields into a set to impose a total order.
- Convert between a `DbFieldList` and `BTreeSet` when inserting /
  reading fields from the nested tables in ClickHouse.
- Add sanity test that we're sorting field schema correctly, including
  on read from the database.
- Errors for schema mismatches report entire schema, not just fields.
- Logic around new schema was pretty complicated, and it was difficult
  to reason about its correctness. Make the lock async, and check both
  the internal cache and database when looking for a schema for a new
  sample.
- Add test that we don't add a schema to the DB again when it exists in
  the DB, but not the internal cache. Instead, we update the cache.
@bnaecker bnaecker force-pushed the improve-oximeter-error-reporting branch from 1a4fe9e to aa81e57 Compare October 23, 2023 19:56
@bnaecker
Copy link
Collaborator Author

Ok, I've tested this on madrid several times. The big upshot is that #4220 would have been much easier to debug:

20:09:13.761Z ERRO oximeter (oximeter-agent): timeseries schema mismatch, sample will be skipped
    actual = TimeseriesSchema { timeseries_name: TimeseriesName("data_link:link_up"), field_schema: {FieldSchema { name: "link_id", ty: I64, source: Target }, FieldSchema { name: "port_id", ty: String, source: Target }, FieldSchema { name: "rack_id", ty: Uuid, source: Target }, FieldSchema { name: "sidecar_id", ty: Uuid, source: Target }, FieldSchema { name: "sled_id", ty: Uuid, source: Target }}, datum_type: CumulativeI64, created: 2023-10-23T20:09:13.761288693Z }
    collector_id = 82b3c525-cf15-40e1-8a9d-9bf5eb4cb3af
    expected = TimeseriesSchema { timeseries_name: TimeseriesName("data_link:link_up"), field_schema: {FieldSchema { name: "link_id", ty: I64, source: Target }, FieldSchema { name: "port_id", ty: String, source: Target }, FieldSchema { name: "rack_id", ty: Uuid, source: Target }, FieldSchema { name: "sidecar_id", ty: Uuid, source: Target }, FieldSchema { name: "sled_id", ty: Uuid, source: Target }}, datum_type: Bool, created: 2023-10-20T21:20:50.954995640Z }
    file = oximeter/db/src/client.rs:401
    id = 17b64209-cdcd-490c-b961-2c4212821cbd
    sample = Sample { measurement: Measurement { timestamp: 2023-10-23T20:09:13.630990476Z, datum: CumulativeI64(Cumulative { start_time: 2023-10-20T23:06:25.888054024Z, value: 1 }) }, timeseries_name: "data_link:link_up", target: FieldSet { name: "data_link", fields: {"link_id": Field { name: "link_id", value: I64(0) }, "port_id": Field { name: "port_id", value: String("qsfp0") }, "rack_id": Field { name: "rack_id", value: Uuid(0b26ba18-7ca7-42c6-8450-d0fcdcaad0c0) }, "sidecar_id": Field { name: "sidecar_id", value: Uuid(00000000-0000-0000-0802-5dbb797061d4) }, "sled_id": Field { name: "sled_id", value: Uuid(c3bce666-0593-4e24-baca-c8bd121c9cda) }} }, metric: FieldSet { name: "link_up", fields: {} } }

Second, we now do not insert additional copies of the timeseries schema into the database, each time we see a sample for the first time. Instead, we check the internal cache and the database, only inserting if the schema appears in neither. I have a unit test which checks this, but I've also seen that the number of entries in the schema table the live madrid ClickHosue database does not increase if we restart oximeter.

@bnaecker bnaecker requested a review from david-crespo October 24, 2023 20:11
@bnaecker bnaecker enabled auto-merge (squash) October 24, 2023 20:35
@bnaecker bnaecker merged commit 9daf99b into main Oct 24, 2023
20 of 21 checks passed
@bnaecker bnaecker deleted the improve-oximeter-error-reporting branch October 24, 2023 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants