Sort fields when extracting timeseries schema #4312

bnaecker · 2023-10-21T21:00:59Z

Fields are reported in a sample via the implementations of Target and Metric, which may or may not be sorted. They'll be in declaration order, if folks derive the traits. When deriving a schema from the sample, collect fields into a set to ignore order.
Convert between a DbFieldList and BTreeSet when inserting / reading fields from the nested tables in ClickHouse.
Add sanity test that we're sorting field schema correctly.
Errors for schema mismatches report entire schema, not just fields.

bnaecker · 2023-10-21T21:02:43Z

This is intended to help smoke out #4220, but may actually resolve that as well. We collect fields in the order their defined from the sample, which may not be sorted. This stores them in a set to ensure sorting by name.

bnaecker · 2023-10-21T23:33:44Z

Ok, I'm looking more at this and am confused. The fields that are causing the issue are all from the target. A Sample does in fact store and return those sorted by name, so that means we must have derived a schema where the fields are sorted by name. Unless I'm missing something, this means we have data in the database where they're not sorted by name, so that the derived sample doesn't match. That's pretty surprising, since we previously wiped the database as part of the updates around #4008.

I'm going to wait for dogfood to come back up and look in the database. I may scale this back to only change how the error is printed if I can't get a good theory for where this error comes from.

bnaecker · 2023-10-23T19:01:17Z

I've updated this to be slightly larger than originally intended. It addresses the point @smklein and I discussed here, where we don't actually check the database for existing schema when a sample is seen for the first time. It also sorts all the fields, from both the target and the metric, when deriving the schema. It also does that when reading from the DB, to handle schema in the DB which were inserted prior to this change. Sorting all fields should give us a consistent representation of the schema. I'm going to deploy this on madrid to make sure that didn't break anything else I'm not considering.

- Fields are reported in a sample sorted within a target or metric, but not necessarily between them. When we derive a schema from a sample, this commit collects fields into a set to impose a total order. - Convert between a `DbFieldList` and `BTreeSet` when inserting / reading fields from the nested tables in ClickHouse. - Add sanity test that we're sorting field schema correctly, including on read from the database. - Errors for schema mismatches report entire schema, not just fields. - Logic around new schema was pretty complicated, and it was difficult to reason about its correctness. Make the lock async, and check both the internal cache and database when looking for a schema for a new sample. - Add test that we don't add a schema to the DB again when it exists in the DB, but not the internal cache. Instead, we update the cache.

bnaecker · 2023-10-23T20:10:57Z

Ok, I've tested this on madrid several times. The big upshot is that #4220 would have been much easier to debug:

20:09:13.761Z ERRO oximeter (oximeter-agent): timeseries schema mismatch, sample will be skipped
    actual = TimeseriesSchema { timeseries_name: TimeseriesName("data_link:link_up"), field_schema: {FieldSchema { name: "link_id", ty: I64, source: Target }, FieldSchema { name: "port_id", ty: String, source: Target }, FieldSchema { name: "rack_id", ty: Uuid, source: Target }, FieldSchema { name: "sidecar_id", ty: Uuid, source: Target }, FieldSchema { name: "sled_id", ty: Uuid, source: Target }}, datum_type: CumulativeI64, created: 2023-10-23T20:09:13.761288693Z }
    collector_id = 82b3c525-cf15-40e1-8a9d-9bf5eb4cb3af
    expected = TimeseriesSchema { timeseries_name: TimeseriesName("data_link:link_up"), field_schema: {FieldSchema { name: "link_id", ty: I64, source: Target }, FieldSchema { name: "port_id", ty: String, source: Target }, FieldSchema { name: "rack_id", ty: Uuid, source: Target }, FieldSchema { name: "sidecar_id", ty: Uuid, source: Target }, FieldSchema { name: "sled_id", ty: Uuid, source: Target }}, datum_type: Bool, created: 2023-10-20T21:20:50.954995640Z }
    file = oximeter/db/src/client.rs:401
    id = 17b64209-cdcd-490c-b961-2c4212821cbd
    sample = Sample { measurement: Measurement { timestamp: 2023-10-23T20:09:13.630990476Z, datum: CumulativeI64(Cumulative { start_time: 2023-10-20T23:06:25.888054024Z, value: 1 }) }, timeseries_name: "data_link:link_up", target: FieldSet { name: "data_link", fields: {"link_id": Field { name: "link_id", value: I64(0) }, "port_id": Field { name: "port_id", value: String("qsfp0") }, "rack_id": Field { name: "rack_id", value: Uuid(0b26ba18-7ca7-42c6-8450-d0fcdcaad0c0) }, "sidecar_id": Field { name: "sidecar_id", value: Uuid(00000000-0000-0000-0802-5dbb797061d4) }, "sled_id": Field { name: "sled_id", value: Uuid(c3bce666-0593-4e24-baca-c8bd121c9cda) }} }, metric: FieldSet { name: "link_up", fields: {} } }

Second, we now do not insert additional copies of the timeseries schema into the database, each time we see a sample for the first time. Instead, we check the internal cache and the database, only inserting if the schema appears in neither. I have a unit test which checks this, but I've also seen that the number of entries in the schema table the live madrid ClickHosue database does not increase if we restart oximeter.

oximeter/db/src/client.rs

bnaecker force-pushed the improve-oximeter-error-reporting branch 2 times, most recently from faa91d9 to 24fddc9 Compare October 23, 2023 16:00

bnaecker force-pushed the improve-oximeter-error-reporting branch from 24fddc9 to 1a4fe9e Compare October 23, 2023 19:39

bnaecker force-pushed the improve-oximeter-error-reporting branch from 1a4fe9e to aa81e57 Compare October 23, 2023 19:56

bnaecker requested review from davepacheco, smklein and david-crespo October 23, 2023 20:12

david-crespo reviewed Oct 24, 2023

View reviewed changes

oximeter/db/src/client.rs Show resolved Hide resolved

Review feedback

a67f226

bnaecker requested a review from david-crespo October 24, 2023 20:11

david-crespo reviewed Oct 24, 2023

View reviewed changes

oximeter/db/src/client.rs Show resolved Hide resolved

david-crespo approved these changes Oct 24, 2023

View reviewed changes

Improve comment around locking schema map

c1b2875

bnaecker enabled auto-merge (squash) October 24, 2023 20:35

bnaecker merged commit 9daf99b into main Oct 24, 2023
20 of 21 checks passed

bnaecker deleted the improve-oximeter-error-reporting branch October 24, 2023 21:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sort fields when extracting timeseries schema #4312

Sort fields when extracting timeseries schema #4312

bnaecker commented Oct 21, 2023

bnaecker commented Oct 21, 2023

bnaecker commented Oct 21, 2023

bnaecker commented Oct 23, 2023

bnaecker commented Oct 23, 2023

Sort fields when extracting timeseries schema #4312

Sort fields when extracting timeseries schema #4312

Conversation

bnaecker commented Oct 21, 2023

bnaecker commented Oct 21, 2023

bnaecker commented Oct 21, 2023

bnaecker commented Oct 23, 2023

bnaecker commented Oct 23, 2023