-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sort fields when extracting timeseries schema #4312
Conversation
This is intended to help smoke out #4220, but may actually resolve that as well. We collect fields in the order their defined from the sample, which may not be sorted. This stores them in a set to ensure sorting by name. |
Ok, I'm looking more at this and am confused. The fields that are causing the issue are all from the target. A I'm going to wait for dogfood to come back up and look in the database. I may scale this back to only change how the error is printed if I can't get a good theory for where this error comes from. |
faa91d9
to
24fddc9
Compare
I've updated this to be slightly larger than originally intended. It addresses the point @smklein and I discussed here, where we don't actually check the database for existing schema when a sample is seen for the first time. It also sorts all the fields, from both the target and the metric, when deriving the schema. It also does that when reading from the DB, to handle schema in the DB which were inserted prior to this change. Sorting all fields should give us a consistent representation of the schema. I'm going to deploy this on |
24fddc9
to
1a4fe9e
Compare
- Fields are reported in a sample sorted within a target or metric, but not necessarily between them. When we derive a schema from a sample, this commit collects fields into a set to impose a total order. - Convert between a `DbFieldList` and `BTreeSet` when inserting / reading fields from the nested tables in ClickHouse. - Add sanity test that we're sorting field schema correctly, including on read from the database. - Errors for schema mismatches report entire schema, not just fields. - Logic around new schema was pretty complicated, and it was difficult to reason about its correctness. Make the lock async, and check both the internal cache and database when looking for a schema for a new sample. - Add test that we don't add a schema to the DB again when it exists in the DB, but not the internal cache. Instead, we update the cache.
1a4fe9e
to
aa81e57
Compare
Ok, I've tested this on
Second, we now do not insert additional copies of the timeseries schema into the database, each time we see a sample for the first time. Instead, we check the internal cache and the database, only inserting if the schema appears in neither. I have a unit test which checks this, but I've also seen that the number of entries in the schema table the live |
Target
andMetric
, which may or may not be sorted. They'll be in declaration order, if folks derive the traits. When deriving a schema from the sample, collect fields into a set to ignore order.DbFieldList
andBTreeSet
when inserting / reading fields from the nested tables in ClickHouse.