-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add basic Oximeter query language #5273
Conversation
0365db0
to
e1f882c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Ben! This is a massive chunk of work and I can't wait to use OxQL. A few comments/questions from reading through the code - but in general I found this very approachable due to the nice structure of the code and lots of great comments and tests.
Ok, I believe I've addressed everything except for fixing the |
@rcgoodfellow Thanks for your patience while I tidied up the grouping operations. I've added an exhaustive test for the various kinds of inputs we expect to handle, at the end of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Ben! 🚢 🚢 🚢
Hmmm the |
I've also found a couple of big holes in the logic being used to fetch data from the DB. The code is trying to be clever, in order to minimize the amount of data we get, but it's not currently fetching the right datasets for more complicated logical conditions. I'll probably spend a day or two working on that and adding a bunch more tests before soliciting more feedback. (In addition to trying to flush out the failed update test.) |
This is probably #4949 |
Thanks @jgallagher, I hadn't yet looked at existing flakes! |
Alright, I think the latest commit (5dde3c3) is in a good place. I'd like to explain what was going on, since this was one of the most mind-bending bits of code I've written in a while. The problemSuppose we get an OxQL query like this:
That means, select all the data for the link
That is, we're selecting the last 1 minute of everything. What gives! The causeWhen we run an OxQL query, specifically the To get the keys, we take the parts of the filter that apply to the fields, and run them against all the joined field tables for that timeseries. So the query above returns two keys: Armed with a set of keys, we turn to fetching the measurements. Easy, just add the timestamp filters to an SELECT timeseries_key, start_time, timestamp, datum
FROM oximeter.measurements_cumulativeu64
WHERE greater(timestamp, '2024-03-27 03:58:13.436067166') AND
timeseries_name = 'physical_data_link:bytes_sent' AND timeseries_key IN (14331077795809521820,14801769476763106306)
ORDER BY timeseries_key, start_time, timestamp
LIMIT 1000001
FORMAT JSONEachRow This seems right, and it is, if the filtering clause only has conjunctions. But if you have disjunctions in there, then this is wrong -- we're running just one query against the fields, and then using all those keys together in one query that selects a time range. But the query at hand explicitly asks for different time ranges for the two keys! The solutionThis involved a long foray in to Boolean algebra, and more recursion than I care to admit. Basically, you need to rewrite the query into disjunctive normal form (DNF), which means it's expressed as only disjunctions of conjunctions: So instead of one query for the fields, we run as many as there are terms in the DNF-reduced version of the query. We then keep track of that filter expression and the keys it resolves, and add them together into the measurements query. That is, instead of doing one clause in the measurements query that looks like SELECT timeseries_key, start_time, timestamp, datum
FROM oximeter.measurements_cumulativeu64
WHERE timeseries_name = 'physical_data_link:bytes_sent' AND (
timeseries_key IN (14801769476763106306) OR
(greater(timestamp, '2024-03-27 04:09:17.907475882') AND timeseries_key IN (14331077795809521820))
)
ORDER BY timeseries_key, start_time, timestamp
LIMIT 1000001
FORMAT JSONEachRow So each of those consistent key groups is a separate conjunction of disjunctions itself in the SQL query, keeping the actual filter that applies to the measurements (timestamps, in this case) matched up to the timeseries keys that go with it. And when you run the query on the corrected code, you get this:
Note that I shortened the two time ranges a bit so things would fit on the page. But the important parts are:
TestsI added several more tests in response to this issue. The most relevant one is here, which tests this exactly query: selecting all of one timeseries, and part of another. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the hard work on this Ben! Things look like they're coming together nicely. Just a few comments/questions since my previous review.
// This makes me really nervous, so I'm adding an escape hatch that we | ||
// only allow a few iterations. If we've not simplified within that, | ||
// we'll just declare the expression too complicated to handle. | ||
const EXPR_COMPLEXITY_LIMIT: usize = 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6 seems perhaps overconservative. I could imagine fairly basic queries that need more than 6 de Morgan transforms. I'd consider increasing this by several orders of magnitude.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I've been thinking about this. I don't actually think this is the right way to impose a limit, because the function is recursive! We need to pass in the current level, and just refuse to recurse more than some number of levels deep.
This value prevents handling expressions like this: a && (b || c || d .. )
, where there are more than 6 items in the right sub-expression. Each one requires a separate call to simplify_to_dnf()
to resolve, but at the same layer of the recursive call stack. That I agree can be much larger, though I'm not sure by how much. The recursive limit should be a separate thing, governing how deep the parse tree get when trying to simplify it. I don't have a good intuition for a limit there either. But both being O(100) seems like way more than enough, and still reasonably safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, O(100) seems good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Welp, I've found a new way you can DOS the query parser! Just make an expression that's a few right-nested ANDs, such as a == 0 && (a == 0 && (a == 0 && ...)))
and BOOM, exponentially long parse times. Fun.
I think that means I need to add a stricter check in the top-level Query::new()
method, which does the parsing internally. That could do something very dumb, like check the number of &&
or parentheses, and bail because it looks suspicious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd check what's going on with the PEG/packrat parser there. I would not expect that kind of behavior for this grammar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also surprised. I've asked it to print debug information about the rules as it matches them. It doesn't seem obviously wrong, but it's also possible I've written them very poorly in a way that requires much more work than I expected. I'll keep digging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've a solution, and a small amount more insight. The Rust crate we're using here, peg
, is not a caching / packrat parser by default. But you can make it that way by decorating rules you expect caching to help. I don't think I quite appreciated the degree to which this helps us. When you nest 4 right-deep AND expressions, this goes from 8.16s down to 0.1s when we cache only one rule, the factor()
rule that matches pretty high-level things like !(foo == "bar")
. So I'll definitely be adding that!
What I don't quite get is why the rule is being called so many times at the exact same locations. Again in that 4-level nesting case, it's being called 139 times for the most deeply-nested expression. I mean, it sort of makes sense that this would be called frequently: factor()
itself as a rule shows up in several places, and is nested right inside itself. But this seems way beyond excessive. With caching, it's down to 7 times at most.
Another mark for writing our own parser. I'd like to defer that, since this PR is already way too big. But I'm on board.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added caching and new tests for complexity limits in f27906b
// If this is an XOR, rewrite it to a disjunction of conjunctions. | ||
// | ||
// If it is not, return a clone of self. | ||
fn rewrite_xor_to_disjunction(&self) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious - do we have to expand xor into more primitive boolean operators b/c we can't express it in the SQL directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can express it in SQL, but it's not easy to tell which parts of the query are independent in that case. The goal of the DNF craziness is to have a bunch of disjunctions, each of which can be run separately to get the group of timeseries keys consistent with it. So we rewrite a ^ b
into (a && !b) || (!a && b)
so that we have two such groups.
I could also be missing something though, and this isn't needed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the need to transform a query in a distributed conjunctive form into a distributed disjunctive form. But if there is no distribution in play because the operation can be executed as a primitive in the target execution environment (SQL in this case) do we really need a transformation that factors a disjunctive operation out of what is essentially a primitive operation from the perspective of the executor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope I'm not misunderstanding your question, but I think we do need to rewrite XORs.
The purpose of the conversion to DNF is not because ClickHouse has any limitations in its SQL engine. All of the logical expressions OxQL supports can be written directly in ClickHouse SQL. We're doing the transformation so that the OxQL layer can figure out how to merge the data from the field tables with those from the measurements table, without first doing a complete join between them. I tried that with the SQL prototype, and it falls over pretty quickly as the data gets larger.
The example in one of my long-winded comments above is relevant here. Suppose you want to select all of one timeseries, and part of another. (That's not an XOR, but it is an OR, so has similar properties.) To do that, you have two choices:
- Run a complete inner join between the field tables and the measurements table, and then just select whatever was written in the OxQL query. No transformation needed! This is how the SQL prototype from a few months works, and it's great, until the data gets to even moderate size. At that point, the JOIN is enormous -- it takes up a huge amount of memory to duplicate all the fields for every measurement, and then we have to pay the cost to move all those over the network.
- Somehow select the subset you care about from both the field tables and measurement table, and only transfer those.
That latter choice is what I'm going for here. But to do that, you need to know which records from the field table the query implies, and which time ranges those records correspond to. Those time ranges can be different for different subsets of the field table, as they are if you want to select all of one timeseries and part of another. The code needs to store a mapping from (list of keys -> time range).
The disjunctive normal form is about knowing how to define each list of keys. Every disjunction is independent, so it is logically correct to independently fetch the keys for each of those, and map that to the time range that corresponds to it.
It's true that an XOR is a primitive in the target environment. It's also logically equivalent to an OR, meaning there could be two independent time ranges at play here. So I'm rewriting the XOR to an OR so that we can make sure we don't accidentally merge the wrong data.
I think there's a good optimization opportunity here! The conversion to DNF is really only needed if there is more than one time range at play in the query. If not, then we can do the original version of the query without any conversion. I certainly think we should do that at a later time, but it is an optimization and I wanted to get it correct first.
I hope that answers the question!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I'm starting to slowly come around. Maybe the thing that caused me to look at this a bit sideways was the fact that we're unconditionally factoring out the xor. But I think I see the issue where the xor is a part of a more complex expression and we need to fully normalize everything to ensure we get the filtering correct. Sorry for the confusion, still trying to wrap my head around the semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at all, it's subtle and took me a long time to be able to even describe the problem! I'll draw up a figure too, a visual explanation might be more helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Ben! Latest additions all LGTM
- Add basic grammar for an Oximeter query language. Includes support for numeric, string, boolean, UUID, timestamp, IP address, and duration literals. Queries are constructed in a pipeline of "table operations", each of which operates on a set of timeseries and produces another set. - Implement temporal alignment, currently supporting one method that generates output samples from the mean of the inputs over the alignment period. - Add basic subquery support, for fetching multiple timeseries and joining them - Implement filtering on fields and timestamps, both in the DB as much as possible, and the query pipeline; and implement filtering on data values in code. - Implement group-by support, where we can currently reduce values within a group by summing or computing the mean. - Add public Nexus API endpoints for listing timeseries schema, and running an OxQL query. Both are currently restricted to fleet readers, until a more thorough authz process is fleshed out. - This also reorganizes the internals of the `oximeter_db::client` module, which were starting to get too unwieldy and littered with conditional compilation directives.
- Typos and comment cleanup. - Keep SQL prototype as default feature - Consts for common strings - Some DCE - Fixes to point cast method, and added tests for all supported casts
- Add query ID to the OxQL `QueryResult` object - Add the query ID to USDT probes - Add the query ID to a child query logger, which is used in all future operations for the query, so we have that ID associated with all messages. - Clarify that duration constants may be approximate.
This Implements two new limits on query complexity: - Maximum amount by which we allow an alignment query to upsample the data, and - Maximum number of total measurement values fetched from ClickHouse The first is easy to check during the `align` table operation, and I've added tests for it. The second is checked by continually updating the total number of rows we receive from the database on each measurement query. This also changes the way we _ensure_ that limit. Previously, we ran the measurements query twice: once to check that it didn't return too much data, and the second to fetch the actual results. That's an obvious TOCTOU, and inefficient, but it was simple. Instead, this commit changes way we do that check, by applying a limit clause to the next measurement query. The limit we use is one more than the remainder in our allotment. That way, we can tell if the query is too big, without running the query multiple times or returning far more data than we want to allow. We don't know how _much_ bigger the actual result set is than what we want, but we do know that it is bigger.
During early development, I implemented a short-cut for group_by operations, which grouped values by _index_ not timestamp. That worked because the inputs were aligned and usually from timeseries with exactly overlapping points. It's not really correct, however. We _want_ to group data points in a timeseries both by their group identity and _also_ by lining up those that occur at the same time. This commit implements correct grouping by finding matching output timestamps in aligned input timeseries, and averaging or summing only those at the same time. Care is taken to not include missing points when computing the averages (or sums). This also found a bug in the alignment code. That used the `Iterator::sum` implementation when averaging values within a time window. That method doesn't return `None` if there is no data, it always returns the zero value for the input. That caused the grouping code to convert missing values to zeros, which when you're summing is fine, but not when averaging.
- Add XOR support. This requires a richer return type from the functions applying a filter to a field, since we need to know if it's true, false, or dont-care. The latter was not representable before. - Support negated filtering _expressions_, not just atoms - Year and month duration literals are capitalized, the rest are lowercase
…er use predicate logic
- Fix up some comments - Break filter expression complexity limits into recursive and iterative limits, and add tests for them - Add caching to `peg` grammar, to avoid wildly exponential runtimes
7c080d6
to
6f31b08
Compare
Thanks for all your feedback and testing @rcgoodfellow! I've fixed up conflicts with main and set to automerge. |
oximeter_db::client
module, which were starting to get too unwieldy and littered with conditional compilation directives.