Rethinking query semantics for aggregated caching #8283
Labels
💬 discussion
🚀 performance
Optimization, memory use, etc
project-many-entities-perf
🔍 re_query
affects re_query itself
We want to be able to cache aggregations of Chunks (where the aggregated data can take any form, as long as it is sharded along its source primary Chunks), regardless of how these chunks came together.
This will be a very important part of our upcoming Chunk processing primitives, which are themselves an important of making Rerun work with many entities.
Specifically, we want to be able to cache Chunks that were range-zipped together.
RangeZip
is a well-defined, deterministic operation that we do all over the place, any time we need to interlace data from multiple streams together.It is the core joining operation in Rerun: grab columnar data from different components, and zip them together over time (where "time" really means "index" and is defined as a tuple of
(Timeline, RowId)
[1]).Unfortunately, as of today, range-zipping will seemingly behave differently depending on how the data was queried / put together: iterating raw Chunks vs.
AggregatedLatestAt
vs.AggregatedRange
, etc.How is that even possible that the origin of the data impacts this operation at all, and what can we do to get out of this mess?
Fixing all these problems is a pre-requisite in order to be able to cache aggregated Chunks.
This document summarizes everything there is to know about existing query pitfalls, and proposes semantic changes in order to make aggregated caching attainable.
It also implicitly names a bunch of concepts that were until now anonymous, and defines a syntax to talk about Chunks, and aggregations of Chunks.
[1] Well, sometimes... Read on.
Background: range-zipping Chunks
Say you have the following two chunks:
Range-zipping is the operation that will merge these two streams of data into a single logical stream.
It is a form of aggregation, and it's used by all of our visualizers in order to merge the many different streams of components they get from executing
LatestAt
and/orRange
queries.I will refer to a series of
LatestAt
/Range
queries followed by aRangeZip
asAggregatedLatestAt
/AggregatedRange
from now on. More on these later.Range-zipping requires a point-of-view, from which the joining is performed. The resulting stream of data will yield one entry for each index where the point-of-view was updated, in any of the source streams.
As of today, we only ever rely on single point-of-views, though that is expected to change in the future.
There are two unique components in this specific example, and as such there are two possible
RangeZip
operations possible:RangeZip(PoV: Position, chunks: [C0, C1])
RangeZip(PoV: Radius, chunks: [C0, C1])
RangeZip(PoV: Position, chunks: [C0, C1])
yields:RangeZip(PoV: Radius, chunks: [C0, C1])
yields:Note
I've omitted
RowId
s in this example for brevity's sake. We will come back toRowId
s later on.RangeZip
is not globally-defined!TODO: aka susceptible to inner filtering
Range-zipping is not a globally-defined operation -- it can and will yield contradicting results if the data streams get offset or filtered in any way.
Consider the two Chunks from the example above once again, but this time let's assume they have been applied a
.filtered(#5..=#20)
:RangeZip(PoV: Position, chunks: [C0.filtered(#5..=#20), C1.filtered(#5..=#20)])
now yields:The same operation on the same Chunks yielded the same rows (indices
#10
and#20
), but with different data!Caution
Because
RangeZip
is not globally-defined, it is impossible to cache the aggregated results of two or more Chunks, as that might or might not yield the correct results depending on which time range the querier is interested in.This is the source of a lot of complexity, and has caused numerous bugs, particularly around caching.
The obvious fix is to always apply range-zipping on raw, unaltered Chunks... but that's not quite good enough: what about the Chunks you don't necessarily know about?
RangeZip
is not bootstrappedTODO: aka susceptible to outer filtering
Say you have the following two chunks:
We've already established what range-zipping those two Chunks look like:
But what if there is yet another Chunk out there such that:
Obviously, taking that third Chunk into account during range-zipping will change the aggregated output.
This has nothing to do with
RangeZip
nor Chunks though:CHUNK(RangeZip(PoV: Position, chunks: [C0, C1]))
is deterministic.CHUNK(RangeZip(PoV: Position, chunks: [C0, C1, C2]))
is also deterministic.CHUNK(RangeZip(PoV: Position, chunks: [C0, C1]))
!=CHUNK(RangeZip(PoV: Position, chunks: [C0, C1, C2]))
, which is fine.This is really just another manifestation of "
RangeZip
is not globally-defined", just one layer above (i.e. filtering whole Chunks vs. filtering rows of Chunks), so why do we care?The problem is that which Chunks you are or are not aware of during zipping is highly dependent on the kind of query you run... In fact, even the data within these Chunks might change across queries.
Caution
Because
RangeZip
isn't globally-defined, and because different queries can return different Chunks, and even modify the data in those Chunks, caching Chunks across different queries is also impossible.Background: aggregated queries
We define an aggregated query as a series of single-component
LatestAt
and/orRange
queries (reminder: there is no such thing as a multi-componentLatestAt
/Range
), whose results (Chunks) are then aggregated usingRangeZip
.To avoid confusion, I will refer to single-component/low-level queries as
LatestAt
andRange
, whereas multi-component/aggregated/range-zipped/high-level queries will be referred to asAggregatedLatestAt
andAggregatedRange
, respectively.Aggregated queries are the core building block of our visualizers:
RangeZip
, they inherit all the pitfalls above.LatestAt
andRange
queries, they also inherit from their semantic quirks.In particular, there are two very specific aggregated-query semantic pitfalls that make it impossible to cache range-zipped Chunks at the moment:
Our
AggregatedLatestAt
queries are globally deterministic, whereas ourAggregatedRange
queries are merely locally deterministic.What that means in pratice is that for a given fixed, immutable dataset (i.e. no Chunks are actively being added nor removed from the store), the results you get at timestamp
ts_result
, for aLatestAt
query at timestampts_query
, are completely deterministic, regardless of whatts_query
you use.This is not true for
Range
queries.Say you have the following data residing in storage:
Any
AggregatedLatestAt
query executed for#10 <= t < #15
will always yield the same results:Now consider an
AggregatedRange
query -- it only takes two examples to see it all fall apart:Both
AggregatedRange
queries yield data from the same index (#10
), but with different values:AggregatedLatestAt
queries are globally deterministic, whereasAggregatedRange
queries are locally deterministic.Nothing surprising: it's a straightforward manifestation of
RangeZip
s pitfalls, just in the context of anAggregatedRange
query.This is a natural and easy to understand consequence of
AggregatedRange
queries not being bootstrapped -- but you can extrapolate the effect that something like this has on caching.In fact, this is one of the main reason why our range-query cache doesn't actually cache queries at all, but rather the underlying Chunks necessary to compute the results: caching actual range queries would be extremely painful (we know, we've been there).
Caution
Consider what happens when range-zipping the two Chunks above:
Notice that [
#10: (Position(1, 1, 1), None)
], which is a possible outcome of running aRange
query on the dataset, is not a possible value when blindly zipping the Chunks together without further context.Aggregated caching is therefore impossible.
AggregatedLatestAt
queries break the rules of indexing by peeking into the future, whereas ourAggregatedRange
queries don't.Say you have the following data residing in storage:
If you were to run a
AggregatedLatestAt
query on top of that data att=#10
, you'd get the following (hint: take a close look at theRadius
):Compare that to an
AggregatedRange
query (hint: look at theRadius
):What the
AggregatedLatestAt
query is doing is technically illegal: somehow we're saying that ourPosition
at index(#10, 1001)
has an associatedRadius
at index(#10, 1099)
-- that is, from the future.Of course this is no mistake, the viewer would be pretty much unusable otherwise (imagine having to meticulously execute your log calls in the perfect order when trying to log multiple components to the same
frame_nr
).This is very much intended behavior, and the only reason it works at all is because we explicitly monkey-patch the
RowId
s at the last second in the visualizers to let them think that the data is not coming from the future:rerun/crates/viewer/re_space_view/src/results_ext.rs
Lines 368 to 373 in 94d545b
Caution
Consider what happens when range-zipping the two Chunks above:
Notice that
#10: (Position(1, 1, 1), Radius(2.0))
, which is a possible outcome of running anAggregatedLatestAt
query on the dataset, is not a possible value when blindly zipping the Chunks together without further context.This in turn makes aggregated caching impossible.
TODO: All of the above is even true at the inter-timestamp level. This is closely related to the complicated rules around bootstrapping.
Background: Summary of existing queries
There exists 5 different queries used within Rerun:
LatestAt
: A low-level latest-at query.Range
: A low-level range query.AggregatedLatestAt
: A multi-component latest-at query, aggregating results from manyLatestAt
s (for implementing Visualizers).AggregatedRange
: A multi-component range query, aggregating results from manyRange
s (for implementing Visualizers).Dataframe
: A high-level dataframe query.LatestAt
(implied)LatestAt
(implied)LatestAt
(Continued)
AggregatedLatestAt
&Dataframe
AggregatedRange
RangeZip
RangeZip
LatestAt
Terminology:
Query kind
: which query are we talking about?LatestAt
: A low-level latest-at query.Range
: A low-level range query.AggregatedLatestAt
: A multi-component latest-at query, aggregating results from manyLatestAt
s (for implementing Visualizers).AggregatedRange
: A multi-component range query, aggregating results from manyRange
s (for implementing Visualizers).Dataframe
: A high-level dataframe query.Entities
: how many entities can be queried at once?Components
: how many components can be queried at once?Yield semantics
: how often does the query yields a new row?Per timestamp, primary only
: for every unique timestamp for which there is at least one row of data where the primary component is non-null.Per index, primary only
: for every unique index ((timestamp, rowid)
) for which there is at least one row of data where the primary component is non-null.Per timestamp, primary & secondaries
: for every unique timestamp for which there is at least one row of data where either the primary or any secondary components are non-null.Intra-timestamp semantics
: what are the semantics used when the data contains multiple rows for a single timestamp?Yield latest only
: only yield the latest value for that timestamp, accorcding to the full index (i.e. time + rowid).Yield all
: yield all values available for that timestamp.Bootstrapping
: what initial state is used to bootstrap the results?None
: none.Global LatestAt
: build up initial state by running global scopeLatestAt
queries ("(implied)" means that the query bootstraps itself by its very nature).Densification
: how are empty cells about to be yielded filled with data?Accumulated
: data is accumulated from previous iterations.Global LatestAt
: data is fetched via a global scopeLatestAt
.Join semantics
: how are multiple component streams joined together into one?Vanilla RangeZip
: our well-defined range-zip, without any further shenanigans.Index-patched RangeZip
: LikeRangeZip
, but all the indices are monkey-patched as(TimeInt::STATIC, RowId::ZERO)
(Per-timestamp join
: dataframe specific code that joins rows with same exact same timestamp, without cross-timestamp accumulation.Used in
: where is this used?By now you should have enough background to understand what all of these means, and be able to figure out how all of these things might or might not interact with aggregated caching.
Caution
Query semantics are way too complicated, making it very hard to reason about even for core Rerun team members.
Caution
Query semantics vary in small but important ways across our different queries, affecting their Chunks output, and therefore rendering aggregated caching close to impossible.
Proposal
TODO:
TODO: we will leave the dataframe APIs out of it for now.
TODO: should overlap be a column?
Move from this:
LatestAt
(implied)LatestAt
(implied)(Continued)
AggregatedLatestAt
&Dataframe
AggregatedRange
RangeZip
RangeZip
to this (changes indicated in ❗bold):
LatestAt
(implied)LatestAt
LatestAt
(Continued)
AggregatedLatestAt
&Dataframe
AggregatedRange
RangeZip
RangeZip
TODO: should all of them be "Index-patched bootstrapped"?
TL;DR
Range
semantics much closer toLatestAt
's.TODO
Say you have the following data residing in storage:
Index-patched bootstrap:
Any
AggregatedLatestAt
query executed for#10 <= t < #15
will still always yield the same results:But now, so will an
AggregatedRange
query:TODO: okay but how does that help?
TODO: so then, we always bootstrap, and specifically we always bootstrap with a patch
TODO: how does that solve the problems above?
TODO: ok but what happens with future peeking non-sense?
TODO: we should always indicate the PoV everywhere
TODO: why does range zip even need a pov???
The text was updated successfully, but these errors were encountered: