Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunk processors with cached output #8221

Open
Wumpf opened this issue Nov 26, 2024 · 6 comments
Open

Chunk processors with cached output #8221

Wumpf opened this issue Nov 26, 2024 · 6 comments
Labels
🚀 performance Optimization, memory use, etc project-many-entities-perf ⛃ re_datastore affects the datastore itself 📺 re_viewer affects re_viewer itself

Comments

@Wumpf
Copy link
Member

Wumpf commented Nov 26, 2024

Related to:

Problem statement

Today, visualizers operate exclusively on typically narrow queries (latest at or range). This makes it exceedingly hard to do any kind of caching on the outputs of a visualizer:

  • data changes potentially every frame as the time cursor moves on
    • but often it doesn't change at all! E.g. large point cloud that was logged a single time
  • sometimes very little data making bookkeeping overheads significant - e.g. visualizer for a single point per entity

However, at the same time the processing done is performance intensive by means of...

  • large amount of data needs to be combined & iterated on: e.g. large point cloud
  • many small pieces of scattered data need to be combine as is the case in e.g. single box/point/thing per entity scenarios

Paradigm shift to chunk processors

If we disregard the multitude of inputs a visualizer has from the viewer, we find that at its core it processes slices of chunks into a visualizable outcome eventually resulting in egui primitives, re_renderer draw data and picking & scene information.

In its simplest form, we can extract this into a compute kernel that takes a single chunk and outputs a single piece of of cacheable data.
This data can be recreated at any point using the input chunk and is invalidated exactly when the chunk is removed from the store. Making this well defined for a cache, allowing for LRU & GC strategies.
Granularity of this processing operation is governed by chunk granularity which is already subject to compaction strategies!

Note that just like chunks are sliced & iterated by queries, the processing result typically has to be sliced to be usable by a visualizer. There is however, no unified strategy for this as this is highly dependent on what the processor's output.

In this model, visualizer's execute method is still invoked as before, but makes use of data from arbitrary chunk processors.

Joined chunk processing

In practice we always join a multitude of chunks. However, all joins are dependent on a "required component" as a primary. This primary translates to chunk processors as a "primary chunk":
The processor is thought of as operating in the range of the primary chunk using any data from other chunks that may straddle its time range. In practice, most of the time, processing many components doesn't need more than the primary chunk as all components are typically sent together in the same chunk. However, it can happen that processing a single chunk may need input from an arbitrary amount of other chunks.

-> A chunk processor has to advertise all chunks that are involved as a query function (it is not allowed to access anything else)
-> If the chunks returned by this query changes, the processing output is invalid

Latest-at vs range semantics

Unfortunately, the chunks involved in a processing operation depend on the query semantics.
To illustrate take this store content as an example:

| Radius chunk |
                    | Position chunk |
t0-------------t1---t2--------------t3---------->
                 time

A visualizer doing a latest at query at t2 operates on both the position and the radius chunk. A visualizer doing a range query from t2 to t3 operates exclusively on the position chunk.
Therefore, if we want to use a chunk processor that operated on the position chunk as a primary, we have to take the radius chunk into account iff we're querying with latest-at semantics.

-> Either:

  • have separate chunk processors for latest at semantics on/off
  • chunk processors always use latest-at semantics for their advertised query but handle differences in outcome transparently

Dealing with blueprints & non-chunk inputs

Visualizers have various different inputs:

  • query range
  • store (chunks)
  • blueprint (chunks)
  • view state
  • application state

For chunk processors to operate correctly it is paramount that they are insulated from unpredictable inputs. We have to add context in steps:

  • start with store chunks->data only
  • add blueprint queries but invalidate chunk on any blueprint store
    • start of ways of tracking blueprint changes more fine grained
  • create signatures for fallback provider output (they depend on various states and may be volatile) allowing to invalidate chunk processing caches on fallback provider output changes
    • check when fallback provider outputs are used in the first place?
    • should we distinguish constant from volatile?
    • better track fallback provider outputs? restrict it?
    • debugging utils for overly volatile fallback providers?

(yes, beyond blueprint this is an open question)

Generally, we'll have to explore how far we can get with hermetic chunk processing while adding other context ad-hoc in execute!

Other usecases for chunk processors

While this is motivated primarily by the complex requirements of visualizers, there are other compelling usecases for this kind or processing, oftentimes much simpler as the inputs are more limited:

  • meta statement over chunks like num_events_cumulative_per_unique_time (a major perf bottleneck for the timeline data density graph for unsorted chunks)
  • component data conversion
  • user facing compute kernels

Common to these is that the amount of data would be too much to keep around indefinitely for all chunks!

Where to start

We need to prototype this concept step by step!
Let's start by re-implementing the points3d visualizer using this new paradigm and try to formalize it in the process.

@Wumpf Wumpf added ⛃ re_datastore affects the datastore itself 📺 re_viewer affects re_viewer itself 🚀 performance Optimization, memory use, etc labels Nov 26, 2024
@Wumpf Wumpf changed the title Chunk processors with LRU cache Chunk processors with cached output Nov 26, 2024
@teh-cmc
Copy link
Member

teh-cmc commented Nov 26, 2024

Other example use cases:

  • Anything related to secondary caching: mesh cache, JPEG cache, etc.
  • Column de-structuring, possibly? (e.g. mypoints.[1] and/or mypoints.y).
  • Real-time ingestion of external, pre-indexed data.

Other misc notes:

  • Because ChunkIds are globally unique, all these caches can be global.
  • By working at the (primary) chunk level, this model makes secondary incremental caching possible (and perhaps even beyond secondary, time will tell).
  • Interactions with higher and higher level APIs will be hard. Dataframe queries for example tend to aggregate -- incremental caching of aggregated data is pretty much an open problem (see e.g. https://github.com/frankmcsherry/blog).

@teh-cmc
Copy link
Member

teh-cmc commented Nov 26, 2024

Latest-at vs range semantics

Unfortunately, the chunks involved in a processing operation depend on the query semantics. To illustrate take this store content as an example:

| Radius chunk |
                    | Position chunk |
t0-------------t1---t2--------------t3---------->
                 time

A visualizer doing a latest at query at t2 operates on both the position and the radius chunk. A visualizer doing a range query from t2 to t3 operates exclusively on the position chunk. Therefore, if we want to use a chunk processor that operated on the position chunk as a primary, we have to take the radius chunk into account iff we're querying with latest-at semantics.

-> Either:

* have separate chunk processors for latest at semantics on/off

* chunk processors always use latest-at semantics for their advertised query but handle differences in outcome transparently

Since these caches work at the Chunk level, they do not have any particular query semantics, and therefore no query-based invalidation either. For that reason I don't think the difference between LatestAt and Range semantics particularly matter?

I.e. a Chunk processor always runs a query, any query, and gets back a set of ChunkIds in return (made up of 1 primary ChunkId + N secondary ChunkIds).
From there it checks its caches to see if secondary state is already available for this primary + secondary ChunkId set, otherwise it computes it and caches it (i.e., ignoring blueprint/fallbacks/etc, the cached value is the result of joining the primary chunk with all the secondary chunks -- range zip + clamp zip).
If the set of ChunkIds returned was different, the query was implicitly invalidated, which will automatically translate into a new cache entry without any further intervention (the old one will be LRU evicted or w/e).

From what I can tell this just works, irrelevant of query semantics, since it doesn't matter how the ChunkIds got there, it only matters which one you got (which will be different for LatestAt vs. Range, but that's fine).

Still it would probably be nice to align the semantics of LatestAt, Range and Dataframe a bit more -- but that seems like an orthogonal problem.

@teh-cmc
Copy link
Member

teh-cmc commented Nov 26, 2024

(More ramblings) All in all this is very similar to the existing QueryCache, the only change is the cache-key mapping in use.

The QueryCache maps a single ChunkId to a cache entry (Map<ChunkId, CacheEntry>), which means there are three possible invalidation paths:

  1. Garbage collection removed a ChunkId from the store -> the query cache reflects that by removing the associated entry.
  2. A query that used to return a certain ChunkId now returns another one -> the query cache creates a new entry for that ChunkId. The previous one is still around, although effectively "soft-invalidated".
  3. The query cache LRU kicks in, removing arbitrary entries (not implemented yet, unfortunately).

In the case of the visualizers, we're caching aggregations of Chunks, so the cache-key mapping is not 1:1 anymore, it's 1:N or, more specifically, 1:(M:N) where M is the number of primary components and N the number of secondaries (Map<(ChunkId, ChunkIdSet), CacheEntry>).
The invalidation paths are the same though:

  1. Garbage collection removed a ChunkId from the store -> the query cache reflects that by removing any entry containing that ChunkId (both in M or in N, doesn't matter).
  2. A query that used to return a certain (ChunkId, ChunkIdSet) now returns another one -> the query cache creates a new entry for that (ChunkId, ChunkIdSet) . The previous one is still around, although effectively "soft-invalidated".
  3. The query cache LRU kicks in, removing arbitrary entries (not implemented yet, unfortunately).

Related: cache eviction based on garbage collection events is a relic from the past, it needs to go away once we introduce LRUs.

Removing cache entries because the underlying data was GC'd is counter-productive in two ways:

  • Lots of compute spent reacting to removed Chunks even though it doesn't matter -- that's what the LRU is for, when we need space, it'll give us space.
  • Pre-emptively removing cached data that might still have value sucks, especially if there is nothing yet to replace it with -- again, just let the LRU do its job.

@Wumpf
Copy link
Member Author

Wumpf commented Nov 26, 2024

For that reason I don't think the difference between LatestAt and Range semantics particularly matter? [...]

I think your line of argumentation makes a lot of sense - if we physically separate the chunkid generation (~= query) from the processing, the processor should be able to be obliviously generate data. And given the perf budgets we work with we don’t care about invalidating the world if someone switches their latest-at to a range (and we don’t have to invalidate the world either)

However, the internal joins that a processor performs are still different. Example:

|radius chunk               |
| r0         r1           r3|

       |position chunk |
       |p0           p1|

---------------time---------------->

In this example both range query over position and latest at will have the same chunks.
However what positions are paired up with radii is different:

  • latest at at respective times at p0 and p1: (p0, r0), (p1, r1)
  • range query over the position range: (p0, NONE), (p1, r1)

@teh-cmc
Copy link
Member

teh-cmc commented Nov 26, 2024

Riiiight, I think I get it. This is effectively the issue that we solve today by monkey-patching the RowIds on the LatestAt path (because the visualizers only ever see Range results!).

We should sync up and see if we can eliminate this abomination at the source.

Wumpf added a commit that referenced this issue Nov 28, 2024
### Related

* More optimization depends likely on
#8221

### What

Make the timepanel faster. Still plenty of room for improvement though.
Achieved by...
* store subscriber to keep track of cumulative chunks needed for a given
entity & timeline
    * plus some statistics. less important
* a bit faster `num_events_cumulative_per_unique_time_unsorted`
* improved `DensityGraph::add_range` (this one was surprisingly
significant)

### Testing

TL;DR: Minimized timepanel takes less than half the time it use. Other
cases are better as well, but more noisy.

----

All testing done on the 2h airtraffic dataset (pre-saved rrd) with all
panels hidden and views removed (unless noted otherwise) on my M1 Max.
Note that the timeline perf is very dependent on amount of screen real
estate - these tests were done maximized on the 14'' Mac's screen.
(Did some throw-away testing on other configs, but these are the ones
we're comparing here!)

This round of optimization focused mostly on the "idle" case of having
the time panel minimized. There are also some gains for the expanded
case, but it's less significant - as illustrated by the profiler
screenshots this is largely dominated
`num_events_cumulative_per_unique_time` which I hope we can solve with
chunk processors (and some untracked overhead I haven't looked into
yet).

**Before:**

Frame cpu time without profiler attached hovers around 4.2ms with some
outliers.
<img width="169" alt="image"
src="https://github.com/user-attachments/assets/0cfead2d-b485-45e8-b864-390cc8acd341">

Attaching the profiler doesn't tell us much since the profiler overhead
drowns out everything else:

![image](https://github.com/user-attachments/assets/db688dc3-c0bc-449a-a9d7-cce192cfec30)

Restarting without profiler and expanding the time panel and making it
as high as possible gives us about 12ms with frequent spikes beyond 13ms

<img width="1798" alt="image"
src="https://github.com/user-attachments/assets/5732a1ed-9911-4dbe-ae15-c52c9d0e4eeb">

Profiler overhead is ironically not _as_ significant. Averaging a few
frames tells us the time panel is at 11.5ms

![image](https://github.com/user-attachments/assets/6186da15-faa9-469c-8e80-b12184ae1689)


**After**

Frame cpu time without profiler attached hovers between 1.5ms and 2.8ms,
rather unsteady
<img width="124" alt="image"
src="https://github.com/user-attachments/assets/9d709888-0b48-4404-a570-417677175202">

Averaging a bunch of frames tells us that the data_density_graph takes
now 0.55ms (imho still pretty bad for that it is)

![image](https://github.com/user-attachments/assets/37cf9d75-7023-41d9-967c-7555b6fc0740)

Restarting without profiler and expanding the time panel and making it
as high as possible gives us around 11ms

<img width="1798" alt="Screenshot 2024-11-26 at 15 45 20"
src="https://github.com/user-attachments/assets/f4eab17f-db0e-4a21-86d9-5ac47560d7d0">

(important: this picture hasn't changed otherwise!)


The time panel now takes 9.4ms (that is definitely still very bad!),
profiler overhead is still significant but it's manageable:


![image](https://github.com/user-attachments/assets/0f4375e8-cc76-40c1-9f7f-049e5a4c4640)
@teh-cmc
Copy link
Member

teh-cmc commented Dec 2, 2024

⚠️ THIS IS A WORK IN PROGRESS -- IT DOES NOT NECESSARILY MAKE ANY SENSE. I'LL REMOVE THIS DISCLAIMER ONCE IT FINALLY DOES. ⚠️

Context

We've had a long discussion with @Wumpf specifically regarding the caching aspect.
Chunk Processors will require us to cache aggregated data (range-zipped data, to be exact), which is a first for us.

The following proposal first requires rethinking our query semantics, which makes caching aggregated Chunks semantically possible in the first place.
You should familiarize yourself with that document before continuing, as well as all the previous posts in this thread.

To make caching of aggregated Chunks possible, we will need to introduce two new types of queries: UnslicedAggregatedLatestAt and UnslicedAggregatedRange.
UnslicedAggregatedLatestAt and UnslicedAggregatedRange are exactly what they sound like -- the behave like their AggregatedLatestAt/AggregatedRange counterparts, but never slice the results down to the exact bounds.

To implement UnslicedAggregatedLatestAt and UnslicedAggregatedRange, we first need to implement UnslicedLatestAt and UnslicedRange.
So all in all, we add 4 more queries: UnslicedLatestAt, UnslicedRange, UnslicedAggregatedLatestAt, UnslicedAggregatedRanged. Fortunately these are all almost identical to their sliced counterparts.
(Feel free to propose a better name than Unsliced 🫠 ).

It is up to the caller to then slice those aggregations further. This allows for caching any aggregation of data at the Chunk-level, which in turns allows for a lot of optimization (see previous comments in this issue).

New queries

For the rest of this section, assume the following store:

CHUNK_STORE
  frame_nr   component
  --------   ---------

  CHUNK CR1
    #0       Radius(0.0)
    #10      Radius(10.0)

  CHUNK CR2
    #5       Radius(5.0)
    #30      Radius(30.0)

  CHUNK CP1
    #10      Position(10, 10, 10)
    #20      Position(20, 20, 20)

  CHUNK CP2
    #0       Position(0, 0, 0)
    #30      Position(30, 30, 30)

UnslicedLatestAt

UnslicedLatestAt is a LatestAt query that always returns the raw, unaltered Chunk that contains the result, as opposed to a unit-slice of it that only contains the row we're interested in.
I.e. LatestAt can be implemented in terms of UnslicedLatestAt simply by slicing down the resulting Chunk.

Consider e.g. LatestAt(at: #15, comp: Position) vs. UnslicedLatestAt(at: #15, comp: Position).

LatestAt(at: #15, comp: Position):

  frame_nr   component
  --------   ---------

CHUNK(LatestAt(at: #15, comp: [Position]))
    #15      Position(10, 10, 10)

UnslicedLatestAt(at: #15, comp: Position):

  frame_nr   component
  --------   ---------

CHUNK(UnslicedLatestAt(at: #15, comp: [Position])) == {CP1}
  CHUNK CP1
    #10      Position(10, 10, 10)
    #20      Position(20, 20, 20)

Note

Similar to LatestAt, UnslicedLatestAt is guaranteed to return a single Chunk, always -- even in case of overlaps (such as demonstrated in this example).

UnslicedRange

UnslicedRange is a Range query that always returns the raw, unaltered Chunks that contain the results, as opposed to sliced-down Chunks that only contain the rows we're interested in.
I.e. Range can be implemented in terms of UnslicedRange by slicing down the resulting Chunks.

Consider e.g. Range(in: #5..=#30, comp: Position) vs. UnslicedRange(in: #5..=#30, comp: Position).

Range(in: #5..=#30, comp: [Position]):

  frame_nr   component
  --------   ---------

CHUNK(Range(in: #5..=#25, comp: Position))
    #10      Position(10, 10, 10)
    #20      Position(20, 20, 20)

UnslicedRange(in: #5..=#25, comp: [Position]):

  frame_nr   component
  --------   ---------

CHUNK(UnslicedRange(in: #5..=#25, comp: Position)) == {CP1, CP2}
  CHUNK CP1
    #10      Position(10, 10, 10)
    #20      Position(20, 20, 20)

  CHUNK CP2
    #0       Position(0, 0, 0)
    #30      Position(30, 30, 30)

Note

Similar to Range, UnslicedRange might return any number of Chunks, depending on overlap.

UnslicedAggregatedLatestAt

UnslicedAggregatedLatestAt is an AggregatedLatestAt query that range-zips the raw, unaltered Chunks that contain the results, as opposed to range-zipping the unit-sliced Chunks with only the rows we're interested in.
I.e. AggregatedLatestAt can be implemented in terms of UnslicedAggregatedLatestAt by slicing down the resulting aggregated Chunk.

Consider e.g. AggregatedLatestAt(at: #15, comps: [Position, Radius]) vs. UnslicedAggregatedLatestAt(at: #15, comps: [Position, Radius]).

AggregatedLatestAt(at: #15, comps: [Position, Radius]):

  frame_nr   component (pov)         component
  --------   ---------------         ---------

CHUNK(AggregatedLatestAt(at: #15, comps: [Position, Radius]))
    #15      Position(10, 10, 10)    Radius(10.0)

UnslicedAggregatedLatestAt(at: #15, comps: [Position, Radius]):

  frame_nr   component (pov)         component
  --------   ---------------         ---------

CHUNK(UnslicedAggregatedLatestAt(at: #15, comps: [Position, Radius])) == RangeZip(PoV: Position, chunks: [CP1, CR1])
  // Primary: CP1
  // Dependencies: [CR1]
  CHUNK(RangeZip(PoV: Position, chunks: [CP1, CR1]))
    #10      Position(10, 10, 10)    Radius(10.0)
    #20      Position(20, 20, 20)    Radius(10.0)

Note

Note that there doesn't exist a query (whether it's an AggregatedLatestAt or an AggregatedRange) that you could run on the underlying store that would yield different data for any of the indices present in this aggregated Chunk. I.e. we can cache this.

As you can see from looking at the results, UnslicedAggregatedLatestAt also reports the dependency graph of its aggregated results.

UnslicedAggregatedRange

UnslicedAggregatedRange is an AggregatedRange query that always returns the raw, unaltered Chunks that contain the results, as opposed to sliced-down Chunks that only contain the rows we're interested in.
I.e. AggregatedRange can be implemented in terms of UnslicedAggregatedRange by slicing down the resulting Chunks.

Note

UnslicedAggregatedRange always shards the data according to the returned primary chunks. This is what makes caching possible in the first place.

Consider e.g. AggregatedRange(in: #5..=#25, comps: [Position, Radius]) vs. AggregatedRange(in: #5..=#25, comps: [Position, Radius]).

AggregatedRange(in: #5..=#25, comps: [Position, Radius]):

  frame_nr   component (pov)         component
  --------   ---------------         ---------

CHUNK(AggregatedRange(in: #5..=#25, comps: [Position, Radius]))
    #10      Position(10, 10, 10)    Radius(10.0)
    #20      Position(20, 20, 20)    Radius(10.0)

UnslicedAggregatedRange(in: #5..=#25, comps: [Position, Radius]):

  frame_nr   component (pov)         component
  --------   ---------------         ---------

CHUNK(UnslicedAggregatedRange(in: #5..=#25, comps: [Position, Radius])) == {CP1, CP2, CR1, CR2}
  // Primary: CP1
  // Dependencies: [CR1, CR2]
  CHUNK(RangeZip(PoV: Position, chunks: [CP1, CR1, CR2]))
    #10      Position(10, 10, 10)    Radius(10.0)
    #20      Position(20, 20, 20)    Radius(10.0)

  // Primary: CP2
  // Dependencies: [CR1, CR2]
  CHUNK(RangeZip(PoV: Position, chunks: [CP2, CR1, CR2]))
    #0       Position( 0   0,  0)    Radius(0.0)
    #30      Position(30, 30, 30)    Radius(30.0)

Note

Note that there doesn't exist a query (whether it's an AggregatedLatestAt or an AggregatedRange) that you could run on the underlying store that would yield different data for any of the indices present in this aggregated Chunk. I.e. we can cache this.

TODO: do we somehow need to bootstrap each primary chunk then?
TODO: how does bootstrapping fit into the dependencies?

TODO: we need a dependency-reporting RangeZip.

Dependency tracking

Consider this example from above:
UnslicedAggregatedLatestAt(at: #15, comps: [Position, Radius]):

  frame_nr   component (pov)         component
  --------   ---------------         ---------

CHUNK(UnslicedAggregatedLatestAt(at: #15, comps: [Position, Radius])) == RangeZip(PoV: Position, chunks: [CP1, CR1])
  // Primary: CP1
  // Dependencies: [CR1]
  CHUNK(RangeZip(PoV: Position, chunks: [CP1, CR1]))
    #10      Position(10, 10, 10)    Radius(10.0)
    #20      Position(20, 20, 20)    Radius(10.0)

Consider this other example from above:
UnslicedAggregatedRange(in: #5..=#25, comps: [Position, Radius]):

  frame_nr   component (pov)         component
  --------   ---------------         ---------

CHUNK(UnslicedAggregatedRange(in: #5..=#25, comps: [Position, Radius])) == {CP1, CP2, CR1, CR2}
  // Primary: CP1
  // Dependencies: [CR1, CR2]
  CHUNK(RangeZip(PoV: Position, chunks: [CP1, CR1, CR2]))
    #10      Position(10, 10, 10)    Radius(10.0)
    #20      Position(20, 20, 20)    Radius(10.0)

  // Primary: CP2
  // Dependencies: [CR1, CR2]
  CHUNK(RangeZip(PoV: Position, chunks: [CP2, CR1, CR2]))
    #0       Position( 0   0,  0)    Radius(0.0)
    #30      Position(30, 30, 30)    Radius(30.0)

This query yielded two aggregated Chunks that we can cache:

  • CHUNK(RangeZip(PoV: Position, chunks: [CP1, CR1, CR2]))
  • CHUNK(RangeZip(PoV: Position, chunks: [CP2, CR1, CR2]))

Both of these aggregated Chunks have two dependencies (i.e. other Chunks whose data was used as part of the aggregation).
In this specific example, they happen to share the same ones.

Aggregating non-chunk data

The actual data you aggregate into can be whatever you want it to be, as long as:

  • Each aggregate corresponds to one aggregated primary Chunk.
  • TODO: Each aggregate can be invalidated

Invalidation

There is actually no need for invalidation per-se:

  • Half of it is the LRU.
  • The other half is handled just-in-time, by running the queries every frame.

TODO: all very similar to the query cache, see my other comment in this thread.

Bootstrapping shenanigans

TODO: demonstrate a failing case

Thinking about the future

TODO: what does multi-primaries look like?
TODO: What about dataframe APIs? How would that look like?
TODO: Path to a generalized Chunk compute graph?


TODO: how does all of this look like in the table of death?
TODO: is there anything we should be worrying about storage-node wise or whatever?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🚀 performance Optimization, memory use, etc project-many-entities-perf ⛃ re_datastore affects the datastore itself 📺 re_viewer affects re_viewer itself
Projects
None yet
Development

No branches or pull requests

2 participants