-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunk processors with cached output #8221
Comments
Other example use cases:
Other misc notes:
|
Since these caches work at the Chunk level, they do not have any particular query semantics, and therefore no query-based invalidation either. For that reason I don't think the difference between LatestAt and Range semantics particularly matter? I.e. a Chunk processor always runs a query, any query, and gets back a set of From what I can tell this just works, irrelevant of query semantics, since it doesn't matter how the Still it would probably be nice to align the semantics of LatestAt, Range and Dataframe a bit more -- but that seems like an orthogonal problem. |
(More ramblings) All in all this is very similar to the existing The
In the case of the visualizers, we're caching aggregations of Chunks, so the cache-key mapping is not 1:1 anymore, it's 1:N or, more specifically, 1:(M:N) where M is the number of primary components and N the number of secondaries (
Related: cache eviction based on garbage collection events is a relic from the past, it needs to go away once we introduce LRUs. Removing cache entries because the underlying data was GC'd is counter-productive in two ways:
|
I think your line of argumentation makes a lot of sense - if we physically separate the chunkid generation (~= query) from the processing, the processor should be able to be obliviously generate data. And given the perf budgets we work with we don’t care about invalidating the world if someone switches their latest-at to a range (and we don’t have to invalidate the world either) However, the internal joins that a processor performs are still different. Example:
In this example both range query over position and latest at will have the same chunks.
|
Riiiight, I think I get it. This is effectively the issue that we solve today by monkey-patching the We should sync up and see if we can eliminate this abomination at the source. |
### Related * More optimization depends likely on #8221 ### What Make the timepanel faster. Still plenty of room for improvement though. Achieved by... * store subscriber to keep track of cumulative chunks needed for a given entity & timeline * plus some statistics. less important * a bit faster `num_events_cumulative_per_unique_time_unsorted` * improved `DensityGraph::add_range` (this one was surprisingly significant) ### Testing TL;DR: Minimized timepanel takes less than half the time it use. Other cases are better as well, but more noisy. ---- All testing done on the 2h airtraffic dataset (pre-saved rrd) with all panels hidden and views removed (unless noted otherwise) on my M1 Max. Note that the timeline perf is very dependent on amount of screen real estate - these tests were done maximized on the 14'' Mac's screen. (Did some throw-away testing on other configs, but these are the ones we're comparing here!) This round of optimization focused mostly on the "idle" case of having the time panel minimized. There are also some gains for the expanded case, but it's less significant - as illustrated by the profiler screenshots this is largely dominated `num_events_cumulative_per_unique_time` which I hope we can solve with chunk processors (and some untracked overhead I haven't looked into yet). **Before:** Frame cpu time without profiler attached hovers around 4.2ms with some outliers. <img width="169" alt="image" src="https://github.com/user-attachments/assets/0cfead2d-b485-45e8-b864-390cc8acd341"> Attaching the profiler doesn't tell us much since the profiler overhead drowns out everything else: ![image](https://github.com/user-attachments/assets/db688dc3-c0bc-449a-a9d7-cce192cfec30) Restarting without profiler and expanding the time panel and making it as high as possible gives us about 12ms with frequent spikes beyond 13ms <img width="1798" alt="image" src="https://github.com/user-attachments/assets/5732a1ed-9911-4dbe-ae15-c52c9d0e4eeb"> Profiler overhead is ironically not _as_ significant. Averaging a few frames tells us the time panel is at 11.5ms ![image](https://github.com/user-attachments/assets/6186da15-faa9-469c-8e80-b12184ae1689) **After** Frame cpu time without profiler attached hovers between 1.5ms and 2.8ms, rather unsteady <img width="124" alt="image" src="https://github.com/user-attachments/assets/9d709888-0b48-4404-a570-417677175202"> Averaging a bunch of frames tells us that the data_density_graph takes now 0.55ms (imho still pretty bad for that it is) ![image](https://github.com/user-attachments/assets/37cf9d75-7023-41d9-967c-7555b6fc0740) Restarting without profiler and expanding the time panel and making it as high as possible gives us around 11ms <img width="1798" alt="Screenshot 2024-11-26 at 15 45 20" src="https://github.com/user-attachments/assets/f4eab17f-db0e-4a21-86d9-5ac47560d7d0"> (important: this picture hasn't changed otherwise!) The time panel now takes 9.4ms (that is definitely still very bad!), profiler overhead is still significant but it's manageable: ![image](https://github.com/user-attachments/assets/0f4375e8-cc76-40c1-9f7f-049e5a4c4640)
ContextWe've had a long discussion with @Wumpf specifically regarding the caching aspect. The following proposal first requires rethinking our query semantics, which makes caching aggregated Chunks semantically possible in the first place. To make caching of aggregated Chunks possible, we will need to introduce two new types of queries: To implement It is up to the caller to then slice those aggregations further. This allows for caching any aggregation of data at the Chunk-level, which in turns allows for a lot of optimization (see previous comments in this issue). New queriesFor the rest of this section, assume the following store: CHUNK_STORE
frame_nr component
-------- ---------
CHUNK CR1
#0 Radius(0.0)
#10 Radius(10.0)
CHUNK CR2
#5 Radius(5.0)
#30 Radius(30.0)
CHUNK CP1
#10 Position(10, 10, 10)
#20 Position(20, 20, 20)
CHUNK CP2
#0 Position(0, 0, 0)
#30 Position(30, 30, 30)
|
Related to:
Problem statement
Today, visualizers operate exclusively on typically narrow queries (latest at or range). This makes it exceedingly hard to do any kind of caching on the outputs of a visualizer:
However, at the same time the processing done is performance intensive by means of...
Paradigm shift to chunk processors
If we disregard the multitude of inputs a visualizer has from the viewer, we find that at its core it processes slices of chunks into a visualizable outcome eventually resulting in egui primitives, re_renderer draw data and picking & scene information.
In its simplest form, we can extract this into a compute kernel that takes a single chunk and outputs a single piece of of cacheable data.
This data can be recreated at any point using the input chunk and is invalidated exactly when the chunk is removed from the store. Making this well defined for a cache, allowing for LRU & GC strategies.
Granularity of this processing operation is governed by chunk granularity which is already subject to compaction strategies!
Note that just like chunks are sliced & iterated by queries, the processing result typically has to be sliced to be usable by a visualizer. There is however, no unified strategy for this as this is highly dependent on what the processor's output.
In this model, visualizer's
execute
method is still invoked as before, but makes use of data from arbitrary chunk processors.Joined chunk processing
In practice we always join a multitude of chunks. However, all joins are dependent on a "required component" as a primary. This primary translates to chunk processors as a "primary chunk":
The processor is thought of as operating in the range of the primary chunk using any data from other chunks that may straddle its time range. In practice, most of the time, processing many components doesn't need more than the primary chunk as all components are typically sent together in the same chunk. However, it can happen that processing a single chunk may need input from an arbitrary amount of other chunks.
-> A chunk processor has to advertise all chunks that are involved as a query function (it is not allowed to access anything else)
-> If the chunks returned by this query changes, the processing output is invalid
Latest-at vs range semantics
Unfortunately, the chunks involved in a processing operation depend on the query semantics.
To illustrate take this store content as an example:
A visualizer doing a latest at query at
t2
operates on both the position and the radius chunk. A visualizer doing a range query fromt2
tot3
operates exclusively on the position chunk.Therefore, if we want to use a chunk processor that operated on the position chunk as a primary, we have to take the radius chunk into account iff we're querying with latest-at semantics.
-> Either:
Dealing with blueprints & non-chunk inputs
Visualizers have various different inputs:
For chunk processors to operate correctly it is paramount that they are insulated from unpredictable inputs. We have to add context in steps:
(yes, beyond blueprint this is an open question)
Generally, we'll have to explore how far we can get with hermetic chunk processing while adding other context ad-hoc in
execute
!Other usecases for chunk processors
While this is motivated primarily by the complex requirements of visualizers, there are other compelling usecases for this kind or processing, oftentimes much simpler as the inputs are more limited:
num_events_cumulative_per_unique_time
(a major perf bottleneck for the timeline data density graph for unsorted chunks)Common to these is that the amount of data would be too much to keep around indefinitely for all chunks!
Where to start
We need to prototype this concept step by step!
Let's start by re-implementing the points3d visualizer using this new paradigm and try to formalize it in the process.
The text was updated successfully, but these errors were encountered: