-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Caching tiers, memory overhead and garbage collection #5974
Comments
Possible approach:
One problem is that the Another problem is the order of garbage collection within each cache. Any order that's not entirely RowId-based comes with its own biases to solve though: which entity paths should be GC'd first? what about timelines? |
Make it possible to not cache some components, all while pretending really hard that they've been cached. - Related: #5974 --- Part of a PR series to completely revamp the data APIs in preparation for the removal of instance keys and the introduction of promises: - #5573 - #5574 - #5581 - #5605 - #5606 - #5633 - #5673 - #5679 - #5687 - #5755 - #5990 - #5992 - #5993 - #5994 - #6035 - #6036 - #6037 Builds on top of the static data PR series: - #5534
This is a living document that contains a brain dump regarding the state of caching in the viewer, some of the major issues on that front, some thoughts for the future, etc.
Mostly so I can organize my thoughts and have something to point to rather than writing book-length comments in the code again and again.
State of caching
(I'm going to assume that promises have been implemented already for the remainder of this document)
The viewer's data stack is currently organized as a series of caches sitting on top of each other.
Pulling on the top-most caching layer should implicitely trigger a passthrough chain reaction that updates all the layers, if and where required.
Every layer specifies 3 behaviors:
The cache stack, from bottom to top, looks like this:
0. .rrd file (raw data)
The raw data, as it sits on disk storage. The only thing that's not a cache, at present.
Some entries in that file might actually be promises, which are either cheap datatype conversions or full-blown promises, in which case the raw data is just a reference to the real raw data, which lives somewhere else.
1. Datastore
Raw data from an rrd file gets loaded into the Datastore, which can be seen as a cached/denormalized index of an rrd file.
The index adds some fixed overhead for each entry in the rrd file, which can range from irrelevant to extremely problematic depending on the kind of data we're talking about.
Fill
The store gets filled by pushing data into it in a streaming fashion. It gets no say in what should or shouldn't be indexed -- it just blindly ingests everything.
Invalidation
There is no invalidation involved: the store is always up-to-date since the raw rrd data is immutable/append-only.
Garbage collection
Store data gets evicted in RowId order (!) by the garbage collector, triggered based on the resident memory size of the entire app.
Using resident memory pressure as a trigger can lead to nasty problems e.g. if another part of the app ends up leaking memory, or simply if another caching layer happens to be hogging memory.
Once data gets evicted from the store, it's gone. The only way to get it back is to re-log it, e.g. by reloading the source rrd file, if possible.
2. Promise cache
The promise cache keeps track of the data that resulted from resolving a promise.
Fill
The cache is filled by resolving promises, which is currently a just-in-time process that happens when a space view (or any other system) decodes the results of a query.
Invalidation
There's nothing to invalidate: promises are identified by their unique, immutable RowId, therefore the resolved data is assumed to always be up-to-date (i.e. the viewer assumes that promise resolution is a pure process: promise resolution is expected to always return the same result in the successful case).
Garbage collection
We don't have a GC strategy in place for promises yet, but the promise cache will have to respond to memory pressure, one way or another.
Since we assume that promise resolution is pure, it's always fine to garbage collect resolved data -- we can always resolve it again (as long as the raw promise itself doesn't get GC'd from the datastore!).
3. Primary cache
The primary cache is composed of two parts: a query cache and a deserialization cache.
The query cache's role is, unsurprisingly, to keep track of the raw results of executing a query. In practice, that means: for a given entity path, components and query time, keep track of the resulting data time and associated data cells.
The query cache does not copy any data, it's all pointers (buried in datacells).
There are two situations where the query cache comes in handy:
Every query adds some small overhead, which very quickly adds up when you get to hundreds+ entities.
Gathering all the cells for a large range query can be very costly in and of itself.
The second part of this, the deserialization cache, is a horrible hack. Its role is to overcome the overhead of unpeeling lots of small data cells, and lay out the data in a way that's optimal for the CPU cache.
Somehow we need to get rid of it, but right now it's what makes the time series view's performance somewhat bearable.
The primary cache effectively doubles the memory overhead of the data, since it exists both as arrow data sitting in a cell in the store (or in the promise cache, same difference), as well as native data in the deserialization cache.
In practice this generally matters very little, for two reasons:
Still, doubling the memory overhead of large point clouds definitely sucks.
Fill
The query cache is filled by running queries through the cached APIs, which happens both in space views and all over the UI code because of our immediate mode architecture.
It's a completely untyped process since it's just shuffling cells around.
The deserialization cache is filled when iterating over the query results, at which point the data is converted to a native type.
Note: what gets deserialized and cached is the resolved data, which is why the primary cache lives on top of the promise cache.
Invalidation
The primary cache subscribes to the datastore's changelog for invalidation purposes.
Invalidation of range queries is pretty coarse, and can be quite costly, compute-wise.
Garbage collection
There's no garbage collection as of yet. The primary cache just grows indefinitely as more and more queries are run.
It tends to not be an issue in practice (yet) because most recordings hit the memory limit from the indexing overhead before the primary caching overhead can ever become an issue.
4. Secondary caches
The general idea behind secondary caches is that they store arbitrary user-data (where the user in this case is a system, whether a space view or otherwise), and automatically benefit from the invalidation logic of the primary cache.
We unfortunately don't have any secondary caches up and running yet, but it's easy to think of scenarios where they would fit nicely:
egui
primitives in the time series view (although subpixel aggregation might make that pretty tough)Fill
Secondary caches are expected to be filled by the end system itself, right at the edge, since it (the system) is the only entity that knows about the transformation the data should go through.
For CPU-based artifact, the memory is handled by the cache subsystem as usual.
For GPU-based artifact, we're probably going to have to find a way to centralize VRAM management eventually... As a starting point, we can just let the end system deal with it, like we already do today.
Invalidation
Invalidation of secondary caches is driven by the primary cache (itself subscribed to the datastore changelog).
The system doing the filling doesn't need to run any extra invalidation logic of its own, which is nice.
Garbage collection
That one needs to be figured out.
Secondary caches are interesting in that aspect for several reasons:
5. Ad-hoc caches (JPEG, Encoded Meshes)
These are ad-hoc secondary caches that predate the introduction of generalized caching throughout the app.
They are meant to be replaced by proper secondary caches... at some point.
Fill
I think they get filled just-in-time by the image and mesh views, respectively.
Invalidation & Garbage collection
There is no invalidation logic per se, but the cache keys do include the RowId of the data.
The cache also keeps track of which cache keys were used during the last frame, and clears out everything that hasn't been used for a couple frames if the cache's memory limit has been reached. I think.
Garbage collecting the caches
Clearly we will need to come up with a way of garbage collecting the different caching layers, decorrelated from the invalidation process.
The difficulty there is to come up with a precedence/priority algorithm that makes sense for most use cases, given the tension caused by the different layers all sharing the same memory resources.
It also isn't clear that one should always prioritize giving memory to the datastore.
E.g. in real-time scenarios where one is only interested in the last few frames of data, the user would prefer dedicating more memory to the secondary caches rather than keeping more, older datastore entries alive.
Maybe this can be solved by exposing more configuration knobs for power users... but we know how that generally turns out.
The intermediary promise and primary caches effectively become dead weight once the views have computed their secondary caches accordingly, assuming they have any.
There are a lot of gotchas with that train of thought, though.
Still, it does kind of seem at first glance the caches on the edge, whether it's the store at the bottom or the secondary caches at the top, should take priority over intermediate ones in most cases.
What about multi-recording & multi-app scenarios?
Something we haven't really spent any time on until now is the behavior of garbage collection in multi-recording and multi-app scenarios.
I don't even remember what we do today. I think we always start with the oldest recording, irrelevant of which app it belongs too.
That seems good enough for a while, although it'd be nice to document this somewhere.
The text was updated successfully, but these errors were encountered: