-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
External data references, promises, larger-than-RAM blob store #3119
Comments
Really love this suggestion! It could really be a great way of enabling recordings that are much bigger than RAM and it also seems quite extensible. One thought / question: Would it make sense to make the An example use case would be:
|
We've started calling these external data references promises. They contain some data (e.g. an URI) that lets some plugin find the data for the viewer. A simple promise could be a file-name, a more complicated one an S3 bucket with login credentials. Promises would be a great solution for the case of Big Data, Small Index. By "Index" we mean "What data was logged when". For instance, when logging a lot of big images, the data grows quickly, but the index stays small. In comparison, when logging millions of scalars per second, both the index and data volumes grow, and a promise would not help at all. The
|
Entity links ==
|
Related to #5247 |
Fine-grained promises (i.e. at the cell/componentbatch level) have proven to be the wrong tool for the job: we've since removed all remaining traces of them. External data references, promises and larger-than-RAM storage can all be solved much more efficiently using a combination of our new storage node and Chunk-level processors. See #8221 for more information about Chunk processors: |
Rationale
There are many instances where logging data in our datastore is inefficient at best, or simply unmanageable at worst.
(And sometimes just non-sensical: we relog something that is already stored somewhere else?)
A clear instance of this is with video data: logging every frame of a video separately is extremely inefficient memory-wise. This inefficiency stems from the loss of compression benefits between frames, such as those from run encodings.
When users try to log multiple 4K streams to Rerun, the task quickly becomes overwhelming.
Instead of this cumbersome process, it would be far more practical for users to point to data stored elsewhere –- be it on a local disk, in cloud storage, on a file server, or any other storage mediums.
Strategy
One approach is to make it so any given Component can serve as either actual data (as it functions now) or as a URI pointing to that data; roughly:
To put this into perspective with a concrete example, consider video frames:
While this approach effectively addresses the size and inefficiency concerns, it does introduce a new challenge: the need for prefetching and buffering.
Of course, scrubbing randomly will now incur delays (at least for this specific component of this specific entity), which isn't any different from any video player on the web!
Opportunities for extension hooks
Like other subsystems in Rerun, there are opportunities for plugins to customize the behavior of references:
git://
,http://
, ...) or replace existing ones with a custom implementation (e.g. custom logic to fetch an exact frame).Moreover, any data that is referred to by URIs needs to be a known Rerun datatype, i.e. data-format plugins are again very relevant here.
Hotswapping data
A nice side-effect of all of this is that the data behind the URI reference can easily be swapped out for something else.
Imagine e.g. that you're working on some computer vision tool for football, and your blueprint always contains a stadium mesh: not only using a URI prevents bloating all your blueprints with the data for that mesh, it also makes it possible to update the mesh remotely:
E.g.
Mesh::Uri(http://footballcvanalytics.com/assets/latest/stadium.glb)
which redirects tohttp://footballcvanalytics.com/assets/0.2.1/stadium.glb
... for now!Packaging
On the surface, using URI references seems to disrupt the convenience of creating self-contained
rrd
archives.However, it would be entirely feasible to embed external data directly within the
rrd
archive and then reference it using e.g. a new URI protocol like:rrd://embedded/assets/myvideo.mp4?ts=1648832391
.This would essentially gives us the best of both worlds.
The text was updated successfully, but these errors were encountered: