Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External data references, promises, larger-than-RAM blob store #3119

Closed
teh-cmc opened this issue Aug 28, 2023 · 5 comments
Closed

External data references, promises, larger-than-RAM blob store #3119

teh-cmc opened this issue Aug 28, 2023 · 5 comments

Comments

@teh-cmc
Copy link
Member

teh-cmc commented Aug 28, 2023

Rationale

There are many instances where logging data in our datastore is inefficient at best, or simply unmanageable at worst.
(And sometimes just non-sensical: we relog something that is already stored somewhere else?)

A clear instance of this is with video data: logging every frame of a video separately is extremely inefficient memory-wise. This inefficiency stems from the loss of compression benefits between frames, such as those from run encodings.
When users try to log multiple 4K streams to Rerun, the task quickly becomes overwhelming.

Instead of this cumbersome process, it would be far more practical for users to point to data stored elsewhere –- be it on a local disk, in cloud storage, on a file server, or any other storage mediums.

Strategy

One approach is to make it so any given Component can serve as either actual data (as it functions now) or as a URI pointing to that data; roughly:

enum Component {
    Data(T),
    Reference(Uri),
}

To put this into perspective with a concrete example, consider video frames:

/// An entity can either be an actual image or a URI that indicates where the image can be found.
enum Image {
    Data(TensorData),

    /// Some examples for clarity:
    /// `http://cloud.storage.com/mybucket/myimage.png`
    /// `ftp://cloud.storage.com/mybucket/myvideo.mp4?frame_id=765398`
    /// `file:///home/Downloads/myvideo.mp4?ts=1648832391`
    External(Uri),
}

While this approach effectively addresses the size and inefficiency concerns, it does introduce a new challenge: the need for prefetching and buffering.
Of course, scrubbing randomly will now incur delays (at least for this specific component of this specific entity), which isn't any different from any video player on the web!

Opportunities for extension hooks

Like other subsystems in Rerun, there are opportunities for plugins to customize the behavior of references:

  • URI Protocols: implement plugins to support new protocols (e.g. git://, http://, ...) or replace existing ones with a custom implementation (e.g. custom logic to fetch an exact frame).
  • Prefetching and Buffering Logic: Depending on the specific use case, different prefetching and buffering strategies can be developed and implemented.

Moreover, any data that is referred to by URIs needs to be a known Rerun datatype, i.e. data-format plugins are again very relevant here.

Hotswapping data

A nice side-effect of all of this is that the data behind the URI reference can easily be swapped out for something else.

Imagine e.g. that you're working on some computer vision tool for football, and your blueprint always contains a stadium mesh: not only using a URI prevents bloating all your blueprints with the data for that mesh, it also makes it possible to update the mesh remotely:

E.g. Mesh::Uri(http://footballcvanalytics.com/assets/latest/stadium.glb) which redirects to http://footballcvanalytics.com/assets/0.2.1/stadium.glb... for now!

Packaging

On the surface, using URI references seems to disrupt the convenience of creating self-contained rrd archives.

However, it would be entirely feasible to embed external data directly within the rrd archive and then reference it using e.g. a new URI protocol like: rrd://embedded/assets/myvideo.mp4?ts=1648832391.

This would essentially gives us the best of both worlds.

@nikolausWest
Copy link
Member

Really love this suggestion! It could really be a great way of enabling recordings that are much bigger than RAM and it also seems quite extensible. One thought / question:

Would it make sense to make the Reference optional rather than making the Component a union. That way, the user could choose to send the data and it's URI (for recreating it) together to avoid the roundtrip of SDK-ref->viewer-ref->URI-data->viewer. When we have the Reference for a component we therefore know we can just GC the Data whenever it makes sense because we can just recreate it later again.

An example use case would be:

  • User has a fast image processing pipeline that processes images at 600Hz.
  • They would like to have a live view of what's going on but also be able to scrub back in time and analyze more closely.
  • They therefore log the output (for example object detections) of the pipeline at full frame rate
  • The images they instead write to a video file locally (fast) and log:
    • Reference + Data at 30Hz
    • Reference only for all other frames

@emilk
Copy link
Member

emilk commented Jan 11, 2024

We've started calling these external data references promises. They contain some data (e.g. an URI) that lets some plugin find the data for the viewer. A simple promise could be a file-name, a more complicated one an S3 bucket with login credentials.

Promises would be a great solution for the case of Big Data, Small Index. By "Index" we mean "What data was logged when". For instance, when logging a lot of big images, the data grows quickly, but the index stays small. In comparison, when logging millions of scalars per second, both the index and data volumes grow, and a promise would not help at all.

The Promise could either be a datatype or a component.

Promise as a datatype

If Promise is a datatype, then the high level index looks the same as if you didn't use a Promise. This means the stream panel look the same for instance, and all heuristics would work as expected. The Promise would be resolved early, so that the visualizers would just see the resolved data, and be ignorant of the fact it was backed by a Promise rather than inline arrow data. This would work well for cases where a single component can be huge (e.g. a TensorData datatype).

There should also be some way to replace a whole component array with a single promise, e.g. replace all the positions in a point cloud with a single Datatype (not quite a splat, but similar!).

Promise as a component

If Promise is a component, it could represent a whole entity. It should probably contain the names of the components it will resolve to.

This will produce a very different index for the user. For instance, the Promise component would probably show up in the streams panel.

More here: https://www.notion.so/rerunio/Larger-than-RAM-Seeking-plugins-Promises-and-Resolvers-1dbc3e223d2947db8a8e49cf8773c068?pvs=4

@emilk emilk changed the title Support for external data references External data references, promises, larger-than-RAM blob store Jan 11, 2024
@nikolausWest
Copy link
Member

Entity links == Promise as component?

Thinking about this a bit more there are a lot of parallels with something else we've been discussing: entity links. If a reference is a component that contains a uri + a list of expected components, that could work as an internal reference as well. If the uri is just an entity path then get the listed components from that entity, if it's a https url then query that url with a list of components as parameters and so on.

The operation of moving data out to a separate blob storage could then consist of adding a reference with a uri to the right place in the external blob store and a list of the components on the entity. We could then have a new GC step that starts by looking at all references, and drops data that can be recreated/fetched first. I'm not sure about the details but maybe this actually shifts it from a separate blob storage to a separate row storage?

@bedilbek
Copy link

Related to #5247

@teh-cmc
Copy link
Member Author

teh-cmc commented Nov 26, 2024

Fine-grained promises (i.e. at the cell/componentbatch level) have proven to be the wrong tool for the job: we've since removed all remaining traces of them.

External data references, promises and larger-than-RAM storage can all be solved much more efficiently using a combination of our new storage node and Chunk-level processors.

See #8221 for more information about Chunk processors:

@teh-cmc teh-cmc closed this as not planned Won't fix, can't repro, duplicate, stale Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants