Feature idea: Jelly stream transcoders #225

Ostrzyciel · 2024-11-21T14:49:28Z

I've run into a use case where we want to expose a single, contiguous Jelly stream. The source data for this stream is not kept in memory, but instead it's serialized on disk (currently with a W3C serialization). So, to expose the Jelly stream, we first have to parse the W3C files to (for example) RDF4J Statements and then reserialize this as Jelly.

What if we could use Jelly to store the data on disk? Then, when serving the stream we could somehow re-encode the Jelly stream data without deserializing it fully to RDF statements. Instead, we could process only the intermediate Protobuf classes.

Such a transcoder would function much like a Jelly encoder, but would take RdfStreamRows (or RdfStreamFrames) as input. Then:

If we don't need to change anything in the row, just pass it on.
If we need to change it (e.g., different prefix ids), then construct a new instance of it and pass it further.
If we don't need the row at all (e.g., duplicated prefix entry), then ignore it.

The major part of this is of course the lookup table remapping. To speed this up, we could have fixed size arrays to hold the mappings. For example: int[] prefixMapping would have n + 1 elements, where n is the size of the prefix lookup in the input. Value prefixMapping[i] tells us how to map prefix id i from the input stream to the output stream. If the mapping is not there (value 0), then use the slow path – get or insert the prefix in encoder's lookup.

I'm thinking of a few possible use cases for these transcoders, and thus a few different APIs that would need to be exposed:

It should not only be possible to concatenate multiple input streams into one, but also to split a single input stream. I'm not sure if this all should be stuffed into one API, but one possible way to do this would be something like this:

def ingestRow(row: RdfStreamRow): Iterable[RdfStreamRow]

def ingestFrame(frame: RdfStreamFrame): RdfStreamFrame

// Resets the internal encoder. Further rows will be encoded as if in a new Jelly stream.
def splitStream(): Unit

The transcoder would automatically detect when a new input stream is started based on the presence of RdfStreamOptions. The ingestFrame method is just there for the convenience.

The text was updated successfully, but these errors were encountered:

Ostrzyciel · 2024-11-23T08:49:32Z

Merging streams is relatively easy, especially if we make a few assumptions. I'm going to go with two: (1) the input stream is valid (wrt. the spec), and (2) the input's lookup sizes are smaller or equal to the output's. Then, the number of lookup entries in the lookup is equal or less to the number of lookup entries in the input, simplifying the algorithm greatly.

Splitting streams, however, turns out to be pretty hard. After splitting the stream we'd to re-emit lookup entries, but only those that will be used in the output. This would have to be tracked somehow, and honestly, I'm not sure if it's worth it. At least in the aforementioned use case, when splitting the input stream we have to fully parse it anyway, so any savings here would not be huge.

Instead of the stream splitter, we could investigate resetting ProtoEncoder instead of destroying and re-allocating it.

Issue: #225 Maybe it works? TODO: - Lookup tests - Transcoder tests in core - Integration tests (extensive) - Code audit – performance, security - Scaladoc/javadoc

Ostrzyciel · 2024-11-23T12:42:51Z

(2) the input's lookup sizes are smaller or equal to the output's. Then, the number of lookup entries in the lookup is equal or less to the number of lookup entries in the input, simplifying the algorithm greatly.

Not exactly sure about this one, I guess it depends on the used lookup eviction policy. TODO: check this.

Issue: #225 Maybe it works? TODO: - Lookup tests - Transcoder tests in core - Integration tests (extensive) - Code audit – performance, security - Scaladoc/javadoc

Ostrzyciel added the enhancement New feature or request label Nov 21, 2024

Ostrzyciel added this to the 2.4.0 milestone Nov 21, 2024

Ostrzyciel mentioned this issue Nov 22, 2024

Consider using Jelly for inter-service communication knowledgepixels/nanopub-registry#29

Open

Ostrzyciel self-assigned this Nov 22, 2024

Ostrzyciel added a commit that referenced this issue Nov 23, 2024

WIP: ProtoTranscoder implementation

2c3d45b

Issue: #225 Maybe it works? TODO: - Lookup tests - Transcoder tests in core - Integration tests (extensive) - Code audit – performance, security - Scaladoc/javadoc

Ostrzyciel added a commit that referenced this issue Dec 18, 2024

WIP: ProtoTranscoder implementation

498a5e2

Issue: #225 Maybe it works? TODO: - Lookup tests - Transcoder tests in core - Integration tests (extensive) - Code audit – performance, security - Scaladoc/javadoc

Ostrzyciel linked a pull request Dec 18, 2024 that will close this issue

Add Jelly stream transcoders #236

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature idea: Jelly stream transcoders #225

Feature idea: Jelly stream transcoders #225

Ostrzyciel commented Nov 21, 2024 •

edited

Loading

Ostrzyciel commented Nov 23, 2024

Ostrzyciel commented Nov 23, 2024

Feature idea: Jelly stream transcoders #225

Feature idea: Jelly stream transcoders #225

Comments

Ostrzyciel commented Nov 21, 2024 • edited Loading

Ostrzyciel commented Nov 23, 2024

Ostrzyciel commented Nov 23, 2024

Ostrzyciel commented Nov 21, 2024 •

edited

Loading