Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature idea: Jelly stream transcoders #225

Open
Ostrzyciel opened this issue Nov 21, 2024 · 2 comments · May be fixed by #236
Open

Feature idea: Jelly stream transcoders #225

Ostrzyciel opened this issue Nov 21, 2024 · 2 comments · May be fixed by #236
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@Ostrzyciel
Copy link
Member

Ostrzyciel commented Nov 21, 2024

I've run into a use case where we want to expose a single, contiguous Jelly stream. The source data for this stream is not kept in memory, but instead it's serialized on disk (currently with a W3C serialization). So, to expose the Jelly stream, we first have to parse the W3C files to (for example) RDF4J Statements and then reserialize this as Jelly.

What if we could use Jelly to store the data on disk? Then, when serving the stream we could somehow re-encode the Jelly stream data without deserializing it fully to RDF statements. Instead, we could process only the intermediate Protobuf classes.

Such a transcoder would function much like a Jelly encoder, but would take RdfStreamRows (or RdfStreamFrames) as input. Then:

  • If we don't need to change anything in the row, just pass it on.
  • If we need to change it (e.g., different prefix ids), then construct a new instance of it and pass it further.
  • If we don't need the row at all (e.g., duplicated prefix entry), then ignore it.

The major part of this is of course the lookup table remapping. To speed this up, we could have fixed size arrays to hold the mappings. For example: int[] prefixMapping would have n + 1 elements, where n is the size of the prefix lookup in the input. Value prefixMapping[i] tells us how to map prefix id i from the input stream to the output stream. If the mapping is not there (value 0), then use the slow path – get or insert the prefix in encoder's lookup.

I'm thinking of a few possible use cases for these transcoders, and thus a few different APIs that would need to be exposed:

It should not only be possible to concatenate multiple input streams into one, but also to split a single input stream. I'm not sure if this all should be stuffed into one API, but one possible way to do this would be something like this:

def ingestRow(row: RdfStreamRow): Iterable[RdfStreamRow]

def ingestFrame(frame: RdfStreamFrame): RdfStreamFrame

// Resets the internal encoder. Further rows will be encoded as if in a new Jelly stream.
def splitStream(): Unit

The transcoder would automatically detect when a new input stream is started based on the presence of RdfStreamOptions. The ingestFrame method is just there for the convenience.

@Ostrzyciel
Copy link
Member Author

Merging streams is relatively easy, especially if we make a few assumptions. I'm going to go with two: (1) the input stream is valid (wrt. the spec), and (2) the input's lookup sizes are smaller or equal to the output's. Then, the number of lookup entries in the lookup is equal or less to the number of lookup entries in the input, simplifying the algorithm greatly.

Splitting streams, however, turns out to be pretty hard. After splitting the stream we'd to re-emit lookup entries, but only those that will be used in the output. This would have to be tracked somehow, and honestly, I'm not sure if it's worth it. At least in the aforementioned use case, when splitting the input stream we have to fully parse it anyway, so any savings here would not be huge.

Instead of the stream splitter, we could investigate resetting ProtoEncoder instead of destroying and re-allocating it.

Ostrzyciel added a commit that referenced this issue Nov 23, 2024
Issue: #225

Maybe it works?

TODO:

- Lookup tests
- Transcoder tests in core
- Integration tests (extensive)
- Code audit – performance, security
- Scaladoc/javadoc
@Ostrzyciel
Copy link
Member Author

(2) the input's lookup sizes are smaller or equal to the output's. Then, the number of lookup entries in the lookup is equal or less to the number of lookup entries in the input, simplifying the algorithm greatly.

Not exactly sure about this one, I guess it depends on the used lookup eviction policy. TODO: check this.

Ostrzyciel added a commit that referenced this issue Dec 18, 2024
Issue: #225

Maybe it works?

TODO:

- Lookup tests
- Transcoder tests in core
- Integration tests (extensive)
- Code audit – performance, security
- Scaladoc/javadoc
@Ostrzyciel Ostrzyciel linked a pull request Dec 18, 2024 that will close this issue
@Ostrzyciel Ostrzyciel linked a pull request Dec 18, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant