Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using Jelly for inter-service communication #29

Open
Ostrzyciel opened this issue Nov 19, 2024 · 9 comments
Open

Consider using Jelly for inter-service communication #29

Ostrzyciel opened this issue Nov 19, 2024 · 9 comments
Labels
enhancement New feature or request

Comments

@Ostrzyciel
Copy link
Contributor

This is a follow-up of what we discussed on the last Nano Session.

The idea is to use Jelly, a high-performance RDF serialization format and streaming protocol for communication between microservices in the next generation of the Nanopub infrastructure.

Jelly can act as a simple serialization format (like N-Triples or Turtle), turning a bunch of triples/quads into bytes. But, it can also serialize a stream of RDF graphs/datasets, which is where its main advantage is. Let's say we have a series of nanopubs (RDF datasets) – if we serialized them as a bunch of TriG files, we would have to each time repeat the same prefixes, property names, classes, etc. If we could not repeat ourselves, we could reduce the size of the serialized data, and also speed things up (less work = faster). See the benchmarks below for numbers.

Jelly is currently implemented for Jena 5 and RDF4J 5, with a full integration with the relevant I/O APIs. The implementation is open-source (Apache 2.0 license) and exhaustively tested in CI/CD. When you just want to serialize one RDF dataset, all you need to do is to add a Maven dependency on Jelly and use the standard RDF4J Rio API. That's all, it just works.

To implement streams of RDF datasets, you can either reuse Jelly's jelly-stream module, which is based on Apache Pekko Streams, or write something on your own. jelly-stream can be trivially integrated with gRPC (module jelly-grpc provides a ready pub/sub gRPC service), Kafka, WebSocket, MQTT, or whatever else supported by Pekko.

Design

Okay, so where could it be used? In short – anywhere where there are microservices talking to each other. Quick mockup:

nanopub_jelly_mockup

  • Nanopub registry -> registry replication.
    • We could send new nanopubs in batches – one HTTP GET response would correspond to several new nanopubs.
    • Alternatively, we could maintain a real-time stream with gRPC (HTTP/2 transport) or WebSocket that would send the new nanopubs exactly at the moment they are added. This would remove polling entirely. Latencies would drop dramatically.
  • Nanopub registry -> nanopub query
    • Same mechanism as above
  • Nanopub registry dumps
    • It should be much faster with Jelly to make a huge nanopublication dump than with other formats. It will also be way faster to load such a dump when spinning up a new registry server.

Technical

I'd rate this as "very doable", because Nanopub Registry already uses RDF4J 5.0.2, with which Jelly is fully compatible. I've already implemented similar stuff for another RDF4J app (sorry, private code, can't share it), and it works fine with MQTT, Kafka, and long-running gRPC streams.

Benchmarks

The benchmarks on the Jelly website were conducted with a mix of 13 different datasets, one of them is a nanopublication dump. Here I post the dis-aggregated results only for the nanopub dataset.

The scenario considered here is "grouped RDF streaming", so transmitting a series of discrete RDF datasets. In our case, 1 dataset = 1 nanopub. The benchmarks without the network stack were repeated 15 times, first 5 runs discarded to account for JVM warmup. With network: 8 runs, first 3 discarded.

Hardware: AMD Ryzen 9 7900 (12-core, 24-thread, 5.0 GHz); 64 GB RAM (DDR5 5600 MT/s). The disk was not used during the benchmarks (all data was in memory). The throughput benchmarks are single-threaded, but the JVM was allowed to use all available cores for garbage collection, JIT compilation, and other tasks.

Software: Linux kernel 6.10.11, Oracle GraalVM 23.0.1+11.1, Apache Jena 5.2.0, Eclipse RDF4J 5.0.2, Jelly-JVM 2.2.2. Benchmark code: https://github.com/Jelly-RDF/jvm-benchmarks/tree/dd58f5de0916c1223ca115052567c1fb39f4cd62

Serialization (writing) throughput

Serializing 100k nanopubs from a series of Jena DatasetGraph / RDF4J Model objects to a null byte stream.

Higher is better.

Jena:

Pasted image 20241119180925

RDF4J:

Pasted image 20241119180645

Deserializing (parsing) throughput

Deserializing 100k nanopubs from a memory-backed byte stream to a series of Iterable[Statement], where each iterable is one RDF dataset. Constructing an RDF4J Model object is not included here, because that involves creating hashmaps and whatnots and is not the concern of the serialization format itself.

Jena:

Pasted image 20241119181003

RDF4J:

Pasted image 20241119180742

Serialized representation size

Serializing 100k nanopubs from a series of Jena DatasetGraph / RDF4J Model objects to a byte-counting stream.

Less is better.

Pasted image 20241119181511

Streaming over the network – Kafka

One producer sending 10k nanopubs over Kafka (1 RDF dataset = 1 Kafka message) to one consumer. All software is on the same host (unlimited bandwidth). I have this benchmark only for Jena, but the results should be very similar for RDF4J.

Higher is better.

Pasted image 20241119181840

Same, with producer network bandwidth limited to 100 Mbit/s, with a 10 ms one-way latency. The consumer had unlimited network to the broker.

Pasted image 20241119181942

These results aren't that great, but it's possible to get more with gRPC. Also, you could use better compression that gzip.

Streaming over the network – gRPC

Same case as in Kafka, but this is a direct point-to-point connection using gRPC. Here there are only results for Jelly, gRPC is more-or-less directly integrated with Protocol Buffers, so it would be pretty hard to integrate the other formats with gRPC.

Unlimited network:

Pasted image 20241119185600

100 Mbit/s, 10 ms one-way latency:

Pasted image 20241119185606

My involvement

I would be glad to help with implementing this. I have already implemented Jelly in a few Jena/RDF4J apps, so this shouldn't be too hard.

If you want me to test Jelly with a specific network environment (bandwidth, latency), streaming protocol, and compression settings, let me know. Currently the tests only use gzip, but it's a horrible compression algorithm. I think the best-best combination would be zstd + gRPC.

So...

What do you think?

@Ostrzyciel
Copy link
Contributor Author

@tkuhn pinging, just in case :)

@tkuhn
Copy link
Contributor

tkuhn commented Nov 21, 2024

Thanks for this very detailed issue description. Yes, makes a lot of sense, would be great to use Jelly!

I think to start the best place would be to look at inter-registry communication, so when a new Nanopub Registry is gathering all nanopublications it is interested in, then it could connect to other Nanopub Registries and get the respective nanopubs as a Jelly stream.

Currently, only the "core" nanopublications are loaded, which are agent intros and approvals. And they are then partitioned based on pubkey and type. For example here: https://registry.np.kpxl.org/list/1162349fdeaf431e71ab55898cb2a425b971d466150c2aa5b3c1beb498045a37/4a9a4dc2fa939033dd444d4f630fec3450eee305d98dd9945f110b2a6bb51317 (all approval-nanopubs for one of my own pubkeys)

The lists aren't very long at the moment. But in the near future I will implement the full loading of nanopublications (apart from the agent intros and approvals), and then we'll have more.

As a start we could make the registry provide each such list as a Jelly stream, so we could try to consume these streams with one of your tools. And then as a next step, we could implement the registries consuming these streams themselves. And then when we transition Nanopub Query to get its nanopubs from Nanopub Registry (it currently still gets them from the older nanopub-servers), we can implement this directly as Jelly streams. How does that sound?

@Ostrzyciel
Copy link
Contributor Author

Makes sense!

So, regarding the transport – we stick with just plain HTTP requests, yes?

So, I'd have a request like:

And the response would be an arbitrarily large Jelly file with all nanopubs in this list. The response would be streamed, of course. I can implement this without adding crazy dependencies like Pekko, just jelly-core, jelly-rdf4j and the protobuf library. I'll try to whip up a prototype for this when I have a moment.

gRPC would make sense in pub/sub scenarios, because it eliminates polling. But, it also introduces a lot of additional complexity. I guess this could be investigated later.

@Ostrzyciel
Copy link
Contributor Author

OK, I made a quick PoC... it actually works #30

But the TriG representation in the DB will be a bottleneck (see the PR).

@tkuhn
Copy link
Contributor

tkuhn commented Nov 22, 2024

Fantastic, thank you! Yes, we should store the nanopub in Jelly format too.

@Ostrzyciel
Copy link
Contributor Author

Continuing the discussion from #30:

1.

I've opened an issue to implement Jelly transcoders – this will be needed to efficiently transform multiple Jelly files stored in MongoDB into a single stream: Jelly-RDF/jelly-jvm#225

2.

The only thing I realized is that Jelly doesn't preserve the prefixes. Is there a way to keep them? Otherwise nanopubs will look ugly or we have to guess prefixes, both of which are not ideal.

At the moment, Jelly does not have a way to preserve prefix declarations. It only uses prefixes internally for compressing the IRIs, but these prefixes are arbitrary and do not have names.

Honestly, I just wanted to cover the entire RDF Abstract Syntax with Jelly, and prefix declarations are not there. But, it's true that many RDF formats (including binary ones) have this feature for the sake of convenience. I think we could add this to Jelly 1.1, but disable the feature by default, to keep backward compatibility.

3.

I think we definitely can/should store them in Jelly format too. In principle we could even store them only in Jelly format. But not sure how stable this is? TriG seems a bit a safer choice as a proper standard. But we can have both side-by-side, which is a bit redundant, but storage space is the cheapest of all resourses (compared to memory, bandwidth, ...), so that seems a good choice.

Like you said, storage is cheap, and the TriG representation is used in a lot places, so I think it would make sense to store both. It's also good to have a human-readable fallback and backup in case something glitches out horribly.

If Mongo compresses its records, then the size impact should not be huge. Most of the file size in Jelly are strings (IRIs and literals), and these are stored raw. So, the byte-level compressor should be able to pick up the same byte patterns in the TriG and Jelly serializations. But, to be honest, I don't really know how Mongo stores its data.

What's next?

I will be travelling for the next few weeks (conferences and stuff – including promoting Jelly :) ), so I'll have less time for this. I will work on points 1. and 2. mentioned here and then I will come back to work on storing the Nanopubs in Jelly.

@tkuhn
Copy link
Contributor

tkuhn commented Nov 22, 2024

Sounds great!

I will try to find time to look into 3.

With respect to prefixes, we could also find a workaround. Such as providing triples like NP x:hasPrefix "abc: <...>" for each nanopublication (but outside of the nanopub graphs).

@Ostrzyciel
Copy link
Contributor Author

With respect to prefixes, we could also find a workaround. Such as providing triples like NP x:hasPrefix "abc: <...>" for each nanopublication (but outside of the nanopub graphs).

I think there is already a vocabulary for this, as part of SHACL: https://www.w3.org/TR/shacl/#sparql-prefixes

I'm not sure about this, it's always a trade-off, because this is another layer of syntax on top of raw nanopublications, increasing complexity. Libraries like Jena and RDF4J actually expect RDF formats to support prefix declarations natively (this is of course optional). If we put the prefix declarations as triples in the graph, they start to be part of the data itself, and something like Jena would not be able to interpret that.

@tkuhn
Copy link
Contributor

tkuhn commented Nov 22, 2024

Yes, you are right. If the next version of Jelly can preserve prefixes, that'd of course be the ideal solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants