-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider using Jelly for inter-service communication #29
Comments
@tkuhn pinging, just in case :) |
Thanks for this very detailed issue description. Yes, makes a lot of sense, would be great to use Jelly! I think to start the best place would be to look at inter-registry communication, so when a new Nanopub Registry is gathering all nanopublications it is interested in, then it could connect to other Nanopub Registries and get the respective nanopubs as a Jelly stream. Currently, only the "core" nanopublications are loaded, which are agent intros and approvals. And they are then partitioned based on pubkey and type. For example here: https://registry.np.kpxl.org/list/1162349fdeaf431e71ab55898cb2a425b971d466150c2aa5b3c1beb498045a37/4a9a4dc2fa939033dd444d4f630fec3450eee305d98dd9945f110b2a6bb51317 (all approval-nanopubs for one of my own pubkeys) The lists aren't very long at the moment. But in the near future I will implement the full loading of nanopublications (apart from the agent intros and approvals), and then we'll have more. As a start we could make the registry provide each such list as a Jelly stream, so we could try to consume these streams with one of your tools. And then as a next step, we could implement the registries consuming these streams themselves. And then when we transition Nanopub Query to get its nanopubs from Nanopub Registry (it currently still gets them from the older nanopub-servers), we can implement this directly as Jelly streams. How does that sound? |
Makes sense! So, regarding the transport – we stick with just plain HTTP requests, yes? So, I'd have a request like:
And the response would be an arbitrarily large Jelly file with all nanopubs in this list. The response would be streamed, of course. I can implement this without adding crazy dependencies like Pekko, just gRPC would make sense in pub/sub scenarios, because it eliminates polling. But, it also introduces a lot of additional complexity. I guess this could be investigated later. |
OK, I made a quick PoC... it actually works #30 But the TriG representation in the DB will be a bottleneck (see the PR). |
Fantastic, thank you! Yes, we should store the nanopub in Jelly format too. |
Continuing the discussion from #30: 1.I've opened an issue to implement Jelly transcoders – this will be needed to efficiently transform multiple Jelly files stored in MongoDB into a single stream: Jelly-RDF/jelly-jvm#225 2.
At the moment, Jelly does not have a way to preserve prefix declarations. It only uses prefixes internally for compressing the IRIs, but these prefixes are arbitrary and do not have names. Honestly, I just wanted to cover the entire RDF Abstract Syntax with Jelly, and prefix declarations are not there. But, it's true that many RDF formats (including binary ones) have this feature for the sake of convenience. I think we could add this to Jelly 1.1, but disable the feature by default, to keep backward compatibility. 3.
Like you said, storage is cheap, and the TriG representation is used in a lot places, so I think it would make sense to store both. It's also good to have a human-readable fallback and backup in case something glitches out horribly. If Mongo compresses its records, then the size impact should not be huge. Most of the file size in Jelly are strings (IRIs and literals), and these are stored raw. So, the byte-level compressor should be able to pick up the same byte patterns in the TriG and Jelly serializations. But, to be honest, I don't really know how Mongo stores its data. What's next?I will be travelling for the next few weeks (conferences and stuff – including promoting Jelly :) ), so I'll have less time for this. I will work on points 1. and 2. mentioned here and then I will come back to work on storing the Nanopubs in Jelly. |
Sounds great! I will try to find time to look into 3. With respect to prefixes, we could also find a workaround. Such as providing triples like |
I think there is already a vocabulary for this, as part of SHACL: https://www.w3.org/TR/shacl/#sparql-prefixes I'm not sure about this, it's always a trade-off, because this is another layer of syntax on top of raw nanopublications, increasing complexity. Libraries like Jena and RDF4J actually expect RDF formats to support prefix declarations natively (this is of course optional). If we put the prefix declarations as triples in the graph, they start to be part of the data itself, and something like Jena would not be able to interpret that. |
Yes, you are right. If the next version of Jelly can preserve prefixes, that'd of course be the ideal solution. |
This is a follow-up of what we discussed on the last Nano Session.
The idea is to use Jelly, a high-performance RDF serialization format and streaming protocol for communication between microservices in the next generation of the Nanopub infrastructure.
Jelly can act as a simple serialization format (like N-Triples or Turtle), turning a bunch of triples/quads into bytes. But, it can also serialize a stream of RDF graphs/datasets, which is where its main advantage is. Let's say we have a series of nanopubs (RDF datasets) – if we serialized them as a bunch of TriG files, we would have to each time repeat the same prefixes, property names, classes, etc. If we could not repeat ourselves, we could reduce the size of the serialized data, and also speed things up (less work = faster). See the benchmarks below for numbers.
Jelly is currently implemented for Jena 5 and RDF4J 5, with a full integration with the relevant I/O APIs. The implementation is open-source (Apache 2.0 license) and exhaustively tested in CI/CD. When you just want to serialize one RDF dataset, all you need to do is to add a Maven dependency on Jelly and use the standard RDF4J Rio API. That's all, it just works.
To implement streams of RDF datasets, you can either reuse Jelly's
jelly-stream
module, which is based on Apache Pekko Streams, or write something on your own.jelly-stream
can be trivially integrated with gRPC (modulejelly-grpc
provides a ready pub/sub gRPC service), Kafka, WebSocket, MQTT, or whatever else supported by Pekko.Design
Okay, so where could it be used? In short – anywhere where there are microservices talking to each other. Quick mockup:
Technical
I'd rate this as "very doable", because Nanopub Registry already uses RDF4J 5.0.2, with which Jelly is fully compatible. I've already implemented similar stuff for another RDF4J app (sorry, private code, can't share it), and it works fine with MQTT, Kafka, and long-running gRPC streams.
Benchmarks
The benchmarks on the Jelly website were conducted with a mix of 13 different datasets, one of them is a nanopublication dump. Here I post the dis-aggregated results only for the nanopub dataset.
The scenario considered here is "grouped RDF streaming", so transmitting a series of discrete RDF datasets. In our case, 1 dataset = 1 nanopub. The benchmarks without the network stack were repeated 15 times, first 5 runs discarded to account for JVM warmup. With network: 8 runs, first 3 discarded.
Hardware: AMD Ryzen 9 7900 (12-core, 24-thread, 5.0 GHz); 64 GB RAM (DDR5 5600 MT/s). The disk was not used during the benchmarks (all data was in memory). The throughput benchmarks are single-threaded, but the JVM was allowed to use all available cores for garbage collection, JIT compilation, and other tasks.
Software: Linux kernel 6.10.11, Oracle GraalVM 23.0.1+11.1, Apache Jena 5.2.0, Eclipse RDF4J 5.0.2, Jelly-JVM 2.2.2. Benchmark code: https://github.com/Jelly-RDF/jvm-benchmarks/tree/dd58f5de0916c1223ca115052567c1fb39f4cd62
Serialization (writing) throughput
Serializing 100k nanopubs from a series of Jena
DatasetGraph
/ RDF4JModel
objects to a null byte stream.Higher is better.
Jena:
RDF4J:
Deserializing (parsing) throughput
Deserializing 100k nanopubs from a memory-backed byte stream to a series of
Iterable[Statement]
, where each iterable is one RDF dataset. Constructing an RDF4JModel
object is not included here, because that involves creating hashmaps and whatnots and is not the concern of the serialization format itself.Jena:
RDF4J:
Serialized representation size
Serializing 100k nanopubs from a series of Jena
DatasetGraph
/ RDF4JModel
objects to a byte-counting stream.Less is better.
Streaming over the network – Kafka
One producer sending 10k nanopubs over Kafka (1 RDF dataset = 1 Kafka message) to one consumer. All software is on the same host (unlimited bandwidth). I have this benchmark only for Jena, but the results should be very similar for RDF4J.
Higher is better.
Same, with producer network bandwidth limited to 100 Mbit/s, with a 10 ms one-way latency. The consumer had unlimited network to the broker.
These results aren't that great, but it's possible to get more with gRPC. Also, you could use better compression that gzip.
Streaming over the network – gRPC
Same case as in Kafka, but this is a direct point-to-point connection using gRPC. Here there are only results for Jelly, gRPC is more-or-less directly integrated with Protocol Buffers, so it would be pretty hard to integrate the other formats with gRPC.
Unlimited network:
100 Mbit/s, 10 ms one-way latency:
My involvement
I would be glad to help with implementing this. I have already implemented Jelly in a few Jena/RDF4J apps, so this shouldn't be too hard.
If you want me to test Jelly with a specific network environment (bandwidth, latency), streaming protocol, and compression settings, let me know. Currently the tests only use gzip, but it's a horrible compression algorithm. I think the best-best combination would be zstd + gRPC.
So...
What do you think?
The text was updated successfully, but these errors were encountered: