diff --git a/tasks/flat-compression/index.md b/tasks/flat-compression/index.md new file mode 100644 index 0000000..c46e243 --- /dev/null +++ b/tasks/flat-compression/index.md @@ -0,0 +1,35 @@ +A benchmark task measuring the compression efficiency of flat RDF serializations. + +## Methodology + +### Data + +Flat distributions of any dataset in the `flat` category of RiverBench may be used for this task. + +### Workload + +The task consists of serializing RDF data to bytes and measuring the size of the obtained representation. + +In this task, the time taken to serialize and deserialize the data is not considered – see the [`flat-serialization-throughput`](../flat-serialization-throughput/index.md) and [`flat-deserialization-throughput`](../flat-deserialization-throughput/index.md) tasks for that aspect. + +### Metrics + +- The primary metric is the serialized representation size of the RDF data, in bytes. +- Additionally, the compression ratio can be calculated as the ratio of the reference size to the compressed size. The reference size is the size of the same data serialized using a baseline method, e.g., the N-Triples serialization format. + - In the RDF literature, the "compression ratio" is often defined as the inverse of the above definition and expressed as a percentage. For example, a compression ratio of (50%) means that the compressed data is half the size of the reference data. + +## Results + +There are no results with RiverBench available for this task yet. + +## Examples and references + +- In a paper about an RDF compression method using MapReduce, a compression benchmark was performed in Section 5.1. The authors have measured the output size of their method (in gigabytes) in comparison to the input data size. + - Urbani, J., Maassen, J., Drost, N., Seinstra, F., & Bal, H. (2013). Scalable RDF data compression with MapReduce. Concurrency and Computation: Practice and Experience, 25(1), 24-39. + - https://doi.org/10.1002/cpe.2840 +- In the paper about the HDT binary format, such a benchmark was performed in Section 5. The "Compression Ratio" metric there refers to the ratio between the compressed data size and the reference data size, with N-Triples used as the reference. + - Fernández, J. D., Martínez-Prieto, M. A., Gutiérrez, C., Polleres, A., & Arias, M. (2013). Binary RDF representation for publication and exchange (HDT). Journal of Web Semantics, 19, 22-41. + - https://doi.org/10.1016/j.websem.2013.01.002 +- In the paper about the ERI format, such a benchmark can be found in Section 4.2. The "Compression Ratio" metric there refers to the ratio between the compressed data size and the reference data size, with N-Triples used as the reference. + - Fernández, J. D., Llaves, A., & Corcho, O. (2014). Efficient RDF interchange (ERI) format for RDF data streams. In The Semantic Web–ISWC 2014: 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part II 13 (pp. 244-259). Springer International Publishing. + - https://doi.org/10.1007/978-3-319-11915-1_16 diff --git a/tasks/flat-compression/metadata.ttl b/tasks/flat-compression/metadata.ttl new file mode 100644 index 0000000..15c9849 --- /dev/null +++ b/tasks/flat-compression/metadata.ttl @@ -0,0 +1,22 @@ +@prefix : . +@prefix dcterms: . +@prefix foaf: . +@prefix rb: . +@prefix rbdoc: . + +:task + # General information + a rb:Task ; + dcterms:conformsTo ; + dcterms:identifier "flat-compression" ; + dcterms:title "RDF compression"@en ; + dcterms:description "A benchmark task measuring the compression efficiency of flat RDF serializations."@en ; + + # Authors + dcterms:creator [ + foaf:name "Piotr Sowiński" ; + foaf:nick "Ostrzyciel" ; + foaf:homepage , ; + rbdoc:hasDocWeight 1 ; + ] +. diff --git a/tasks/flat-deserialization-throughput/index.md b/tasks/flat-deserialization-throughput/index.md new file mode 100644 index 0000000..4156bc1 --- /dev/null +++ b/tasks/flat-deserialization-throughput/index.md @@ -0,0 +1,34 @@ +A benchmark task measuring the throughput (in statements per second) of deserializing a byte stream into a flat sequence of RDF triples or RDF quads. + +## Methodology + +### Data + +Flat distributions of any dataset in the `flat` category of RiverBench may be used for this task. + +### Workload + +In this task, an RDF store is set up (for example, [Apache Jena TDB2](https://jena.apache.org/documentation/tdb2/index.html) or Virtuoso) and then loaded with a flat dump of RDF statements (triples or quads). + +- When comparing multiple RDF stores, identical input data (serialized in the same format) should be used for all stores. +- The benchmark includes the time taken to deserialize the input data and insert the resulting RDF statements into the store, considering the entire process and the impact of the underlying I/O. +- Input data may be either batched or streamed, depending on the capabilities of the RDF store being tested and the specific research questions being addressed. + - When using batched input, the input dataset is split into chunks, and each chunk is processed by the system separately. The metrics are typically calculated per chunk. + - When using streamed input, the input dataset is processed as a continuous stream of statements. The metrics are typically calculated in regular time intervals or after processing a certain number of statements. + +### Metrics + +- Time taken to deserialize the input data and insert the resulting RDF statements into the store. From this measurement, the insertion throughput (in statements per second) can be calculated. +- Memory usage during and after the loading process. +- Storage space used by the RDF store during and after the loading process. +- Total CPU time used during the loading process. + +## Results + +There are no results with RiverBench available for this task yet. + +## Examples and references + +- Such a benchmark was performed in a paper comparing several RDF stores on IoT devices (Section 8). There, the authors measured the time taken to load the data and the memory usage. + - Le-Tuan, A., Hayes, C., Hauswirth, M., & Le-Phuoc, D. (2020). Pushing the Scalability of RDF Engines on IoT Edge Devices. Sensors, 20(10), 2788. + - https://doi.org/10.3390/s20102788 diff --git a/tasks/flat-deserialization-throughput/metadata.ttl b/tasks/flat-deserialization-throughput/metadata.ttl new file mode 100644 index 0000000..eb57148 --- /dev/null +++ b/tasks/flat-deserialization-throughput/metadata.ttl @@ -0,0 +1,22 @@ +@prefix : . +@prefix dcterms: . +@prefix foaf: . +@prefix rb: . +@prefix rbdoc: . + +:task + # General information + a rb:Task ; + dcterms:conformsTo ; + dcterms:identifier "flat-deserialization-throughput" ; + dcterms:title "Deserialization throughput"@en ; + dcterms:description "A benchmark task measuring the throughput (in statements per second) of deserializing a byte stream into a flat sequence of RDF triples or RDF quads."@en ; + + # Authors + dcterms:creator [ + foaf:name "Piotr Sowiński" ; + foaf:nick "Ostrzyciel" ; + foaf:homepage , ; + rbdoc:hasDocWeight 1 ; + ] +. diff --git a/tasks/flat-rdf-store-loading/index.md b/tasks/flat-rdf-store-loading/index.md new file mode 100644 index 0000000..c068886 --- /dev/null +++ b/tasks/flat-rdf-store-loading/index.md @@ -0,0 +1,30 @@ +A benchmark task measuring the time taken and resources used by RDF stores when loading flat RDF data (triples or quads). + +## Methodology + +### Data + +Flat distributions of any dataset in the `flat` category of RiverBench may be used for this task. + +### Workload + +The task consists of deserializing serialized RDF data stored in memory to a flat sequence of RDF statements. To isolate the performance of the deserializer itself, the following steps are taken: + +- The data (serialized RDF statements in the tested serialization format) is preloaded into memory before the benchmark starts. +- The deserialization process is repeated multiple times to amortize the cost of the initial setup, just-in-time code recompilation, and other external factors. +- The deserialized statements are not inserted into any data structure or database, but just temporarily stored in memory and immediately discarded. This is to avoid the overhead of maintaining additional data structures. + - See the [`flat-rdf-store-loading`](../flat-rdf-store-loading/index.md) task for a benchmark that measures the time taken to load the deserialized data into a database. + +### Metrics + +The primary metric is the throughput of the deserialization process, measured in RDF statements (triples or quads) per second. This is calculated as the total number of RDF statements deserialized divided by the total time taken to deserialize them. + +## Results + +There are no results with RiverBench available for this task yet. + +## Examples and references + +- In the paper about the ERI format, a similar benchmark can be found in Section 4.3. The corresponding task in the paper is named "Decompression time" and measures the time taken to deserialize the entire sequence of triples. + - Fernández, J. D., Llaves, A., & Corcho, O. (2014). Efficient RDF interchange (ERI) format for RDF data streams. In The Semantic Web–ISWC 2014: 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part II 13 (pp. 244-259). Springer International Publishing. + - https://doi.org/10.1007/978-3-319-11915-1_16 diff --git a/tasks/flat-rdf-store-loading/metadata.ttl b/tasks/flat-rdf-store-loading/metadata.ttl new file mode 100644 index 0000000..05d38ae --- /dev/null +++ b/tasks/flat-rdf-store-loading/metadata.ttl @@ -0,0 +1,22 @@ +@prefix : . +@prefix dcterms: . +@prefix foaf: . +@prefix rb: . +@prefix rbdoc: . + +:task + # General information + a rb:Task ; + dcterms:conformsTo ; + dcterms:identifier "flat-rdf-store-loading" ; + dcterms:title "Loading data into an RDF store"@en ; + dcterms:description "A benchmark task measuring the time taken and resources used by RDF stores when loading flat RDF data (triples or quads)."@en ; + + # Authors + dcterms:creator [ + foaf:name "Piotr Sowiński" ; + foaf:nick "Ostrzyciel" ; + foaf:homepage , ; + rbdoc:hasDocWeight 1 ; + ] +. diff --git a/tasks/flat-serialization-throughput/index.md b/tasks/flat-serialization-throughput/index.md new file mode 100644 index 0000000..6e25f85 --- /dev/null +++ b/tasks/flat-serialization-throughput/index.md @@ -0,0 +1,31 @@ +A benchmark task measuring the throughput (in statements per second) of serializing a flat sequence of RDF triples or RDF quads. + +## Methodology + +### Data + +Flat distributions of any dataset in the `flat` category of RiverBench may be used for this task. + +### Workload + +The task consists of serializing RDF data stored in memory (as an array of RDF statements or similar) to a byte stream. To isolate the performance of the serializer itself, the following steps are taken: + +- The data (RDF statements) is preloaded into memory before the benchmark starts. +- The RDF statements are stored in a data structure that is trivially iterable (e.g., an array). +- The serialization process is repeated multiple times to amortize the cost of the initial setup, just-in-time code recompilation, and other external factors. +- The serialized data is typically not written to disk, but just temporarily stored in memory and immediately discarded. This is to avoid the overhead of disk I/O. + - Alternatively, in benchmarks that wish to evaluate the performance of writing to disk, especially considering the impact of different disk usage patterns (e.g., sequential vs. random access), the data may be written to disk. + +### Metrics + +The primary metric is the throughput of the serialization process, measured in RDF statements (triples or quads) per second. This is calculated as the total number of RDF statements serialized divided by the total time taken to serialize them. + +## Results + +There are no results with RiverBench available for this task yet. + +## Examples and references + +- In the paper about the ERI format, a similar benchmark can be found in Section 4.3. The corresponding task in the paper is named "Compression time" and measures the time taken to serialize the entire sequence of triples. + - Fernández, J. D., Llaves, A., & Corcho, O. (2014). Efficient RDF interchange (ERI) format for RDF data streams. In The Semantic Web–ISWC 2014: 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part II 13 (pp. 244-259). Springer International Publishing. + - https://doi.org/10.1007/978-3-319-11915-1_16 diff --git a/tasks/flat-serialization-throughput/metadata.ttl b/tasks/flat-serialization-throughput/metadata.ttl new file mode 100644 index 0000000..771c132 --- /dev/null +++ b/tasks/flat-serialization-throughput/metadata.ttl @@ -0,0 +1,22 @@ +@prefix : . +@prefix dcterms: . +@prefix foaf: . +@prefix rb: . +@prefix rbdoc: . + +:task + # General information + a rb:Task ; + dcterms:conformsTo ; + dcterms:identifier "flat-serialization-throughput" ; + dcterms:title "Serialization throughput"@en ; + dcterms:description "A benchmark task measuring the throughput (in statements per second) of serializing a flat sequence of RDF triples or RDF quads."@en ; + + # Authors + dcterms:creator [ + foaf:name "Piotr Sowiński" ; + foaf:nick "Ostrzyciel" ; + foaf:homepage , ; + rbdoc:hasDocWeight 1 ; + ] +. diff --git a/tasks/metadata.ttl b/tasks/metadata.ttl index ad79ff8..c2df9ee 100644 --- a/tasks/metadata.ttl +++ b/tasks/metadata.ttl @@ -26,4 +26,4 @@ # Optional: use this property to order creators in the generated docs rbdoc:hasDocWeight 1 ; ] -. \ No newline at end of file +.