From bd4409052c604363b38b75214293a0c7eb8aa01c Mon Sep 17 00:00:00 2001 From: Ostrzyciel Date: Fri, 10 May 2024 10:47:42 +0200 Subject: [PATCH] Fix mixed-up task descriptions --- .../flat-deserialization-throughput/index.md | 23 ++++++++----------- tasks/flat-rdf-store-loading/index.md | 22 ++++++++++-------- 2 files changed, 23 insertions(+), 22 deletions(-) diff --git a/tasks/flat-deserialization-throughput/index.md b/tasks/flat-deserialization-throughput/index.md index 4156bc1..261805a 100644 --- a/tasks/flat-deserialization-throughput/index.md +++ b/tasks/flat-deserialization-throughput/index.md @@ -8,20 +8,17 @@ Flat distributions of any dataset in the `flat` category of RiverBench may be us ### Workload -In this task, an RDF store is set up (for example, [Apache Jena TDB2](https://jena.apache.org/documentation/tdb2/index.html) or Virtuoso) and then loaded with a flat dump of RDF statements (triples or quads). +The task consists of deserializing serialized RDF data stored in memory to a flat sequence of RDF statements. To isolate the performance of the deserializer itself, the following steps are taken: -- When comparing multiple RDF stores, identical input data (serialized in the same format) should be used for all stores. -- The benchmark includes the time taken to deserialize the input data and insert the resulting RDF statements into the store, considering the entire process and the impact of the underlying I/O. -- Input data may be either batched or streamed, depending on the capabilities of the RDF store being tested and the specific research questions being addressed. - - When using batched input, the input dataset is split into chunks, and each chunk is processed by the system separately. The metrics are typically calculated per chunk. - - When using streamed input, the input dataset is processed as a continuous stream of statements. The metrics are typically calculated in regular time intervals or after processing a certain number of statements. +- The data (serialized RDF statements in the tested serialization format) is preloaded into memory before the benchmark starts. +- The deserialization process is repeated multiple times to amortize the cost of the initial setup, just-in-time code recompilation, and other external factors. +- The deserialized statements are not inserted into any data structure or database, but just temporarily stored in memory and immediately discarded. This is to avoid the overhead of maintaining additional data structures. + - See the [`flat-rdf-store-loading`](../flat-rdf-store-loading/index.md) task for a benchmark that measures the time taken to load the deserialized data into a database. ### Metrics -- Time taken to deserialize the input data and insert the resulting RDF statements into the store. From this measurement, the insertion throughput (in statements per second) can be calculated. -- Memory usage during and after the loading process. -- Storage space used by the RDF store during and after the loading process. -- Total CPU time used during the loading process. +The primary metric is the throughput of the deserialization process, measured in RDF statements (triples or quads) per second. This is calculated as the total number of RDF statements deserialized divided by the total time taken to deserialize them. + ## Results @@ -29,6 +26,6 @@ There are no results with RiverBench available for this task yet. ## Examples and references -- Such a benchmark was performed in a paper comparing several RDF stores on IoT devices (Section 8). There, the authors measured the time taken to load the data and the memory usage. - - Le-Tuan, A., Hayes, C., Hauswirth, M., & Le-Phuoc, D. (2020). Pushing the Scalability of RDF Engines on IoT Edge Devices. Sensors, 20(10), 2788. - - https://doi.org/10.3390/s20102788 +- In the paper about the ERI format, a similar benchmark can be found in Section 4.3. The corresponding task in the paper is named "Decompression time" and measures the time taken to deserialize the entire sequence of triples. + - Fernández, J. D., Llaves, A., & Corcho, O. (2014). Efficient RDF interchange (ERI) format for RDF data streams. In The Semantic Web–ISWC 2014: 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part II 13 (pp. 244-259). Springer International Publishing. + - https://doi.org/10.1007/978-3-319-11915-1_16 diff --git a/tasks/flat-rdf-store-loading/index.md b/tasks/flat-rdf-store-loading/index.md index c068886..afb7989 100644 --- a/tasks/flat-rdf-store-loading/index.md +++ b/tasks/flat-rdf-store-loading/index.md @@ -8,16 +8,20 @@ Flat distributions of any dataset in the `flat` category of RiverBench may be us ### Workload -The task consists of deserializing serialized RDF data stored in memory to a flat sequence of RDF statements. To isolate the performance of the deserializer itself, the following steps are taken: +In this task, an RDF store is set up (for example, [Apache Jena TDB2](https://jena.apache.org/documentation/tdb2/index.html) or Virtuoso) and then loaded with a flat dump of RDF statements (triples or quads). -- The data (serialized RDF statements in the tested serialization format) is preloaded into memory before the benchmark starts. -- The deserialization process is repeated multiple times to amortize the cost of the initial setup, just-in-time code recompilation, and other external factors. -- The deserialized statements are not inserted into any data structure or database, but just temporarily stored in memory and immediately discarded. This is to avoid the overhead of maintaining additional data structures. - - See the [`flat-rdf-store-loading`](../flat-rdf-store-loading/index.md) task for a benchmark that measures the time taken to load the deserialized data into a database. +- When comparing multiple RDF stores, identical input data (serialized in the same format) should be used for all stores. +- The benchmark includes the time taken to deserialize the input data and insert the resulting RDF statements into the store, considering the entire process and the impact of the underlying I/O. +- Input data may be either batched or streamed, depending on the capabilities of the RDF store being tested and the specific research questions being addressed. + - When using batched input, the input dataset is split into chunks, and each chunk is processed by the system separately. The metrics are typically calculated per chunk. + - When using streamed input, the input dataset is processed as a continuous stream of statements. The metrics are typically calculated in regular time intervals or after processing a certain number of statements. ### Metrics -The primary metric is the throughput of the deserialization process, measured in RDF statements (triples or quads) per second. This is calculated as the total number of RDF statements deserialized divided by the total time taken to deserialize them. +- Time taken to deserialize the input data and insert the resulting RDF statements into the store. From this measurement, the insertion throughput (in statements per second) can be calculated. +- Memory usage during and after the loading process. +- Storage space used by the RDF store during and after the loading process. +- Total CPU time used during the loading process. ## Results @@ -25,6 +29,6 @@ There are no results with RiverBench available for this task yet. ## Examples and references -- In the paper about the ERI format, a similar benchmark can be found in Section 4.3. The corresponding task in the paper is named "Decompression time" and measures the time taken to deserialize the entire sequence of triples. - - Fernández, J. D., Llaves, A., & Corcho, O. (2014). Efficient RDF interchange (ERI) format for RDF data streams. In The Semantic Web–ISWC 2014: 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part II 13 (pp. 244-259). Springer International Publishing. - - https://doi.org/10.1007/978-3-319-11915-1_16 +- Such a benchmark was performed in a paper comparing several RDF stores on IoT devices (Section 8). There, the authors measured the time taken to load the data and the memory usage. + - Le-Tuan, A., Hayes, C., Hauswirth, M., & Le-Phuoc, D. (2020). Pushing the Scalability of RDF Engines on IoT Edge Devices. Sensors, 20(10), 2788. + - https://doi.org/10.3390/s20102788