Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix mixed-up task descriptions #2

Merged
merged 1 commit into from
May 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 10 additions & 13 deletions tasks/flat-deserialization-throughput/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,27 +8,24 @@ Flat distributions of any dataset in the `flat` category of RiverBench may be us

### Workload

In this task, an RDF store is set up (for example, [Apache Jena TDB2](https://jena.apache.org/documentation/tdb2/index.html) or Virtuoso) and then loaded with a flat dump of RDF statements (triples or quads).
The task consists of deserializing serialized RDF data stored in memory to a flat sequence of RDF statements. To isolate the performance of the deserializer itself, the following steps are taken:

- When comparing multiple RDF stores, identical input data (serialized in the same format) should be used for all stores.
- The benchmark includes the time taken to deserialize the input data and insert the resulting RDF statements into the store, considering the entire process and the impact of the underlying I/O.
- Input data may be either batched or streamed, depending on the capabilities of the RDF store being tested and the specific research questions being addressed.
- When using batched input, the input dataset is split into chunks, and each chunk is processed by the system separately. The metrics are typically calculated per chunk.
- When using streamed input, the input dataset is processed as a continuous stream of statements. The metrics are typically calculated in regular time intervals or after processing a certain number of statements.
- The data (serialized RDF statements in the tested serialization format) is preloaded into memory before the benchmark starts.
- The deserialization process is repeated multiple times to amortize the cost of the initial setup, just-in-time code recompilation, and other external factors.
- The deserialized statements are not inserted into any data structure or database, but just temporarily stored in memory and immediately discarded. This is to avoid the overhead of maintaining additional data structures.
- See the [`flat-rdf-store-loading`](../flat-rdf-store-loading/index.md) task for a benchmark that measures the time taken to load the deserialized data into a database.

### Metrics

- Time taken to deserialize the input data and insert the resulting RDF statements into the store. From this measurement, the insertion throughput (in statements per second) can be calculated.
- Memory usage during and after the loading process.
- Storage space used by the RDF store during and after the loading process.
- Total CPU time used during the loading process.
The primary metric is the throughput of the deserialization process, measured in RDF statements (triples or quads) per second. This is calculated as the total number of RDF statements deserialized divided by the total time taken to deserialize them.


## Results

There are no results with RiverBench available for this task yet.

## Examples and references

- Such a benchmark was performed in a paper comparing several RDF stores on IoT devices (Section 8). There, the authors measured the time taken to load the data and the memory usage.
- Le-Tuan, A., Hayes, C., Hauswirth, M., & Le-Phuoc, D. (2020). Pushing the Scalability of RDF Engines on IoT Edge Devices. Sensors, 20(10), 2788.
- https://doi.org/10.3390/s20102788
- In the paper about the ERI format, a similar benchmark can be found in Section 4.3. The corresponding task in the paper is named "Decompression time" and measures the time taken to deserialize the entire sequence of triples.
- Fernández, J. D., Llaves, A., & Corcho, O. (2014). Efficient RDF interchange (ERI) format for RDF data streams. In The Semantic Web–ISWC 2014: 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part II 13 (pp. 244-259). Springer International Publishing.
- https://doi.org/10.1007/978-3-319-11915-1_16
22 changes: 13 additions & 9 deletions tasks/flat-rdf-store-loading/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,23 +8,27 @@ Flat distributions of any dataset in the `flat` category of RiverBench may be us

### Workload

The task consists of deserializing serialized RDF data stored in memory to a flat sequence of RDF statements. To isolate the performance of the deserializer itself, the following steps are taken:
In this task, an RDF store is set up (for example, [Apache Jena TDB2](https://jena.apache.org/documentation/tdb2/index.html) or Virtuoso) and then loaded with a flat dump of RDF statements (triples or quads).

- The data (serialized RDF statements in the tested serialization format) is preloaded into memory before the benchmark starts.
- The deserialization process is repeated multiple times to amortize the cost of the initial setup, just-in-time code recompilation, and other external factors.
- The deserialized statements are not inserted into any data structure or database, but just temporarily stored in memory and immediately discarded. This is to avoid the overhead of maintaining additional data structures.
- See the [`flat-rdf-store-loading`](../flat-rdf-store-loading/index.md) task for a benchmark that measures the time taken to load the deserialized data into a database.
- When comparing multiple RDF stores, identical input data (serialized in the same format) should be used for all stores.
- The benchmark includes the time taken to deserialize the input data and insert the resulting RDF statements into the store, considering the entire process and the impact of the underlying I/O.
- Input data may be either batched or streamed, depending on the capabilities of the RDF store being tested and the specific research questions being addressed.
- When using batched input, the input dataset is split into chunks, and each chunk is processed by the system separately. The metrics are typically calculated per chunk.
- When using streamed input, the input dataset is processed as a continuous stream of statements. The metrics are typically calculated in regular time intervals or after processing a certain number of statements.

### Metrics

The primary metric is the throughput of the deserialization process, measured in RDF statements (triples or quads) per second. This is calculated as the total number of RDF statements deserialized divided by the total time taken to deserialize them.
- Time taken to deserialize the input data and insert the resulting RDF statements into the store. From this measurement, the insertion throughput (in statements per second) can be calculated.
- Memory usage during and after the loading process.
- Storage space used by the RDF store during and after the loading process.
- Total CPU time used during the loading process.

## Results

There are no results with RiverBench available for this task yet.

## Examples and references

- In the paper about the ERI format, a similar benchmark can be found in Section 4.3. The corresponding task in the paper is named "Decompression time" and measures the time taken to deserialize the entire sequence of triples.
- Fernández, J. D., Llaves, A., & Corcho, O. (2014). Efficient RDF interchange (ERI) format for RDF data streams. In The Semantic Web–ISWC 2014: 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part II 13 (pp. 244-259). Springer International Publishing.
- https://doi.org/10.1007/978-3-319-11915-1_16
- Such a benchmark was performed in a paper comparing several RDF stores on IoT devices (Section 8). There, the authors measured the time taken to load the data and the memory usage.
- Le-Tuan, A., Hayes, C., Hauswirth, M., & Le-Phuoc, D. (2020). Pushing the Scalability of RDF Engines on IoT Edge Devices. Sensors, 20(10), 2788.
- https://doi.org/10.3390/s20102788