From 2460466227fd753343d72e0b752f2691dfa516ab Mon Sep 17 00:00:00 2001 From: supermaxiste Date: Tue, 25 Jun 2024 17:30:36 +0200 Subject: [PATCH] doc: Update Readme (#21) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Gabriel Nützi --- README.md | 184 +++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 181 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index b488f47..f84e3b6 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,14 @@ A simple Rust CLI tool to protect sensitive values in [RDF triples](https://en.wikipedia.org/wiki/Semantic_triple) through -[pseudonymization](https://en.wikipedia.org/wiki/Pseudonymization). +[pseudonymization](https://en.wikipedia.org/wiki/Pseudonymization). The goal is to offer a fast, secure and memory-efficient pseudonymization solution to any RDF graph. + +Note: code is still in development and we support only [NTriples format](https://en.wikipedia.org/wiki/N-Triples) as input. + +The tool works in two steps: + + 1. Indexing to create a reference to all [rdf:type](https://www.w3.org/TR/rdf12-schema/#ch_type) instances in the graph + 2. Pseudonymization to encrypt or hash sensitive parts of any RDF triple in the graph via a human-readable configuration file and the previously generated index
Table of Content @@ -11,6 +18,8 @@ A simple Rust CLI tool to protect sensitive values in - [RDF Protect](#rdf-protect) - [Installation & Usage](#installation-usage) - [Usage](#usage) + - [Use Case](#use-cases) + - [Example](#example) - [Development](#development) - [Requirements](#requirements) - [Nix](#nix) @@ -23,11 +32,180 @@ A simple Rust CLI tool to protect sensitive values in ## Installation & Usage -TODO +The package must be compiled from source using [cargo](https://doc.rust-lang.org/cargo/getting-started/installation.html): + +```shell +git clone https://github.com/sdsc-ordes/rdf-protect +cd rdf-protect +cargo build --release +# executable binary located in ./target/release/rdf-protect +``` ### Usage -TODO +The general command-line interface outlines the two main steps of the tool, indexing and pseudonymization: + +```shell +$ rdf-protect --help +A tool to pseudonymize URIs and values in RDF graphs. + +Usage: rdf-protect + +Commands: + index 1. Pass: Create a node-to-type index from input triples + pseudo 2. Pass: Pseudonymize input triples + help Print this message or the help of the given subcommand(s) + +Options: + -h, --help Print help + -V, --version Print version +``` + +Indexing only requires an RDF file as input: + +```shell +$ rdf-protect index --help +1. Pass: Create a node-to-type index from input triples + +Usage: rdf-protect index [OPTIONS] [INPUT] + +Arguments: + [INPUT] File descriptor to read triples from. Defaults to `stdin` [default: -] + +Options: + -o, --output Output file descriptor to for the node-to-type index [default: -] + -h, --help Print help +``` + +Pseudonomyzation requires an RDF file, index and config as input: + +```shell +$ rdf-protect pseudo --help +2. Pass: Pseudonymize input triples + +Usage: rdf-protect pseudo [OPTIONS] --index --config [INPUT] + +Arguments: + [INPUT] File descriptor to read input triples from. Defaults to `stdin` [default: -] + +Options: + -i, --index Index file produced by prepare-index. Required for pseudonymization + -c, --config The config file descriptor to use for defining RDF elements to pseudonymize. Format: yaml + -o, --output Output file descriptor for pseudonymized triples. Defaults to `stdout` [default: -] + -h, --help Print help +``` + +In both subcommands, the input defaults to stdin and the output to stdout, allowing to pipe both up- and downstream `rdf-protect` (see next section). + +### Use Case + +The main idea behind `rdf-protect` is to integrate smoothly into other CLI tools up- and downstream via piping. +Let us assume that we're running a SPARQL query on a large graph and we would like to pseudonymize some of the triples. This is how the flow should look like: + +```shell +curl | rdf-protect -i index -c config.yaml | pseudo.nt +``` + +For this flow to stream data instead of loading everything into memory, we had to include an indexing step to make the streaming process consistent and easier to control. It is not as clean as having one command doing everything, but it simplifies code development. + +### Example + +There are three possible ways to pseudonymize RDF triples: + +1. Pseudonymize the URI of nodes with `rdf:type`. +2. Pseudonymize values for specific subject-predicate combinations. +3. Pseudonymize any value for a given predicate. + +By using all three ways together, we're able to get an RDF file with sensitive information: + +```ntriples + . + . + . + "my_account32" . + "secret-123" . + "Alice" . + . + "Bank" . +``` +And pseudonymize the sensitive information such as people's names, personal and secret information while keeping the rest as is: + +``` + . + . + . + "pp54r32" . + "asfnd223" . + "af321bbc" . + . + "Bank" . +``` + +The next subsections break down each of the three pseudonymization approaches to better understand how they operate. + +#### 1. Pseudonymize the URI of nodes with `rdf:type` + +Given the following config: +```yaml +replace_uri_of_nodes_with_type: + - "http://xmlns.com/foaf/0.1/Person" +``` +The goal is to pseudonymize all instaces of `rdf:type` Person. The following input file: +``` + . +``` +Would become: +``` + . +``` +#### 2. Pseudonymize values for specific subject-predicate combinations + +Given the following config: + +```yaml +replace_values_of_subject_predicate: + "http://xmlns.com/foaf/0.1/Person": + - "http://schema.org/name" +``` +The goal is to pseudonymize only the instances of names when they're associated to Person. The following input file: +``` + . + "Alice" . + . + "Bank" . +``` +Would become: +``` + . + "af321bbc" . + . + "Bank" . +``` + +#### 3. Pseudonymize any value for a given predicate + +Given the following config: +```yaml +replace_value_of_predicate: + - "http://schema.org/name" +``` + +The goal is to pseudonymize any values associated to name. +The following input file: +``` + . + "Alice" . + . + "Bank" . +``` +Would become: +``` + . + "af321bbc" . + . + "38a3dd71" . +``` + ## Development