This repository includes code from a Swiss National Data Stream (NDS): LUCID. The general goal of the NDS initiative is to collect clinical data across five Swiss University Hospitals and share it with researchers. In the case of LUCID, research focuses on low-value care: services that provide little or no benefit to patients. If you're interested, check the official project page.
graphdb-loader is a pipeline for loading RDF data into a GraphDB instance, handling backups and failures gracefully.
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.
The workflow assumes that:
- Nextflow is installed (>=22.10.1)
- Basic UNIX tools are installed:
xargs
,sed
,awk
,curl
- A running GraphDB server is accessible through the network.
- A directory (
graphdb_dir
) is shared between the graphdb server and the workflow (either it is a shared volume, or they run on the same host). - Nextflow secrets
GRAPHDB_USER
andGRAPHDB_PASS
are set to provide GraphDB authentication credentials.
See usage instructions for more information.
- Check if new RDF files are present in source directory
- Attempt to load files in target GraphDB instance.
- In case of failure, retry operation up to 3 times
- Backup successfully imported RDF files to backup directory.
- Log all imports, their target graph, success status and the GraphDB server response.
flowchart TD
p0((Channel.fromPath))
p1(( ))
p2[copy_to_graphdb]
p3(( ))
p4(( ))
p5[graphdb_auth]
p6[graphdb_import_file]
p7([map])
p8([filter])
p9([map])
p10(( ))
p11[copy_to_backup]
p12(( ))
p13([map])
p14([collectFile])
p15([view])
p16(( ))
p0 -->|ch_input_files| p2
p1 -->|graphdb_dir| p2
p2 --> p6
p3 -->|username| p5
p4 -->|password| p5
p5 -->|-| p6
p6 --> p7
p7 -->|ch_log| p8
p8 --> p9
p9 -->|ch_imported_files| p11
p10 -->|backup_dir| p11
p11 --> p12
p7 -->|ch_log| p13
p13 --> p14
p14 --> p15
p15 --> p16
See usage docs for all of the available options when running the pipeline.
-
Install
Nextflow
(>=22.10.1
) -
Install
Podman
for full pipeline reproducibility (tested on 4.9.4). -
Download the pipeline and test it on a minimal dataset with a single command:
nextflow run sdsc-ordes/nds-lucid-graphdb-loader -profile test --outdir <OUTDIR>
Note that some form of configuration will be needed so that Nextflow knows how to fetch the required software. This is usually done in the form of a config profile (
YOURPROFILE
in the example command above). You can chain multiple config profiles in a comma-separated string. -
Start running your own analysis!
nextflow run sdsc-ordes/nds-lucid-graphdb-loader --input_dir /data/source --graphdb_url localhost:7200 --graphdb_dir /data/graphdb-import --graphdb_repo repo --backup_dir /data/backup
The pipeline imports files into named graphs. The named graph URI is ${base_uri}/${graph}
where:
base_uri
is set to a default value for LUCID but can be set as a parameter.graph
is extracted from the input filename, assuming the pattern*_{graph}.{ext}
.
We recommend that input files start with a timestamp to ensure unique backup filenames, e.g. {timestamp}_{graph}.{ext}
graphdb-loader was originally written by Stefan Milosavljevic and Cyril Matthey-Doret.
If you would like to contribute to this pipeline, please see the contributing guidelines.
This pipeline uses code developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.