Skip to content

Latest commit

 

History

History
117 lines (85 loc) · 4.85 KB

README.md

File metadata and controls

117 lines (85 loc) · 4.85 KB

Context

This repository includes code from a Swiss National Data Stream (NDS): LUCID. The general goal of the NDS initiative is to collect clinical data across five Swiss University Hospitals and share it with researchers. In the case of LUCID, research focuses on low-value care: services that provide little or no benefit to patients. If you're interested, check the official project page.

Introduction

graphdb-loader is a pipeline for loading RDF data into a GraphDB instance, handling backups and failures gracefully.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.

Requirements

The workflow assumes that:

  • Nextflow is installed (>=22.10.1)
  • Basic UNIX tools are installed: xargs, sed, awk, curl
  • A running GraphDB server is accessible through the network.
  • A directory (graphdb_dir) is shared between the graphdb server and the workflow (either it is a shared volume, or they run on the same host).
  • Nextflow secrets GRAPHDB_USER and GRAPHDB_PASS are set to provide GraphDB authentication credentials.

See usage instructions for more information.

Pipeline summary

  1. Check if new RDF files are present in source directory
  2. Attempt to load files in target GraphDB instance.
  3. In case of failure, retry operation up to 3 times
  4. Backup successfully imported RDF files to backup directory.
  5. Log all imports, their target graph, success status and the GraphDB server response.
flowchart TD
    p0((Channel.fromPath))
    p1(( ))
    p2[copy_to_graphdb]
    p3(( ))
    p4(( ))
    p5[graphdb_auth]
    p6[graphdb_import_file]
    p7([map])
    p8([filter])
    p9([map])
    p10(( ))
    p11[copy_to_backup]
    p12(( ))
    p13([map])
    p14([collectFile])
    p15([view])
    p16(( ))
    p0 -->|ch_input_files| p2
    p1 -->|graphdb_dir| p2
    p2 --> p6
    p3 -->|username| p5
    p4 -->|password| p5
    p5 -->|-| p6
    p6 --> p7
    p7 -->|ch_log| p8
    p8 --> p9
    p9 -->|ch_imported_files| p11
    p10 -->|backup_dir| p11
    p11 --> p12
    p7 -->|ch_log| p13
    p13 --> p14
    p14 --> p15
    p15 --> p16
Loading

Quick Start

See usage docs for all of the available options when running the pipeline.

  1. Install Nextflow (>=22.10.1)

  2. Install Podman for full pipeline reproducibility (tested on 4.9.4).

  3. Download the pipeline and test it on a minimal dataset with a single command:

    nextflow run sdsc-ordes/nds-lucid-graphdb-loader -profile test --outdir <OUTDIR>

    Note that some form of configuration will be needed so that Nextflow knows how to fetch the required software. This is usually done in the form of a config profile (YOURPROFILE in the example command above). You can chain multiple config profiles in a comma-separated string.

  4. Start running your own analysis!

    nextflow run sdsc-ordes/nds-lucid-graphdb-loader --input_dir /data/source --graphdb_url localhost:7200 --graphdb_dir /data/graphdb-import --graphdb_repo repo --backup_dir /data/backup

Usage notes

The pipeline imports files into named graphs. The named graph URI is ${base_uri}/${graph} where:

  • base_uri is set to a default value for LUCID but can be set as a parameter.
  • graph is extracted from the input filename, assuming the pattern *_{graph}.{ext}.

We recommend that input files start with a timestamp to ensure unique backup filenames, e.g. {timestamp}_{graph}.{ext}

Credits

graphdb-loader was originally written by Stefan Milosavljevic and Cyril Matthey-Doret.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

This pipeline uses code developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.