Skip to content

Watch a folder to automatically ingest new data into GraphDB via its REST API

License

Notifications You must be signed in to change notification settings

sdsc-ordes/nds-lucid-graphdb-loader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Context

This repository includes code from a Swiss National Data Stream (NDS): LUCID. The general goal of the NDS initiative is to collect clinical data across five Swiss University Hospitals and share it with researchers. In the case of LUCID, research focuses on low-value care: services that provide little or no benefit to patients. If you're interested, check the official project page.

Introduction

graphdb-loader is a pipeline for loading RDF data into a GraphDB instance, handling backups and failures gracefully.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.

Requirements

The workflow assumes that:

  • Nextflow is installed (>=22.10.1)
  • Basic UNIX tools are installed: xargs, sed, awk, curl
  • A running GraphDB server is accessible through the network.
  • A directory (graphdb_dir) is shared between the graphdb server and the workflow (either it is a shared volume, or they run on the same host).
  • Nextflow secrets GRAPHDB_USER and GRAPHDB_PASS are set to provide GraphDB authentication credentials.

See usage instructions for more information.

Pipeline summary

  1. Check if new RDF files are present in source directory
  2. Attempt to load files in target GraphDB instance.
  3. In case of failure, retry operation up to 3 times
  4. Backup successfully imported RDF files to backup directory.
  5. Log all imports, their target graph, success status and the GraphDB server response.
flowchart TD
    p0((Channel.fromPath))
    p1(( ))
    p2[copy_to_graphdb]
    p3(( ))
    p4(( ))
    p5[graphdb_auth]
    p6[graphdb_import_file]
    p7([map])
    p8([filter])
    p9([map])
    p10(( ))
    p11[copy_to_backup]
    p12(( ))
    p13([map])
    p14([collectFile])
    p15([view])
    p16(( ))
    p0 -->|ch_input_files| p2
    p1 -->|graphdb_dir| p2
    p2 --> p6
    p3 -->|username| p5
    p4 -->|password| p5
    p5 -->|-| p6
    p6 --> p7
    p7 -->|ch_log| p8
    p8 --> p9
    p9 -->|ch_imported_files| p11
    p10 -->|backup_dir| p11
    p11 --> p12
    p7 -->|ch_log| p13
    p13 --> p14
    p14 --> p15
    p15 --> p16
Loading

Quick Start

See usage docs for all of the available options when running the pipeline.

  1. Install Nextflow (>=22.10.1)

  2. Install Podman for full pipeline reproducibility (tested on 4.9.4).

  3. Download the pipeline and test it on a minimal dataset with a single command:

    nextflow run sdsc-ordes/nds-lucid-graphdb-loader -profile test --outdir <OUTDIR>

    Note that some form of configuration will be needed so that Nextflow knows how to fetch the required software. This is usually done in the form of a config profile (YOURPROFILE in the example command above). You can chain multiple config profiles in a comma-separated string.

  4. Start running your own analysis!

    nextflow run sdsc-ordes/nds-lucid-graphdb-loader --input_dir /data/source --graphdb_url localhost:7200 --graphdb_dir /data/graphdb-import --graphdb_repo repo --backup_dir /data/backup

Usage notes

The pipeline imports files into named graphs. The named graph URI is ${base_uri}/${graph} where:

  • base_uri is set to a default value for LUCID but can be set as a parameter.
  • graph is extracted from the input filename, assuming the pattern *_{graph}.{ext}.

We recommend that input files start with a timestamp to ensure unique backup filenames, e.g. {timestamp}_{graph}.{ext}

Credits

graphdb-loader was originally written by Stefan Milosavljevic and Cyril Matthey-Doret.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

This pipeline uses code developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

About

Watch a folder to automatically ingest new data into GraphDB via its REST API

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published