Table of contents
- Name: exomiser-pipeline-nf
- Tools: exomiser
- Version: 12.1.0
It is a fully containerised nextflow pipeline that runs exomisers on either a single sample VCF file or a trio VCF file.
The Exomiser is a tool to perform genome-wide prioritisation of genomic variants including non-coding and regulatory variants using patient phenotypes as a means of differentiating candidate genes.
To perform an analysis, Exomiser requires the patient's genome/exome in VCF format and their phenotype encoded in HPO terms. The exomiser is also capable of analysing trios/small family genomes.
The main input of the pipeline (families_file
) is a TSV file and the main output of the pipeline is an HTML file containing pathogenicity score of the called variants.
This is a TSV file that contains the following info tab separated
run_id | proband_id | hpo | vcf_path | vcf_index_path | proband_sex | mother_id | father_id |
---|---|---|---|---|---|---|---|
The vcf_path column can contain the path to either a multiVCF(trio) or a single-sample VCF.
In the case of a single-sample VCF, the last 2 columns must contain nan
as a value. An example can be found here
In the hpo column, multiple comma-separated HPO terms can be present.
This is a file needed by exomiser to run. It contains information on where to find the reference data as well as the versioning of the reference genome. An example can be found here
This is a file needed by exomiser to run. It contains placeholders in the text that get filled in by the second process of the pipeline just before running exomiser. The one used for testing can be found here
This is a parameter that defines the kind of reference data. It accepts "test" or "full".
The "full" profile points to the reference data bundle needed by exomiser (~120 GB!). A copy of such files can be found here . The reference dataset has been added as a parameter, allowing flexibility to pull the data from any resource (i.e. cloud, local storage, ftp, ...) and Nextflow will automatically take care of fetching the data without having to add anything to the pipeline itself.
The "test" profile points to some mock data used in testing.
There are other parameters that can be tweaked to personalize the behaviour of the pipeline. These are referenced in nextflow.config
Here is the list of steps performed by this pipeline.
process ped_hpo_creation
- this process produces the pedigree (PED) file needed for exomiser to run using a python script.process exomiser
- this process is where the autoconfig file for exomiser is generated and exomiser is run.
- a html and a json file containing a report on the analysis
- the autoconfig file, for reproducibility purpose
- a vcf file with the called variants that are identified as causative
The pipeline can be run like:
nextflow run main.nf --families_file 's3://lifebit-featured-datasets/pipelines/exomiser-nf/fam_file.tsv' \
--prioritisers 'hiPhivePrioritiser' \
--exomiser_data 's3://lifebit-featured-datasets/pipelines/exomiser-data-bundle' \
--application_properties 's3://lifebit-featured-datasets/pipelines/exomiser-nf/application.properties' \
--auto_config_yml 's3://lifebit-featured-datasets/pipelines/exomiser-nf/auto_config.yml'
To run the pipeline with docker
(used by default), type the following commands:
To test the pipeline on a multi-VCF:
nextflow run main.nf -profile test_full_family
or
nextflow run main.nf -profile test_full_multi_hpo
To test the pipeline on a single-sample VCF:
nextflow run main.nf -profile test_full_single_vcf
Be careful when running this, as the pipeline requires the staging of 120 GB of reference data, required by exomiser, so only that takes a while!
profile name | Run locally | Run on CloudOS | description |
---|---|---|---|
test_full_family | the data required is so big, it was tested on a c5.4xlarge EC2 machine | Successful | this test is designed to test the pipeline on a multi-VCF with trio information |
test_full_single_vcf | the data required is so big, it was tested on a c5.4xlarge EC2 machine | Successful | this test is designed to test the pipeline on a single-sample-VCF |
test_full_multi_hpo | the data required is so big, it was tested on a c5.4xlarge EC2 machine | Successful | this test is designed to test the pipeline on a multi-VCF with trio information using multiple HPO terms |