Skip to content

Commit

Permalink
Merge branch 'dev' into nf-core-template-merge-2.9
Browse files Browse the repository at this point in the history
  • Loading branch information
maxulysse committed Sep 14, 2023
2 parents 259c1c4 + c85b8a0 commit 3248729
Show file tree
Hide file tree
Showing 125 changed files with 6,545 additions and 199 deletions.
22 changes: 16 additions & 6 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,12 @@ jobs:
NXF_VER:
- "23.04.0"
- "latest-everything"
test:
- "default"
- "annotation"
- "removeduplicates"
- "skipbasecalib"
- "bamcsiindex"
steps:
- name: Check out pipeline code
uses: actions/checkout@v3
Expand All @@ -35,9 +41,13 @@ jobs:
with:
version: "${{ matrix.NXF_VER }}"

- name: Run pipeline with test data
# TODO nf-core: You can customise CI pipeline run tests as required
# For example: adding multiple test runs with different parameters
# Remember that you can parallelise this by using strategy.matrix
run: |
nextflow run ${GITHUB_WORKSPACE} -profile test,docker --outdir ./results
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: "3.x"

- name: Install dependencies
run: python -m pip install --upgrade pip pytest-workflow

- name: Run pipeline with tests settings
run: pytest --tag ${{matrix.test}} --kwdof
25 changes: 19 additions & 6 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,27 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## v1.1.0dev - [date]
## [1.1.0dev] nfcore/rnavar

Initial release of nf-core/rnavar, created with the [nf-core](https://nf-co.re/) template.
New version with additional features and bug fixes.

### `Added`
## [1.0.0] nfcore/rnavar - 2022/06/20

First production release of the pipeline with latest software versions.

### `Fixed`
This version is based on GATK4 best-practices for RNAseq [Ref](https://github.com/gatk-workflows/gatk4-rnaseq-germline-snps-indels) and it includes:

### `Added`

### `Dependencies`
- Added `FastQC v0.11.9` from nf-core modules for read-level QC and summary.
- Added `STAR v2.7.9a` from nf-core modules for read alignment to reference genome.
- Added `Samtools v1.15.1` from nf-core modules for alignment statistics and QC.
- Added `GATK v4.2.6.1` from nf-core modules for alignment post-processing, variant calling and filtration.
- Added `Tabix v1.11` from nf-core modules for indexing BAM ann VCF files.
- Added `SnpEff v5.0` from nf-core modules for variant annotation.
- Added `Ensembl VEP v104.3` from nf-core modules for variant annotation.
- Added `MultiQC v1.12` from nf-core modules for QC summary report.
- Added Scatter i.e., one interval-list into many interval-files to run multiple processes in parallel.

### `Deprecated`
Thanks to everyone that contributed to this release.
Special thanks to @maxulysse and @FriederikeHanssen for your review and valuable suggestions.
28 changes: 28 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,38 @@

> Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. Available online https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
- [STAR](https://pubmed.ncbi.nlm.nih.gov/23104886/)

> Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner Bioinformatics. 2013 Jan 1;29(1):15-21. doi: 10.1093/bioinformatics/bts635. Epub 2012 Oct 25. PubMed PMID: 23104886; PubMed Central PMCID: PMC3530905.
- [SAMtools](https://pubmed.ncbi.nlm.nih.gov/19505943/)

> Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PubMed PMID: 19505943; PubMed Central PMCID: PMC2723002.
- [GATK](https://pubmed.ncbi.nlm.nih.gov/20644199/)

> McKenna A, Hanna M, Banks E, et al.: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010 Sep;20(9):1297-303. doi: 10.1101/gr.107524.110. Epub 2010 Jul 19. PubMed PMID: 20644199; PubMed Central PMCID: PMC2928508.
- [snpEff](https://pubmed.ncbi.nlm.nih.gov/22728672/)

> Cingolani P, Platts A, Wang le L, et al.: A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin). Apr-Jun 2012;6(2):80-92. doi: 10.4161/fly.19695. PubMed PMID: 22728672; PubMed Central PMCID: PMC3679285.
- [VEP](https://pubmed.ncbi.nlm.nih.gov/27268795/)

> McLaren W, Gil L, Hunt SE, et al.: The Ensembl Variant Effect Predictor. Genome Biol. 2016 Jun 6;17(1):122. doi: 10.1186/s13059-016-0974-4. PubMed PMID: 27268795; PubMed Central PMCID: PMC4893825.
- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
- [Tabix](https://pubmed.ncbi.nlm.nih.gov/21208982/)

> Heng Li, Tabix: fast retrieval of sequence features from generic TAB-delimited files, Bioinformatics, Volume 27, Issue 5, 1 March 2011, Pages 718–719. doi: 10.1093/bioinformatics/btq671. PubMed PMID: 21208982; PubMed Central PMCID: PMC3042176.
- [R](https://www.R-project.org/)

> R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)
Expand Down
77 changes: 46 additions & 31 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
# ![nf-core/rnavar](docs/images/nf-core-rnavar_logo_light.png#gh-light-mode-only) ![nf-core/rnavar](docs/images/nf-core-rnavar_logo_dark.png#gh-dark-mode-only)
# ![nf-core/rnavar](docs/images/nf-core-rnavar_logo_light.png#gh-light-mode-only) ![nf-core-rnavar](docs/images/nf-core/rnavar_logo_dark.png#gh-dark-mode-only)

[![AWS CI](https://img.shields.io/badge/CI%20tests-full%20size-FF9900?labelColor=000000&logo=Amazon%20AWS)](https://nf-co.re/rnavar/results)[![Cite with Zenodo](http://img.shields.io/badge/DOI-10.5281/zenodo.XXXXXXX-1073c8?labelColor=000000)](https://doi.org/10.5281/zenodo.XXXXXXX)
[![GitHub Actions CI Status](https://github.com/nf-core/rnavar/workflows/nf-core%20CI/badge.svg)](https://github.com/nf-core/rnavar/actions?query=workflow%3A%22nf-core+CI%22)
[![GitHub Actions Linting Status](https://github.com/nf-core/rnavar/workflows/nf-core%20linting/badge.svg)](https://github.com/nf-core/rnavar/actions?query=workflow%3A%22nf-core+linting%22)
[![AWS CI](https://img.shields.io/badge/CI%20tests-full%20size-FF9900?logo=Amazon%20AWS)](https://nf-co.re/rnavar/results)
[![Cite with Zenodo](https://zenodo.org/badge/DOI/10.5281/zenodo.6669637.svg)](https://doi.org/10.5281/zenodo.6669637)

[![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A523.04.0-23aa62.svg)](https://www.nextflow.io/)
[![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
Expand All @@ -12,20 +15,38 @@

## Introduction

**nf-core/rnavar** is a bioinformatics pipeline that ...

<!-- TODO nf-core:
Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
major pipeline sections and the types of output it produces. You're giving an overview to someone new
to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
-->

<!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples. -->
<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->

1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
**nf-core/rnavar** is a bioinformatics pipeline for RNA variant calling analysis following GATK4 best practices.

## Pipeline summary

1. Merge re-sequenced FastQ files ([`cat`](http://www.linfo.org/cat.html))
2. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
3. Align reads to reference genome ([`STAR`](https://github.com/alexdobin/STAR))
4. Sort and index alignments ([`SAMtools`](https://sourceforge.net/projects/samtools/files/samtools/))
5. Duplicate read marking ([`GATK4 MarkDuplicates`](https://gatk.broadinstitute.org/hc/en-us/articles/360037052812-MarkDuplicates-Picard))
6. Splits reads that contain Ns in their cigar string ([`GATK4 SplitNCigarReads`](https://gatk.broadinstitute.org/hc/en-us/articles/4409917482651-SplitNCigarReads))
7. Estimate and correct systematic bias using base quality score recalibration ([`GATK4 BaseRecalibrator`](https://gatk.broadinstitute.org/hc/en-us/articles/4409897206043-BaseRecalibrator), [`GATK4 ApplyBQSR`](https://gatk.broadinstitute.org/hc/en-us/articles/4409897168667-ApplyBQSR))
8. Convert a BED file to a Picard Interval List ([`GATK4 BedToIntervalList`](https://gatk.broadinstitute.org/hc/en-us/articles/4409924780827-BedToIntervalList-Picard-))
9. Scatter one interval-list into many interval-files ([`GATK4 IntervalListTools`](https://gatk.broadinstitute.org/hc/en-us/articles/4409917392155-IntervalListTools-Picard-))
10. Call SNPs and indels ([`GATK4 HaplotypeCaller`](https://gatk.broadinstitute.org/hc/en-us/articles/4409897180827-HaplotypeCaller))
11. Merge multiple VCF files into one VCF ([`GATK4 MergeVCFs`](https://gatk.broadinstitute.org/hc/en-us/articles/4409924817691-MergeVcfs-Picard-))
12. Index the VCF ([`Tabix`](http://www.htslib.org/doc/tabix.html))
13. Filter variant calls based on certain criteria ([`GATK4 VariantFiltration`](https://gatk.broadinstitute.org/hc/en-us/articles/4409897204763-VariantFiltration))
14. Annotate variants ([`snpEff`](https://pcingola.github.io/SnpEff/se_introduction/), [Ensembl VEP](https://www.ensembl.org/info/docs/tools/vep/index.html))
15. Present QC for raw read, alignment, gene biotype, sample similarity, and strand-specificity checks ([`MultiQC`](http://multiqc.info/), [`R`](https://www.r-project.org/))

### Summary of tools and version used in the pipeline

| Tool | Version |
| ----------- | ------- |
| FastQC | 0.11.9 |
| STAR | 2.7.9a |
| Samtools | 1.15.1 |
| GATK | 4.2.6.1 |
| Tabix | 1.11 |
| SnpEff | 5.0 |
| Ensembl VEP | 104.3 |
| MultiQC | 1.12 |

## Usage

Expand All @@ -52,14 +73,9 @@ Each row represents a fastq file (single-end) or a pair of fastq files (paired e

Now, you can run the pipeline using:

<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->

```bash
nextflow run nf-core/rnavar \
-profile <docker/singularity/.../institute> \
--input samplesheet.csv \
--outdir <OUTDIR>
```
```console
nextflow run nf-core/rnavar -profile <docker/singularity/podman/shifter/charliecloud/conda/institute> --input samplesheet.csv --outdir <OUTDIR> --genome GRCh38
```

> **Warning:**
> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those
Expand All @@ -76,11 +92,13 @@ For more details about the output files and reports, please refer to the

## Credits

nf-core/rnavar was originally written by @praveenraj2018.
These scripts were originally written in Nextflow DSL2 for use at the [Barntumörbanken, Karolinska Institutet](https://ki.se/forskning/barntumorbanken), by Praveen Raj ([@praveenraj2018](https://github.com/praveenraj2018)) and Maxime U Garcia ([@maxulysse](https://github.com/maxulysse)).

We thank the following people for their extensive assistance in the development of this pipeline:
The pipeline is primarily maintained by Praveen Raj ([@praveenraj2018](https://github.com/praveenraj2018)) from [Barntumörbanken, Karolinska Institutet](https://ki.se/forskning/barntumorbanken) and Maxime U Garcia ([@maxulysse](https://github.com/maxulysse)) from [Seqera Labs](https://seqera/io)

<!-- TODO nf-core: If applicable, make list of people who have also contributed -->
Many thanks to other who have helped out along the way too, including (but not limited to):
[@ewels](https://github.com/ewels),
[@drpatelh](https://github.com/drpatelh).

## Contributions and Support

Expand All @@ -90,10 +108,7 @@ For further information or help, don't hesitate to get in touch on the [Slack `#

## Citations

<!-- TODO nf-core: Add citation for pipeline after first release. Uncomment lines below and update Zenodo doi and badge at the top of this file. -->
<!-- If you use nf-core/rnavar for your analysis, please cite it using the following doi: [10.5281/zenodo.XXXXXX](https://doi.org/10.5281/zenodo.XXXXXX) -->

<!-- TODO nf-core: Add bibliography of tools and data used in your pipeline -->
If you use nf-core/rnavar for your analysis, please cite it using the following doi: [10.5281/zenodo.6669637](https://doi.org/10.5281/zenodo.6669637)

An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.

Expand Down
52 changes: 52 additions & 0 deletions assets/multiqc_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,55 @@ report_section_order:
order: -1002

export_plots: true

# Run only these modules
run_modules:
- custom_content
- fastqc
- star
- samtools
- picard
- gatk
- snpeff
- vep

# Order of modules
module_order:
- fastqc:
name: "FastQC (raw)"
path_filters:
- "*_val_*.zip"
- star:
name: "Read Alignment (STAR)"
- samtools:
name: "Samtools Flagstat"
- picard:
name: "GATK4 MarkDuplicates"
info: "Metrics generated either by GATK4 MarkDuplicates"
- qualimap:
name: "Qualimap"
- gatk:
name: "GATK4 BQSR"
- snpeff:
name: "SNPeff"
- vep:
name: "VEP"

extra_fn_clean_exts:
- "_val"

# Don't show % Dups in the General Stats table (we have this from Picard)
table_columns_visible:
fastqc:
percent_duplicates: False

sp:
samtools/stats:
fn: "*.aligned.bam.stats"
samtools/flagstat:
fn: "*.aligned.bam.flagstat"
picard/markdups:
fn: "*.markdup.sorted.metrics"
snpeff:
contents: "SnpEff_version"
max_filesize: 5000000
7 changes: 4 additions & 3 deletions assets/samplesheet.csv
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
sample,fastq_1,fastq_2
SAMPLE_PAIRED_END,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz
SAMPLE_SINGLE_END,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz,
sample,fastq_1,fastq_2,strandedness
RAP1_UNINDUCED_REP1,s3://nf-core-awsmegatests/rnaseq/input_data/minimal/GSE110004/SRR6357073_1.fastq.gz,,reverse
RAP1_UNINDUCED_REP2,s3://nf-core-awsmegatests/rnaseq/input_data/minimal/GSE110004/SRR6357074_1.fastq.gz,,reverse
RAP1_UNINDUCED_REP2,s3://nf-core-awsmegatests/rnaseq/input_data/minimal/GSE110004/SRR6357075_1.fastq.gz,,reverse
7 changes: 6 additions & 1 deletion assets/schema_input.json
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,13 @@
"maxLength": 0
}
]
},
"strandedness": {
"type": "string",
"errorMessage": "Strandedness must be provided and be one of 'forward', 'reverse' or 'unstranded'",
"enum": ["forward", "reverse", "unstranded"]
}
},
"required": ["sample", "fastq_1"]
"required": ["sample", "fastq_1", "strandedness"]
}
}
6 changes: 3 additions & 3 deletions bin/check_samplesheet.py
Original file line number Diff line number Diff line change
Expand Up @@ -177,15 +177,15 @@ def check_samplesheet(file_in, file_out):
Example:
This function checks that the samplesheet follows the following structure,
see also the `viral recon samplesheet`_::
see also the `rnavar test samplesheet`_::
sample,fastq_1,fastq_2
SAMPLE_PE,SAMPLE_PE_RUN1_1.fastq.gz,SAMPLE_PE_RUN1_2.fastq.gz
SAMPLE_PE,SAMPLE_PE_RUN2_1.fastq.gz,SAMPLE_PE_RUN2_2.fastq.gz
SAMPLE_SE,SAMPLE_SE_RUN1_1.fastq.gz,
.. _viral recon samplesheet:
https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/samplesheet/samplesheet_test_illumina_amplicon.csv
.. _rnavar test samplesheet:
https://raw.githubusercontent.com/nf-core/test-datasets/rnavar/samplesheet/v1.0/samplesheet.csv
"""
required_columns = {"sample", "fastq_1", "fastq_2"}
Expand Down
2 changes: 0 additions & 2 deletions conf/base.config
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@

process {

// TODO nf-core: Check the defaults for all processes
cpus = { check_max( 1 * task.attempt, 'cpus' ) }
memory = { check_max( 6.GB * task.attempt, 'memory' ) }
time = { check_max( 4.h * task.attempt, 'time' ) }
Expand All @@ -24,7 +23,6 @@ process {
// These labels are used and recognised by default in DSL2 files hosted on nf-core/modules.
// If possible, it would be nice to keep the same label naming convention when
// adding in your local modules too.
// TODO nf-core: Customise requirements for specific processes.
// See https://www.nextflow.io/docs/latest/config.html#config-process-selectors
withLabel:process_single {
cpus = { check_max( 1 , 'cpus' ) }
Expand Down
Loading

0 comments on commit 3248729

Please sign in to comment.