Merge branch 'dev' into nf-core-template-merge-2.9

nf-core · Sep 14, 2023 · 3248729 · 3248729
2 parents 259c1c4 + c85b8a0
commit 3248729
Show file tree

Hide file tree

Showing 125 changed files with 6,545 additions and 199 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -26,6 +26,12 @@ jobs:
         NXF_VER:
           - "23.04.0"
           - "latest-everything"
+        test:
+          - "default"
+          - "annotation"
+          - "removeduplicates"
+          - "skipbasecalib"
+          - "bamcsiindex"
     steps:
       - name: Check out pipeline code
         uses: actions/checkout@v3
@@ -35,9 +41,13 @@ jobs:
         with:
           version: "${{ matrix.NXF_VER }}"
 
-      - name: Run pipeline with test data
-        # TODO nf-core: You can customise CI pipeline run tests as required
-        # For example: adding multiple test runs with different parameters
-        # Remember that you can parallelise this by using strategy.matrix
-        run: |
-          nextflow run ${GITHUB_WORKSPACE} -profile test,docker --outdir ./results
+      - name: Set up Python
+        uses: actions/setup-python@v2
+        with:
+          python-version: "3.x"
+
+      - name: Install dependencies
+        run: python -m pip install --upgrade pip pytest-workflow
+
+      - name: Run pipeline with tests settings
+        run: pytest --tag ${{matrix.test}} --kwdof
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,14 +3,27 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
-## v1.1.0dev - [date]
+## [1.1.0dev] nfcore/rnavar
 
-Initial release of nf-core/rnavar, created with the [nf-core](https://nf-co.re/) template.
+New version with additional features and bug fixes.
 
-### `Added`
+## [1.0.0] nfcore/rnavar - 2022/06/20
+
+First production release of the pipeline with latest software versions.
 
-### `Fixed`
+This version is based on GATK4 best-practices for RNAseq [Ref](https://github.com/gatk-workflows/gatk4-rnaseq-germline-snps-indels) and it includes:
+
+### `Added`
 
-### `Dependencies`
+- Added `FastQC v0.11.9` from nf-core modules for read-level QC and summary.
+- Added `STAR v2.7.9a` from nf-core modules for read alignment to reference genome.
+- Added `Samtools v1.15.1` from nf-core modules for alignment statistics and QC.
+- Added `GATK v4.2.6.1` from nf-core modules for alignment post-processing, variant calling and filtration.
+- Added `Tabix v1.11` from nf-core modules for indexing BAM ann VCF files.
+- Added `SnpEff v5.0` from nf-core modules for variant annotation.
+- Added `Ensembl VEP v104.3` from nf-core modules for variant annotation.
+- Added `MultiQC v1.12` from nf-core modules for QC summary report.
+- Added Scatter i.e., one interval-list into many interval-files to run multiple processes in parallel.
 
-### `Deprecated`
+Thanks to everyone that contributed to this release.
+Special thanks to @maxulysse and @FriederikeHanssen for your review and valuable suggestions.
diff --git a/CITATIONS.md b/CITATIONS.md
@@ -14,10 +14,38 @@
 
   > Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. Available online https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
 
+- [STAR](https://pubmed.ncbi.nlm.nih.gov/23104886/)
+
+  > Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner Bioinformatics. 2013 Jan 1;29(1):15-21. doi: 10.1093/bioinformatics/bts635. Epub 2012 Oct 25. PubMed PMID: 23104886; PubMed Central PMCID: PMC3530905.
+
+- [SAMtools](https://pubmed.ncbi.nlm.nih.gov/19505943/)
+
+  > Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PubMed PMID: 19505943; PubMed Central PMCID: PMC2723002.
+
+- [GATK](https://pubmed.ncbi.nlm.nih.gov/20644199/)
+
+  > McKenna A, Hanna M, Banks E, et al.: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010 Sep;20(9):1297-303. doi: 10.1101/gr.107524.110. Epub 2010 Jul 19. PubMed PMID: 20644199; PubMed Central PMCID: PMC2928508.
+
+- [snpEff](https://pubmed.ncbi.nlm.nih.gov/22728672/)
+
+  > Cingolani P, Platts A, Wang le L, et al.: A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin). Apr-Jun 2012;6(2):80-92. doi: 10.4161/fly.19695. PubMed PMID: 22728672; PubMed Central PMCID: PMC3679285.
+
+- [VEP](https://pubmed.ncbi.nlm.nih.gov/27268795/)
+
+  > McLaren W, Gil L, Hunt SE, et al.: The Ensembl Variant Effect Predictor. Genome Biol. 2016 Jun 6;17(1):122. doi: 10.1186/s13059-016-0974-4. PubMed PMID: 27268795; PubMed Central PMCID: PMC4893825.
+
 - [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
 
   > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
 
+- [Tabix](https://pubmed.ncbi.nlm.nih.gov/21208982/)
+
+  > Heng Li, Tabix: fast retrieval of sequence features from generic TAB-delimited files, Bioinformatics, Volume 27, Issue 5, 1 March 2011, Pages 718–719. doi: 10.1093/bioinformatics/btq671. PubMed PMID: 21208982; PubMed Central PMCID: PMC3042176.
+
+- [R](https://www.R-project.org/)
+
+  > R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
+
 ## Software packaging/containerisation tools
 
 - [Anaconda](https://anaconda.com)

diff --git a/README.md b/README.md
@@ -1,6 +1,9 @@
-# ![nf-core/rnavar](docs/images/nf-core-rnavar_logo_light.png#gh-light-mode-only) ![nf-core/rnavar](docs/images/nf-core-rnavar_logo_dark.png#gh-dark-mode-only)
+# ![nf-core/rnavar](docs/images/nf-core-rnavar_logo_light.png#gh-light-mode-only) ![nf-core-rnavar](docs/images/nf-core/rnavar_logo_dark.png#gh-dark-mode-only)
 
-[![AWS CI](https://img.shields.io/badge/CI%20tests-full%20size-FF9900?labelColor=000000&logo=Amazon%20AWS)](https://nf-co.re/rnavar/results)[![Cite with Zenodo](http://img.shields.io/badge/DOI-10.5281/zenodo.XXXXXXX-1073c8?labelColor=000000)](https://doi.org/10.5281/zenodo.XXXXXXX)
+[![GitHub Actions CI Status](https://github.com/nf-core/rnavar/workflows/nf-core%20CI/badge.svg)](https://github.com/nf-core/rnavar/actions?query=workflow%3A%22nf-core+CI%22)
+[![GitHub Actions Linting Status](https://github.com/nf-core/rnavar/workflows/nf-core%20linting/badge.svg)](https://github.com/nf-core/rnavar/actions?query=workflow%3A%22nf-core+linting%22)
+[![AWS CI](https://img.shields.io/badge/CI%20tests-full%20size-FF9900?logo=Amazon%20AWS)](https://nf-co.re/rnavar/results)
+[![Cite with Zenodo](https://zenodo.org/badge/DOI/10.5281/zenodo.6669637.svg)](https://doi.org/10.5281/zenodo.6669637)
 
 [![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A523.04.0-23aa62.svg)](https://www.nextflow.io/)
 [![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
@@ -12,20 +15,38 @@
 
 ## Introduction
 
-**nf-core/rnavar** is a bioinformatics pipeline that ...
-
-<!-- TODO nf-core:
-   Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
-   major pipeline sections and the types of output it produces. You're giving an overview to someone new
-   to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
--->
-
-<!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
-     workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples.   -->
-<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->
-
-1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
-2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
+**nf-core/rnavar** is a bioinformatics pipeline for RNA variant calling analysis following GATK4 best practices.
+
+## Pipeline summary
+
+1. Merge re-sequenced FastQ files ([`cat`](http://www.linfo.org/cat.html))
+2. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
+3. Align reads to reference genome ([`STAR`](https://github.com/alexdobin/STAR))
+4. Sort and index alignments ([`SAMtools`](https://sourceforge.net/projects/samtools/files/samtools/))
+5. Duplicate read marking ([`GATK4 MarkDuplicates`](https://gatk.broadinstitute.org/hc/en-us/articles/360037052812-MarkDuplicates-Picard))
+6. Splits reads that contain Ns in their cigar string ([`GATK4 SplitNCigarReads`](https://gatk.broadinstitute.org/hc/en-us/articles/4409917482651-SplitNCigarReads))
+7. Estimate and correct systematic bias using base quality score recalibration ([`GATK4 BaseRecalibrator`](https://gatk.broadinstitute.org/hc/en-us/articles/4409897206043-BaseRecalibrator), [`GATK4 ApplyBQSR`](https://gatk.broadinstitute.org/hc/en-us/articles/4409897168667-ApplyBQSR))
+8. Convert a BED file to a Picard Interval List ([`GATK4 BedToIntervalList`](https://gatk.broadinstitute.org/hc/en-us/articles/4409924780827-BedToIntervalList-Picard-))
+9. Scatter one interval-list into many interval-files ([`GATK4 IntervalListTools`](https://gatk.broadinstitute.org/hc/en-us/articles/4409917392155-IntervalListTools-Picard-))
+10. Call SNPs and indels ([`GATK4 HaplotypeCaller`](https://gatk.broadinstitute.org/hc/en-us/articles/4409897180827-HaplotypeCaller))
+11. Merge multiple VCF files into one VCF ([`GATK4 MergeVCFs`](https://gatk.broadinstitute.org/hc/en-us/articles/4409924817691-MergeVcfs-Picard-))
+12. Index the VCF ([`Tabix`](http://www.htslib.org/doc/tabix.html))
+13. Filter variant calls based on certain criteria ([`GATK4 VariantFiltration`](https://gatk.broadinstitute.org/hc/en-us/articles/4409897204763-VariantFiltration))
+14. Annotate variants ([`snpEff`](https://pcingola.github.io/SnpEff/se_introduction/), [Ensembl VEP](https://www.ensembl.org/info/docs/tools/vep/index.html))
+15. Present QC for raw read, alignment, gene biotype, sample similarity, and strand-specificity checks ([`MultiQC`](http://multiqc.info/), [`R`](https://www.r-project.org/))
+
+### Summary of tools and version used in the pipeline
+
+| Tool        | Version |
+| ----------- | ------- |
+| FastQC      | 0.11.9  |
+| STAR        | 2.7.9a  |
+| Samtools    | 1.15.1  |
+| GATK        | 4.2.6.1 |
+| Tabix       | 1.11    |
+| SnpEff      | 5.0     |
+| Ensembl VEP | 104.3   |
+| MultiQC     | 1.12    |
 
 ## Usage
 
@@ -52,14 +73,9 @@ Each row represents a fastq file (single-end) or a pair of fastq files (paired e
 
 Now, you can run the pipeline using:
 
-<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->
-
-```bash
-nextflow run nf-core/rnavar \
-   -profile <docker/singularity/.../institute> \
-   --input samplesheet.csv \
-   --outdir <OUTDIR>
-```
+   ```console
+   nextflow run nf-core/rnavar -profile <docker/singularity/podman/shifter/charliecloud/conda/institute> --input samplesheet.csv  --outdir <OUTDIR> --genome GRCh38
+   ```
 
 > **Warning:**
 > Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those
@@ -76,11 +92,13 @@ For more details about the output files and reports, please refer to the
 
 ## Credits
 
-nf-core/rnavar was originally written by @praveenraj2018.
+These scripts were originally written in Nextflow DSL2 for use at the [Barntumörbanken, Karolinska Institutet](https://ki.se/forskning/barntumorbanken), by Praveen Raj ([@praveenraj2018](https://github.com/praveenraj2018)) and Maxime U Garcia ([@maxulysse](https://github.com/maxulysse)).
 
-We thank the following people for their extensive assistance in the development of this pipeline:
+The pipeline is primarily maintained by Praveen Raj ([@praveenraj2018](https://github.com/praveenraj2018)) from [Barntumörbanken, Karolinska Institutet](https://ki.se/forskning/barntumorbanken) and Maxime U Garcia ([@maxulysse](https://github.com/maxulysse)) from [Seqera Labs](https://seqera/io)
 
-<!-- TODO nf-core: If applicable, make list of people who have also contributed -->
+Many thanks to other who have helped out along the way too, including (but not limited to):
+[@ewels](https://github.com/ewels),
+[@drpatelh](https://github.com/drpatelh).
 
 ## Contributions and Support
 
@@ -90,10 +108,7 @@ For further information or help, don't hesitate to get in touch on the [Slack `#
 
 ## Citations
 
-<!-- TODO nf-core: Add citation for pipeline after first release. Uncomment lines below and update Zenodo doi and badge at the top of this file. -->
-<!-- If you use  nf-core/rnavar for your analysis, please cite it using the following doi: [10.5281/zenodo.XXXXXX](https://doi.org/10.5281/zenodo.XXXXXX) -->
-
-<!-- TODO nf-core: Add bibliography of tools and data used in your pipeline -->
+If you use nf-core/rnavar for your analysis, please cite it using the following doi: [10.5281/zenodo.6669637](https://doi.org/10.5281/zenodo.6669637)
 
 An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.
 

diff --git a/assets/multiqc_config.yml b/assets/multiqc_config.yml
@@ -11,3 +11,55 @@ report_section_order:
     order: -1002
 
 export_plots: true
+
+# Run only these modules
+run_modules:
+  - custom_content
+  - fastqc
+  - star
+  - samtools
+  - picard
+  - gatk
+  - snpeff
+  - vep
+
+# Order of modules
+module_order:
+  - fastqc:
+      name: "FastQC (raw)"
+      path_filters:
+        - "*_val_*.zip"
+  - star:
+      name: "Read Alignment (STAR)"
+  - samtools:
+      name: "Samtools Flagstat"
+  - picard:
+      name: "GATK4 MarkDuplicates"
+      info: "Metrics generated either by GATK4 MarkDuplicates"
+  - qualimap:
+      name: "Qualimap"
+  - gatk:
+      name: "GATK4 BQSR"
+  - snpeff:
+      name: "SNPeff"
+  - vep:
+      name: "VEP"
+
+extra_fn_clean_exts:
+  - "_val"
+
+# Don't show % Dups in the General Stats table (we have this from Picard)
+table_columns_visible:
+  fastqc:
+    percent_duplicates: False
+
+sp:
+  samtools/stats:
+    fn: "*.aligned.bam.stats"
+  samtools/flagstat:
+    fn: "*.aligned.bam.flagstat"
+  picard/markdups:
+    fn: "*.markdup.sorted.metrics"
+  snpeff:
+    contents: "SnpEff_version"
+    max_filesize: 5000000
diff --git a/assets/samplesheet.csv b/assets/samplesheet.csv
@@ -1,3 +1,4 @@
-sample,fastq_1,fastq_2
-SAMPLE_PAIRED_END,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz
-SAMPLE_SINGLE_END,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz,
+sample,fastq_1,fastq_2,strandedness
+RAP1_UNINDUCED_REP1,s3://nf-core-awsmegatests/rnaseq/input_data/minimal/GSE110004/SRR6357073_1.fastq.gz,,reverse
+RAP1_UNINDUCED_REP2,s3://nf-core-awsmegatests/rnaseq/input_data/minimal/GSE110004/SRR6357074_1.fastq.gz,,reverse
+RAP1_UNINDUCED_REP2,s3://nf-core-awsmegatests/rnaseq/input_data/minimal/GSE110004/SRR6357075_1.fastq.gz,,reverse
diff --git a/assets/schema_input.json b/assets/schema_input.json
@@ -29,8 +29,13 @@
                         "maxLength": 0
                     }
                 ]
+            },
+            "strandedness": {
+                "type": "string",
+                "errorMessage": "Strandedness must be provided and be one of 'forward', 'reverse' or 'unstranded'",
+                "enum": ["forward", "reverse", "unstranded"]
             }
         },
-        "required": ["sample", "fastq_1"]
+        "required": ["sample", "fastq_1", "strandedness"]
     }
 }
diff --git a/bin/check_samplesheet.py b/bin/check_samplesheet.py
@@ -177,15 +177,15 @@ def check_samplesheet(file_in, file_out):
 
     Example:
         This function checks that the samplesheet follows the following structure,
-        see also the `viral recon samplesheet`_::
+        see also the `rnavar test samplesheet`_::
 
             sample,fastq_1,fastq_2
             SAMPLE_PE,SAMPLE_PE_RUN1_1.fastq.gz,SAMPLE_PE_RUN1_2.fastq.gz
             SAMPLE_PE,SAMPLE_PE_RUN2_1.fastq.gz,SAMPLE_PE_RUN2_2.fastq.gz
             SAMPLE_SE,SAMPLE_SE_RUN1_1.fastq.gz,
 
-    .. _viral recon samplesheet:
-        https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/samplesheet/samplesheet_test_illumina_amplicon.csv
+    .. _rnavar test samplesheet:
+        https://raw.githubusercontent.com/nf-core/test-datasets/rnavar/samplesheet/v1.0/samplesheet.csv
 
     """
     required_columns = {"sample", "fastq_1", "fastq_2"}

diff --git a/conf/base.config b/conf/base.config
@@ -10,7 +10,6 @@
 
 process {
 
-    // TODO nf-core: Check the defaults for all processes
     cpus   = { check_max( 1    * task.attempt, 'cpus'   ) }
     memory = { check_max( 6.GB * task.attempt, 'memory' ) }
     time   = { check_max( 4.h  * task.attempt, 'time'   ) }
@@ -24,7 +23,6 @@ process {
     //        These labels are used and recognised by default in DSL2 files hosted on nf-core/modules.
     //        If possible, it would be nice to keep the same label naming convention when
     //        adding in your local modules too.
-    // TODO nf-core: Customise requirements for specific processes.
     // See https://www.nextflow.io/docs/latest/config.html#config-process-selectors
     withLabel:process_single {
         cpus   = { check_max( 1                  , 'cpus'    ) }