nf-core · nschcolnicov · Oct 4, 2024 · Oct 3, 2024 · Oct 4, 2024 · Oct 4, 2024
diff --git a/README.md b/README.md
@@ -51,7 +51,7 @@ You can find numerous talks on the nf-core events page from various topics inclu
    4. ncRNA filtration
    5. piRNA filtration
    6. Others filtration
-5. UMI barcode deduplication ([`UMI-tools`](https://github.com/CGATOxford/UMI-tools))
+5. UMI barcode deduplication ([`UMI-tools`](https://github.com/CGATOxford/UMI-tools)) <!-- TODO, isn't this done on the UMI step above? -->
 6. miRNA quantification
    - EdgeR
      1. Reads alignment against miRBase mature miRNA ([`Bowtie1`](http://bowtie-bio.sourceforge.net/index.shtml))
@@ -108,15 +108,15 @@ Now, you can run the pipeline using:
 
 ```bash
 nextflow run nf-core/smrnaseq \
-   -profile <docker/singularity/.../institute>,illumina \
+   -profile <docker/singularity/.../institute>,<protocol> \
   --input samplesheet.csv \
   --genome 'GRCh37' \
   --mirtrace_species 'hsa' \
   --outdir <OUTDIR>
 ```
 
 > [!IMPORTANT]
-> Remember to add a protocol as an additional profile (such as `illumina`, `nexttflex`, `qiaseq` or `custom`) when running with your own data. Default is `custom`. See [usage documentation](https://nf-co.re/smrnaseq/usage) for more details about these profiles.
+> Remember to add a protocol as an additional profile (such as `illumina`, `nexttflex`, `qiaseq` or `cats`) when running with your own data. If no protocol is indicated via -profile, the pipeline will likely fail. Alternatively, if needed to run a custom protocol, parameters must be set manually, and auto-detect feature is available. See [usage documentation](https://nf-co.re/smrnaseq/usage) for more details about these profiles.
 
 > [!WARNING]
 > Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_;

diff --git a/docs/output.md b/docs/output.md
@@ -12,6 +12,7 @@ The directories listed below will be created in the results directory after the
 
 The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:
 
+- [Preprocessing](#preprocessing) - Preprocessing of reference files
 - [FastQC](#fastqc) - read quality control
 - [UMI-tools extract](#umi-tools-extract) - UMI barcode extraction
 - [UMI-collapse deduplicate](#umicollapse-deduplicate) - read deduplication
@@ -29,6 +30,18 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
 
 If `--save_intermediates` is specified, intermediate files generated by each process will be saved in the output directory.
 
+## Preprocessing
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `bowtie_index/genome`: Cleaned genome.fa fasta.
+- `untar/bowtie_index`: Uncompressed bowtie index file.
+
+</details>
+
+Preprocessing is done to format reference files before using them in the workflow, it includes [`untar`](https://www.gnu.org/software/tar/manual/) and [`bioawk`](https://github.com/lh3/bioawk). If the `bowtie_index` file provided is in gzip format it will be processed by `untar`. The fasta file provided will be cleaned using `bioawk`.
+
 ### FastQC
 
 <details markdown="1">
@@ -49,7 +62,7 @@ If `--save_intermediates` is specified, intermediate files generated by each pro
 <details markdown="1">
 <summary>Output files</summary>
 
-- `umitools/`
+- `umi_dedup/fastq_extracted_umi/`
   - `*.fastq.gz`: If `--save_umi_intermeds` is specified, FastQ files **after** UMI extraction will be placed in this directory.
   - `*.log`: Log file generated by the UMI-tools `extract` command.
 
@@ -79,42 +92,46 @@ FastP can automatically detect adapter sequences when not specified directly by
 <details markdown="1">
 <summary>Output files</summary>
 
-- `umi_dedup/`
-  - `*.log`: Results statistics files detailing the UMI deduplication results.
+- `umi_dedup/bam_deduplicated`
   - `*.fastq.gz`: If `--save_umi_intermeds` is specified, the deduplicated fastq.gz files **after** UMI deduplication will be placed in this directory.
   </details>
 
 [UMI-tools](https://github.com/CGATOxford/UMI-tools) deduplicates reads based on unique molecular identifiers (UMIs) to address PCR-bias. Firstly, the UMI-tools `extract` command removes the UMI barcode information from the read sequence and adds it to the read name as highlighted in the [UMI-tools extract](#umi-tools-extract) section. Umicollapse works directly on the fastq files instead of mapping the UMI data first, then deduplicating and generating fastq files again.
 
 ## Bowtie2
 
-[Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) is used to align the reads to user-defined databases of contaminants.
+[Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) is used to align the reads to user-defined databases and to build indexes for `--filter_contaminant` files.
 
 MultiQC reports the number of reads that were removed by each of the contaminant databases.
 
 ## Bowtie
 
-[Bowtie](http://bowtie-bio.sourceforge.net/index.shtml) is used for mapping adapter trimmed reads against the mature miRNAs and miRNA precursors (hairpins) of the chosen database [miRBase](http://www.mirbase.org/) or [MirGeneDB](https://mirgenedb.org/).
+[Bowtie](http://bowtie-bio.sourceforge.net/index.shtml) is used for building the index for the fasta genome, if needed. It is also used for mapping adapter trimmed reads against the mature miRNAs and miRNA precursors (hairpins) of the chosen database [miRBase](http://www.mirbase.org/) or [MirGeneDB](https://mirgenedb.org/).
+
+**Output directory: `results/`**
 
-**Output directory: `results/samtools`**
+- `bowtie_index/`
+  - `mirna_hairpin/bowtie`: mairpin.fa bowtie index files.
+  - `mirna_mature/bowtie`: mature.fa bowtie index files.
+- `genome_quant/`
+  - `genome_quant/bam/.*bam`: The aligned BAM file results.
+  - `genome_quant/bam/.*unmapped.fastq.gz`: Unmapped reads results.
+- `mirna_quant/`
 
-- `sample_mature.bam`: The aligned BAM file of alignment against mature miRNAs
-- `sample_mature_unmapped.fq.gz`: Unmapped reads against mature miRNAs _This file will be used as input for the alignment against miRNA precursors (hairpins)_
-- `sample_mature_hairpin.bam`: The aligned BAM file of alignment against miRNA precursors (hairpins) that didn't map to the mature
-- `sample_mature_hairpin_unmapped.fq.gz`: Unmapped reads against miRNA precursors (hairpins)
-- `sample_mature_hairpin_genome.bam`: The aligned BAM file of reads that didn't map to the precursor.
+  - `mirna_quant/bam/{hairpin,mature,seqcluster}/.*bam`: The aligned BAM file results against hairpin, mature or seqcluster.
+  - `mirna_quant/bam/{hairpin,mature,seqcluster}/.*unmapped.fastq.gz`: Unmapped reads for hairpin, mature or seqcluster.
 
 If `--save_intermediates` is specified, these files will be placed in this directory.
 
 ## SAMtools
 
 [SAMtools](http://samtools.sourceforge.net/) is used for sorting and indexing the output BAM files from Bowtie. In addition, the numbers of features are counted with the `idxstats` option.
 
-**Output directory: `results/samtools/samtools_stats`**
+**Output directory: `results/{genome_quant,mirna_quant}/bam`**
 
 These files will be saved in this directory if `--save_intermediates` is specified. In any case, these stats will always be available in the MultiQC report.
 
-- `stats|idxstats|flagstat`: BAM stats for each of the files listed above.
+- `.*stats|.*idxstats|.*flagstat`: BAM stats for each of the files listed above.
 
 ![samtools](images/samtools_alignment_plot.png)
 
@@ -177,9 +194,8 @@ The files for each sample can also be visualized into a single plot in the Multi
 
 ![MultiQC - FastQC adapter content plot](images/mqc_fastqc_adapter.png)
 
-:::note
-The FastQC plots displayed in the MultiQC report shows _untrimmed_ reads. They may contain adapter sequence and potentially regions with low quality.
-:::
+> [!NOTE]
+> The FastQC plots displayed in the MultiQC report shows _untrimmed_ reads. They may contain adapter sequence and potentially regions with low quality.
 
 ### MultiQC
 
@@ -197,7 +213,8 @@ The FastQC plots displayed in the MultiQC report shows _untrimmed_ reads. They m
 
 Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see <http://multiqc.info>.
 
-- Note: There may be a discrepancy in read counts number displayed in MultiQC between the original FASTQ and BAM files, this is due to secondary alignments being reported by the aligner, which can inflate the total read count number in the BAM files. [More info about this behavior can be found here](https://github.com/nf-core/smrnaseq/issues/94).
+> [!NOTE]
+> There may be a discrepancy in read counts number displayed in MultiQC between the original FASTQ and BAM files, this is due to secondary alignments being reported by the aligner, which can inflate the total read count number in the BAM files. [More info about this behavior can be found here](https://github.com/nf-core/smrnaseq/issues/94).
 
 ### Pipeline information
 

diff --git a/docs/usage.md b/docs/usage.md
@@ -29,26 +29,29 @@ The parameter `--three_prime_adapter` is set to the Illumina TruSeq single index
 
 ### `mirtrace_species` or `mirgenedb_species`
 
-It should point to the 3-letter species name used by [miRBase](https://www.mirbase.org/help/genome_summary.shtml) or [MirGeneDB](https://www.mirgenedb.org/browse). Note the difference in case for the two databases.
+It should point to the 3-letter species name used by [miRBase](https://www.mirbase.org/browse) or [MirGeneDB](https://www.mirgenedb.org/browse). Note the difference in case for the two databases.
 
 ### miRNA related files
 
 Different parameters can be set for the two supported databases. By default `miRBase` will be used with the parameters below.
 
 - `mirna_gtf`: If not supplied by the user, then `mirna_gtf` will point to the latest GFF3 file in miRbase: `https://mirbase.org/download/CURRENT/genomes/${params.mirtrace_species}.gff3`
-- `mature`: points to the FASTA file of mature miRNA sequences. `https://mirbase.org/download/mature.fa`
-- `hairpin`: points to the FASTA file of precursor miRNA sequences. `https://mirbase.org/download/hairpin.fa`
+- `mature`: points to the FASTA file of mature miRNA sequences. Default: `https://mirbase.org/download/mature.fa`
+- `hairpin`: points to the FASTA file of precursor miRNA sequences. Default: `https://mirbase.org/download/hairpin.fa`
 
 If MirGeneDB should be used instead it needs to be specified using `--mirgenedb` and use the parameters below.
 
-- `mirgenedb_gff`: The data can not be downloaded automatically (URLs are created with short term tokens in it), thus the user needs to supply the gff file for either his species, or all species downloaded from `https://mirgenedb.org/download`. The total set will automatically be subsetted to the species specified with `--mirgenedb_species`.
-- `mirgenedb_mature`: points to the FASTA file of mature miRNA sequences. Download from `https://mirgenedb.org/download`.
-- `mirgenedb_hairpin`: points to the FASTA file of precursor miRNA sequences. Download from `https://mirgenedb.org/download`. Note that MirGeneDB does not have a dedicated `hairpin` file, but the `Precursor sequences` are to be used.
+- `mirgenedb_gff`: The GFF file cannot be downloaded automatically due to the presence of short-term tokens in the URLs. Therefore, the user must manually provide the GFF file, either for their species of interest or for all species, by downloading it from [MirGeneDB](https://mirgenedb.org/download). The provided dataset will be automatically filtered based on the species specified with the `--mirgenedb_species` parameter.
+- `mirgenedb_mature`: This parameter should point to the FASTA file containing mature miRNA sequences. The file can be manually downloaded from [MirGeneDB](https://mirgenedb.org/download).
+- `mirgenedb_hairpin`: This parameter should point to the FASTA file containing precursor miRNA sequences. Note that MirGeneDB does not offer a dedicated hairpin file, but the precursor sequences can be downloaded from [MirGeneDB](https://mirgenedb.org/download) and used instead.
 
 ### Genome
 
 - `fasta`: the reference genome FASTA file
-- `bt_indices`: points to the folder containing the `bowtie2` indices for the genome reference specified by `fasta`. **Note:** if the FASTA file in `fasta` is not the same file used to generate the `bowtie2` indices, then the pipeline will fail.
+- `bowtie_index`: points to the folder containing the `bowtie` indices for the genome reference specified by `fasta`.
+
+> [!NOTE]
+> if the FASTA file in `fasta` is not the same file used to generate the `bowtie` indices, then the pipeline will fail.
 
 ### Contamination filtering
 
@@ -77,9 +80,8 @@ The pipeline handles UMIs with two tools. Umicollapse to deduplicate on entire r
 --with_umi --umitools_extract_method regex --umitools_bc_pattern = '.+(?P<discard_1>AACTGTAGGCACCATCAAT){s<=2}(?P<umi_1>.{12})(?P<discard_2>.*)'
 ```
 
-:::note
-You will have to specify custom umitools_bc_pattern patterns if your UMI read structure is different. Please check the required capability in your UMI handling manual. It should be set in a way, that only the insert sequence of the RNA molecule is left after extraction. Please refer to the manual of the used kit for the expected read structure.
-:::
+> [!NOTE]
+> If your UMI read structure differs, you'll need to specify custom `umitools_bc_pattern` patterns. Ensure that the pattern is set so that only the insert sequence of the RNA molecule remains after extraction. For details, refer to the UMI handling manual or the documentation of the kit you're using for the expected read structure.
 
 ## Samplesheet input
 
@@ -104,7 +106,10 @@ CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz
 
 ### Full samplesheet
 
-The pipeline will auto-detect whether a sample is single- or paired-end using the information provided in the samplesheet. The samplesheet can have as many columns as you desire. However, there is a strict requirement for the first 3 columns to match those defined in the table below.
+The pipeline will auto-detect whether a sample is single- or paired-end using the information provided in the samplesheet. The samplesheet must have at least 2 columns (`sample` and `fastq1`). A third column can be added if the sample is paired-end (`fastq2`).
+
+> [!NOTE]
+> Most of the tools used can't accommodate paired end reads, so whenever paired-end samples are used as inputs, the R1 and R2 reads are concatenated into a single fastq file by the pipeline.
 
 A final samplesheet file consisting of single-end data and may look something like the one below. This is for 6 samples, where `TREATMENT_REP3` has been sequenced twice.
 
@@ -119,10 +124,11 @@ TREATMENT_REP3,AEG588A6_S6_L003_R1_001.fastq.gz
 TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz
 ```
 
-| Column    | Description                                                                                                                                                                            |
-| --------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `sample`  | Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (`_`). |
-| `fastq_1` | Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz".                                                             |
+| Column    | Description                                                                                                                                                                            | Requirement |
+| --------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------- |
+| `sample`  | Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (`_`). | Mandatory   |
+| `fastq_1` | Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz".                                                             | Mandatory   |
+| `fastq_2` | Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz".                                                             | Optional    |
 
 An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline.
 
@@ -149,9 +155,8 @@ If you wish to repeatedly use the same parameters for multiple runs, rather than
 
 Pipeline settings can be provided in a `yaml` or `json` file via `-params-file <file>`.
 
-:::warning
-Do not use `-c <file>` to specify parameters as this will result in errors. Custom config files specified with `-c` must only be used for [tuning process resource specifications](https://nf-co.re/docs/usage/configuration#tuning-workflow-resources), other infrastructural tweaks (such as output directories), or module arguments (args).
-:::
+> [!WARNING]
+> Do not use `-c <file>` to specify parameters as this will result in errors. Custom config files specified with `-c` must only be used for [tuning process resource specifications](https://nf-co.re/docs/usage/configuration#tuning-workflow-resources), other infrastructural tweaks (such as output directories), or module arguments (args).
 
 The above pipeline run specified with a params file in yaml format:
 
@@ -199,25 +204,22 @@ The `bin` directory contains some scripts used by the pipeline which may also be
 
 To further assist in reproducbility, you can use share and re-use [parameter files](#running-the-pipeline) to repeat pipeline runs with the same settings without having to write out a command with every single parameter.
 
-:::tip
-If you wish to share such profile (such as upload as supplementary material for academic publications), make sure to NOT include cluster specific paths to files, nor institutional specific profiles.
-:::
+> [!TIP]
+> If you wish to share such a profile (such as uploading it as supplementary material for academic publications), make sure not to include cluster-specific paths to files, nor institution-specific profiles.
 
 ## Core Nextflow arguments
 
-:::note
-These options are part of Nextflow and use a _single_ hyphen (pipeline parameters use a double-hyphen).
-:::
+> [!NOTE]
+> These options are part of Nextflow and use a _single_ hyphen (pipeline parameters use a double-hyphen).
 
 ### `-profile`
 
 Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments.
 
 Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud, Apptainer, Conda) - see below.
 
-:::info
-We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility, however when this is not possible, Conda is also supported.
-:::
+> [!TIP]
+> We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility, however when this is not possible, Conda is also supported.
 
 The pipeline also dynamically loads configurations from [https://github.com/nf-core/configs](https://github.com/nf-core/configs) when it runs, making multiple config profiles for various institutional clusters available at run time. For more information and to see if your system is available in these configs please see the [nf-core/configs documentation](https://github.com/nf-core/configs#documentation).