EBI-Metagenomics · mberacochea · Sep 20, 2024 · Sep 19, 2024 · Sep 20, 2024 · Sep 20, 2024
diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md
@@ -36,8 +36,8 @@ There are typically two types of tests that run:
 
 ### Lint tests
 
-`nf-core` has a [set of guidelines](https://nf-co.re/developers/guidelines) which all pipelines must adhere to.
-To enforce these and ensure that all pipelines stay in sync, we have developed a helper tool which runs checks on the pipeline code. This is in the [nf-core/tools repository](https://github.com/nf-core/tools) and once installed can be run locally with the `nf-core lint <pipeline-directory>` command.
+This pipeline follows some of the `nf-core` [guidelines](https://nf-co.re/developers/guidelines).
+To enforce these, the `nf-core` team has developed a helper tool which runs checks on the pipeline code. This is in the [nf-core/tools repository](https://github.com/nf-core/tools) and once installed can be run locally with the `nf-core lint <pipeline-directory>` command.
 
 If any failures or warnings are encountered, please follow the listed URL for more documentation.
 
@@ -52,9 +52,9 @@ These tests are run both with the latest available version of `Nextflow` and als
 
 :warning: Only in the unlikely and regretful event of a release happening with a bug.
 
-- On your own fork, make a new branch `patch` based on `upstream/master`.
+- On your own fork, make a new branch `patch` based on `upstream/main`.
 - Fix the bug, and bump version (X.Y.Z+1).
-- A PR should be made on `master` from patch to directly this particular bug.
+- A PR should be made on `main` from patch to directly this particular bug.
 
 ## Pipeline contribution conventions
 
@@ -93,26 +93,3 @@ Please use the following naming schemes, to make it easy to understand what is g
 
 - initial process channel: `ch_output_from_<process>`
 - intermediate and terminal channels: `ch_<previousprocess>_for_<nextprocess>`
-
-### Nextflow version bumping
-
-If you are using a new feature from core Nextflow, you may bump the minimum required version of nextflow in the pipeline with: `nf-core bump-version --nextflow . [min-nf-version]`
-
-### Images and figures
-
-For overview images and other documents we follow the nf-core [style guidelines and examples](https://nf-co.re/developers/design_guidelines).
-
-## GitHub Codespaces
-
-This repo includes a devcontainer configuration which will create a GitHub Codespaces for Nextflow development! This is an online developer environment that runs in your browser, complete with VSCode and a terminal.
-
-To get started:
-
-- Open the repo in [Codespaces](https://github.com/ebi-metagenomics/miassembler/codespaces)
-- Tools installed
-  - nf-core
-  - Nextflow
-
-Devcontainer specs:
-
-- [DevContainer config](.devcontainer/devcontainer.json)
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -19,7 +19,4 @@ Learn more about contributing: [CONTRIBUTING.md](https://github.com/ebi-metageno
 - [ ] Make sure your code lints (`nf-core lint`).
 - [ ] Ensure the test suite passes (`nextflow run . -profile test,docker --outdir <OUTDIR>`).
 - [ ] Check for unexpected warnings in debug mode (`nextflow run . -profile debug,test,docker --outdir <OUTDIR>`).
-- [ ] Usage Documentation in `docs/usage.md` is updated.
-- [ ] Output Documentation in `docs/output.md` is updated.
-- [ ] `CHANGELOG.md` is updated.
 - [ ] `README.md` is updated (including new tool citations and authors/contributors).
diff --git a/README.md b/README.md
@@ -165,6 +165,49 @@ results
 
 The nested structure based on ENA Study and Reads accessions was created to suit the Microbiome Informatics team’s needs. The benefit of this structure is that results from different runs of the same study won’t overwrite any results.
 
+### Coverage
+
+The pipeline reports the coverage values for the assembly using two mechanisms: `jgi_summarize_bam_contig_depths` and a custom whole assembly coverage and coverage depth.
+
+#### jgi_summarize_bam_contig_depths
+
+This tool summarizes the depth of coverage for each contig from BAM files containing the mapped reads. It quantifies the extent to which contigs in an assembly are covered by these reads. The output is a tabular file, with rows representing contigs and columns displaying the summarized coverage values from the BAM files. This summary is useful for binning contigs or estimating abundance in various metagenomic datasets.
+
+This file is generated per assembly and stored in the following location (e.g., for study `SRP115494` and run `SRR6180434`): `SRP1154/SRP115494/multiqc/SRR5949/SRR5949318/assembly/metaspades/3.15.5/coverage/SRR6180434_coverage_depth_summary.tsv.gz`
+
+##### Example output of `jgi_summarize_bam_contig_depths`
+
+| contigName                       | contigLen | totalAvgDepth | SRR6180434_sorted.bam | SRR6180434_sorted.bam-var |
+| -------------------------------- | --------- | ------------- | --------------------- | ------------------------- |
+| NODE_1_length_539_cov_105.072314 | 539       | 273.694       | 273.694               | 74284.7                   |
+
+###### Explanation of the Columns:
+
+1. **contigName**: The name or identifier of the contig (e.g., `NODE_1_length_539_cov_105.072314`). This is usually derived from the assembly process and may include information such as the contig length and coverage.
+
+2. **contigLen**: The length of the contig in base pairs (e.g., `539`).
+
+3. **totalAvgDepth**: The average depth of coverage across the entire contig from all BAM files (e.g., `273.694`). This represents the total sequencing coverage averaged across the length of the contig. This value will be the same as the sample avg. depth in assemblies of a single sample.
+
+4. **SRR6180434_sorted.bam**: The average depth of coverage for the specific sample represented by this BAM file (e.g., `273.694`). This shows how well the contig is covered by reads.
+
+5. **SRR6180434_sorted.bam-var**: The variance in the depth of coverage for the same BAM file (e.g., `74284.7`). This gives a measure of how uniform or uneven the read coverage is across the contig.
+
+#### Coverage JSON
+
+The pipeline calculates two key metrics: coverage and coverage depth for the entire assembly. The coverage is determined by dividing the number of assembled base pairs by the total number of base pairs before filtering. Coverage depth is calculated by dividing the number of assembled base pairs by the total length of the assembly, provided the assembly length is greater than zero. These metrics provide insights into how well the reads cover the assembly and the average depth of coverage across the assembled contigs. The script that calculates this number is [calculate_assembly_coverage.py](bin/calculate_assembly_coverage.py).
+
+The pipeline creates a JSON file with the following content:
+
+```json
+{
+  "coverage": 0.04760503915318373,
+  "coverage_depth": 273.694
+}
+```
+
+The file is stored in (e.g. for study `SRP115494` and run `SRR6180434`) -> `SRP1154/SRP115494/multiqc/SRR5949/SRR5949318/assembly/metaspades/3.15.5/coverage/SRR6180434_coverage.json`
+
 ### Top Level Reports
 
 #### MultiQC
@@ -187,10 +230,10 @@ SRR6180434,filter_ratio_threshold_exceeded
 
 ##### Runs exclusion messages
 
-| Exclusion Message                 | Description                                                                                                                                                                                                                                                                            |
-| --------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Exclusion Message                 | Description                                                                                                                                                                                                                                                              |
+| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | `filter_ratio_threshold_exceeded` | The maximum fraction of reads that are allowed to be filtered out. If exceeded, it flags excessive filtering. The default value is 0.9, meaning that if more than 90% of the reads are filtered out, the threshold is considered exceeded, and the run is not assembled. |
-| `low_reads_count_threshold`       | The minimum number of reads required after filtering. If below, it flags a low read count, and the run is not assembled.                                                                                                                                                               |
+| `low_reads_count_threshold`       | The minimum number of reads required after filtering. If below, it flags a low read count, and the run is not assembled.                                                                                                                                                 |
 
 #### Assembled Runs
 

diff --git a/bin/calculate_assembly_coverage.py b/bin/calculate_assembly_coverage.py
@@ -0,0 +1,91 @@
+#!/usr/bin/env python
+
+import csv
+from typing import Tuple
+import json
+import argparse
+import gzip
+
+def get_assembled_base_pairs_and_length(jgi_summarize_coverage_file_gz: str) -> Tuple[int, int]:
+    """
+    Computes the total assembled pairs and assembly length from a JGI summarize coverage file.
+
+    :param jgi_summarize_coverage_file: Path to the JGI summarize coverage file.
+    :type jgi_summarize_coverage_file: str
+
+    :return: A tuple containing the total number of assembled pairs and the total assembly length.
+    :rtype: Tuple[int, int]
+
+    :raises ValueError: If 'contigLen' or 'totalAvgDepth' columns contain non-numeric values.
+    """
+    assembled_base_pairs = 0
+    assembly_length = 0
+    with gzip.open(jgi_summarize_coverage_file_gz, "rt") as file_handle:
+        csv_reader = csv.DictReader(file_handle, delimiter="\t")
+        for row in csv_reader:
+            contig_length_str = row["contigLen"]
+            total_avg_depth_str = row["totalAvgDepth"]
+
+            if not contig_length_str.isnumeric():
+                raise ValueError(f"The column 'contigLen' has an invalid value: {contig_length_str}")
+
+            if int(contig_length_str) == 0:
+                raise ValueError(f"The column 'contigLen' cannot have a contig of len 0")
+
+            contig_length = float(contig_length_str)
+            # If total total_avg_depth_str is not a float, a ValueError should be raised
+            total_avg_depth = float(total_avg_depth_str)
+
+            assembled_base_pairs += contig_length * total_avg_depth
+            assembly_length += contig_length
+
+    return assembled_base_pairs, assembly_length
+
+
+def get_total_bases_before_filtering(fastp_json: str) -> int:
+    """
+    Retrieves the total number of bases before filtering from a Fastp JSON file.
+
+    :param fastp_json: Path to the Fastp JSON file.
+    :type fastp_json: str
+
+    :return: The total number of bases before filtering.
+    :rtype: int
+
+    :raises ValueError: If the 'total_bases' value in the Fastp JSON is not numeric.
+    """
+    with open(fastp_json, "r") as json_handle:
+        fastp_json_data = json.load(json_handle)
+        # Get the total bases before filtering (the raw bp count)
+        total_bases_before_filtering = fastp_json_data.get("summary").get("before_filtering").get("total_bases")
+
+        if not isinstance(total_bases_before_filtering, int):
+            raise ValueError(
+                f"The value of 'total_bases_before_filtering' in the fastp json is not numeric: {total_bases_before_filtering}"
+            )
+
+    return total_bases_before_filtering
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Calculate coverage and coverage depth from the jgi_summarize_bam_contig_depths and fastp data."
+    )
+    parser.add_argument("-j", "--jgi", type=str, required=True, help="Path to the JGI summarize coverage compressed (gz) file.")
+    parser.add_argument("-f", "--fastp", type=str, required=True, help="Path to the qc fastp json report file.")
+    parser.add_argument("-o", "--outfile", type=str, default="coverage_report.json", help="Name of the output file.")
+
+    args = parser.parse_args()
+
+    assembled_base_pairs, assembly_length = get_assembled_base_pairs_and_length(args.jgi)
+    total_bases_before_filtering = get_total_bases_before_filtering(args.fastp)
+
+    coverage = assembled_base_pairs / total_bases_before_filtering
+    coverage_depth = assembled_base_pairs / assembly_length if assembly_length > 0 else 0
+
+    with open(args.outfile, "w") as outfile_file:
+        json.dump({"coverage": coverage, "coverage_depth": coverage_depth}, outfile_file)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/conf/modules.config b/conf/modules.config
@@ -259,6 +259,25 @@ process {
                 path: "${params.outdir}",
                 mode: params.publish_dir_mode,
                 pattern: "*.txt.gz",
+                saveAs: {
+                    filename -> {
+                        def output_file = new File(filename);
+                        return "${study_reads_folder( meta )}/assembly/${meta.assembler}/${meta.assembler_version}/coverage/${meta.id}_coverage_depth_summary.tsv.gz";
+                    }
+                }
+            ]
+        ]
+    }
+
+    withName: 'CALCULATE_ASSEMBLY_COVERAGE' {
+        cpus   = { check_max( 1     * task.attempt, 'cpus'    ) }
+        memory = { check_max( 100.MB * task.attempt, 'memory'  ) }
+        time   = { check_max( 30.m   * task.attempt, 'time'    ) }
+        publishDir = [
+            [
+                path: "${params.outdir}",
+                mode: params.publish_dir_mode,
+                pattern: "*.json",
                 saveAs: {
                     filename -> {
                         def output_file = new File(filename);

diff --git a/conf/test.config b/conf/test.config
@@ -11,8 +11,24 @@
 */
 
 profiles {
-    // Limit resources so that this can run on GitHub Actions
+
     test {
+        params {
+            max_cpus   = 4
+            max_memory = '8.GB'
+            max_time   = '6.h'
+
+            samplesheet                      = "tests/samplesheet/test.csv"
+
+            bwamem2_reference_genomes_folder = "tests/human_phix/bwa2mem"
+            blast_reference_genomes_folder   = "tests/human_phix/blast"
+            human_phix_blast_index_name      = "human_phix"
+            human_phix_bwamem2_index_name    = "human_phix"
+        }
+    }
+
+    // Limit resources so that this can run on GitHub Actions
+    test_ci {
         params {
             max_cpus   = 2
             max_memory = '6.GB'

diff --git a/modules/local/calculate_assembly_coverage.nf b/modules/local/calculate_assembly_coverage.nf
@@ -0,0 +1,42 @@
+process CALCULATE_ASSEMBLY_COVERAGE {
+
+    tag "$meta.id"
+
+    label 'process_single'
+
+    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
+        'https://depot.galaxyproject.org/singularity/python:3.10' :
+        'quay.io/biocontainers/python:3.10' }"
+
+    input:
+    tuple val(meta), path(jgi_summary_tsv_gz), path(fastp_json)
+
+    output:
+    tuple val(meta), path("${meta.id}_coverage.json"), emit: assembly_coverage_json
+    val("versions.yml"),                               emit: versions
+
+    when:
+    task.ext.when == null || task.ext.when
+
+    script:
+    def prefix = task.ext.prefix ?: "${meta.id}"
+    """
+    calculate_assembly_coverage.py -j ${jgi_summary_tsv_gz} -f ${fastp_json} -o ${prefix}_coverage.json
+
+    cat <<-END_VERSIONS > versions.yml
+    "${task.process}":
+        python: \$(python --version 2>&1 | sed 's/Python //g')
+    END_VERSIONS
+    """
+
+    stub:
+    def prefix = task.ext.prefix ?: "${meta.id}"
+    """
+    touch ${prefix}_coverage_data.json
+
+    cat <<-END_VERSIONS > versions.yml
+    "${task.process}":
+        python: \$(python --version 2>&1 | sed 's/Python //g')
+    END_VERSIONS
+    """
+}
diff --git a/nextflow.config b/nextflow.config
@@ -107,9 +107,6 @@ includeConfig 'conf/base.config'
 
 
 profiles {
-    test {
-        includeConfig 'conf/test.config'
-    }
     debug {
         dumpHashes             = true
         process.beforeScript   = 'echo $HOSTNAME'
@@ -203,6 +200,9 @@ profiles {
     codon_slurm { includeConfig 'conf/codon_slurm.config' }
 }
 
+// Test profiles
+includeConfig 'conf/test.config'
+
 // Set default registry for Apptainer, Docker, Podman and Singularity independent of -profile
 // Will not be used unless Apptainer / Docker / Podman / Singularity are enabled
 // Set to your registry if you have a mirror of containers

diff --git a/nf-test.config b/nf-test.config
@@ -3,5 +3,5 @@ config {
     testsDir "tests"
     workDir ".nf-test"
     configFile "tests/nextflow.config"
-    profile "test,docker"
+    profile "test_ci,docker"
 }
diff --git a/subworkflows/local/assembly_coverage.nf b/subworkflows/local/assembly_coverage.nf
@@ -1,12 +1,14 @@
-include { BWAMEM2_INDEX                        } from '../../modules/nf-core/bwamem2/index/main'
+include { CALCULATE_ASSEMBLY_COVERAGE          } from '../../modules/local/calculate_assembly_coverage'
 include { BWAMEM2_MEM as BWAMEM2_MEM_COVERAGE  } from '../../modules/ebi-metagenomics/bwamem2/mem/main'
+include { BWAMEM2_INDEX                        } from '../../modules/nf-core/bwamem2/index/main'
 include { SAMTOOLS_IDXSTATS                    } from '../../modules/nf-core/samtools/idxstats/main'
 include { METABAT2_JGISUMMARIZEBAMCONTIGDEPTHS } from '../../modules/nf-core/metabat2/jgisummarizebamcontigdepths/main'
 
 workflow ASSEMBLY_COVERAGE {
 
     take:
     assembly_reads   // [ val(meta), path(assembly_fasta), path(reads) ]
+    fastp_json       // [ val(meta), path(fasp_json) ]
 
     main:
 
@@ -41,8 +43,16 @@ workflow ASSEMBLY_COVERAGE {
 
     ch_versions = ch_versions.mix(SAMTOOLS_IDXSTATS.out.versions)
 
+    // This process calculates a single coverage and coverage depth value for the whole assembly //
+    CALCULATE_ASSEMBLY_COVERAGE(
+        METABAT2_JGISUMMARIZEBAMCONTIGDEPTHS.out.depth.join ( fastp_json )
+    )
+
+    ch_versions = ch_versions.mix(CALCULATE_ASSEMBLY_COVERAGE.out.versions)
+
     emit:
-    coverage_depth     = METABAT2_JGISUMMARIZEBAMCONTIGDEPTHS.out.depth
-    samtools_idxstats  = SAMTOOLS_IDXSTATS.out.idxstats
-    versions           = ch_versions
+    coverage_depth_summary        = METABAT2_JGISUMMARIZEBAMCONTIGDEPTHS.out.depth
+    samtools_idxstats             = SAMTOOLS_IDXSTATS.out.idxstats
+    assembly_coverage_json        = CALCULATE_ASSEMBLY_COVERAGE.out.assembly_coverage_json
+    versions                      = ch_versions
 }