Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include the calucation of a single covearge and coverage depth value. #20

Merged
merged 4 commits into from
Sep 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 4 additions & 27 deletions .github/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,8 @@ There are typically two types of tests that run:

### Lint tests

`nf-core` has a [set of guidelines](https://nf-co.re/developers/guidelines) which all pipelines must adhere to.
To enforce these and ensure that all pipelines stay in sync, we have developed a helper tool which runs checks on the pipeline code. This is in the [nf-core/tools repository](https://github.com/nf-core/tools) and once installed can be run locally with the `nf-core lint <pipeline-directory>` command.
This pipeline follows some of the `nf-core` [guidelines](https://nf-co.re/developers/guidelines).
To enforce these, the `nf-core` team has developed a helper tool which runs checks on the pipeline code. This is in the [nf-core/tools repository](https://github.com/nf-core/tools) and once installed can be run locally with the `nf-core lint <pipeline-directory>` command.

If any failures or warnings are encountered, please follow the listed URL for more documentation.

Expand All @@ -52,9 +52,9 @@ These tests are run both with the latest available version of `Nextflow` and als

:warning: Only in the unlikely and regretful event of a release happening with a bug.

- On your own fork, make a new branch `patch` based on `upstream/master`.
- On your own fork, make a new branch `patch` based on `upstream/main`.
- Fix the bug, and bump version (X.Y.Z+1).
- A PR should be made on `master` from patch to directly this particular bug.
- A PR should be made on `main` from patch to directly this particular bug.

## Pipeline contribution conventions

Expand Down Expand Up @@ -93,26 +93,3 @@ Please use the following naming schemes, to make it easy to understand what is g

- initial process channel: `ch_output_from_<process>`
- intermediate and terminal channels: `ch_<previousprocess>_for_<nextprocess>`

### Nextflow version bumping

If you are using a new feature from core Nextflow, you may bump the minimum required version of nextflow in the pipeline with: `nf-core bump-version --nextflow . [min-nf-version]`

### Images and figures

For overview images and other documents we follow the nf-core [style guidelines and examples](https://nf-co.re/developers/design_guidelines).

## GitHub Codespaces

This repo includes a devcontainer configuration which will create a GitHub Codespaces for Nextflow development! This is an online developer environment that runs in your browser, complete with VSCode and a terminal.

To get started:

- Open the repo in [Codespaces](https://github.com/ebi-metagenomics/miassembler/codespaces)
- Tools installed
- nf-core
- Nextflow

Devcontainer specs:

- [DevContainer config](.devcontainer/devcontainer.json)
3 changes: 0 additions & 3 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,4 @@ Learn more about contributing: [CONTRIBUTING.md](https://github.com/ebi-metageno
- [ ] Make sure your code lints (`nf-core lint`).
- [ ] Ensure the test suite passes (`nextflow run . -profile test,docker --outdir <OUTDIR>`).
- [ ] Check for unexpected warnings in debug mode (`nextflow run . -profile debug,test,docker --outdir <OUTDIR>`).
- [ ] Usage Documentation in `docs/usage.md` is updated.
- [ ] Output Documentation in `docs/output.md` is updated.
- [ ] `CHANGELOG.md` is updated.
- [ ] `README.md` is updated (including new tool citations and authors/contributors).
49 changes: 46 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,49 @@ results

The nested structure based on ENA Study and Reads accessions was created to suit the Microbiome Informatics team’s needs. The benefit of this structure is that results from different runs of the same study won’t overwrite any results.

### Coverage

The pipeline reports the coverage values for the assembly using two mechanisms: `jgi_summarize_bam_contig_depths` and a custom whole assembly coverage and coverage depth.

#### jgi_summarize_bam_contig_depths

This tool summarizes the depth of coverage for each contig from BAM files containing the mapped reads. It quantifies the extent to which contigs in an assembly are covered by these reads. The output is a tabular file, with rows representing contigs and columns displaying the summarized coverage values from the BAM files. This summary is useful for binning contigs or estimating abundance in various metagenomic datasets.

This file is generated per assembly and stored in the following location (e.g., for study `SRP115494` and run `SRR6180434`): `SRP1154/SRP115494/multiqc/SRR5949/SRR5949318/assembly/metaspades/3.15.5/coverage/SRR6180434_coverage_depth_summary.tsv.gz`

##### Example output of `jgi_summarize_bam_contig_depths`

| contigName | contigLen | totalAvgDepth | SRR6180434_sorted.bam | SRR6180434_sorted.bam-var |
| -------------------------------- | --------- | ------------- | --------------------- | ------------------------- |
| NODE_1_length_539_cov_105.072314 | 539 | 273.694 | 273.694 | 74284.7 |

###### Explanation of the Columns:

1. **contigName**: The name or identifier of the contig (e.g., `NODE_1_length_539_cov_105.072314`). This is usually derived from the assembly process and may include information such as the contig length and coverage.

2. **contigLen**: The length of the contig in base pairs (e.g., `539`).

3. **totalAvgDepth**: The average depth of coverage across the entire contig from all BAM files (e.g., `273.694`). This represents the total sequencing coverage averaged across the length of the contig. This value will be the same as the sample avg. depth in assemblies of a single sample.

4. **SRR6180434_sorted.bam**: The average depth of coverage for the specific sample represented by this BAM file (e.g., `273.694`). This shows how well the contig is covered by reads.

5. **SRR6180434_sorted.bam-var**: The variance in the depth of coverage for the same BAM file (e.g., `74284.7`). This gives a measure of how uniform or uneven the read coverage is across the contig.

#### Coverage JSON

The pipeline calculates two key metrics: coverage and coverage depth for the entire assembly. The coverage is determined by dividing the number of assembled base pairs by the total number of base pairs before filtering. Coverage depth is calculated by dividing the number of assembled base pairs by the total length of the assembly, provided the assembly length is greater than zero. These metrics provide insights into how well the reads cover the assembly and the average depth of coverage across the assembled contigs. The script that calculates this number is [calculate_assembly_coverage.py](bin/calculate_assembly_coverage.py).

The pipeline creates a JSON file with the following content:

```json
{
"coverage": 0.04760503915318373,
"coverage_depth": 273.694
}
```

The file is stored in (e.g. for study `SRP115494` and run `SRR6180434`) -> `SRP1154/SRP115494/multiqc/SRR5949/SRR5949318/assembly/metaspades/3.15.5/coverage/SRR6180434_coverage.json`

### Top Level Reports

#### MultiQC
Expand All @@ -187,10 +230,10 @@ SRR6180434,filter_ratio_threshold_exceeded

##### Runs exclusion messages

| Exclusion Message | Description |
| --------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Exclusion Message | Description |
| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `filter_ratio_threshold_exceeded` | The maximum fraction of reads that are allowed to be filtered out. If exceeded, it flags excessive filtering. The default value is 0.9, meaning that if more than 90% of the reads are filtered out, the threshold is considered exceeded, and the run is not assembled. |
| `low_reads_count_threshold` | The minimum number of reads required after filtering. If below, it flags a low read count, and the run is not assembled. |
| `low_reads_count_threshold` | The minimum number of reads required after filtering. If below, it flags a low read count, and the run is not assembled. |

#### Assembled Runs

Expand Down
91 changes: 91 additions & 0 deletions bin/calculate_assembly_coverage.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
#!/usr/bin/env python

import csv
from typing import Tuple
import json
import argparse
import gzip

def get_assembled_base_pairs_and_length(jgi_summarize_coverage_file_gz: str) -> Tuple[int, int]:
"""
Computes the total assembled pairs and assembly length from a JGI summarize coverage file.

:param jgi_summarize_coverage_file: Path to the JGI summarize coverage file.
:type jgi_summarize_coverage_file: str

:return: A tuple containing the total number of assembled pairs and the total assembly length.
:rtype: Tuple[int, int]

:raises ValueError: If 'contigLen' or 'totalAvgDepth' columns contain non-numeric values.
"""
assembled_base_pairs = 0
assembly_length = 0
with gzip.open(jgi_summarize_coverage_file_gz, "rt") as file_handle:
csv_reader = csv.DictReader(file_handle, delimiter="\t")
for row in csv_reader:
contig_length_str = row["contigLen"]
total_avg_depth_str = row["totalAvgDepth"]

if not contig_length_str.isnumeric():
raise ValueError(f"The column 'contigLen' has an invalid value: {contig_length_str}")

if int(contig_length_str) == 0:
raise ValueError(f"The column 'contigLen' cannot have a contig of len 0")

contig_length = float(contig_length_str)
# If total total_avg_depth_str is not a float, a ValueError should be raised
total_avg_depth = float(total_avg_depth_str)

assembled_base_pairs += contig_length * total_avg_depth
assembly_length += contig_length

return assembled_base_pairs, assembly_length


def get_total_bases_before_filtering(fastp_json: str) -> int:
"""
Retrieves the total number of bases before filtering from a Fastp JSON file.

:param fastp_json: Path to the Fastp JSON file.
:type fastp_json: str

:return: The total number of bases before filtering.
:rtype: int

:raises ValueError: If the 'total_bases' value in the Fastp JSON is not numeric.
"""
with open(fastp_json, "r") as json_handle:
fastp_json_data = json.load(json_handle)
# Get the total bases before filtering (the raw bp count)
total_bases_before_filtering = fastp_json_data.get("summary").get("before_filtering").get("total_bases")

if not isinstance(total_bases_before_filtering, int):
raise ValueError(
f"The value of 'total_bases_before_filtering' in the fastp json is not numeric: {total_bases_before_filtering}"
)

return total_bases_before_filtering


def main():
parser = argparse.ArgumentParser(
description="Calculate coverage and coverage depth from the jgi_summarize_bam_contig_depths and fastp data."
)
parser.add_argument("-j", "--jgi", type=str, required=True, help="Path to the JGI summarize coverage compressed (gz) file.")
parser.add_argument("-f", "--fastp", type=str, required=True, help="Path to the qc fastp json report file.")
parser.add_argument("-o", "--outfile", type=str, default="coverage_report.json", help="Name of the output file.")

args = parser.parse_args()

assembled_base_pairs, assembly_length = get_assembled_base_pairs_and_length(args.jgi)
total_bases_before_filtering = get_total_bases_before_filtering(args.fastp)

coverage = assembled_base_pairs / total_bases_before_filtering
coverage_depth = assembled_base_pairs / assembly_length if assembly_length > 0 else 0

with open(args.outfile, "w") as outfile_file:
json.dump({"coverage": coverage, "coverage_depth": coverage_depth}, outfile_file)


if __name__ == "__main__":
main()
19 changes: 19 additions & 0 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -259,6 +259,25 @@ process {
path: "${params.outdir}",
mode: params.publish_dir_mode,
pattern: "*.txt.gz",
saveAs: {
filename -> {
def output_file = new File(filename);
return "${study_reads_folder( meta )}/assembly/${meta.assembler}/${meta.assembler_version}/coverage/${meta.id}_coverage_depth_summary.tsv.gz";
}
}
]
]
}

withName: 'CALCULATE_ASSEMBLY_COVERAGE' {
cpus = { check_max( 1 * task.attempt, 'cpus' ) }
memory = { check_max( 100.MB * task.attempt, 'memory' ) }
time = { check_max( 30.m * task.attempt, 'time' ) }
publishDir = [
[
path: "${params.outdir}",
mode: params.publish_dir_mode,
pattern: "*.json",
saveAs: {
filename -> {
def output_file = new File(filename);
Expand Down
18 changes: 17 additions & 1 deletion conf/test.config
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,24 @@
*/

profiles {
// Limit resources so that this can run on GitHub Actions

test {
params {
max_cpus = 4
max_memory = '8.GB'
max_time = '6.h'

samplesheet = "tests/samplesheet/test.csv"

bwamem2_reference_genomes_folder = "tests/human_phix/bwa2mem"
blast_reference_genomes_folder = "tests/human_phix/blast"
human_phix_blast_index_name = "human_phix"
human_phix_bwamem2_index_name = "human_phix"
}
}

// Limit resources so that this can run on GitHub Actions
test_ci {
params {
max_cpus = 2
max_memory = '6.GB'
Expand Down
42 changes: 42 additions & 0 deletions modules/local/calculate_assembly_coverage.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
process CALCULATE_ASSEMBLY_COVERAGE {

tag "$meta.id"

label 'process_single'

container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/python:3.10' :
'quay.io/biocontainers/python:3.10' }"

input:
tuple val(meta), path(jgi_summary_tsv_gz), path(fastp_json)

output:
tuple val(meta), path("${meta.id}_coverage.json"), emit: assembly_coverage_json
val("versions.yml"), emit: versions

when:
task.ext.when == null || task.ext.when

script:
def prefix = task.ext.prefix ?: "${meta.id}"
"""
calculate_assembly_coverage.py -j ${jgi_summary_tsv_gz} -f ${fastp_json} -o ${prefix}_coverage.json

cat <<-END_VERSIONS > versions.yml
"${task.process}":
python: \$(python --version 2>&1 | sed 's/Python //g')
END_VERSIONS
"""

stub:
def prefix = task.ext.prefix ?: "${meta.id}"
"""
touch ${prefix}_coverage_data.json

cat <<-END_VERSIONS > versions.yml
"${task.process}":
python: \$(python --version 2>&1 | sed 's/Python //g')
END_VERSIONS
"""
}
6 changes: 3 additions & 3 deletions nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -107,9 +107,6 @@ includeConfig 'conf/base.config'


profiles {
test {
includeConfig 'conf/test.config'
}
debug {
dumpHashes = true
process.beforeScript = 'echo $HOSTNAME'
Expand Down Expand Up @@ -203,6 +200,9 @@ profiles {
codon_slurm { includeConfig 'conf/codon_slurm.config' }
}

// Test profiles
includeConfig 'conf/test.config'

// Set default registry for Apptainer, Docker, Podman and Singularity independent of -profile
// Will not be used unless Apptainer / Docker / Podman / Singularity are enabled
// Set to your registry if you have a mirror of containers
Expand Down
2 changes: 1 addition & 1 deletion nf-test.config
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,5 @@ config {
testsDir "tests"
workDir ".nf-test"
configFile "tests/nextflow.config"
profile "test,docker"
profile "test_ci,docker"
}
18 changes: 14 additions & 4 deletions subworkflows/local/assembly_coverage.nf
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
include { BWAMEM2_INDEX } from '../../modules/nf-core/bwamem2/index/main'
include { CALCULATE_ASSEMBLY_COVERAGE } from '../../modules/local/calculate_assembly_coverage'
include { BWAMEM2_MEM as BWAMEM2_MEM_COVERAGE } from '../../modules/ebi-metagenomics/bwamem2/mem/main'
include { BWAMEM2_INDEX } from '../../modules/nf-core/bwamem2/index/main'
include { SAMTOOLS_IDXSTATS } from '../../modules/nf-core/samtools/idxstats/main'
include { METABAT2_JGISUMMARIZEBAMCONTIGDEPTHS } from '../../modules/nf-core/metabat2/jgisummarizebamcontigdepths/main'

workflow ASSEMBLY_COVERAGE {

take:
assembly_reads // [ val(meta), path(assembly_fasta), path(reads) ]
fastp_json // [ val(meta), path(fasp_json) ]

main:

Expand Down Expand Up @@ -41,8 +43,16 @@ workflow ASSEMBLY_COVERAGE {

ch_versions = ch_versions.mix(SAMTOOLS_IDXSTATS.out.versions)

// This process calculates a single coverage and coverage depth value for the whole assembly //
CALCULATE_ASSEMBLY_COVERAGE(
METABAT2_JGISUMMARIZEBAMCONTIGDEPTHS.out.depth.join ( fastp_json )
)

ch_versions = ch_versions.mix(CALCULATE_ASSEMBLY_COVERAGE.out.versions)

emit:
coverage_depth = METABAT2_JGISUMMARIZEBAMCONTIGDEPTHS.out.depth
samtools_idxstats = SAMTOOLS_IDXSTATS.out.idxstats
versions = ch_versions
coverage_depth_summary = METABAT2_JGISUMMARIZEBAMCONTIGDEPTHS.out.depth
samtools_idxstats = SAMTOOLS_IDXSTATS.out.idxstats
assembly_coverage_json = CALCULATE_ASSEMBLY_COVERAGE.out.assembly_coverage_json
versions = ch_versions
}
Loading