Merge pull request #6 from EBI-Metagenomics/feature/restructure_outputs

Feature/restructure outputs
EBI-Metagenomics · May 30, 2024 · 36bc2f0 · 36bc2f0
2 parents ce57e24 + 1963f75
commit 36bc2f0
Show file tree

Hide file tree

Showing 48 changed files with 1,207 additions and 476 deletions.
diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md
@@ -23,8 +23,11 @@ If you're not used to this workflow with git, you can start with some [docs from
 
 ## Tests
 
-You can optionally test your changes by running the pipeline locally. Then it is recommended to use the `debug` profile to
-receive warnings about process selectors and other debug info. Example: `nextflow run . -profile debug,test,docker --outdir <OUTDIR>`.
+You have the option to test your changes locally by running the pipeline. For receiving warnings about process selectors and other `debug` information, it is recommended to use the debug profile. Execute all the tests with the following command:
+
+```bash
+nf-test test --profile debug,test,docker --verbose
+```
 
 When you create a pull request with changes, [GitHub Actions](https://github.com/features/actions) will run automatic tests.
 Typically, pull-requests are only fully reviewed when these tests are passing, though of course we can help out before then.
@@ -40,7 +43,7 @@ If any failures or warnings are encountered, please follow the listed URL for mo
 
 ### Pipeline tests
 
-Each `nf-core` pipeline should be set up with a minimal set of test-data.
+Each of the Microbiome Informatics pipelines should be set up with a minimal set of test-data.
 `GitHub Actions` then runs the pipeline on this data to ensure that it exits successfully.
 If there are any failures then the automated tests fail.
 These tests are run both with the latest available version of `Nextflow` and also the minimum required version that is stated in the pipeline code.
@@ -82,7 +85,7 @@ Once there, use `nf-core schema build` to add to `nextflow_schema.json`.
 
 Sensible defaults for process resource requirements (CPUs / memory / time) for a process should be defined in `conf/base.config`. These should generally be specified generic with `withLabel:` selectors so they can be shared across multiple processes/steps of the pipeline. A nf-core standard set of labels that should be followed where possible can be seen in the [nf-core pipeline template](https://github.com/nf-core/tools/blob/master/nf_core/pipeline-template/conf/base.config), which has the default process as a single core-process, and then different levels of multi-core configurations for increasingly large memory requirements defined with standardised labels.
 
-The process resources can be passed on to the tool dynamically within the process with the `${task.cpu}` and `${task.memory}` variables in the `script:` block.
+The process resources can be passed on to the tool dynamically within the process with the `${task.cpus}` and `${task.memory}` variables in the `script:` block.
 
 ### Naming schemes
 

diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,36 @@
+name: nf-test CI
+on:
+  push:
+    branches:
+      - dev
+  pull_request:
+  release:
+    types: [published]
+
+env:
+  NXF_ANSI_LOG: false
+  NFTEST_VER: "0.8.4"
+
+jobs:
+  test:
+    name: Run pipeline with test data
+    runs-on: ubuntu-latest
+
+    steps:
+      - name: Check out pipeline code
+        uses: actions/checkout@v4
+
+      - uses: actions/setup-java@99b8673ff64fbf99d8d325f52d9a5bdedb8483e9 # v4
+        with:
+          distribution: "temurin"
+          java-version: "17"
+
+      - name: Setup Nextflow
+        uses: nf-core/setup-nextflow@v2
+
+      - name: Install nf-test
+        uses: nf-core/setup-nf-test@v1
+
+      - name: Run pipeline with test data
+        run: |
+          nf-test test
diff --git a/.gitignore b/.gitignore
@@ -9,5 +9,8 @@ testing*
 results/
 
 *.pyc
+.pytest_cache/
 
-assets/fetch_tool_credentials.json
+assets/fetch_tool_credentials.json
+.nf-test.log
+.nf-test/
diff --git a/.nf-core.yml b/.nf-core.yml
@@ -1,32 +1,48 @@
+repository_type: pipeline
+template:
+  prefix: ebi-metagenomics
+  skip:
+    - ci
+    - github_badges
 lint:
   files_exist:
     - CODE_OF_CONDUCT.md
     - assets/nf-core-miassembler_logo_light.png
     - docs/images/nf-core-miassembler_logo_light.png
     - docs/images/nf-core-miassembler_logo_dark.png
+    - docs/output.md
+    - docs/usage.md
     - .github/ISSUE_TEMPLATE/config.yml
     - .github/workflows/awstest.yml
     - .github/workflows/awsfulltest.yml
     - .github/workflows/branch.yml
     - .github/workflows/ci.yml
     - .github/workflows/linting_comment.yml
     - .github/workflows/linting.yml
+    - conf/test_full.config
+    - lib/Utils.groovy
+    - lib/WorkflowMain.groovy
+    - lib/NfcoreTemplate.groovy
+    - lib/WorkflowMiassembler.groovy
+    - lib/nfcore_external_java_deps.jar
   files_unchanged:
     - CODE_OF_CONDUCT.md
     - assets/nf-core-miassembler_logo_light.png
     - docs/images/nf-core-miassembler_logo_light.png
     - docs/images/nf-core-miassembler_logo_dark.png
     - .github/ISSUE_TEMPLATE/bug_report.yml
+    - .github/CONTRIBUTING.md
+    - LICENSE
+    - docs/README.md
+    - .gitignore
   multiqc_config:
     - report_comment
-  nextflow_config:
+  nextflow_config: False
+    - params.input
+    - params.validationSchemaIgnoreParams
+    - params.custom_config_version
+    - params.custom_config_base
     - manifest.name
     - manifest.homePage
   readme:
     - nextflow_badge
-repository_type: pipeline
-template:
-  prefix: ebi-metagenomics
-  skip:
-    - ci
-    - github_badges
diff --git a/README.md b/README.md
@@ -10,6 +10,9 @@
 
 This pipeline is still in early development. It's mostly a direct port of the mi-automation assembly generation pipeline. Some of the bespoke scripts used to remove contaminated contigs or to calculate the coverage of the assembly were replaced with tools provided by the community ([SeqKit](https://doi.org/10.1371/journal.pone.0163962) and [quast](https://doi.org/10.1093/bioinformatics/btu153) respectively).
 
+> [!NOTE]
+> This pipeline uses the nf-core template with some tweaks, but it's not part of nf-core.
+
 ## Usage
 
 > [!WARNING]
@@ -23,12 +26,21 @@ nextflow run ebi-metagenomics/miassembler --help
 Input/output options
   --study_accession                  [string]  The ENA Study secondary accession
   --reads_accession                  [string]  The ENA Run primary accession
-  --assembler                        [string]  The short reads assembler (accepted: spades, metaspades, megahit) [default: metaspades for PE, megahit for SE]
+  --private_study                    [boolean] To use if the ENA study is private [default: false]
+  --assembler                        [string]  The short reads assembler (accepted: spades, metaspades, megahit) [default: metaspades]
   --reference_genome                 [string]  The genome to be used to clean the assembly, the genome will be taken from the Microbiome Informatics internal
                                                directory (accepted: chicken.fna, salmon.fna, cod.fna, pig.fna, cow.fna, mouse.fna, honeybee.fna,
-                                               rainbow_trout.fna, ...) [default: human+phiX]
-  --reference_genomes_folder         [string]  The folder with the reference genome blast indexes, defaults to the Microbiome Informatics internal directory
-                                               [default: /nfs/production/rdf/metagenomics/pipelines/prod/assembly-pipeline/blast_dbs/]
+                                               rainbow_trout.fna, rat.fna, ...)
+  --blast_reference_genomes_folder   [string]  The folder with the reference genome blast indexes, defaults to the Microbiome Informatics internal
+                                               directory.
+  --bwamem2_reference_genomes_folder [string]  The folder with the reference genome bwa-mem2 indexes, defaults to the Microbiome Informatics internal
+                                               directory.
+  --remove_human_phix                [boolean] Remove human and phiX reads pre assembly, and contigs matching those genomes. [default: true]
+  --human_phix_blast_index_name      [string]  Combined Human and phiX BLAST db. [default: human_phix]
+  --human_phix_bwamem2_index_name    [string]  Combined Human and phiX bwa-mem2 index. [default: human_phix]
+  --min_contig_length                [integer] Minimum contig length filter. [default: 500]
+  --assembly_memory                  [integer] Default memory allocated for the assembly process. [default: 100]
+  --spades_only_assembler            [boolean] Run SPAdes/metaSPAdes without the error correction step. [default: true]
   --outdir                           [string]  The output directory where the results will be saved. You have to use absolute paths to storage on Cloud
                                                infrastructure.
   --email                            [string]  Email address for completion summary.
@@ -50,7 +62,43 @@ nextflow run ebi-metagenomics/miassembler \
   --reads_accession SRR1631361
 ```
 
+## Outputs
+
+The outputs of the pipeline are organized as follows:
+
+```
+results/SRP1154
+└── SRP115494
+    └── SRR6180
+        └── SRR6180434
+            ├── assembly
+            │   └── metaspades
+            │       └── 3.15.5
+            │           ├── coverage
+            │           ├── decontamination
+            │           └── qc
+            │               ├── multiqc
+            │               └── quast
+            └── qc
+                ├── fastp
+                └── fastqc
+
+```
+
+The nested structure based on ENA Study and Reads accessions was created to suit the Microbiome Informatics team’s needs. The benefit of this structure is that results from different runs of the same study won’t overwrite any results.
+
+## Tests
+
+There is a very small test data set ready to use:
+
+```bash
+nextflow run main.nf -resume -profile test,docker
+```
+
+### End to end tests
+
 Two end-to-end tests can be launched (with megahit and metaspades) with the following command:
+
 ```bash
 pytest tests/workflows/ --verbose
 ```
diff --git a/assets/email_template.html b/assets/email_template.html
@@ -12,7 +12,7 @@
 
 <img src="cid:nfcorepipelinelogo">
 
-<h1>ebi-metagenomics/miassembler v${version}</h1>
+<h1>ebi-metagenomics/miassembler ${version}</h1>
 <h2>Run Name: $runName</h2>
 
 <% if (!success){

diff --git a/assets/methods_description_template.yml b/assets/methods_description_template.yml
@@ -3,27 +3,21 @@ description: "Suggested text and references to use when describing pipeline usag
 section_name: "ebi-metagenomics/miassembler Methods Description"
 section_href: "https://github.com/ebi-metagenomics/miassembler"
 plot_type: "html"
-## TODO nf-core: Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
 ## You inject any metadata in the Nextflow '${workflow}' object
 data: |
   <h4>Methods</h4>
-  <p>Data was processed using ebi-metagenomics/miassembler v${workflow.manifest.version} ${doi_text} of the nf-core collection of workflows (<a href="https://doi.org/10.1038/s41587-020-0439-x">Ewels <em>et al.</em>, 2020</a>), utilising reproducible software environments from the Bioconda (<a href="https://doi.org/10.1038/s41592-018-0046-7">Grüning <em>et al.</em>, 2018</a>) and Biocontainers (<a href="https://doi.org/10.1093/bioinformatics/btx192">da Veiga Leprevost <em>et al.</em>, 2017</a>) projects.</p>
+  <p>Data is processed using MGnify ebi-metagenomics/miassembler v${workflow.manifest.version} ${doi_text}. Supported assemblers are MEGAHIT, SPAdes and metaSPAdes (default). Single-end reads are assembled only using MEGAHIT and metatranscriptomic data only with SPAdes. Pipeline uses a set of custom functions and modules from nf-core collection (<a href="https://doi.org/10.1038/s41587-020-0439-x">Ewels <em>et al.</em>, 2020</a>), utilising reproducible software environments from the Bioconda (<a href="https://doi.org/10.1038/s41592-018-0046-7">Grüning <em>et al.</em>, 2018</a>) and Biocontainers (<a href="https://doi.org/10.1093/bioinformatics/btx192">da Veiga Leprevost <em>et al.</em>, 2017</a>) projects.</p>
   <p>The pipeline was executed with Nextflow v${workflow.nextflow.version} (<a href="https://doi.org/10.1038/nbt.3820">Di Tommaso <em>et al.</em>, 2017</a>) with the following command:</p>
   <pre><code>${workflow.commandLine}</code></pre>
   <p>${tool_citations}</p>
   <h4>References</h4>
   <ul>
+    <li>Richardson LJ, Allen B, Baldi G, Beracochea M, Bileschi M, Burdett T, Burgin J, Caballero-Pérez J, Cochrane G, Colwell L, Curtis T, Escobar-Zepeda A, Gurbich T, Kale V, Korobeynikov A, Raj S, Rogers AB, Sakharova E, Sanchez S, Wilkinson D and Finn RD. (2023) MGnify: the microbiome sequence data analysis resource in 2023. Nucleic Acids Research. doi: <a href="https://academic.oup.com/nar/article/51/D1/D753/6880769">10.1093/nar/gkac1080</a></li>
+    <li>Li, D., Liu, C-M., Luo, R., Sadakane, K., and Lam, T-W. (2015). MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics. doi: <a href="https://doi.org/10.1093/bioinformatics/btv033">10.1093/bioinformatics/btv033</a></li>
+    <li>Prjibelski A., Antipov D., Meleshko D., Lapidus A., Korobeynikov A. (2020). Using SPAdes De Novo Assembler. Current Protocols. doi: <a href="https://doi.org/10.1002/cpbi.102">10.1002/cpbi.102</a></li>
     <li>Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., & Notredame, C. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4), 316-319. doi: <a href="https://doi.org/10.1038/nbt.3820">10.1038/nbt.3820</a></li>
     <li>Ewels, P. A., Peltzer, A., Fillinger, S., Patel, H., Alneberg, J., Wilm, A., Garcia, M. U., Di Tommaso, P., & Nahnsen, S. (2020). The nf-core framework for community-curated bioinformatics pipelines. Nature Biotechnology, 38(3), 276-278. doi: <a href="https://doi.org/10.1038/s41587-020-0439-x">10.1038/s41587-020-0439-x</a></li>
     <li>Grüning, B., Dale, R., Sjödin, A., Chapman, B. A., Rowe, J., Tomkins-Tinch, C. H., Valieris, R., Köster, J., & Bioconda Team. (2018). Bioconda: sustainable and comprehensive software distribution for the life sciences. Nature Methods, 15(7), 475–476. doi: <a href="https://doi.org/10.1038/s41592-018-0046-7">10.1038/s41592-018-0046-7</a></li>
     <li>da Veiga Leprevost, F., Grüning, B. A., Alves Aflitos, S., Röst, H. L., Uszkoreit, J., Barsnes, H., Vaudel, M., Moreno, P., Gatto, L., Weber, J., Bai, M., Jimenez, R. C., Sachsenberg, T., Pfeuffer, J., Vera Alvarez, R., Griss, J., Nesvizhskii, A. I., & Perez-Riverol, Y. (2017). BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics (Oxford, England), 33(16), 2580–2582. doi: <a href="https://doi.org/10.1093/bioinformatics/btx192">10.1093/bioinformatics/btx192</a></li>
     ${tool_bibliography}
   </ul>
-  <div class="alert alert-info">
-    <h5>Notes:</h5>
-    <ul>
-      ${nodoi_text}
-      <li>The command above does not include parameters contained in any configs or profiles that may have been used. Ensure the config file is also uploaded with your publication!</li>
-      <li>You should also cite all software used within this run. Check the "Software Versions" of this report to get version information.</li>
-    </ul>
-  </div>
diff --git a/assets/mgnify_logo.png b/assets/mgnify_logo.png
diff --git a/assets/multiqc_config.yml b/assets/multiqc_config.yml
@@ -1,16 +1,17 @@
 report_comment: >
-  This report has been generated by the <a href="https://github.com/ebi-metagenomics/miassembler/tree/dev" target="_blank">ebi-metagenomics/miassembler</a>
+  This report has been generated by the <a href="https://github.com/ebi-metagenomics/miassembler/" target="_blank">ebi-metagenomics/miassembler</a>
   analysis pipeline.
+
 report_section_order:
   "ebi-metagenomics-miassembler-methods-description":
     order: -1000
-  software_versions:
-    order: -1001
   "ebi-metagenomics-miassembler-summary":
     order: -1002
 
 export_plots: true
 
+skip_versions_section: true
+
 top_modules:
   - fastqc
   - quast

diff --git a/assets/samplesheet.csv b/assets/samplesheet.csv