Skip to content

Commit

Permalink
feat: Add resource profiles (#151)
Browse files Browse the repository at this point in the history
* refactor: Change some module resource labels

Based on previous testing

* feat: Add resource profiles and change base reqs

- Also makes recombination run by default

* refactor: Change panaroo run to high

* refactor: Increase time for high

* docs: Add basic resource profiles documentation

* refactor: Increase reqs for some modules

- Decided after Entirococcus testing

* docs: Update resource requirement documentation

* refactor: Run all ann tools in large config

* docs: Bolden sentence

* fix: Change descriptions for profiles
  • Loading branch information
jvfe authored Aug 22, 2023
1 parent 81a7a9b commit 349b428
Show file tree
Hide file tree
Showing 20 changed files with 113 additions and 46 deletions.
12 changes: 5 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,7 @@ See our [roadmap](ROADMAP.md) for a full list of future development targets.
3.1. Please check [nf-core/configs](https://github.com/nf-core/configs#documentation) to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use `-profile <institute>` in your command. This will enable either `docker` or `singularity` and set the appropriate execution settings for your local compute environment.
3.2. If you are using `singularity` then the pipeline will auto-detect this and attempt to download the Singularity images directly as opposed to performing a conversion from Docker images. If you are persistently observing issues downloading Singularity images directly due to timeout or network issues then please use the `--singularity_pull_docker_container` parameter to pull and convert the Docker image instead.
In case of input datasets larger than 100 samples, [check our resource profiles documentation](https://beiko-lab.github.io/arete/resource-profiles/), for optimal usage.
4. Start running your own analysis (ideally using `-profile docker` or `-profile singularity` for stability)!
Expand Down Expand Up @@ -174,9 +175,8 @@ nextflow run beiko-lab/ARETE \
--poppunk_model dbscan \
--run_recombination \
--run_gubbins \
--use_ppanggolin \
-entry annotation \
-profile docker
-profile medium,docker
```

Parameters used:
Expand All @@ -187,19 +187,17 @@ Parameters used:
- `--run_gubbins` - Run [Gubbins](https://github.com/nickjcroucher/gubbins) as part of the recombination subworkflow.
- `--use_ppanggolin` - Use [PPanGGOLiN](https://github.com/labgem/PPanGGOLiN) for calculating the pangenome. Tends to perform better on larger input sets.
- `-entry annotation` - Run annotation subworkflow and further steps (See [usage](https://beiko-lab.github.io/arete/usage/)).
- `-profile docker` - Run tools in docker containers.
- `-profile medium,docker` - Run tools in docker containers. For `-profile medium`, check our [resource requirements documentation](https://beiko-lab.github.io/arete/resource_profiles/).

### Annotation to evolutionary dynamics on 10,000 genomes

```bash
nextflow run beiko-lab/ARETE \
--input_sample_table samplesheet.csv \
--poppunk_model dbscan \
--use_ppanggolin \
--run_recombination \
--enable_subsetting \
-entry annotation \
-profile docker
-profile large,docker
```

Parameters used:
Expand All @@ -210,7 +208,7 @@ Parameters used:
- `--use_ppanggolin` - Use [PPanGGOLiN](https://github.com/labgem/PPanGGOLiN) for calculating the pangenome. Tends to perform better on larger input sets.
- `--enable_subsetting` - Enable subsetting workflow based on genome similarity (See [subsetting documentation](https://beiko-lab.github.io/arete/subsampling/))
- `-entry annotation` - Run annotation subworkflow and further steps (See [usage](https://beiko-lab.github.io/arete/usage/)).
- `-profile docker` - Run tools in docker containers.
- `-profile large,docker` - Run tools in docker containers. For `-profile large`, check our [resource requirements documentation](https://beiko-lab.github.io/arete/resource_profiles/).

## Credits <a name="credits"></a>

Expand Down
34 changes: 13 additions & 21 deletions conf/base.config
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
/*
========================================================================================
nf-core/arete Nextflow base config file
beiko-lab/ARETE Nextflow base config file
========================================================================================
A 'blank slate' config file, appropriate for general use on most high performance
compute environments. Assumes that all software is installed and available on
Expand All @@ -10,7 +10,6 @@

process {

// TODO nf-core: Check the defaults for all processes
cpus = { check_max( 1 * task.attempt, 'cpus' ) }
memory = { check_max( 6.GB * task.attempt, 'memory' ) }
time = { check_max( 4.h * task.attempt, 'time' ) }
Expand All @@ -19,38 +18,31 @@ process {
maxRetries = 3
maxErrors = '-1'

// Process-specific resource requirements
// NOTE - Please try and re-use the labels below as much as possible.
// These labels are used and recognised by default in DSL2 files hosted on nf-core/modules.
// If possible, it would be nice to keep the same label naming convention when
// adding in your local modules too.
// TODO nf-core: Customise requirements for specific processes.
// See https://www.nextflow.io/docs/latest/config.html#config-process-selectors
withLabel:process_single {
cpus = { check_max( 1 , 'cpus' ) }
memory = { check_max( 6.GB * task.attempt, 'memory' ) }
time = { check_max( 4.h * task.attempt, 'time' ) }
cpus = { check_max( 1, 'cpus' ) }
memory = { check_max( 6.GB, 'memory' ) }
time = { check_max( 4.h, 'time' ) }
}
withLabel:process_low {
cpus = { check_max( 6 * task.attempt, 'cpus' ) }
memory = { check_max( 20.GB * task.attempt, 'memory' ) }
memory = { check_max( 12.GB * task.attempt, 'memory' ) }
time = { check_max( 4.h * task.attempt, 'time' ) }
}
withLabel:process_medium {
cpus = { check_max( 12 * task.attempt, 'cpus' ) }
memory = { check_max( 48.GB * task.attempt, 'memory' ) }
time = { check_max( 8.h * task.attempt, 'time' ) }
cpus = { check_max( 10, 'cpus' ) }
memory = { check_max( 36.GB, 'memory' ) }
time = { check_max( 8.h, 'time' ) }
}
withLabel:process_high {
cpus = { check_max( 24 * task.attempt, 'cpus' ) }
memory = { check_max( 72.GB * task.attempt, 'memory' ) }
time = { check_max( 168.h, 'time' ) }
cpus = { check_max( 14, 'cpus' ) }
memory = { check_max( 72.GB, 'memory' ) }
time = { check_max( 20.h * task.attempt, 'time' ) }
}
withLabel:process_long {
time = { check_max( 20.h * task.attempt, 'time' ) }
time = { check_max( 72.h, 'time' ) }
}
withLabel:process_high_memory {
memory = { check_max( 240.GB, 'memory' ) }
memory = { check_max( 100.GB, 'memory' ) }
}
withLabel:error_ignore {
errorStrategy = 'ignore'
Expand Down
28 changes: 28 additions & 0 deletions conf/large.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
params {
config_profile_name = 'Large profile'
config_profile_description = 'Profile for >1000 sample datasets with subsetting enabled.'

use_ppanggolin = true
use_fasttree = true
enable_subsetting = true
skip_profile_creation = true
}

process {
withLabel:process_medium {
cpus = { check_max( 18 * task.attempt, 'cpus' ) }
memory = { check_max( 36.GB * task.attempt, 'memory' ) }
time = { check_max( 8.h * task.attempt, 'time' ) }
}
withLabel:process_high {
cpus = { check_max( 24 * task.attempt, 'cpus' ) }
memory = { check_max( 72.GB * task.attempt, 'memory' ) }
time = { check_max( 20.h * task.attempt, 'time' ) }
}
withLabel:process_long {
time = { check_max( 72.h, 'time' ) }
}
withLabel:process_high_memory {
memory = { check_max( 249.GB, 'memory' ) }
}
}
23 changes: 23 additions & 0 deletions conf/medium.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
params {
config_profile_name = 'Medium profile'
config_profile_description = 'Profile for 100-1000 sample datasets'

use_ppanggolin = true
use_fasttree = true
}

process {
withLabel:process_medium {
cpus = { check_max( 12 * task.attempt, 'cpus' ) }
memory = { check_max( 36.GB * task.attempt, 'memory' ) }
time = { check_max( 8.h * task.attempt, 'time' ) }
}
withLabel:process_high {
cpus = { check_max( 16 * task.attempt, 'cpus' ) }
memory = { check_max( 72.GB * task.attempt, 'memory' ) }
time = { check_max( 20.h * task.attempt, 'time' ) }
}
withLabel:process_high_memory {
memory = { check_max( 200.GB, 'memory' ) }
}
}
19 changes: 19 additions & 0 deletions docs/resource_profiles.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# ARETE and dataset size

Currently ARETE has three distinct profiles that change the pipeline execution in some ways: The default profile (which we can call `small`), the `medium` profile and the `large` profile.

These three profiles were developed based on the size and diversity of the input dataset and change some parameter defaults based on tests we have performed on similar-sized datasets.

If you want to first gauge the potential diversity of your dataset and have some input assemblies you can try the [PopPUNK entry](https://beiko-lab.github.io/arete/usage/#poppunk-entry). One of the outputs will provide insight into how many clusters, or lineages, your dataset divides into.

The sizes are:

- For the default or `small` profile, we expect datasets with 100 samples/assemblies or fewer.
It runs on the default pipeline parameters, with no changes.

- For the `medium` profile, we expect datasets with >100 and <1000 samples.
It increases the default resource requirements for most processes and also uses [PPanGGoLiN](https://github.com/labgem/PPanGGOLiN) for pangenome construction, instead of [Panaroo](https://github.com/gtonkinhill/panaroo/).

- For the `large` profile, we expect datasets with >1000 samples.
It also increases default resource requirements for some processes and uses PPanGGoLin.
Additionally, **it enables [PopPUNK subsampling](subsampling.md), with default parameters**.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ nav:
- Citations: CITATIONS.md
- Roadmap: ROADMAP.md
- Reference:
- Dataset Size: resource_profiles.md
- Parameters: params.md
- Subsampling: subsampling.md
repo_url: https://github.com/beiko-lab/arete
Expand Down
8 changes: 4 additions & 4 deletions modules/local/blast_databases.nf
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
process GET_CAZYDB {
label 'process_low'
label 'process_single'
label 'error_retry_delay'

output:
Expand All @@ -12,7 +12,7 @@ process GET_CAZYDB {
}

process GET_VFDB{
label 'process_low'
label 'process_single'
label 'error_retry_delay'

output:
Expand All @@ -29,7 +29,7 @@ process GET_VFDB{
}

process GET_BACMET{
label 'process_low'
label 'process_single'
label 'error_retry_delay'

output:
Expand All @@ -46,7 +46,7 @@ process GET_BACMET{
}

process GET_ICEBERG {
label 'process_low'
label 'process_single'
label 'error_retry_delay'

output:
Expand Down
1 change: 1 addition & 0 deletions modules/local/chunked_fasttree.nf
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
process CHUNKED_FASTTREE {
label 'process_high'
label 'process_long'

conda (params.enable_conda ? "bioconda::fasttree=2.1.10" : null)
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
Expand Down
1 change: 0 additions & 1 deletion modules/local/create_report.nf
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
process CREATE_REPORT {
label 'process_high'
label 'process_high_memory'

conda (params.enable_conda ? "conda-forge::pandas=1.4.3" : null)
if (workflow.containerEngine == 'singularity' && !params.singularity_pull_docker_container) {
Expand Down
2 changes: 1 addition & 1 deletion modules/local/get_db_cache.nf
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
//process for acquiring cached databases
process GET_DB_CACHE {
label 'process_medium'
label 'process_single'

input:
path(dbcache)
Expand Down
2 changes: 1 addition & 1 deletion modules/local/get_minikraken.nf
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ options = initOptions(params.options)
process KRAKEN2_DB {
//publishDir 'dbcache/', mode:'copy'
tag "minikraken"
label 'process_high'
label 'process_medium'
label 'error_retry_delay'

output:
Expand Down
2 changes: 1 addition & 1 deletion modules/local/graphviz/gml2gv/main.nf
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
process GML2GV {
label 'process_low'
label 'process_single'

conda "bioconda::perl-graphviz=2.24"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
Expand Down
2 changes: 1 addition & 1 deletion modules/local/islandpath/main.nf
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
process ISLANDPATH {
tag "$meta.id"
label 'process_medium'
label 'process_low'

conda "bioconda::islandpath=1.0.6"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
Expand Down
2 changes: 1 addition & 1 deletion modules/local/poppunk_samplesheet.nf
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
process POPPUNK_MAKE_SAMPLESHEET {
label "process_low"
label "process_single"

input:
path(samplesheets)
Expand Down
1 change: 1 addition & 0 deletions modules/local/ppanggolin/workflow/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ process PPANGGOLIN_WORKFLOW {
tag "$meta.id"
label 'process_high'
label 'process_high_memory'
label 'process_long'

conda "bioconda::ppanggolin=1.2.105"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
Expand Down
2 changes: 1 addition & 1 deletion modules/local/ska.nf
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
process SKA2 {
tag "$cluster"
label 'process_medium'
label 'process_high'

conda "bioconda::gubbins=3.3.0"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
Expand Down
2 changes: 1 addition & 1 deletion modules/nf-core/panaroo/run/main.nf
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
process PANAROO_RUN {
tag "$meta.id"
label 'process_medium'
label 'process_high'

conda (params.enable_conda ? "bioconda::panaroo=1.3.2" : null)
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
Expand Down
8 changes: 5 additions & 3 deletions modules/nf-core/panaroo/run/panaroo-run.diff
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
Changes in module 'nf-core/panaroo/run'
--- modules/nf-core/panaroo/run/main.nf
+++ modules/nf-core/panaroo/run/main.nf
@@ -2,17 +2,19 @@
@@ -1,18 +1,20 @@
process PANAROO_RUN {
tag "$meta.id"
label 'process_medium'
- label 'process_medium'
+ label 'process_high'

- conda "bioconda::panaroo=1.2.9"
+ conda (params.enable_conda ? "bioconda::panaroo=1.3.2" : null)
Expand All @@ -20,7 +22,7 @@ Changes in module 'nf-core/panaroo/run'
tuple val(meta), path("results/*") , emit: results
+ tuple val(meta), path("results/final_graph.gml") , optional: true, emit: graph_gml
tuple val(meta), path("results/core_gene_alignment.aln"), optional: true, emit: aln
+ path("results/aligned_gene_sequences/*aln.fas") , optional: true, emit: accessory_aln
+ path "results/aligned_gene_sequences/*aln.fas" , optional: true, emit: accessory_aln
path "versions.yml" , emit: versions

when:
Expand Down
6 changes: 4 additions & 2 deletions nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ params {
skip_kraken = false

// Recombination
run_recombination = false
run_recombination = true
run_verticall = true
run_gubbins = false

Expand Down Expand Up @@ -88,7 +88,7 @@ params {
// Defaults only, expecting to be overwritten
max_memory = '125.GB'
max_cpus = 72
max_time = '440.h'
max_time = '168.h'

}

Expand Down Expand Up @@ -154,6 +154,8 @@ profiles {
}
test { includeConfig 'conf/test.config' }
test_full { includeConfig 'conf/test_full.config' }
medium { includeConfig 'conf/medium.config' }
large { includeConfig 'conf/large.config' }
}

// Export these variables to prevent local Python/R libraries from conflicting with those in the container
Expand Down
3 changes: 2 additions & 1 deletion nextflow_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -208,7 +208,8 @@
"run_recombination": {
"type": "boolean",
"description": "Run Recombination",
"fa_icon": "fas fa-tree"
"fa_icon": "fas fa-tree",
"default": true
},
"run_verticall": {
"type": "boolean",
Expand Down

0 comments on commit 349b428

Please sign in to comment.