Skip to content

Commit

Permalink
Merge pull request #45 from Ferlab-Ste-Justine/feat/CLIN-3616-allow-t…
Browse files Browse the repository at this point in the history
…o-pass-dbsnp-option-in-genotypegvcf

feat: CLIN-3616 allow to pass dbsnp option in genotypegvcf
  • Loading branch information
LysianeBouchard authored Dec 5, 2024
2 parents 82e6660 + f8615d7 commit 33db042
Show file tree
Hide file tree
Showing 6 changed files with 42 additions and 4 deletions.
8 changes: 7 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### `Added`
- [#44](https://github.com/Ferlab-Ste-Justine/Post-processing-Pipeline/pull/44) Make interval file optional in GenotypeGVCFs process
- [#44](https://github.com/Ferlab-Ste-Justine/Post-processing-Pipeline/pull/44) Decouple the interval file parameter from the broad
- [#45](https://github.com/Ferlab-Ste-Justine/Post-processing-Pipeline/pull/45) Allow to add dbsnp ids to output vcf files

### `Known issues`
- The nf-core modules that we are using have a potential performance flaw. Typically, the regex used to describe the output files also match the input files (ex: "*.vcf"), which can cause unnecessary file transfers. This has already proven to cause issues on fusion. One fix could be to transfer the whole modules to local to perform the small change necessary to fix this.
- The VEP cache version used in the CQDG environment (112) does not match the default configured VEP version (111). This issue can be avoided by overriding the Docker container of the ensemblevep process. If no project is using VEP version 111, it should not be used as the default value.


## v2.2.0-dev

Expand All @@ -21,10 +27,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- The nf-core modules that we are using have a potential performance flaw. Typically, the regex used to describe the output files also match the input files (ex: "*.vcf"), which can cause unnecessary file transfers. This has already proven to cause issues on fusion. One fix could be to transfer the whole modules to local to perform the small change necessary to fix this.
- The VEP cache version used in the CQDG environment (112) does not match the default configured VEP version (111). This issue can be avoided by overriding the Docker container of the ensemblevep process. If no project is using VEP version 111, it should not be used as the default value.


### `Fixed`
- [#41](https://github.com/Ferlab-Ste-Justine/Post-processing-Pipeline/pull/41) Fix vep url pointing to the wrong vep version in the reference data documentation.


## v2.1.0dev

### `Added`
Expand Down
8 changes: 8 additions & 0 deletions docs/reference_data.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,12 @@ This directory should contain the following files:
- The reference genome FASTA file index (e.g., `Homo_sapiens_assembly38.fasta.fai`). Its location will be automatically derived by appending `.fai` to the `referenceGenomeFasta` parameter.
- The reference genome dictionary file (e.g., `Homo_sapiens_assembly38.dict`). Its location will be automatically derived by replacing the `.fasta` file extension of the `referenceGenomeFasta` parameter with `.dict`.

## DBSNP reference data
The `dbsnpFile` and `dbsnpFileIndex` parameters specify the path to a dbsnp file and it's index, respectively.
If specified, dbsnp ids will be added in the ID column of the output vcf files in the GenotypeGVCFs step.

Both parameters are null by default. Note that, if specifying `dbsnpFile`, it is mandatory to specify `dbsnpFileIndex`.

## Broad reference data (VQSR)
The `broad` parameter specifies the directory containing the reference data files for VQSR.
Note that the VQSR step applies only to whole genome data, so you need to specify the broad parameter only if you have whole genome data.
Expand Down Expand Up @@ -105,6 +111,8 @@ analysis file should contain only the `analysis` section.
| --- | --- | --- |
| `referenceGenome` | _Required_ | Path to the directory containing the reference genome data |
| `referenceGenomeFasta` | _Required_ | Filename of the reference genome .fasta file, within the specified `referenceGenome` directory |
| `dbsnpFile` | _Optional_ | Path to dbsnp file. If specified, will be used to add ids in the ID column of output vcf files. |
| `dbsnpFileIndex` | _Optional_ | Path to dbsnp file index. Must be specified if the dbsnpFile parameter is specified. |
| `broad` | _Optional_ | Path to the directory containing Broad reference data (for VQSR) |
| `intervalsFile` | _Optional_ | Path to the file containg the genome intervals list on which to operate |
| `vepCache` | _Optional_ | Path to the vep cache data directory |
Expand Down
2 changes: 2 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,8 @@ Parameters summary
| `outdir` | _Required_ | Path to the output directoy |
| `referenceGenome` | _Required_ | Path to the directory containing the reference genome data |
| `referenceGenomeFasta` | _Required_ | Filename of the reference genome .fasta file, within the specified `referenceGenome` directory |
| `dbsnpFile` | _Optional_ | Path to dbsnp file. If specified, will be used to add ids in the ID column of output vcf files. |
| `dbsnpFileIndex` | _Optional_ | Path to dbsnp file index. Must be specified if the dbsnpFile parameter is specified. |
| `broad` | _Optional_ | Path to the directory containing Broad reference data (for VQSR) |
| `intervalsFile` | _Optional_ | Path to the file containg the genome intervals list on which to operate |
| `tools` | _Optional_ | Additional tools to run separated by commas. Supported tools are `vep` and `exomiser` |
Expand Down
2 changes: 2 additions & 0 deletions nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ params {
referenceGenomeFasta = null
broad = null
intervalsFile = null
dbsnpFile = null
dbsnpFileIndex = null
tools = ""
vep_cache = null
vep_cache_version = null
Expand Down
20 changes: 19 additions & 1 deletion nextflow_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -74,12 +74,30 @@
"description": "Name of the fasta file for the genome",
"help_text": "Name of the fasta file for the genome we usually apply \"Homo_sapiens_assembly38.fasta\"",
"format": "file-path"
},
"dbsnpFile": {
"type": "string",
"description": "Path to dbsnp file.",
"help_text": "Path to dbsnp file. Will be used to add dbsnp ids in the output vcf ID column if provided.",
"format": "file-path"
},
"dbsnpFileIndex": {
"type": "string",
"description": "Path to dbsnp file index.",
"help_text": "Path to dbsnp file index. Required if specifying the dbsnpFile parameter.",
"format": "file-path"
}
},
"required": [
"referenceGenome",
"referenceGenomeFasta"
]
],
"if": {
"required": ["dbsnpFile"]
},
"then": {
"required": ["dbsnpFileIndex"]
}
},
"institutional_config_options": {
"title": "Institutional config options",
Expand Down
6 changes: 4 additions & 2 deletions workflows/postprocessing.nf
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,8 @@ workflow POSTPROCESSING {
def pathReferenceGenomeFai = file(pathReferenceGenomeFasta + ".fai")
def pathIntervalFile = params.intervalsFile? file(params.intervalsFile) : [] //The empty list is used if we don't want to use an interval file
def pathReferenceDict = file(params.referenceGenome + "/" + params.referenceGenomeFasta.substring(0,params.referenceGenomeFasta.indexOf(".")) + ".dict")
def dbsnpFile = params.dbsnpFile? file(params.dbsnpFile) : []
def dbsnpFileIndex = params.dbsnpFileIndex? file(params.dbsnpFileIndex) : []
file(params.outdir).mkdirs()

take:
Expand Down Expand Up @@ -180,8 +182,8 @@ workflow POSTPROCESSING {
[[:], pathReferenceGenomeFasta],
[[:], pathReferenceGenomeFai],
[[:], pathReferenceDict],
[[:], []], //leaving empty as we don't use dbsnp
[[:], []] //leaving empty as we don't use dbsnp
[[:], dbsnpFile],
[[:], dbsnpFileIndex]
).vcf
.join(GATK4_GENOTYPEGVCFS.out.tbi)

Expand Down

0 comments on commit 33db042

Please sign in to comment.