Merge pull request #45 from Ferlab-Ste-Justine/feat/CLIN-3616-allow-t…

…o-pass-dbsnp-option-in-genotypegvcf feat: CLIN-3616 allow to pass dbsnp option in genotypegvcf
Ferlab-Ste-Justine · Dec 5, 2024 · 33db042 · 33db042
2 parents 82e6660 + f8615d7
commit 33db042
Show file tree

Hide file tree

Showing 6 changed files with 42 additions and 4 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,6 +8,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ### `Added`
 - [#44](https://github.com/Ferlab-Ste-Justine/Post-processing-Pipeline/pull/44) Make interval file optional in GenotypeGVCFs process
 - [#44](https://github.com/Ferlab-Ste-Justine/Post-processing-Pipeline/pull/44) Decouple the interval file parameter from the broad
+- [#45](https://github.com/Ferlab-Ste-Justine/Post-processing-Pipeline/pull/45) Allow to add dbsnp ids to output vcf files
+
+### `Known issues`
+- The nf-core modules that we are using have a potential performance flaw. Typically, the regex used to describe the output files also match the input files (ex: "*.vcf"), which can cause unnecessary file transfers.  This has already proven to cause issues on fusion. One fix could be to transfer the whole modules to local to perform the small change necessary to fix this.
+- The VEP cache version used in the CQDG environment (112) does not match the default configured VEP version (111). This issue can be avoided by overriding the Docker container of the ensemblevep process. If no project is using VEP version 111, it should not be used as the default value.
+
 
 ## v2.2.0-dev
 
@@ -21,10 +27,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - The nf-core modules that we are using have a potential performance flaw. Typically, the regex used to describe the output files also match the input files (ex: "*.vcf"), which can cause unnecessary file transfers.  This has already proven to cause issues on fusion. One fix could be to transfer the whole modules to local to perform the small change necessary to fix this.
 - The VEP cache version used in the CQDG environment (112) does not match the default configured VEP version (111). This issue can be avoided by overriding the Docker container of the ensemblevep process. If no project is using VEP version 111, it should not be used as the default value.
 
-
 ### `Fixed`
 - [#41](https://github.com/Ferlab-Ste-Justine/Post-processing-Pipeline/pull/41) Fix vep url pointing to the wrong vep version in the reference data documentation.
 
+
 ## v2.1.0dev
 
 ### `Added`

diff --git a/docs/reference_data.md b/docs/reference_data.md
@@ -12,6 +12,12 @@ This directory should contain the following files:
 - The reference genome FASTA file index (e.g., `Homo_sapiens_assembly38.fasta.fai`). Its location will be automatically derived by appending `.fai` to the `referenceGenomeFasta` parameter.
 - The reference genome dictionary file (e.g., `Homo_sapiens_assembly38.dict`). Its location will be automatically derived by replacing the `.fasta` file extension of the `referenceGenomeFasta` parameter with `.dict`.
 
+## DBSNP reference data
+The `dbsnpFile` and `dbsnpFileIndex` parameters specify the path to a dbsnp file and it's index, respectively.
+If specified, dbsnp ids will be added in the ID column of the output vcf files in the GenotypeGVCFs step.
+
+Both parameters are null by default. Note that, if specifying `dbsnpFile`, it is mandatory to specify `dbsnpFileIndex`.
+
 ## Broad reference data (VQSR)
 The `broad` parameter specifies the directory containing the reference data files for VQSR. 
 Note that the VQSR step applies only to whole genome data, so you need to specify the broad parameter only if you have whole genome data.
@@ -105,6 +111,8 @@ analysis file should contain only the `analysis` section.
 | --- | --- | --- |
 | `referenceGenome` |  _Required_ | Path to the directory containing the reference genome data |
 | `referenceGenomeFasta` | _Required_ | Filename of the reference genome .fasta file, within the specified `referenceGenome` directory |
+| `dbsnpFile` | _Optional_ | Path to dbsnp file. If specified, will be used to add ids in the ID column of output vcf files. |
+| `dbsnpFileIndex` | _Optional_ | Path to dbsnp file index. Must be specified if the dbsnpFile parameter is specified. |
 | `broad` | _Optional_ | Path to the directory containing Broad reference data (for VQSR) |
 | `intervalsFile` | _Optional_ | Path to the file containg the genome intervals list on which to operate |
 | `vepCache` | _Optional_ | Path to the vep cache data directory |

diff --git a/docs/usage.md b/docs/usage.md
@@ -164,6 +164,8 @@ Parameters summary
 | `outdir` | _Required_ | Path to the output directoy |
 | `referenceGenome` |  _Required_ | Path to the directory containing the reference genome data |
 | `referenceGenomeFasta` | _Required_ | Filename of the reference genome .fasta file, within the specified `referenceGenome` directory |
+| `dbsnpFile` | _Optional_ | Path to dbsnp file. If specified, will be used to add ids in the ID column of output vcf files. |
+| `dbsnpFileIndex` | _Optional_ | Path to dbsnp file index. Must be specified if the dbsnpFile parameter is specified. |
 | `broad` | _Optional_ | Path to the directory containing Broad reference data (for VQSR) |
 | `intervalsFile` | _Optional_ | Path to the file containg the genome intervals list on which to operate |
 | `tools` | _Optional_ | Additional tools to run separated by commas. Supported tools are `vep` and `exomiser` |

diff --git a/nextflow.config b/nextflow.config
@@ -20,6 +20,8 @@ params {
     referenceGenomeFasta = null
     broad = null
     intervalsFile = null
+    dbsnpFile = null
+    dbsnpFileIndex = null
     tools = ""
     vep_cache = null
     vep_cache_version = null

diff --git a/nextflow_schema.json b/nextflow_schema.json
@@ -74,12 +74,30 @@
           "description": "Name of the fasta file for the genome",
           "help_text": "Name of the fasta file for the genome we usually apply \"Homo_sapiens_assembly38.fasta\"",
           "format": "file-path"
+        },
+        "dbsnpFile": {
+          "type": "string",
+          "description": "Path to dbsnp file.",
+          "help_text": "Path to dbsnp file. Will be used to add dbsnp ids in the output vcf ID column if provided.",
+          "format": "file-path"
+        },
+        "dbsnpFileIndex": {
+          "type": "string",
+          "description": "Path to dbsnp file index.",
+          "help_text": "Path to dbsnp file index. Required if specifying the dbsnpFile parameter.",
+          "format": "file-path"
         }
       },
       "required": [
         "referenceGenome",
         "referenceGenomeFasta"
-      ]
+      ],
+      "if": {
+        "required": ["dbsnpFile"]
+      },
+      "then": {
+        "required": ["dbsnpFileIndex"]
+      }
     },
     "institutional_config_options": {
       "title": "Institutional config options",

diff --git a/workflows/postprocessing.nf b/workflows/postprocessing.nf
@@ -140,6 +140,8 @@ workflow POSTPROCESSING {
     def pathReferenceGenomeFai = file(pathReferenceGenomeFasta + ".fai")
     def pathIntervalFile =  params.intervalsFile? file(params.intervalsFile) : [] //The empty list is used if we don't want to use an interval file
     def pathReferenceDict = file(params.referenceGenome + "/" + params.referenceGenomeFasta.substring(0,params.referenceGenomeFasta.indexOf(".")) + ".dict")
+    def dbsnpFile = params.dbsnpFile? file(params.dbsnpFile) : []
+    def dbsnpFileIndex = params.dbsnpFileIndex? file(params.dbsnpFileIndex) : []
     file(params.outdir).mkdirs()
 
     take:
@@ -180,8 +182,8 @@ workflow POSTPROCESSING {
     [[:], pathReferenceGenomeFasta],
     [[:], pathReferenceGenomeFai],
     [[:], pathReferenceDict],
-    [[:], []], //leaving empty as we don't use dbsnp
-    [[:], []]  //leaving empty as we don't use dbsnp
+    [[:], dbsnpFile],
+    [[:], dbsnpFileIndex]
     ).vcf
     .join(GATK4_GENOTYPEGVCFS.out.tbi)