Merge pull request #1 from Sage-Bionetworks-Workflows/bgrande/v1

Create initial version of vcf2maf workflow
Sage-Bionetworks-Workflows · Sep 17, 2022 · f9fd807 · f9fd807
2 parents 51a96a4 + cbe08e0
commit f9fd807
Show file tree

Hide file tree

Showing 6 changed files with 655 additions and 2 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,3 @@
+# Nextflow files
+.nextflow*
+work/
diff --git a/README.md b/README.md
@@ -1,2 +1,182 @@
-# nf-vcf2maf
-Annotate VCF files using VEP and generate MAF files using vcf2maf
+# nf-vcf2maf: Annotate VCF Files and Generate MAF Files
+
+## Purpose
+
+The purpose of this Nextflow workflow is to annotate variants in VCF files using the Ensembl Variant Effect Predictor (VEP) and convert the annotated VCF file into the [Mutation Annotation Format](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) (MAF). Unlike VEP-anntotated VCF files, MAF files are generally more useful for downstream applications given their tabular nature. For example, you can easily load them into R (_e.g._ with the [`maftools`](https://www.bioconductor.org/packages/release/bioc/html/maftools.html) package) and/or derive the input files for [cBioPortal](https://www.cbioportal.org/visualize).
+
+**Important:** Please read the [limitations](#known-limitations) listed below.
+
+This repository leverages a [fork of vcf2maf](https://github.com/Sage-Bionetworks-Workflows/vcf2maf) and a [custom container image](https://github.com/Sage-Bionetworks-Workflows/vcf2maf-docker).
+
+## Quickstart
+
+1. Prepare a CSV samplesheet according to the [format](#samplesheet) described below.
+
+    **Example:** Stored locally as `./samplesheet.csv`
+
+    ```csv
+    synapse_id  ,sample_parent_id ,merged_parent_id ,study_id ,variant_class ,variant_caller ,is_releasable
+    syn87654301 ,syn87654311      ,syn87654321      ,study_x  ,germline      ,deepvariant    ,true
+    syn87654302 ,syn87654311      ,syn87654321      ,study_x  ,germline      ,deepvariant    ,false
+    syn87654303 ,syn87654311      ,syn87654321      ,study_x  ,germline      ,deepvariant    ,true
+    syn87654304 ,syn87654312      ,syn87654321      ,study_x  ,germline      ,mutect2        ,false
+    syn87654305 ,syn87654312      ,syn87654321      ,study_x  ,germline      ,mutect2        ,false
+    syn87654306 ,syn87654312      ,syn87654321      ,study_x  ,germline      ,mutect2        ,false
+    syn87654307 ,syn87654313      ,syn87654322      ,study_y  ,germline      ,deepvariant    ,true
+    syn36245848 ,syn87654313      ,syn87654322      ,study_y  ,germline      ,deepvariant    ,true
+    ```
+
+2. Create a Nextflow secret called `SYNAPSE_AUTH_TOKEN` with a Synapse personal access token ([docs](#authentication)).
+
+3. Prepare your parameters file. For more details, check out the [Parameters](#parameters) section. Only the `input` parameters is required.
+
+    **Example:** Stored locally as `./params.yml`
+
+    ```yaml
+    input: "./samplesheet.csv"
+    maf_center: "Sage Bionetworks"
+    max_subpop_af: 0.0005
+    ```
+
+4. Launch workflow using the [Nextflow CLI](https://nextflow.io/docs/latest/cli.html#run), the [Tower CLI](https://help.tower.nf/latest/cli/), or the [Tower web UI](https://help.tower.nf/latest/launch/launchpad/).
+
+    **Example:** Launched using the Nextflow CLI with Docker enabled
+
+    ```console
+    nextflow run sage-bionetworks-workflows/nf-vcf2maf -params-file ./params.yml -profile docker
+    ```
+
+5. Explore the MAf files uploaded to Synapse (using the parent IDs listed in the samplesheet).
+
+## Authentication
+
+This workflow takes care of transferring files to and from Synapse. Hence, it requires a secret with a personal access token for authentication. To configure Nextflow with such a token, follow these steps:
+
+1. Generate a personal access token (PAT) on Synapse using [this dashboard](https://www.synapse.org/#!PersonalAccessTokens:). Make sure to enable the `view`, `download`, and `modify` scopes since this workflow both downloads and uploads to Synapse.
+2. Create a secret called `SYNAPSE_AUTH_TOKEN` containing a Synapse personal access token using the [Nextflow CLI](https://nextflow.io/docs/latest/secrets.html) or [Nextflow Tower](https://help.tower.nf/latest/secrets/overview/).
+3. (Tower only) When launching the workflow, include the `SYNAPSE_AUTH_TOKEN` as a pipeline secret from either your user or workspace secrets.
+
+## Parameters
+
+Check out the [Quickstart](#quickstart) section for example parameter values. You are encouraged to read the [limitations](#known-limitations) listed below because some parameters have not been tested with non-default values.
+
+- **`input`**: (Required) A CSV samplesheet that lists the VCF files that should be processed. See below for the [samplesheet format](#samplesheet).
+- **`max_subpop_af`**: Threshold used by vcf2maf for labeling variants with the `common_variant` filter. Specifically, the `common_variant` filter is applied to variants with an allele frequency of at least `max_subpop_af` in one or more gnomAD sub-populations ([source](https://github.com/mskcc/vcf2maf/blob/5ed414428046e71833f454d4b64da6c30362a89b/docs/vep_maf_readme.txt#L116-L120)). This filter is useful for removing false-positive somatic variants. The merged MAF files lack these common variants. Default: `0.0005`.
+- **`maf_center`**: Value used in the `Center` MAF column. Default: `"Sage Bionetworks"`.
+- **`reference_fasta`**: Reference genome FASTA file used in variant calling. Default: `"s3://sage-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/Homo_sapiens_assembly38.fasta"`.
+- **`reference_fasta_fai`**: Reference genome FASTA index (FAI) file. This shouldn't be needed in most cases since the workflow will automatically pick up on the `.fai` file alongside the `.fasta` file. Default: `"${reference_fasta}.fai"`.
+- **`vep_tarball`**: A tarball (ideally compressed) of the VEP cache. Default: `"s3://sage-igenomes/vep_cache/homo_sapiens_vep_107_GRCh38.tar.gz"`.
+- **`ncbi_build`**: The NCBI genome build. Passed to `--assembly` in VEP ([source](http://Jul2022.archive.ensembl.org/info/docs/tools/vep/script/vep_options.html)). Default: `"GRCh38"`.
+- **`species`**: The species identifier. Passed to `--species` in VEP ([source](http://Jul2022.archive.ensembl.org/info/docs/tools/vep/script/vep_options.html)). Default: `"homo_sapiens"`.
+
+## Inputs
+
+### Samplesheet
+
+The input samplesheet should be in comma-separated values (CSV) format and contain the following columns. **You should avoid using spaces or special characters in any of the columns.** Otherwise, you might run into job caching issues.
+
+1. **`synapse_id`**: Synapse ID of the VCF file
+   - Make sure that the Synapse account associated with the personal access token has access to all listed VCF files
+2. **`biospecimen_id`**: Biospecimen/sample identifier
+    - This value will be used to populate the `Tumor_Sample_Barcode` MAF column
+    - **Important:** This value needs to uniquely identify samples within each merged MAF file. See [below](#maf-files) for information on how MAF files are merged.
+3. **`sample_parent_id`**: Synapse ID of the folder where the individual sample MAF file will be uploaded
+    - Suggestion: The folder that contains the VCF file
+4. **`merged_parent_id`**: The Synapse ID of the folder where the merged MAF file will be uploaded
+    - Suggestion: The root folder containing the VCF files
+    - **Important:** This value should be consistent across VCF files that are expected to be merged. Otherwise, you will end up with artificially split merged MAF files. See [below](#maf-files) for information on how MAF files are merged.
+5. **`study_id`**: Study identifier
+    - Suggestion: The Synapse ID of the project representing the study if you don’t have shorthand study IDs
+6. **`variant_class`**: Whether the VCF file contains somatic or germline mutations
+    - **Valid values:** `somatic` or `germline`
+7. **`variant_caller`**: Name of the variant caller
+8. **`is_releasable`**: Whether the VCF file should be included in the merged MAF file
+    - **Valid values:** `true` or `false`
+
+## Outputs
+
+### MAF Files
+
+- Individual sample MAF files
+  - Unfiltered (_i.e._ includes all variants regardless of their FILTER status, including those that weren’t deemed high-confidence by the variant caller)
+  - File naming: `${biospecimen_id}-${variant_class}-${variant_caller}.maf`
+- Merged study MAF files (one for every combination of `study_id`, `variant_class` and `variant_caller`)
+  - Filtered (_i.e._ restricted to “releasable” samples and variants where `FILTER == 'PASS'`, which excludes those with common_variant due to any(gnomAD_*_AF) >= 0.0005)
+  - File naming: `${study_id}-${variant_class}-${variant_caller}.merged.maf`
+
+## Known Limitations
+
+- This workflow has only been tested with the following parameters:
+  - `vep_tarball`: Ensembl VEP 107
+  - `species`: `homo_sapiens`
+  - `ncbi_build`: `GRCh38`
+  - `reference_fasta`: GATK FASTA file
+
+## Benchmarks
+
+### Setup
+
+The following benchmarks were performed with the following setup on an EC2 instance:
+
+```console
+# Install tmux for long-running commands
+sudo yum install -y tmux
+
+# Install and setup Nextflow
+sudo yum install -y java
+(cd .local/bin && wget -qO- https://get.nextflow.io | bash)
+echo 'export NXF_ENABLE_SECRETS=true' >> ~/.bashrc
+source ~/.bashrc
+nextflow secrets put -n SYNAPSE_AUTH_TOKEN -v "<synapse-pat>"
+mkdir -p $HOME/.nextflow/
+echo 'aws.client.anonymous = true' >> $HOME/.nextflow/config
+
+# Download and extract Ensembl VEP cache
+mkdir -p $HOME/ref/ $HOME/.vep/
+rsync -avr --progress rsync://ftp.ensembl.org/ensembl/pub/release-107/variation/indexed_vep_cache/homo_sapiens_vep_107_GRCh38.tar.gz $HOME/ref/
+tar -zvxf $HOME/ref/homo_sapiens_vep_107_GRCh38.tar.gz -C $HOME/.vep/
+
+# Download reference genome FASTA file
+mkdir -p $HOME/ref/fasta/
+aws --no-sign-request s3 sync s3://sage-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/ $HOME/ref/fasta/
+
+# Stage reference files in memory
+mkdir -p /dev/shm/vep/ /dev/shm/fasta/
+sudo mount -o remount,size=25G /dev/shm  # Increase /dev/shm size
+rsync -avhP $HOME/.vep/ /dev/shm/vep/
+rsync -avhP $HOME/ref/fasta/ /dev/shm/fasta/
+```
+
+### Preparing the Ensembl VEP cache
+
+#### Outside of Nextflow
+
+To determine the most efficient way of preparing the VEP cache for vcf2maf, I tried different permutations of downloading the tarball or extracted folder from Ensembl or S3. Here are the individual results:
+
+- **Download tarball using rsync from Ensembl:** 10 min 23 sec
+- **Download tarball using AWS CLI from S3:** 3 min 14 sec
+- **Extract tarball using `tar` locally:** 6 min 11 sec
+- **Download extracted folder using AWS CLI from S3:** 4 min 5 sec
+
+Based on the above results, here are some estimated runtimes:
+
+- **Download tarball from Ensembl and extract locally:** 16 min 34 sec
+- **Download tarball from S3 and extract locally:** 9 min 25 sec
+- **Download extracted tarball from S3:** 4 min 5 sec
+
+Based on the above estimates, downloading the extracted tarball from S3 seems like the most efficient method followed by downloading the tarball from S3 and extracting locally.
+
+#### Within Nextflow
+
+After benchmarking different methods for downloading VEP cache, I performed various test within a Nextflow run. Note that SHM refers to files being in shared memory (_i.e._ `/dev/shm`).
+
+- **Baseline (all reference files in SHM):** 3 min 43 sec
+- **FASTA in S3 and VEP folder in SHM:** 4 min 9 sec
+- **FASTA in S3 and VEP folder in non-SHM:** 3 min 43 sec
+- **FASTA in S3 and VEP folder in S3:** Over 17 min 7 sec[^1]
+- **FASTA in S3 and VEP tarball in non-SHM:** 8 min 39 sec
+- **FASTA and VEP tarball in S3:** 8 min 38 sec
+
+The above results showed that downloading the tarball from S3 was the most efficient method. While ~10 minutes is a long time to spend on preparing reference files, it's trivial compared to the actual runtimes of vcf2maf, which can reach 4-5 hours. The benefit is portability, including the ability of running this workflow on Tower.
+
+[^1]: While this was expected to be the most efficient method of downloading the VEP cache, I had to kill the job because it was taking so long. Perhaps the AWS Java SDK isn't as efficient as the AWS CLI for downloading a folder in S3 recursively.
diff --git a/bin/filter_maf.py b/bin/filter_maf.py
@@ -0,0 +1,18 @@
+#!/usr/bin/env python3
+
+import csv
+import sys
+
+input_maf  = sys.argv[1]
+output_maf = sys.argv[2]
+
+with (
+    open(input_maf, newline='') as infile,
+    open(output_maf, "w", newline='') as outfile
+):
+    reader = csv.DictReader(infile, delimiter='\t')
+    writer = csv.DictWriter(outfile, reader.fieldnames, delimiter='\t')
+    writer.writeheader()
+    for row in reader:
+        if row['FILTER'] == 'PASS':
+            writer.writerow(row)
diff --git a/bin/merge_mafs.py b/bin/merge_mafs.py
@@ -0,0 +1,153 @@
+#!/usr/bin/env python3
+
+# Copied from the genome-nexus/annotation-tools GitHub repository:
+# https://github.com/genome-nexus/annotation-tools/blob/2e315735/merge_mafs.py
+
+# Copyright (c) 2020 Memorial Sloan Kettering Cancer Center
+#
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included in all
+# copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+# SOFTWARE.
+
+import os
+import sys
+import argparse
+
+# Script for merging MAFs
+# Creates one aggregate header taken from column names acorss MAFs
+# Writes out rows while mapping to header
+# NAs and equivalents replaced with ''
+
+TUMOR_SAMPLE_BARCODE_COLUMN = "Tumor_Sample_Barcode"
+HUGO_SYMBOL_COLUMN = "Hugo_Symbol"
+ENTREZ_GENE_ID_COLUMN = "Entrez_Gene_Id"
+
+NULL_OR_MISSING_VALUES = ["", "NA", "N/A", None]
+
+def process_datum(val):
+    """ Cleans up datum. """
+    try:
+        vfixed = val.strip()
+    except AttributeError:
+        vfixed = ""
+    if vfixed in NULL_OR_MISSING_VALUES:
+        return ""
+    return vfixed
+
+def get_file_header(data_file):
+    """
+        Returns header from MAF
+        Assumes file is tab-delimited
+        Assumes header is first non-commented line
+    """
+    header = []
+    with open(data_file, "r") as header_source:
+        for line in header_source:
+            if not line.startswith("#"):
+                header = list(map(str.strip, line.rstrip().split("\t")))
+                break
+    if not header:
+        print("Could not extract header from mutation data file: %s - Exiting..." % (filename))
+        sys.exit(2)
+    return header
+
+def merge_headers(data_filenames):
+    """
+        Generates a merged header from all input files.
+        * Also ensures that "Hugo_Symbol" and "Entrez_Gene_Id"
+          are the first 2 columns in the merged MAF header.
+    """
+    merged_header = [HUGO_SYMBOL_COLUMN, ENTREZ_GENE_ID_COLUMN]
+    for fname in data_filenames:
+        columns_to_add = [column for column in get_file_header(fname) if column not in merged_header]
+        merged_header.extend(columns_to_add)
+    return merged_header
+
+def merge_input_mafs(input_mafs, output_maf_filename):
+    """ Generates a merged MAF given a list of input MAFs. """
+    merged_header = merge_headers(input_mafs)
+
+    rows_to_write = []
+    for input_maf in input_mafs:
+        print('Loading data from ' + input_maf)
+        header_processed = False
+        file_header = get_file_header(input_maf)
+        with open(input_maf, "r") as maf:
+            for line in maf:
+                if line.startswith("#"):
+                    continue
+                if not header_processed:
+                    header_processed = True
+                    continue
+                # map row values to current header columns
+                mapped_row = dict(zip(file_header, list(map(lambda x: process_datum(x), line.split("\t")))))
+                sample_id = mapped_row[TUMOR_SAMPLE_BARCODE_COLUMN]
+
+                # full record for the merged MAF (if mapped_row does not contain a column in merged_header, it is blank)
+                normalized_row = list(map(lambda x: mapped_row.get(x, ""), merged_header))
+                rows_to_write.append("\t".join(normalized_row))
+
+    with open(output_maf_filename, "w") as output_maf:
+        output_maf.write("\t".join(merged_header))
+        output_maf.write("\n")
+        output_maf.write("\n".join(rows_to_write))
+        output_maf.write("\n")
+    print('Merged MAF written to: %s' % (output_maf_filename))
+
+def verify_input_mafs(input_mafs):
+    for maf in input_mafs:
+        if not os.path.isfile(maf):
+            print("Could not find MAF by filename: %s" % (maf))
+            exit(1)
+
+def usage(parser):
+    parser.print_help()
+    sys.exit(2)
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("-i", "--input-mafs-list", action = "store", dest = "input_mafs_list", help = "comma-delimited list of MAFs to merge")
+    parser.add_argument("-d", "--input-mafs-directory", action = "store", dest = "input_mafs_directory", help = "directory containing all MAFs to merge")
+    parser.add_argument("-o", "--output-maf", action = "store", dest = "output_maf", help = "output filename for merged MAF [REQUIRED]")
+
+    args = parser.parse_args()
+    input_mafs_list = args.input_mafs_list
+    input_mafs_directory = args.input_mafs_directory
+    output_maf = args.output_maf
+
+    if not output_maf:
+        print("Missing required argument: -o | --output-maf")
+        usage(parser)
+
+    if (input_mafs_list and input_mafs_directory) or (not input_mafs_list and not input_mafs_directory):
+        print("Please choose only one of the following options when running script:  --input-mafs-list | --input-mafs-directory")
+        usage(parser)
+
+    if input_mafs_list:
+        input_mafs = list(map(str.strip, input_mafs_list.split(",")))
+    else:
+        input_mafs = []
+        for maf in os.listdir(input_mafs_directory):
+            full_maf_path = os.path.join(input_mafs_directory, maf)
+            if os.path.isfile(full_maf_path):
+                input_mafs.append(full_maf_path)
+            else:
+                print("Skipping non-file element in directory... '%s'" % (full_maf_path))
+
+    verify_input_mafs(input_mafs)
+    merge_input_mafs(input_mafs, output_maf)