-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #1 from Sage-Bionetworks-Workflows/bgrande/v1
Create initial version of vcf2maf workflow
- Loading branch information
Showing
6 changed files
with
655 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# Nextflow files | ||
.nextflow* | ||
work/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,182 @@ | ||
# nf-vcf2maf | ||
Annotate VCF files using VEP and generate MAF files using vcf2maf | ||
# nf-vcf2maf: Annotate VCF Files and Generate MAF Files | ||
|
||
## Purpose | ||
|
||
The purpose of this Nextflow workflow is to annotate variants in VCF files using the Ensembl Variant Effect Predictor (VEP) and convert the annotated VCF file into the [Mutation Annotation Format](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) (MAF). Unlike VEP-anntotated VCF files, MAF files are generally more useful for downstream applications given their tabular nature. For example, you can easily load them into R (_e.g._ with the [`maftools`](https://www.bioconductor.org/packages/release/bioc/html/maftools.html) package) and/or derive the input files for [cBioPortal](https://www.cbioportal.org/visualize). | ||
|
||
**Important:** Please read the [limitations](#known-limitations) listed below. | ||
|
||
This repository leverages a [fork of vcf2maf](https://github.com/Sage-Bionetworks-Workflows/vcf2maf) and a [custom container image](https://github.com/Sage-Bionetworks-Workflows/vcf2maf-docker). | ||
|
||
## Quickstart | ||
|
||
1. Prepare a CSV samplesheet according to the [format](#samplesheet) described below. | ||
|
||
**Example:** Stored locally as `./samplesheet.csv` | ||
|
||
```csv | ||
synapse_id ,sample_parent_id ,merged_parent_id ,study_id ,variant_class ,variant_caller ,is_releasable | ||
syn87654301 ,syn87654311 ,syn87654321 ,study_x ,germline ,deepvariant ,true | ||
syn87654302 ,syn87654311 ,syn87654321 ,study_x ,germline ,deepvariant ,false | ||
syn87654303 ,syn87654311 ,syn87654321 ,study_x ,germline ,deepvariant ,true | ||
syn87654304 ,syn87654312 ,syn87654321 ,study_x ,germline ,mutect2 ,false | ||
syn87654305 ,syn87654312 ,syn87654321 ,study_x ,germline ,mutect2 ,false | ||
syn87654306 ,syn87654312 ,syn87654321 ,study_x ,germline ,mutect2 ,false | ||
syn87654307 ,syn87654313 ,syn87654322 ,study_y ,germline ,deepvariant ,true | ||
syn36245848 ,syn87654313 ,syn87654322 ,study_y ,germline ,deepvariant ,true | ||
``` | ||
2. Create a Nextflow secret called `SYNAPSE_AUTH_TOKEN` with a Synapse personal access token ([docs](#authentication)). | ||
3. Prepare your parameters file. For more details, check out the [Parameters](#parameters) section. Only the `input` parameters is required. | ||
**Example:** Stored locally as `./params.yml` | ||
```yaml | ||
input: "./samplesheet.csv" | ||
maf_center: "Sage Bionetworks" | ||
max_subpop_af: 0.0005 | ||
``` | ||
4. Launch workflow using the [Nextflow CLI](https://nextflow.io/docs/latest/cli.html#run), the [Tower CLI](https://help.tower.nf/latest/cli/), or the [Tower web UI](https://help.tower.nf/latest/launch/launchpad/). | ||
**Example:** Launched using the Nextflow CLI with Docker enabled | ||
```console | ||
nextflow run sage-bionetworks-workflows/nf-vcf2maf -params-file ./params.yml -profile docker | ||
``` | ||
5. Explore the MAf files uploaded to Synapse (using the parent IDs listed in the samplesheet). | ||
## Authentication | ||
This workflow takes care of transferring files to and from Synapse. Hence, it requires a secret with a personal access token for authentication. To configure Nextflow with such a token, follow these steps: | ||
1. Generate a personal access token (PAT) on Synapse using [this dashboard](https://www.synapse.org/#!PersonalAccessTokens:). Make sure to enable the `view`, `download`, and `modify` scopes since this workflow both downloads and uploads to Synapse. | ||
2. Create a secret called `SYNAPSE_AUTH_TOKEN` containing a Synapse personal access token using the [Nextflow CLI](https://nextflow.io/docs/latest/secrets.html) or [Nextflow Tower](https://help.tower.nf/latest/secrets/overview/). | ||
3. (Tower only) When launching the workflow, include the `SYNAPSE_AUTH_TOKEN` as a pipeline secret from either your user or workspace secrets. | ||
## Parameters | ||
Check out the [Quickstart](#quickstart) section for example parameter values. You are encouraged to read the [limitations](#known-limitations) listed below because some parameters have not been tested with non-default values. | ||
- **`input`**: (Required) A CSV samplesheet that lists the VCF files that should be processed. See below for the [samplesheet format](#samplesheet). | ||
- **`max_subpop_af`**: Threshold used by vcf2maf for labeling variants with the `common_variant` filter. Specifically, the `common_variant` filter is applied to variants with an allele frequency of at least `max_subpop_af` in one or more gnomAD sub-populations ([source](https://github.com/mskcc/vcf2maf/blob/5ed414428046e71833f454d4b64da6c30362a89b/docs/vep_maf_readme.txt#L116-L120)). This filter is useful for removing false-positive somatic variants. The merged MAF files lack these common variants. Default: `0.0005`. | ||
- **`maf_center`**: Value used in the `Center` MAF column. Default: `"Sage Bionetworks"`. | ||
- **`reference_fasta`**: Reference genome FASTA file used in variant calling. Default: `"s3://sage-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/Homo_sapiens_assembly38.fasta"`. | ||
- **`reference_fasta_fai`**: Reference genome FASTA index (FAI) file. This shouldn't be needed in most cases since the workflow will automatically pick up on the `.fai` file alongside the `.fasta` file. Default: `"${reference_fasta}.fai"`. | ||
- **`vep_tarball`**: A tarball (ideally compressed) of the VEP cache. Default: `"s3://sage-igenomes/vep_cache/homo_sapiens_vep_107_GRCh38.tar.gz"`. | ||
- **`ncbi_build`**: The NCBI genome build. Passed to `--assembly` in VEP ([source](http://Jul2022.archive.ensembl.org/info/docs/tools/vep/script/vep_options.html)). Default: `"GRCh38"`. | ||
- **`species`**: The species identifier. Passed to `--species` in VEP ([source](http://Jul2022.archive.ensembl.org/info/docs/tools/vep/script/vep_options.html)). Default: `"homo_sapiens"`. | ||
## Inputs | ||
### Samplesheet | ||
The input samplesheet should be in comma-separated values (CSV) format and contain the following columns. **You should avoid using spaces or special characters in any of the columns.** Otherwise, you might run into job caching issues. | ||
1. **`synapse_id`**: Synapse ID of the VCF file | ||
- Make sure that the Synapse account associated with the personal access token has access to all listed VCF files | ||
2. **`biospecimen_id`**: Biospecimen/sample identifier | ||
- This value will be used to populate the `Tumor_Sample_Barcode` MAF column | ||
- **Important:** This value needs to uniquely identify samples within each merged MAF file. See [below](#maf-files) for information on how MAF files are merged. | ||
3. **`sample_parent_id`**: Synapse ID of the folder where the individual sample MAF file will be uploaded | ||
- Suggestion: The folder that contains the VCF file | ||
4. **`merged_parent_id`**: The Synapse ID of the folder where the merged MAF file will be uploaded | ||
- Suggestion: The root folder containing the VCF files | ||
- **Important:** This value should be consistent across VCF files that are expected to be merged. Otherwise, you will end up with artificially split merged MAF files. See [below](#maf-files) for information on how MAF files are merged. | ||
5. **`study_id`**: Study identifier | ||
- Suggestion: The Synapse ID of the project representing the study if you don’t have shorthand study IDs | ||
6. **`variant_class`**: Whether the VCF file contains somatic or germline mutations | ||
- **Valid values:** `somatic` or `germline` | ||
7. **`variant_caller`**: Name of the variant caller | ||
8. **`is_releasable`**: Whether the VCF file should be included in the merged MAF file | ||
- **Valid values:** `true` or `false` | ||
## Outputs | ||
### MAF Files | ||
- Individual sample MAF files | ||
- Unfiltered (_i.e._ includes all variants regardless of their FILTER status, including those that weren’t deemed high-confidence by the variant caller) | ||
- File naming: `${biospecimen_id}-${variant_class}-${variant_caller}.maf` | ||
- Merged study MAF files (one for every combination of `study_id`, `variant_class` and `variant_caller`) | ||
- Filtered (_i.e._ restricted to “releasable” samples and variants where `FILTER == 'PASS'`, which excludes those with common_variant due to any(gnomAD_*_AF) >= 0.0005) | ||
- File naming: `${study_id}-${variant_class}-${variant_caller}.merged.maf` | ||
## Known Limitations | ||
- This workflow has only been tested with the following parameters: | ||
- `vep_tarball`: Ensembl VEP 107 | ||
- `species`: `homo_sapiens` | ||
- `ncbi_build`: `GRCh38` | ||
- `reference_fasta`: GATK FASTA file | ||
## Benchmarks | ||
### Setup | ||
The following benchmarks were performed with the following setup on an EC2 instance: | ||
```console | ||
# Install tmux for long-running commands | ||
sudo yum install -y tmux | ||
# Install and setup Nextflow | ||
sudo yum install -y java | ||
(cd .local/bin && wget -qO- https://get.nextflow.io | bash) | ||
echo 'export NXF_ENABLE_SECRETS=true' >> ~/.bashrc | ||
source ~/.bashrc | ||
nextflow secrets put -n SYNAPSE_AUTH_TOKEN -v "<synapse-pat>" | ||
mkdir -p $HOME/.nextflow/ | ||
echo 'aws.client.anonymous = true' >> $HOME/.nextflow/config | ||
# Download and extract Ensembl VEP cache | ||
mkdir -p $HOME/ref/ $HOME/.vep/ | ||
rsync -avr --progress rsync://ftp.ensembl.org/ensembl/pub/release-107/variation/indexed_vep_cache/homo_sapiens_vep_107_GRCh38.tar.gz $HOME/ref/ | ||
tar -zvxf $HOME/ref/homo_sapiens_vep_107_GRCh38.tar.gz -C $HOME/.vep/ | ||
# Download reference genome FASTA file | ||
mkdir -p $HOME/ref/fasta/ | ||
aws --no-sign-request s3 sync s3://sage-igenomes/igenomes/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/ $HOME/ref/fasta/ | ||
# Stage reference files in memory | ||
mkdir -p /dev/shm/vep/ /dev/shm/fasta/ | ||
sudo mount -o remount,size=25G /dev/shm # Increase /dev/shm size | ||
rsync -avhP $HOME/.vep/ /dev/shm/vep/ | ||
rsync -avhP $HOME/ref/fasta/ /dev/shm/fasta/ | ||
``` | ||
|
||
### Preparing the Ensembl VEP cache | ||
|
||
#### Outside of Nextflow | ||
|
||
To determine the most efficient way of preparing the VEP cache for vcf2maf, I tried different permutations of downloading the tarball or extracted folder from Ensembl or S3. Here are the individual results: | ||
|
||
- **Download tarball using rsync from Ensembl:** 10 min 23 sec | ||
- **Download tarball using AWS CLI from S3:** 3 min 14 sec | ||
- **Extract tarball using `tar` locally:** 6 min 11 sec | ||
- **Download extracted folder using AWS CLI from S3:** 4 min 5 sec | ||
|
||
Based on the above results, here are some estimated runtimes: | ||
|
||
- **Download tarball from Ensembl and extract locally:** 16 min 34 sec | ||
- **Download tarball from S3 and extract locally:** 9 min 25 sec | ||
- **Download extracted tarball from S3:** 4 min 5 sec | ||
|
||
Based on the above estimates, downloading the extracted tarball from S3 seems like the most efficient method followed by downloading the tarball from S3 and extracting locally. | ||
|
||
#### Within Nextflow | ||
|
||
After benchmarking different methods for downloading VEP cache, I performed various test within a Nextflow run. Note that SHM refers to files being in shared memory (_i.e._ `/dev/shm`). | ||
|
||
- **Baseline (all reference files in SHM):** 3 min 43 sec | ||
- **FASTA in S3 and VEP folder in SHM:** 4 min 9 sec | ||
- **FASTA in S3 and VEP folder in non-SHM:** 3 min 43 sec | ||
- **FASTA in S3 and VEP folder in S3:** Over 17 min 7 sec[^1] | ||
- **FASTA in S3 and VEP tarball in non-SHM:** 8 min 39 sec | ||
- **FASTA and VEP tarball in S3:** 8 min 38 sec | ||
|
||
The above results showed that downloading the tarball from S3 was the most efficient method. While ~10 minutes is a long time to spend on preparing reference files, it's trivial compared to the actual runtimes of vcf2maf, which can reach 4-5 hours. The benefit is portability, including the ability of running this workflow on Tower. | ||
|
||
[^1]: While this was expected to be the most efficient method of downloading the VEP cache, I had to kill the job because it was taking so long. Perhaps the AWS Java SDK isn't as efficient as the AWS CLI for downloading a folder in S3 recursively. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
#!/usr/bin/env python3 | ||
|
||
import csv | ||
import sys | ||
|
||
input_maf = sys.argv[1] | ||
output_maf = sys.argv[2] | ||
|
||
with ( | ||
open(input_maf, newline='') as infile, | ||
open(output_maf, "w", newline='') as outfile | ||
): | ||
reader = csv.DictReader(infile, delimiter='\t') | ||
writer = csv.DictWriter(outfile, reader.fieldnames, delimiter='\t') | ||
writer.writeheader() | ||
for row in reader: | ||
if row['FILTER'] == 'PASS': | ||
writer.writerow(row) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,153 @@ | ||
#!/usr/bin/env python3 | ||
|
||
# Copied from the genome-nexus/annotation-tools GitHub repository: | ||
# https://github.com/genome-nexus/annotation-tools/blob/2e315735/merge_mafs.py | ||
|
||
# Copyright (c) 2020 Memorial Sloan Kettering Cancer Center | ||
# | ||
# Permission is hereby granted, free of charge, to any person obtaining a copy | ||
# of this software and associated documentation files (the "Software"), to deal | ||
# in the Software without restriction, including without limitation the rights | ||
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
# copies of the Software, and to permit persons to whom the Software is | ||
# furnished to do so, subject to the following conditions: | ||
# | ||
# The above copyright notice and this permission notice shall be included in all | ||
# copies or substantial portions of the Software. | ||
# | ||
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
# SOFTWARE. | ||
|
||
import os | ||
import sys | ||
import argparse | ||
|
||
# Script for merging MAFs | ||
# Creates one aggregate header taken from column names acorss MAFs | ||
# Writes out rows while mapping to header | ||
# NAs and equivalents replaced with '' | ||
|
||
TUMOR_SAMPLE_BARCODE_COLUMN = "Tumor_Sample_Barcode" | ||
HUGO_SYMBOL_COLUMN = "Hugo_Symbol" | ||
ENTREZ_GENE_ID_COLUMN = "Entrez_Gene_Id" | ||
|
||
NULL_OR_MISSING_VALUES = ["", "NA", "N/A", None] | ||
|
||
def process_datum(val): | ||
""" Cleans up datum. """ | ||
try: | ||
vfixed = val.strip() | ||
except AttributeError: | ||
vfixed = "" | ||
if vfixed in NULL_OR_MISSING_VALUES: | ||
return "" | ||
return vfixed | ||
|
||
def get_file_header(data_file): | ||
""" | ||
Returns header from MAF | ||
Assumes file is tab-delimited | ||
Assumes header is first non-commented line | ||
""" | ||
header = [] | ||
with open(data_file, "r") as header_source: | ||
for line in header_source: | ||
if not line.startswith("#"): | ||
header = list(map(str.strip, line.rstrip().split("\t"))) | ||
break | ||
if not header: | ||
print("Could not extract header from mutation data file: %s - Exiting..." % (filename)) | ||
sys.exit(2) | ||
return header | ||
|
||
def merge_headers(data_filenames): | ||
""" | ||
Generates a merged header from all input files. | ||
* Also ensures that "Hugo_Symbol" and "Entrez_Gene_Id" | ||
are the first 2 columns in the merged MAF header. | ||
""" | ||
merged_header = [HUGO_SYMBOL_COLUMN, ENTREZ_GENE_ID_COLUMN] | ||
for fname in data_filenames: | ||
columns_to_add = [column for column in get_file_header(fname) if column not in merged_header] | ||
merged_header.extend(columns_to_add) | ||
return merged_header | ||
|
||
def merge_input_mafs(input_mafs, output_maf_filename): | ||
""" Generates a merged MAF given a list of input MAFs. """ | ||
merged_header = merge_headers(input_mafs) | ||
|
||
rows_to_write = [] | ||
for input_maf in input_mafs: | ||
print('Loading data from ' + input_maf) | ||
header_processed = False | ||
file_header = get_file_header(input_maf) | ||
with open(input_maf, "r") as maf: | ||
for line in maf: | ||
if line.startswith("#"): | ||
continue | ||
if not header_processed: | ||
header_processed = True | ||
continue | ||
# map row values to current header columns | ||
mapped_row = dict(zip(file_header, list(map(lambda x: process_datum(x), line.split("\t"))))) | ||
sample_id = mapped_row[TUMOR_SAMPLE_BARCODE_COLUMN] | ||
|
||
# full record for the merged MAF (if mapped_row does not contain a column in merged_header, it is blank) | ||
normalized_row = list(map(lambda x: mapped_row.get(x, ""), merged_header)) | ||
rows_to_write.append("\t".join(normalized_row)) | ||
|
||
with open(output_maf_filename, "w") as output_maf: | ||
output_maf.write("\t".join(merged_header)) | ||
output_maf.write("\n") | ||
output_maf.write("\n".join(rows_to_write)) | ||
output_maf.write("\n") | ||
print('Merged MAF written to: %s' % (output_maf_filename)) | ||
|
||
def verify_input_mafs(input_mafs): | ||
for maf in input_mafs: | ||
if not os.path.isfile(maf): | ||
print("Could not find MAF by filename: %s" % (maf)) | ||
exit(1) | ||
|
||
def usage(parser): | ||
parser.print_help() | ||
sys.exit(2) | ||
|
||
if __name__ == '__main__': | ||
parser = argparse.ArgumentParser() | ||
|
||
parser.add_argument("-i", "--input-mafs-list", action = "store", dest = "input_mafs_list", help = "comma-delimited list of MAFs to merge") | ||
parser.add_argument("-d", "--input-mafs-directory", action = "store", dest = "input_mafs_directory", help = "directory containing all MAFs to merge") | ||
parser.add_argument("-o", "--output-maf", action = "store", dest = "output_maf", help = "output filename for merged MAF [REQUIRED]") | ||
|
||
args = parser.parse_args() | ||
input_mafs_list = args.input_mafs_list | ||
input_mafs_directory = args.input_mafs_directory | ||
output_maf = args.output_maf | ||
|
||
if not output_maf: | ||
print("Missing required argument: -o | --output-maf") | ||
usage(parser) | ||
|
||
if (input_mafs_list and input_mafs_directory) or (not input_mafs_list and not input_mafs_directory): | ||
print("Please choose only one of the following options when running script: --input-mafs-list | --input-mafs-directory") | ||
usage(parser) | ||
|
||
if input_mafs_list: | ||
input_mafs = list(map(str.strip, input_mafs_list.split(","))) | ||
else: | ||
input_mafs = [] | ||
for maf in os.listdir(input_mafs_directory): | ||
full_maf_path = os.path.join(input_mafs_directory, maf) | ||
if os.path.isfile(full_maf_path): | ||
input_mafs.append(full_maf_path) | ||
else: | ||
print("Skipping non-file element in directory... '%s'" % (full_maf_path)) | ||
|
||
verify_input_mafs(input_mafs) | ||
merge_input_mafs(input_mafs, output_maf) |
Oops, something went wrong.