Merge pull request #57 from kids-first/feature/mb-add-openpedcan_v14

🚀 Add OpenPedCan v15
kids-first · Mar 14, 2024 · 6b5a3bb · 6b5a3bb
2 parents 028564c + 3874f3d
commit 6b5a3bb
Show file tree

Hide file tree

Showing 11 changed files with 441 additions and 184 deletions.
diff --git a/COLLABORATIONS/openTARGETS/README.md b/COLLABORATIONS/openTARGETS/README.md
@@ -13,18 +13,20 @@ Genomic data generally obtained as such:
  - Copy number: tsv file with copy number, ploidy, and GISTIC-style information in maf-like format (each call is a row)
  - RNA expression: tpm values from rsem stored an `.rds` object
  - RNA fusion: annoFuse output
-For example, for v12, bucket s3://d3b-openaccess-us-east-1-prd-pbta/open-targets/v12/:
+For example, for v14, bucket s3://d3b-openaccess-us-east-1-prd-pbta/open-targets/v14/:
 ```
-consensus_wgs_plus_cnvkit_wxs.tsv.gz
+consensus_wgs_plus_cnvkit_wxs_plus_freec_tumor_only.tsv.gz
 fusion-dgd.tsv.gz
 fusion-putative-oncogenic.tsv
 gene-expression-rsem-tpm-collapsed.rds
-tcga-gene-expression-rsem-tpm-collapsed.rds
+tcga_gene-expression-rsem-tpm-collapsed.rds
+gtex_gene-expression-rsem-tpm-collapsed.rds
 snv-consensus-plus-hotspots.maf.tsv.gz
+snv-mutect2-tumor-only-plus-hotspots.maf.tsv.gz
 ```
 
 ### Prep work
-The histologies file needs `formatted_sample_id` added and likely a blacklist from the D3b Warehouse or some other source to supress duplicate RNA libraries from different sequencing methods.
+The histologies file needs `formatted_sample_id` added and likely a blacklist from the D3b Warehouse or some other source to suppress duplicate RNA libraries from different sequencing methods.
 Since we are not handling `Methylation` yet, it is recommended those entries be removed ahead of time.
 To create the histologies file, recommended method is to:
 1. `docker pull pgc-images.sbgenomics.com/d3b-bixu/open-pedcan:latest` OR if you have R installed locally, ensure the following libraries are installed:
@@ -38,12 +40,12 @@ To create the histologies file, recommended method is to:
 1. Pull the OpenPedCan repo (warning, it's 12GB ): https://github.com/PediatricOpenTargets/OpenPedCan-analysis, or just download the script from `analyses/pedcbio-sample-name/pedcbio_sample_name_col.R`
 1. Export from D3b Warehouse the latest existing cBio IDs to use for population. Ensure that the output is csv double-quoted. Currently that can be obtained using the sql command:
     ```sql
-
+    with custom as (
     select participant_id, formatted_sample_id, specimen_id, analyte_types, normal_bs_id, normal_sample_id
     from prod_cbio.aml_sd_pet7q6f2_2018_cbio_sample
     union
     select participant_id, formatted_sample_id, specimen_id, analyte_types, normal_bs_id, normal_sample_id
-    from prod_cbio.aml_sd_z6mwd3h0_2018_cbio_sample
+    from prod_cbio.bllnos_sd_z6mwd3h0_2018_cbio_sample
     union
     select participant_id, formatted_sample_id, specimen_id, analyte_types, normal_bs_id, normal_sample_id
     from prod_cbio.x01_fy16_nbl_maris_cbio_sample
@@ -68,13 +70,18 @@ To create the histologies file, recommended method is to:
 	  union
     select participant_id, formatted_sample_id, specimen_id, analyte_types, normal_bs_id, normal_sample_id
     from prod_cbio.chdm_phs001643_2018_cbio_sample
+    union
+    select participant_id, formatted_sample_id, specimen_id, analyte_types, normal_bs_id, normal_sample_id
+    from prod_cbio.pbta_mioncoseq_cbio_sample
+    )
+    select * from custom
 
     ```
-1. Get a blacklist from D3b Warehouse, exporting table `bix_workflows.cbio_hide_reasons
+1. Get a blacklist from D3b Warehouse, exporting table `bix_workflows.cbio_hide_reasons`
 
 ### Run as standalone
 1. Download from https://github.com/PediatricOpenTargets/OpenPedCan-analysis the `analyses/pedcbio-sample-name/pedcbio_sample_name_col.R` or run from repo if you have it
-1. Run `Rscript --vanilla pedcbio_sample_name_col.R -i path-to-histolgies-file.tsv -n path-to-cbio-names.csv -b Methylation`
+1. Run `Rscript --vanilla pedcbio_sample_name_col.R -i path-to-histologies-file.tsv -n path-to-cbio-names.csv -b 'Methylation,Phospho-Proteomics,Whole Cell Proteomics,miRNA-Seq'`
 OR
 ### Run in repo
 1. Either run an interactive docker or using your local R, and ensure to mount a volume that will have the repo and whatever input histologies file you end up using, i.e. `docker run -it --mount type=bind,source=/home/ubuntu,target=/WORK pgc-images.sbgenomics.com/d3b-bixu/open-pedcan:latest /bin/bash`
@@ -86,7 +93,11 @@ TCGA data are kept in a seprate matrix from everything else. We need to merge th
 ```sh
 Rscript COLLABORATIONS/openTARGETS/merge_rsem_rds.R --first_file gene-expression-rsem-tpm-collapsed.rds --second_file tcga-gene-expression-rsem-tpm-collapsed.rds --output_fn gene_tcga_expression_common_merge.rds
 ```
-
+UPDATE: GTEx is also in a seprate matrix, so run again currently to make the "final" merge before conversion
+```sh
+Rscript COLLABORATIONS/openTARGETS/merge_rsem_rds.R --first_file gene_tcga_expression_common_merge.rds --second_file gtex_gene-expression-rsem-tpm-collapsed.rds --output_fn gene_tcga_gtex_expression_common_merge.rds
+```
+```
 
 ### File Transformation
 It's recommended to put datasheets in a dir called `datasheets`, downloaded files in it's own dir (in v12 it's `GF_INPUTS`) and the rest of the processed outputs into it's own dir (`study_build` for v12) to keep things sane and also be able to leverage existing study build script in `scripts/organize_upload_packages.py`
@@ -140,7 +151,7 @@ optional arguments:
 ```
 _NOTE_ for v11 input, I ran the following command `zcat snv-dgd.maf.tsv.gz | perl -e '$skip = <>; $skip= <>; while(<>){print $_;}' | gzip -c >> snv-consensus-plus-hotspots.maf.tsv.gz` to add DGD data
 
-_NOTE_ for v12 input,I would have following command `python3 ~/tools/kf-cbioportal-etl/COLLABORATIONS/openTARGETS/add_dgd_maf_to_openpedcan.py -i /home/ubuntu/tools/kf-cbioportal-etl/COLLABORATIONS/openTARGETS/maf_openpedcan_v12_header.txt -c openpedcan_v12.maf -t ../bs_id_sample_map.txt -m ../GF_INPUTS/snv-dgd.maf.tsv.gz` to add DGD data, which is more robust - however, there are data issues with DGD, so it was left out
+_NOTE_ for v15 input,I would have following command `python3 ~/tools/kf-cbioportal-etl/COLLABORATIONS/openTARGETS/append_maf_to_existing.py -i /home/ubuntu/tools/kf-cbioportal-etl/COLLABORATIONS/openTARGETS/maf_openpedcan_v15_header.txt -c openpedcan_v15.maf -t ../INPUT_PREP/bs_id_sample_map.txt -m ../INPUTS/snv-mutect2-tumor-only-plus-hotspots.maf.tsv.gz` to add tumor-only data, which is more robust
 
 Example run:
 `python3 COLLABORATIONS/openTARGETS/rename_filter_maf.py -m bs_id_sample_map.txt -v snv-consensus-plus-hotspots.maf.tsv.gz -s 1 -n openpedcan_v12`
@@ -189,7 +200,7 @@ Options:
 		Show this help message and exit
 ```
 Example run:
-`Rscript COLLABORATIONS/openTARGETS/rename_export_rsem.R --rna_rds gene_tcga_expression_common_merge.rds --map_id bs_id_sample_map.txt --type openpedcan_v11 --computeZscore R 2> rna_convert.errs`
+`Rscript COLLABORATIONS/openTARGETS/rename_export_rsem.R --rna_rds gene_tcga_gtex_expression_common_merge.rds --map_id bs_id_sample_map.txt --type openpedcan_v15 --computeZscore R 2> rna_convert.errs`
 
 #### 5. scripts/convert_fusion_as_sv.py
 

diff --git a/.../openTARGETS/add_dgd_maf_to_openpedcan.py → ...ONS/openTARGETS/append_maf_to_existing.py b/.../openTARGETS/add_dgd_maf_to_openpedcan.py → ...ONS/openTARGETS/append_maf_to_existing.py
@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 """
-Helper script to append DGD data to an existing merged maf file.
+Helper script to append a maf to an existing  maf file.
 Uses filter_entry to filter out undesired calls like in other mafs
 """
 import sys
@@ -82,6 +82,15 @@
     v_idx = m_header.index("Variant_Classification")
     h_idx = m_header.index("Hugo_Symbol")
 
+    # need to also pop entrez ID if exists, as process_maf_entry() will do that to data
+    try:
+        m_header.pop(m_header.index("Entrez_Gene_Id"))
+    except Exception as e:
+        print(e, file=sys.stderr)
+    # bug fix for OpenPedCan, position will be one less after process_maf_entry
+    n_ref_ct_idx = m_header.index('n_ref_count')
+    n_alt_ct_idx = m_header.index('n_alt_count')    
+
     for i in range(len(m_header)):
         if m_header[i] in h_dict:
             h_dict[m_header[i]] = i
@@ -90,16 +99,25 @@
         to_print = []
         datum = process_maf_entry(data, maf_exc, v_idx, h_idx, tid_idx, eid_idx, bs_cbio_key)
         # Set tumor barcode to cBio ID
-        if datum:
-            for item in header:
-                if h_dict[item] != None:
-                    to_print.append(datum[h_dict[item]])
-                else:
-                    to_print.append("")
-            print("\t".join(to_print), file=append_maf)
-        else:
-            skipped += 1
+        try:
+            if datum:
+                for item in header:
+                    if h_dict[item] != None:
+                        to_print.append(datum[h_dict[item]])
+                    else:
+                        to_print.append("")
+                # bug fix for maf format in OpenPedCan
+                for i in [n_ref_ct_idx, n_alt_ct_idx]:
+                    if to_print[i] == "NA":
+                        to_print[i] = ""
+                print("\t".join(to_print), file=append_maf)
+            else:
+                skipped += 1
+        except Exception as e:
+            print (e)
+            pdb.set_trace()
+            hold = 1
     sys.stderr.write("Processed " + maf_fn + "\n")
-    sys.stderr.write("Skipped " + str(skipped) + " entries meeting exlusion criteria\n")
+    sys.stderr.write("Skipped " + str(skipped) + " entries meeting exclusion criteria\n")
 
 append_maf.close()
diff --git a/COLLABORATIONS/openTARGETS/clinical_to_datasheets.py b/COLLABORATIONS/openTARGETS/clinical_to_datasheets.py
@@ -1,6 +1,5 @@
 import sys
 import argparse
-import json
 import math
 import re
 import pdb
@@ -153,17 +152,17 @@ def build_header(header_list, entry):
 
 # track some sample info so that somatic events can be collapsed
 samp_dict = {}
-s_type = header.index('sample_type')
-comp = header.index('composition')
-bs_id = header.index('Kids_First_Biospecimen_ID')
-exp = header.index('experimental_strategy')
+s_type_index = header.index('sample_type')
+comp_index = header.index('composition')
+bs_id_index = header.index('Kids_First_Biospecimen_ID')
+exp_index = header.index('experimental_strategy')
 
 # experimental strategy may be too vague, use to tell if RNA
 # related to recent addition of DGD samples, need gene matrix file
-rna_lib = header.index('RNA_library')
-cohort = header.index('cohort')
+rna_lib_index = header.index('RNA_library')
+cohort_index = header.index('cohort')
 a_idx = header.index('aliquot_id')
-cbio_id = header.index('formatted_sample_id')
+cbio_id_index = header.index('formatted_sample_id')
 data_gene = open('data_gene_matrix_CHOP.txt', 'w')
 data_gene.write('SAMPLE_ID\tmutations\n')
 
@@ -173,20 +172,20 @@ def build_header(header_list, entry):
 
 for data in clin_data:
     info = data.rstrip('\n').split('\t')
-    if info[s_type] == "Normal" and info[exp] != 'RNA-Seq':
+    if info[s_type_index] == "Normal" and info[exp_index] != 'RNA-Seq':
         continue
-    if info[bs_id] in blacklist_dict:
-        sys.stderr.write('Skipping output of ' + info[bs_id] + ' because in blacklist for reason ' + blacklist_dict[info[bs_id]] + '\n')
+    if info[bs_id_index] in blacklist_dict:
+        sys.stderr.write('Skipping output of ' + info[bs_id_index] + ' because in blacklist for reason ' + blacklist_dict[info[bs_id_index]] + '\n')
         continue
     # adjust exp value if targeted sequencing
-    if info[exp] == 'Targeted Sequencing':
-        if info[rna_lib] != 'NA':
-            info[exp] = 'RNA-Seq'
+    if info[exp_index] == 'Targeted Sequencing':
+        if info[rna_lib_index] != 'NA':
+            info[exp_index] = 'RNA-Seq'
         # if DGD DNA, add to gene matrix
-        elif info[cohort] == 'DGD':
+        elif info[cohort_index] == 'DGD':
             # parse aliquot for panel type, i.e. ET_242MFKXW_DGD_STNGS_93
-            test = re.match('.*_DGD_(\w+)_\d+', info[a_idx])
-            data_gene.write(info[cbio_id] + '\tCHOP-' + test.group(1) + '\n')
+            test = re.match(r'.*_DGD_(\w+)_\d+', info[a_idx])
+            data_gene.write(info[cbio_id_index] + '\tCHOP-' + test.group(1) + '\n')
 
     pt_id = info[header.index("Kids_First_Participant_ID")]
     if pt_id in pt_id_dict:
@@ -214,12 +213,19 @@ def build_header(header_list, entry):
                         # age_at_last_known = float(value)
                         # age_at_last_known = str(math.floor(age_at_last_known/365.25))
                         value = str(math.floor(float(value)/365.25))
-                elif header[i] == "PFS_days" and value != "NA":
+                elif header[i] == "EFS_days" and value != "NA":
                     value = str(math.floor(float(value)/30.5))
                     # d_free_mos = value
             elif header[i] == "tumor_descriptor":
                 if value in tumor_descriptor_dict:
                     value = tumor_descriptor_dict[value]
+            elif header[i] == "EFS_event_type":
+                if value == "Not Applicable":
+                    value = "0:No Event"
+                elif value == "Not Reported":
+                    value = "NA"
+                else:
+                    value = "1:" + value
             # replace status with NA if value not acceptable
             elif header[i] == "OS_status":
                 if value not in ["LIVING", "DECEASED", "NA"]:
@@ -238,11 +244,11 @@ def build_header(header_list, entry):
     if samp_id not in samp_dict:
         samp_dict[samp_id] = sample_to_print
         id_mapping[samp_id] = []
-    id_mapping[samp_id].append(info[bs_id])
-    if info[exp] == "RNA-Seq":
-        bs_type[info[bs_id]] = "RNA"
+    id_mapping[samp_id].append(info[bs_id_index])
+    if info[exp_index] == "RNA-Seq":
+        bs_type[info[bs_id_index]] = "RNA"
     else:
-        bs_type[info[bs_id]] = "DNA"
+        bs_type[info[bs_id_index]] = "DNA"
 # cycle through sample IDs to see if there's matched DNA/RNA and if one can be made
 check = {}
 for samp_id in id_mapping:
@@ -254,10 +260,20 @@ def build_header(header_list, entry):
         spec = id_mapping[samp_id][0] + ";" + id_mapping[samp_id][1]
         if bs_type[id_mapping[samp_id][0]] == "RNA":
             spec = id_mapping[samp_id][1] + ";" + id_mapping[samp_id][0]
-        samp_dict[samp_id][0] = spec
+        samp_dict[samp_id][1] = spec
     elif len(id_mapping[samp_id]) > 2:
-        # QC check, only one or two biospec per sample ID
+        # QC check, only one or two biospec per sample ID, unless it's new DGD RNA + separate fusion biospecimen
         sys.stderr.write("Saw more than two biospecimens for " + samp_id + ": " + ",".join(id_mapping[samp_id]) + "\n")
+        if "DGD" in samp_id:
+            # If two RNA types and is DGD, throw a note to check
+            check_type = {"DNA": [], "RNA": []}
+            for bs_id in id_mapping[samp_id]:
+                check_type[bs_type[bs_id]].append(bs_id)
+            if len(check_type["DNA"]) == 1 and len(check_type["RNA"]) == 2:
+                spec = ";".join(check_type["DNA"] + check_type["RNA"])
+                samp_dict[samp_id][1] = spec
+                sys.stderr.write("Could be a DGD fusion + bulk RNA, may be ok\n")
+
         # exit(1)
     else:
         # skip cell line re-matching
@@ -272,9 +288,9 @@ def build_header(header_list, entry):
 mapping_file = open("bs_id_sample_map.txt", "w")
 mapping_file.write("BS_ID\tSample Type\tCbio ID\n")
 for samp_id in id_mapping:
-    for bs_id in id_mapping[samp_id]:
+    for bs_id_index in id_mapping[samp_id]:
         try:
-            mapping_file.write("\t".join([bs_id, bs_type[bs_id], samp_id]) + "\n")
+            mapping_file.write("\t".join([bs_id_index, bs_type[bs_id_index], samp_id]) + "\n")
         except Exception as e:
             sys.stderr.write(str(e) + "\n")
             pdb.set_trace()

diff --git a/COLLABORATIONS/openTARGETS/cnv_to_tables.py b/COLLABORATIONS/openTARGETS/cnv_to_tables.py
@@ -52,12 +52,13 @@ def collate_data(cnv_fn):
                 ploidy = data[p_idx]
                 cn = data[c_idx]
                 try:
-                    gistic = qual_to_gistic[data[s_idx]]
+                    qual = data[s_idx]
+                    if qual != "NA":
+                        qual = qual.lower()
+                    gistic = qual_to_gistic[qual]
                 except Exception as e:
-                    sys.stderr.write(str(e) + "\nInvalid value for gistic, skipping " + line.decode())
+                    sys.stderr.write(str(e) + "\nInvalid value for gistic, skipping " + line)
                     continue
-                    # pdb.set_trace()
-                    # hold=1
                 if samp_id not in ploidy_dict:
                     ploidy_dict[samp_id] = ploidy
                 if gene not in cn_dict:

diff --git a/COLLABORATIONS/openTARGETS/header_desc.tsv b/COLLABORATIONS/openTARGETS/header_desc.tsv
@@ -3,7 +3,8 @@ germline_sex_estimate			0	1	STRING	1	Female
 cancer_predispositions			0	1	STRING	2	None documented	
 OS_days	OS_MONTHS	Overall survival in months since initial diagnosis	0	1	NUMBER	3	NA	just convert to months
 OS_status	OS_STATUS	Overall patient survival status	0	1	STRING	4	LIVING	
-PFS_days	PFS_MONTHS	Progression free (months) since initial treatment	0	1	NUMBER	5	Not Reported	just convert to months
+EFS_days	EFS_MONTHS	Event free (months) since initial treatment	0	1	NUMBER	5	Not Reported	just convert to months
+EFS_event_type	EFS_STATUS	Event free status (months) since initial treatment	0	1	STRING	5	Not Reported	
 ethnicity	ETHNICITY		0	1	STRING	6	Not Hispanic or Latino	
 race	RACE		0	1	STRING	7	White	
 primary_site	TUMOR_SITE		0	1	STRING	8		
@@ -18,7 +19,8 @@ harmonized_diagnosis	CANCER_TYPE_DETAILED		1	0	STRING	10	Adenoma
 primary_site	TUMOR_TISSUE_SITE		1	0	STRING	9	Suprasellar/Hypothalamic/Pituitary	
 tumor_descriptor	TUMOR_TYPE		1	0	STRING	8	Initial CNS Tumor	
 composition	SAMPLE_TYPE		1	0	STRING	7	Solid Tissue	
-cohort	COHORT	Source study cohort name	1	0	STRING	6		
+cohort	COHORT	Source study cohort name	1	0	STRING	6	
+sub_cohort	SUB_COHORT	Source study sub-cohort name	1	0	STRING	6	
 CNS_region			1	0	STRING	5	Suprasellar	
 tumor_ploidy			1	0	NUMBER	4	3	
 tumor_fraction			1	0	NUMBER	3	0.476369391	

diff --git a/COLLABORATIONS/openTARGETS/maf_openpedcan_v15_header.txt b/COLLABORATIONS/openTARGETS/maf_openpedcan_v15_header.txt
@@ -0,0 +1,2 @@
+#version 2.4
+Hugo_Symbol	Entrez_Gene_Id	Center	NCBI_Build	Chromosome	Start_Position	End_Position	Strand	Variant_Classification	Variant_Type	Reference_Allele	Tumor_Seq_Allele1	Tumor_Seq_Allele2	dbSNP_RS	dbSNP_Val_Status	Tumor_Sample_Barcode	Matched_Norm_Sample_Barcode	Match_Norm_Seq_Allele1	Match_Norm_Seq_Allele2	Tumor_Validation_Allele1	Tumor_Validation_Allele2	Match_Norm_Validation_Allele1	Match_Norm_Validation_Allele2	Verification_Status	Validation_Status	Mutation_Status	Sequencing_Phase	Sequence_Source	Validation_Method	Score	BAM_File	Sequencer	Tumor_Sample_UUID	Matched_Norm_Sample_UUID	HGVSc	HGVSp	HGVSp_Short	Transcript_ID	Exon_Number	t_depth	t_ref_count	t_alt_count	n_depth	n_ref_count	n_alt_count	all_effects	Allele	Gene	Feature	Feature_type	Consequence	cDNA_position	CDS_position	Protein_position	Amino_acids	Codons	Existing_variation	ALLELE_NUM	DISTANCE	STRAND_VEP	SYMBOL	SYMBOL_SOURCE	HGNC_ID	BIOTYPE	CANONICAL	CCDS	ENSP	SWISSPROT	TREMBL	UNIPARC	RefSeq	SIFT	PolyPhen	EXON	INTRON	DOMAINS	AF	AFR_AF	AMR_AF	ASN_AF	EAS_AF	EUR_AF	SAS_AF	AA_AF	EA_AF	CLIN_SIG	SOMATIC	PUBMED	MOTIF_NAME	MOTIF_POS	HIGH_INF_POS	MOTIF_SCORE_CHANGE	IMPACT	PICK	VARIANT_CLASS	TSL	HGVS_OFFSET	PHENO	MINIMISED	GENE_PHENO	FILTER	flanking_bps	vcf_id	vcf_qual	gnomAD_AF	gnomAD_AFR_AF	gnomAD_AMR_AF	gnomAD_ASJ_AF	gnomAD_EAS_AF	gnomAD_FIN_AF	gnomAD_NFE_AF	gnomAD_OTH_AF	gnomAD_SAS_AF	HGVSg	vcf_pos	gnomad_3_1_1_AC	gnomad_3_1_1_AN	gnomad_3_1_1_AF	gnomad_3_1_1_nhomalt	gnomad_3_1_1_AC_popmax	gnomad_3_1_1_AN_popmax	gnomad_3_1_1_AF_popmax	gnomad_3_1_1_nhomalt_popmax	gnomad_3_1_1_AC_controls_and_biobanks	gnomad_3_1_1_AN_controls_and_biobanks	gnomad_3_1_1_AF_controls_and_biobanks	gnomad_3_1_1_AF_non_cancer	gnomad_3_1_1_primate_ai_score	gnomad_3_1_1_splice_ai_consequence	MQ	MQ0	CAL	HotSpotAllele
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		#version 2.4
		Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2 Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2 Verification_Status Validation_Status Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score BAM_File Sequencer Tumor_Sample_UUID Matched_Norm_Sample_UUID HGVSc HGVSp HGVSp_Short Transcript_ID Exon_Number t_depth t_ref_count t_alt_count n_depth n_ref_count n_alt_count all_effects Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation ALLELE_NUM DISTANCE STRAND_VEP SYMBOL SYMBOL_SOURCE HGNC_ID BIOTYPE CANONICAL CCDS ENSP SWISSPROT TREMBL UNIPARC RefSeq SIFT PolyPhen EXON INTRON DOMAINS AF AFR_AF AMR_AF ASN_AF EAS_AF EUR_AF SAS_AF AA_AF EA_AF CLIN_SIG SOMATIC PUBMED MOTIF_NAME MOTIF_POS HIGH_INF_POS MOTIF_SCORE_CHANGE IMPACT PICK VARIANT_CLASS TSL HGVS_OFFSET PHENO MINIMISED GENE_PHENO FILTER flanking_bps vcf_id vcf_qual gnomAD_AF gnomAD_AFR_AF gnomAD_AMR_AF gnomAD_ASJ_AF gnomAD_EAS_AF gnomAD_FIN_AF gnomAD_NFE_AF gnomAD_OTH_AF gnomAD_SAS_AF HGVSg vcf_pos gnomad_3_1_1_AC gnomad_3_1_1_AN gnomad_3_1_1_AF gnomad_3_1_1_nhomalt gnomad_3_1_1_AC_popmax gnomad_3_1_1_AN_popmax gnomad_3_1_1_AF_popmax gnomad_3_1_1_nhomalt_popmax gnomad_3_1_1_AC_controls_and_biobanks gnomad_3_1_1_AN_controls_and_biobanks gnomad_3_1_1_AF_controls_and_biobanks gnomad_3_1_1_AF_non_cancer gnomad_3_1_1_primate_ai_score gnomad_3_1_1_splice_ai_consequence MQ MQ0 CAL HotSpotAllele