Skip to content

Commit

Permalink
Merge pull request #57 from kids-first/feature/mb-add-openpedcan_v14
Browse files Browse the repository at this point in the history
🚀 Add OpenPedCan v15
  • Loading branch information
migbro authored Mar 14, 2024
2 parents 028564c + 3874f3d commit 6b5a3bb
Show file tree
Hide file tree
Showing 11 changed files with 441 additions and 184 deletions.
33 changes: 22 additions & 11 deletions COLLABORATIONS/openTARGETS/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,18 +13,20 @@ Genomic data generally obtained as such:
- Copy number: tsv file with copy number, ploidy, and GISTIC-style information in maf-like format (each call is a row)
- RNA expression: tpm values from rsem stored an `.rds` object
- RNA fusion: annoFuse output
For example, for v12, bucket s3://d3b-openaccess-us-east-1-prd-pbta/open-targets/v12/:
For example, for v14, bucket s3://d3b-openaccess-us-east-1-prd-pbta/open-targets/v14/:
```
consensus_wgs_plus_cnvkit_wxs.tsv.gz
consensus_wgs_plus_cnvkit_wxs_plus_freec_tumor_only.tsv.gz
fusion-dgd.tsv.gz
fusion-putative-oncogenic.tsv
gene-expression-rsem-tpm-collapsed.rds
tcga-gene-expression-rsem-tpm-collapsed.rds
tcga_gene-expression-rsem-tpm-collapsed.rds
gtex_gene-expression-rsem-tpm-collapsed.rds
snv-consensus-plus-hotspots.maf.tsv.gz
snv-mutect2-tumor-only-plus-hotspots.maf.tsv.gz
```

### Prep work
The histologies file needs `formatted_sample_id` added and likely a blacklist from the D3b Warehouse or some other source to supress duplicate RNA libraries from different sequencing methods.
The histologies file needs `formatted_sample_id` added and likely a blacklist from the D3b Warehouse or some other source to suppress duplicate RNA libraries from different sequencing methods.
Since we are not handling `Methylation` yet, it is recommended those entries be removed ahead of time.
To create the histologies file, recommended method is to:
1. `docker pull pgc-images.sbgenomics.com/d3b-bixu/open-pedcan:latest` OR if you have R installed locally, ensure the following libraries are installed:
Expand All @@ -38,12 +40,12 @@ To create the histologies file, recommended method is to:
1. Pull the OpenPedCan repo (warning, it's 12GB ): https://github.com/PediatricOpenTargets/OpenPedCan-analysis, or just download the script from `analyses/pedcbio-sample-name/pedcbio_sample_name_col.R`
1. Export from D3b Warehouse the latest existing cBio IDs to use for population. Ensure that the output is csv double-quoted. Currently that can be obtained using the sql command:
```sql
with custom as (
select participant_id, formatted_sample_id, specimen_id, analyte_types, normal_bs_id, normal_sample_id
from prod_cbio.aml_sd_pet7q6f2_2018_cbio_sample
union
select participant_id, formatted_sample_id, specimen_id, analyte_types, normal_bs_id, normal_sample_id
from prod_cbio.aml_sd_z6mwd3h0_2018_cbio_sample
from prod_cbio.bllnos_sd_z6mwd3h0_2018_cbio_sample
union
select participant_id, formatted_sample_id, specimen_id, analyte_types, normal_bs_id, normal_sample_id
from prod_cbio.x01_fy16_nbl_maris_cbio_sample
Expand All @@ -68,13 +70,18 @@ To create the histologies file, recommended method is to:
union
select participant_id, formatted_sample_id, specimen_id, analyte_types, normal_bs_id, normal_sample_id
from prod_cbio.chdm_phs001643_2018_cbio_sample
union
select participant_id, formatted_sample_id, specimen_id, analyte_types, normal_bs_id, normal_sample_id
from prod_cbio.pbta_mioncoseq_cbio_sample
)
select * from custom
```
1. Get a blacklist from D3b Warehouse, exporting table `bix_workflows.cbio_hide_reasons
1. Get a blacklist from D3b Warehouse, exporting table `bix_workflows.cbio_hide_reasons`
### Run as standalone
1. Download from https://github.com/PediatricOpenTargets/OpenPedCan-analysis the `analyses/pedcbio-sample-name/pedcbio_sample_name_col.R` or run from repo if you have it
1. Run `Rscript --vanilla pedcbio_sample_name_col.R -i path-to-histolgies-file.tsv -n path-to-cbio-names.csv -b Methylation`
1. Run `Rscript --vanilla pedcbio_sample_name_col.R -i path-to-histologies-file.tsv -n path-to-cbio-names.csv -b 'Methylation,Phospho-Proteomics,Whole Cell Proteomics,miRNA-Seq'`
OR
### Run in repo
1. Either run an interactive docker or using your local R, and ensure to mount a volume that will have the repo and whatever input histologies file you end up using, i.e. `docker run -it --mount type=bind,source=/home/ubuntu,target=/WORK pgc-images.sbgenomics.com/d3b-bixu/open-pedcan:latest /bin/bash`
Expand All @@ -86,7 +93,11 @@ TCGA data are kept in a seprate matrix from everything else. We need to merge th
```sh
Rscript COLLABORATIONS/openTARGETS/merge_rsem_rds.R --first_file gene-expression-rsem-tpm-collapsed.rds --second_file tcga-gene-expression-rsem-tpm-collapsed.rds --output_fn gene_tcga_expression_common_merge.rds
```
UPDATE: GTEx is also in a seprate matrix, so run again currently to make the "final" merge before conversion
```sh
Rscript COLLABORATIONS/openTARGETS/merge_rsem_rds.R --first_file gene_tcga_expression_common_merge.rds --second_file gtex_gene-expression-rsem-tpm-collapsed.rds --output_fn gene_tcga_gtex_expression_common_merge.rds
```
```
### File Transformation
It's recommended to put datasheets in a dir called `datasheets`, downloaded files in it's own dir (in v12 it's `GF_INPUTS`) and the rest of the processed outputs into it's own dir (`study_build` for v12) to keep things sane and also be able to leverage existing study build script in `scripts/organize_upload_packages.py`
Expand Down Expand Up @@ -140,7 +151,7 @@ optional arguments:
```
_NOTE_ for v11 input, I ran the following command `zcat snv-dgd.maf.tsv.gz | perl -e '$skip = <>; $skip= <>; while(<>){print $_;}' | gzip -c >> snv-consensus-plus-hotspots.maf.tsv.gz` to add DGD data
_NOTE_ for v12 input,I would have following command `python3 ~/tools/kf-cbioportal-etl/COLLABORATIONS/openTARGETS/add_dgd_maf_to_openpedcan.py -i /home/ubuntu/tools/kf-cbioportal-etl/COLLABORATIONS/openTARGETS/maf_openpedcan_v12_header.txt -c openpedcan_v12.maf -t ../bs_id_sample_map.txt -m ../GF_INPUTS/snv-dgd.maf.tsv.gz` to add DGD data, which is more robust - however, there are data issues with DGD, so it was left out
_NOTE_ for v15 input,I would have following command `python3 ~/tools/kf-cbioportal-etl/COLLABORATIONS/openTARGETS/append_maf_to_existing.py -i /home/ubuntu/tools/kf-cbioportal-etl/COLLABORATIONS/openTARGETS/maf_openpedcan_v15_header.txt -c openpedcan_v15.maf -t ../INPUT_PREP/bs_id_sample_map.txt -m ../INPUTS/snv-mutect2-tumor-only-plus-hotspots.maf.tsv.gz` to add tumor-only data, which is more robust
Example run:
`python3 COLLABORATIONS/openTARGETS/rename_filter_maf.py -m bs_id_sample_map.txt -v snv-consensus-plus-hotspots.maf.tsv.gz -s 1 -n openpedcan_v12`
Expand Down Expand Up @@ -189,7 +200,7 @@ Options:
Show this help message and exit
```
Example run:
`Rscript COLLABORATIONS/openTARGETS/rename_export_rsem.R --rna_rds gene_tcga_expression_common_merge.rds --map_id bs_id_sample_map.txt --type openpedcan_v11 --computeZscore R 2> rna_convert.errs`
`Rscript COLLABORATIONS/openTARGETS/rename_export_rsem.R --rna_rds gene_tcga_gtex_expression_common_merge.rds --map_id bs_id_sample_map.txt --type openpedcan_v15 --computeZscore R 2> rna_convert.errs`
#### 5. scripts/convert_fusion_as_sv.py
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/usr/bin/env python3
"""
Helper script to append DGD data to an existing merged maf file.
Helper script to append a maf to an existing maf file.
Uses filter_entry to filter out undesired calls like in other mafs
"""
import sys
Expand Down Expand Up @@ -82,6 +82,15 @@
v_idx = m_header.index("Variant_Classification")
h_idx = m_header.index("Hugo_Symbol")

# need to also pop entrez ID if exists, as process_maf_entry() will do that to data
try:
m_header.pop(m_header.index("Entrez_Gene_Id"))
except Exception as e:
print(e, file=sys.stderr)
# bug fix for OpenPedCan, position will be one less after process_maf_entry
n_ref_ct_idx = m_header.index('n_ref_count')
n_alt_ct_idx = m_header.index('n_alt_count')

for i in range(len(m_header)):
if m_header[i] in h_dict:
h_dict[m_header[i]] = i
Expand All @@ -90,16 +99,25 @@
to_print = []
datum = process_maf_entry(data, maf_exc, v_idx, h_idx, tid_idx, eid_idx, bs_cbio_key)
# Set tumor barcode to cBio ID
if datum:
for item in header:
if h_dict[item] != None:
to_print.append(datum[h_dict[item]])
else:
to_print.append("")
print("\t".join(to_print), file=append_maf)
else:
skipped += 1
try:
if datum:
for item in header:
if h_dict[item] != None:
to_print.append(datum[h_dict[item]])
else:
to_print.append("")
# bug fix for maf format in OpenPedCan
for i in [n_ref_ct_idx, n_alt_ct_idx]:
if to_print[i] == "NA":
to_print[i] = ""
print("\t".join(to_print), file=append_maf)
else:
skipped += 1
except Exception as e:
print (e)
pdb.set_trace()
hold = 1
sys.stderr.write("Processed " + maf_fn + "\n")
sys.stderr.write("Skipped " + str(skipped) + " entries meeting exlusion criteria\n")
sys.stderr.write("Skipped " + str(skipped) + " entries meeting exclusion criteria\n")

append_maf.close()
68 changes: 42 additions & 26 deletions COLLABORATIONS/openTARGETS/clinical_to_datasheets.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
import sys
import argparse
import json
import math
import re
import pdb
Expand Down Expand Up @@ -153,17 +152,17 @@ def build_header(header_list, entry):

# track some sample info so that somatic events can be collapsed
samp_dict = {}
s_type = header.index('sample_type')
comp = header.index('composition')
bs_id = header.index('Kids_First_Biospecimen_ID')
exp = header.index('experimental_strategy')
s_type_index = header.index('sample_type')
comp_index = header.index('composition')
bs_id_index = header.index('Kids_First_Biospecimen_ID')
exp_index = header.index('experimental_strategy')

# experimental strategy may be too vague, use to tell if RNA
# related to recent addition of DGD samples, need gene matrix file
rna_lib = header.index('RNA_library')
cohort = header.index('cohort')
rna_lib_index = header.index('RNA_library')
cohort_index = header.index('cohort')
a_idx = header.index('aliquot_id')
cbio_id = header.index('formatted_sample_id')
cbio_id_index = header.index('formatted_sample_id')
data_gene = open('data_gene_matrix_CHOP.txt', 'w')
data_gene.write('SAMPLE_ID\tmutations\n')

Expand All @@ -173,20 +172,20 @@ def build_header(header_list, entry):

for data in clin_data:
info = data.rstrip('\n').split('\t')
if info[s_type] == "Normal" and info[exp] != 'RNA-Seq':
if info[s_type_index] == "Normal" and info[exp_index] != 'RNA-Seq':
continue
if info[bs_id] in blacklist_dict:
sys.stderr.write('Skipping output of ' + info[bs_id] + ' because in blacklist for reason ' + blacklist_dict[info[bs_id]] + '\n')
if info[bs_id_index] in blacklist_dict:
sys.stderr.write('Skipping output of ' + info[bs_id_index] + ' because in blacklist for reason ' + blacklist_dict[info[bs_id_index]] + '\n')
continue
# adjust exp value if targeted sequencing
if info[exp] == 'Targeted Sequencing':
if info[rna_lib] != 'NA':
info[exp] = 'RNA-Seq'
if info[exp_index] == 'Targeted Sequencing':
if info[rna_lib_index] != 'NA':
info[exp_index] = 'RNA-Seq'
# if DGD DNA, add to gene matrix
elif info[cohort] == 'DGD':
elif info[cohort_index] == 'DGD':
# parse aliquot for panel type, i.e. ET_242MFKXW_DGD_STNGS_93
test = re.match('.*_DGD_(\w+)_\d+', info[a_idx])
data_gene.write(info[cbio_id] + '\tCHOP-' + test.group(1) + '\n')
test = re.match(r'.*_DGD_(\w+)_\d+', info[a_idx])
data_gene.write(info[cbio_id_index] + '\tCHOP-' + test.group(1) + '\n')

pt_id = info[header.index("Kids_First_Participant_ID")]
if pt_id in pt_id_dict:
Expand Down Expand Up @@ -214,12 +213,19 @@ def build_header(header_list, entry):
# age_at_last_known = float(value)
# age_at_last_known = str(math.floor(age_at_last_known/365.25))
value = str(math.floor(float(value)/365.25))
elif header[i] == "PFS_days" and value != "NA":
elif header[i] == "EFS_days" and value != "NA":
value = str(math.floor(float(value)/30.5))
# d_free_mos = value
elif header[i] == "tumor_descriptor":
if value in tumor_descriptor_dict:
value = tumor_descriptor_dict[value]
elif header[i] == "EFS_event_type":
if value == "Not Applicable":
value = "0:No Event"
elif value == "Not Reported":
value = "NA"
else:
value = "1:" + value
# replace status with NA if value not acceptable
elif header[i] == "OS_status":
if value not in ["LIVING", "DECEASED", "NA"]:
Expand All @@ -238,11 +244,11 @@ def build_header(header_list, entry):
if samp_id not in samp_dict:
samp_dict[samp_id] = sample_to_print
id_mapping[samp_id] = []
id_mapping[samp_id].append(info[bs_id])
if info[exp] == "RNA-Seq":
bs_type[info[bs_id]] = "RNA"
id_mapping[samp_id].append(info[bs_id_index])
if info[exp_index] == "RNA-Seq":
bs_type[info[bs_id_index]] = "RNA"
else:
bs_type[info[bs_id]] = "DNA"
bs_type[info[bs_id_index]] = "DNA"
# cycle through sample IDs to see if there's matched DNA/RNA and if one can be made
check = {}
for samp_id in id_mapping:
Expand All @@ -254,10 +260,20 @@ def build_header(header_list, entry):
spec = id_mapping[samp_id][0] + ";" + id_mapping[samp_id][1]
if bs_type[id_mapping[samp_id][0]] == "RNA":
spec = id_mapping[samp_id][1] + ";" + id_mapping[samp_id][0]
samp_dict[samp_id][0] = spec
samp_dict[samp_id][1] = spec
elif len(id_mapping[samp_id]) > 2:
# QC check, only one or two biospec per sample ID
# QC check, only one or two biospec per sample ID, unless it's new DGD RNA + separate fusion biospecimen
sys.stderr.write("Saw more than two biospecimens for " + samp_id + ": " + ",".join(id_mapping[samp_id]) + "\n")
if "DGD" in samp_id:
# If two RNA types and is DGD, throw a note to check
check_type = {"DNA": [], "RNA": []}
for bs_id in id_mapping[samp_id]:
check_type[bs_type[bs_id]].append(bs_id)
if len(check_type["DNA"]) == 1 and len(check_type["RNA"]) == 2:
spec = ";".join(check_type["DNA"] + check_type["RNA"])
samp_dict[samp_id][1] = spec
sys.stderr.write("Could be a DGD fusion + bulk RNA, may be ok\n")

# exit(1)
else:
# skip cell line re-matching
Expand All @@ -272,9 +288,9 @@ def build_header(header_list, entry):
mapping_file = open("bs_id_sample_map.txt", "w")
mapping_file.write("BS_ID\tSample Type\tCbio ID\n")
for samp_id in id_mapping:
for bs_id in id_mapping[samp_id]:
for bs_id_index in id_mapping[samp_id]:
try:
mapping_file.write("\t".join([bs_id, bs_type[bs_id], samp_id]) + "\n")
mapping_file.write("\t".join([bs_id_index, bs_type[bs_id_index], samp_id]) + "\n")
except Exception as e:
sys.stderr.write(str(e) + "\n")
pdb.set_trace()
Expand Down
9 changes: 5 additions & 4 deletions COLLABORATIONS/openTARGETS/cnv_to_tables.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,12 +52,13 @@ def collate_data(cnv_fn):
ploidy = data[p_idx]
cn = data[c_idx]
try:
gistic = qual_to_gistic[data[s_idx]]
qual = data[s_idx]
if qual != "NA":
qual = qual.lower()
gistic = qual_to_gistic[qual]
except Exception as e:
sys.stderr.write(str(e) + "\nInvalid value for gistic, skipping " + line.decode())
sys.stderr.write(str(e) + "\nInvalid value for gistic, skipping " + line)
continue
# pdb.set_trace()
# hold=1
if samp_id not in ploidy_dict:
ploidy_dict[samp_id] = ploidy
if gene not in cn_dict:
Expand Down
6 changes: 4 additions & 2 deletions COLLABORATIONS/openTARGETS/header_desc.tsv
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@ germline_sex_estimate 0 1 STRING 1 Female
cancer_predispositions 0 1 STRING 2 None documented
OS_days OS_MONTHS Overall survival in months since initial diagnosis 0 1 NUMBER 3 NA just convert to months
OS_status OS_STATUS Overall patient survival status 0 1 STRING 4 LIVING
PFS_days PFS_MONTHS Progression free (months) since initial treatment 0 1 NUMBER 5 Not Reported just convert to months
EFS_days EFS_MONTHS Event free (months) since initial treatment 0 1 NUMBER 5 Not Reported just convert to months
EFS_event_type EFS_STATUS Event free status (months) since initial treatment 0 1 STRING 5 Not Reported
ethnicity ETHNICITY 0 1 STRING 6 Not Hispanic or Latino
race RACE 0 1 STRING 7 White
primary_site TUMOR_SITE 0 1 STRING 8
Expand All @@ -18,7 +19,8 @@ harmonized_diagnosis CANCER_TYPE_DETAILED 1 0 STRING 10 Adenoma
primary_site TUMOR_TISSUE_SITE 1 0 STRING 9 Suprasellar/Hypothalamic/Pituitary
tumor_descriptor TUMOR_TYPE 1 0 STRING 8 Initial CNS Tumor
composition SAMPLE_TYPE 1 0 STRING 7 Solid Tissue
cohort COHORT Source study cohort name 1 0 STRING 6
cohort COHORT Source study cohort name 1 0 STRING 6
sub_cohort SUB_COHORT Source study sub-cohort name 1 0 STRING 6
CNS_region 1 0 STRING 5 Suprasellar
tumor_ploidy 1 0 NUMBER 4 3
tumor_fraction 1 0 NUMBER 3 0.476369391
Expand Down
2 changes: 2 additions & 0 deletions COLLABORATIONS/openTARGETS/maf_openpedcan_v15_header.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
#version 2.4
Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2 Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2 Verification_Status Validation_Status Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score BAM_File Sequencer Tumor_Sample_UUID Matched_Norm_Sample_UUID HGVSc HGVSp HGVSp_Short Transcript_ID Exon_Number t_depth t_ref_count t_alt_count n_depth n_ref_count n_alt_count all_effects Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation ALLELE_NUM DISTANCE STRAND_VEP SYMBOL SYMBOL_SOURCE HGNC_ID BIOTYPE CANONICAL CCDS ENSP SWISSPROT TREMBL UNIPARC RefSeq SIFT PolyPhen EXON INTRON DOMAINS AF AFR_AF AMR_AF ASN_AF EAS_AF EUR_AF SAS_AF AA_AF EA_AF CLIN_SIG SOMATIC PUBMED MOTIF_NAME MOTIF_POS HIGH_INF_POS MOTIF_SCORE_CHANGE IMPACT PICK VARIANT_CLASS TSL HGVS_OFFSET PHENO MINIMISED GENE_PHENO FILTER flanking_bps vcf_id vcf_qual gnomAD_AF gnomAD_AFR_AF gnomAD_AMR_AF gnomAD_ASJ_AF gnomAD_EAS_AF gnomAD_FIN_AF gnomAD_NFE_AF gnomAD_OTH_AF gnomAD_SAS_AF HGVSg vcf_pos gnomad_3_1_1_AC gnomad_3_1_1_AN gnomad_3_1_1_AF gnomad_3_1_1_nhomalt gnomad_3_1_1_AC_popmax gnomad_3_1_1_AN_popmax gnomad_3_1_1_AF_popmax gnomad_3_1_1_nhomalt_popmax gnomad_3_1_1_AC_controls_and_biobanks gnomad_3_1_1_AN_controls_and_biobanks gnomad_3_1_1_AF_controls_and_biobanks gnomad_3_1_1_AF_non_cancer gnomad_3_1_1_primate_ai_score gnomad_3_1_1_splice_ai_consequence MQ MQ0 CAL HotSpotAllele
Loading

0 comments on commit 6b5a3bb

Please sign in to comment.