diff --git a/README.md b/README.md index 781afa8..3bbf4db 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,14 @@ -# Borzoi Model Evaluation & Analyses -This repository contains shell scripts, notebooks, commands, etc. related to the analyses performed in the [Borzoi manuscript](https://www.biorxiv.org/content/10.1101/2023.08.30.555582v1). These analyses invoke functionality from both the [borzoi repository](https://github.com/calico/borzoi.git) and the [baskerville repository](https://github.com/calico/baskerville.git). Visit those links for general install instructions. +# Borzoi Model Training & Evaluation + +This repository contains shell scripts, notebooks, commands, etc. related to the analyses performed in the [Borzoi paper](https://www.biorxiv.org/content/10.1101/2023.08.30.555582v1), including data processing, model training, and evaluation. These analyses invoke functionality from the [borzoi](https://github.com/calico/borzoi.git), [baskerville](https://github.com/calico/baskerville.git), and [westminster](https://github.com/calico/westminster.git) repositories. Visit those links for general install instructions. + +*Tip*: When executing .sh scripts found in this directory structure, we recommend first navigating in the terminal to the 'borzoi/examples' directory from the [borzoi repository](https://github.com/calico/borzoi), since all file paths are relative to this root directory. + +For example, assuming *borzoi-paper* and *borzoi* are cloned to your home folder, issue commands of the form: +```sh +conda activate +cd ~/borzoi/examples +. ~/borzoi-paper/analysis//.sh +``` Contact *drk (at) @calicolabs.com* or *jlinder (at) @calicolabs.com* for questions. diff --git a/analysis/README.md b/analysis/README.md new file mode 100644 index 0000000..30691ba --- /dev/null +++ b/analysis/README.md @@ -0,0 +1,30 @@ +## Analyses + +This directory contains model evaluation scripts and other downstream analyses. + +*Notes*: +- Run the script 'setup_data.sh' to organize the multi-fold hg38 and mm10 data folders, which are required in order to run some evaluations. The hg38 and mm10 data must first be downloaded from the Borzoi training data bucket [here](https://storage.googleapis.com/borzoi-paper/data/) (GCP). +- Some scripts require the QTL data, which can be downloaded [here](https://storage.googleapis.com/borzoi-paper/qtl/) (GCP). +
+ +As an example, to evaluate the model on gene-level test set predictions, issue the following commands: +```sh +conda activate borzoi_py310 +cd ~/borzoi/examples +. ~/borzoi-paper/analysis/setup_data.sh +. ~/borzoi-paper/analysis/test_expression/testg.sh +``` + +As another example, to evaluate the model on sQTL variant effect predictions, issue these commands: +```sh +conda activate borzoi_py310 +cd ~/borzoi/examples +. ~/borzoi-paper/analysis/sqtl/bench_sqtl.sh +``` + +The examples assume that you have +- installed a conda environment named 'borzoi_py310', +- cloned the 'borzoi' and 'borzoi-paper' repositories to your home folder, +- downloaded the borzoi training data to '~/borzoi/examples/data', +- downloaded the QTL data to '~/borzoi/examples/data/qtl_cat', +- and configured the borzoi repository ([instructions](https://github.com/calico/borzoi?tab=readme-ov-file#installation)). diff --git a/analysis/crispr/flowfish/run_gradients_flowfish.sh b/analysis/crispr/flowfish/run_gradients_flowfish.sh old mode 100644 new mode 100755 index 44a61ce..2c4b887 --- a/analysis/crispr/flowfish/run_gradients_flowfish.sh +++ b/analysis/crispr/flowfish/run_gradients_flowfish.sh @@ -1,3 +1,3 @@ #!/bin/sh -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu.py -o flowfish_k562_undo_clip -f 0,1,2,3 --rc 1 --shifts 0 --span 0 --smoothgrad 0 --clip_soft 384.0 -t /home/jlinder/borzoi_v2/targets_k562.txt /home/jlinder/borzoi_v2/params_pred.json /home/jlinder/borzoi_v2 /home/jlinder/flowfish/crispr_genes.gtf +borzoi_satg_gene.py -o saved_models/flowfish_k562 -f 3 -c 0,1,2,3 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 -t targets_k562.txt params_pred.json saved_models flowfish/crispr_genes.gtf diff --git a/analysis/crispr/flowfish/run_gradients_flowfish_miborzoi_ablations.sh b/analysis/crispr/flowfish/run_gradients_flowfish_miborzoi_ablations.sh old mode 100644 new mode 100755 index a0af06f..a6d2b98 --- a/analysis/crispr/flowfish/run_gradients_flowfish_miborzoi_ablations.sh +++ b/analysis/crispr/flowfish/run_gradients_flowfish_miborzoi_ablations.sh @@ -1,19 +1,19 @@ #!/bin/sh -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu.py -o flowfish_miborzoi_k562_all_undo_clip -f 0,1 --rc 1 --shifts 0 --span 0 --smoothgrad 0 --clip_soft 384.0 -t /home/jlinder/mini_borzois_v2/k562_all/targets_k562_subset.txt /home/jlinder/mini_borzois_v2/k562_all/params_pred.json /home/jlinder/mini_borzois_v2/k562_all /home/jlinder/flowfish/crispr_genes.gtf +borzoi_satg_gene.py -o mini_borzois_v2/flowfish_miborzoi_k562_all -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 -t mini_borzois_v2/k562_all/targets_k562_subset.txt mini_borzois_v2/k562_all/params_pred.json mini_borzois_v2/k562_all flowfish/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu.py -o flowfish_miborzoi_k562_dnase_atac_rna_undo_clip -f 0,1 --rc 1 --shifts 0 --span 0 --smoothgrad 0 --clip_soft 384.0 -t /home/jlinder/mini_borzois_v2/k562_dnase_atac_rna/targets_k562_dnase_atac_rna_subset.txt /home/jlinder/mini_borzois_v2/k562_dnase_atac_rna/params_pred.json /home/jlinder/mini_borzois_v2/k562_dnase_atac_rna /home/jlinder/flowfish/crispr_genes.gtf +borzoi_satg_gene.py -o mini_borzois_v2/flowfish_miborzoi_k562_dnase_atac_rna -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 -t mini_borzois_v2/k562_dnase_atac_rna/targets_k562_dnase_atac_rna_subset.txt mini_borzois_v2/k562_dnase_atac_rna/params_pred.json mini_borzois_v2/k562_dnase_atac_rna flowfish/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu.py -o flowfish_miborzoi_k562_rna_undo_clip -f 0,1 --rc 1 --shifts 0 --span 0 --smoothgrad 0 --clip_soft 384.0 -t /home/jlinder/mini_borzois_v2/k562_rna/targets_k562_rna_subset.txt /home/jlinder/mini_borzois_v2/k562_rna/params_pred.json /home/jlinder/mini_borzois_v2/k562_rna /home/jlinder/flowfish/crispr_genes.gtf +borzoi_satg_gene.py -o mini_borzois_v2/flowfish_miborzoi_k562_rna -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 -t mini_borzois_v2/k562_rna/targets_k562_rna_subset.txt mini_borzois_v2/k562_rna/params_pred.json mini_borzois_v2/k562_rna flowfish/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu.py -o flowfish_miborzoi_baseline_undo_clip -f 0,1 --rc 1 --shifts 0 --span 0 --smoothgrad 0 --clip_soft 384.0 -t /home/jlinder/mini_borzois_v2/baseline/targets_subset.txt /home/jlinder/mini_borzois_v2/baseline/params_pred.json /home/jlinder/mini_borzois_v2/baseline /home/jlinder/flowfish/crispr_genes.gtf +borzoi_satg_gene.py -o mini_borzois_v2/flowfish_miborzoi_baseline -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 -t mini_borzois_v2/baseline/targets_subset.txt mini_borzois_v2/baseline/params_pred.json mini_borzois_v2/baseline flowfish/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu.py -o flowfish_miborzoi_human_all_undo_clip -f 0,1 --rc 1 --shifts 0 --span 0 --smoothgrad 0 --clip_soft 384.0 -t /home/jlinder/mini_borzois_v2/human_all/targets_subset.txt /home/jlinder/mini_borzois_v2/human_all/params_pred.json /home/jlinder/mini_borzois_v2/human_all /home/jlinder/flowfish/crispr_genes.gtf +borzoi_satg_gene.py -o mini_borzois_v2/flowfish_miborzoi_human_all -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 -t mini_borzois_v2/human_all/targets_subset.txt mini_borzois_v2/human_all/params_pred.json mini_borzois_v2/human_all flowfish/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu.py -o flowfish_miborzoi_human_dnase_atac_rna_undo_clip -f 0,1 --rc 1 --shifts 0 --span 0 --smoothgrad 0 --clip_soft 384.0 -t /home/jlinder/mini_borzois_v2/human_dnase_atac_rna/targets_human_dnase_atac_rna_subset.txt /home/jlinder/mini_borzois_v2/human_dnase_atac_rna/params_pred.json /home/jlinder/mini_borzois_v2/human_dnase_atac_rna /home/jlinder/flowfish/crispr_genes.gtf +borzoi_satg_gene.py -o mini_borzois_v2/flowfish_miborzoi_human_dnase_atac_rna -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 -t mini_borzois_v2/human_dnase_atac_rna/targets_human_dnase_atac_rna_subset.txt mini_borzois_v2/human_dnase_atac_rna/params_pred.json mini_borzois_v2/human_dnase_atac_rna flowfish/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu.py -o flowfish_miborzoi_multisp_dnase_atac_rna_undo_clip -f 0,1 --rc 1 --shifts 0 --span 0 --smoothgrad 0 --clip_soft 384.0 -t /home/jlinder/mini_borzois_v2/multispecies_dnase_atac_rna/targets_human_dnase_atac_rna_subset.txt /home/jlinder/mini_borzois_v2/multispecies_dnase_atac_rna/params_pred.json /home/jlinder/mini_borzois_v2/multispecies_dnase_atac_rna /home/jlinder/flowfish/crispr_genes.gtf +borzoi_satg_gene.py -o mini_borzois_v2/flowfish_miborzoi_multisp_dnase_atac_rna -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 -t mini_borzois_v2/multispecies_dnase_atac_rna/targets_human_dnase_atac_rna_subset.txt mini_borzois_v2/multispecies_dnase_atac_rna/params_pred.json mini_borzois_v2/multispecies_dnase_atac_rna flowfish/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu.py -o flowfish_miborzoi_multisp_rna_undo_clip -f 0,1 --rc 1 --shifts 0 --span 0 --smoothgrad 0 --clip_soft 384.0 -t /home/jlinder/mini_borzois_v2/multispecies_rna/targets_human_rna_subset.txt /home/jlinder/mini_borzois_v2/multispecies_rna/params_pred.json /home/jlinder/mini_borzois_v2/multispecies_rna /home/jlinder/flowfish/crispr_genes.gtf +borzoi_satg_gene.py -o mini_borzois_v2/flowfish_miborzoi_multisp_rna -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 -t mini_borzois_v2/multispecies_rna/targets_human_rna_subset.txt mini_borzois_v2/multispecies_rna/params_pred.json mini_borzois_v2/multispecies_rna flowfish/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu.py -o flowfish_miborzoi_multisp_no_unet_undo_clip -f 0,1 --rc 1 --shifts 0 --span 0 --smoothgrad 0 --clip_soft 384.0 -t /home/jlinder/mini_borzois_v2/multispecies_no_unet/targets_subset.txt /home/jlinder/mini_borzois_v2/multispecies_no_unet/params_pred.json /home/jlinder/mini_borzois_v2/multispecies_no_unet /home/jlinder/flowfish/crispr_genes.gtf +borzoi_satg_gene.py -o mini_borzois_v2/flowfish_miborzoi_multisp_no_unet -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 -t mini_borzois_v2/multispecies_no_unet/targets_subset.txt mini_borzois_v2/multispecies_no_unet/params_pred.json mini_borzois_v2/multispecies_no_unet flowfish/crispr_genes.gtf diff --git a/analysis/crispr/flowfish/run_ism_shuffle_flowfish.sh b/analysis/crispr/flowfish/run_ism_shuffle_flowfish.sh old mode 100644 new mode 100755 index cbe082a..377ed2c --- a/analysis/crispr/flowfish/run_ism_shuffle_flowfish.sh +++ b/analysis/crispr/flowfish/run_ism_shuffle_flowfish.sh @@ -1,3 +1,3 @@ #!/bin/sh -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu_crispr_ism_shuffle.py -o flowfish_k562_ism_shuffle_undo_clip -f 0,1,2,3 --rc 1 --shifts 0 --span 0 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 16 --mononuc_shuffle 0 --dinuc_shuffle 1 --crispr_file /home/jlinder/flowfish/crispr_table.tsv -t /home/jlinder/borzoi_v2/targets_k562.txt /home/jlinder/borzoi_v2/params_pred.json /home/jlinder/borzoi_v2 /home/jlinder/flowfish/crispr_genes.gtf +borzoi_satg_gene_crispr_ism_shuffle.py -o saved_models/flowfish_k562_ism_shuffle -f 3 -c 0,1,2,3 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 16 --dinuc_shuffle --crispr_file flowfish/crispr_table.tsv -t targets_k562.txt params_pred.json saved_models flowfish/crispr_genes.gtf diff --git a/analysis/crispr/flowfish/run_ism_shuffle_flowfish_miborzoi_ablations.sh b/analysis/crispr/flowfish/run_ism_shuffle_flowfish_miborzoi_ablations.sh old mode 100644 new mode 100755 index e433ab4..ea6a0d9 --- a/analysis/crispr/flowfish/run_ism_shuffle_flowfish_miborzoi_ablations.sh +++ b/analysis/crispr/flowfish/run_ism_shuffle_flowfish_miborzoi_ablations.sh @@ -1,19 +1,19 @@ #!/bin/sh -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu_crispr_ism_shuffle.py -o flowfish_miborzoi_k562_all_undo_ism_shuffle_clip -f 0,1 --rc 1 --shifts 0 --span 0 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --mononuc_shuffle 0 --dinuc_shuffle 1 --crispr_file /home/jlinder/flowfish/crispr_table.tsv -t /home/jlinder/mini_borzois_v2/k562_all/targets_k562_subset.txt /home/jlinder/mini_borzois_v2/k562_all/params_pred.json /home/jlinder/mini_borzois_v2/k562_all /home/jlinder/flowfish/crispr_genes.gtf +borzoi_satg_gene_crispr_ism_shuffle.py -o mini_borzois_v2/flowfish_miborzoi_k562_all_ism_shuffle -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --dinuc_shuffle --crispr_file flowfish/crispr_table.tsv -t mini_borzois_v2/k562_all/targets_k562_subset.txt mini_borzois_v2/k562_all/params_pred.json mini_borzois_v2/k562_all flowfish/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu_crispr_ism_shuffle.py -o flowfish_miborzoi_k562_dnase_atac_rna_undo_ism_shuffle_clip -f 0,1 --rc 1 --shifts 0 --span 0 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --mononuc_shuffle 0 --dinuc_shuffle 1 --crispr_file /home/jlinder/flowfish/crispr_table.tsv -t /home/jlinder/mini_borzois_v2/k562_dnase_atac_rna/targets_k562_dnase_atac_rna_subset.txt /home/jlinder/mini_borzois_v2/k562_dnase_atac_rna/params_pred.json /home/jlinder/mini_borzois_v2/k562_dnase_atac_rna /home/jlinder/flowfish/crispr_genes.gtf +borzoi_satg_gene_crispr_ism_shuffle.py -o mini_borzois_v2/flowfish_miborzoi_k562_dnase_atac_rna_ism_shuffle -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --dinuc_shuffle --crispr_file flowfish/crispr_table.tsv -t mini_borzois_v2/k562_dnase_atac_rna/targets_k562_dnase_atac_rna_subset.txt mini_borzois_v2/k562_dnase_atac_rna/params_pred.json mini_borzois_v2/k562_dnase_atac_rna flowfish/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu_crispr_ism_shuffle.py -o flowfish_miborzoi_k562_rna_undo_ism_shuffle_clip -f 0,1 --rc 1 --shifts 0 --span 0 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --mononuc_shuffle 0 --dinuc_shuffle 1 --crispr_file /home/jlinder/flowfish/crispr_table.tsv -t /home/jlinder/mini_borzois_v2/k562_rna/targets_k562_rna_subset.txt /home/jlinder/mini_borzois_v2/k562_rna/params_pred.json /home/jlinder/mini_borzois_v2/k562_rna /home/jlinder/flowfish/crispr_genes.gtf +borzoi_satg_gene_crispr_ism_shuffle.py -o mini_borzois_v2/flowfish_miborzoi_k562_rna_ism_shuffle -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --dinuc_shuffle --crispr_file flowfish/crispr_table.tsv -t mini_borzois_v2/k562_rna/targets_k562_rna_subset.txt mini_borzois_v2/k562_rna/params_pred.json mini_borzois_v2/k562_rna flowfish/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu_crispr_ism_shuffle.py -o flowfish_miborzoi_baseline_undo_ism_shuffle_clip -f 0,1 --rc 1 --shifts 0 --span 0 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --mononuc_shuffle 0 --dinuc_shuffle 1 --crispr_file /home/jlinder/flowfish/crispr_table.tsv -t /home/jlinder/mini_borzois_v2/baseline/targets_subset.txt /home/jlinder/mini_borzois_v2/baseline/params_pred.json /home/jlinder/mini_borzois_v2/baseline /home/jlinder/flowfish/crispr_genes.gtf +borzoi_satg_gene_crispr_ism_shuffle.py -o mini_borzois_v2/flowfish_miborzoi_baseline_ism_shuffle -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --dinuc_shuffle --crispr_file flowfish/crispr_table.tsv -t mini_borzois_v2/baseline/targets_subset.txt mini_borzois_v2/baseline/params_pred.json mini_borzois_v2/baseline flowfish/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu_crispr_ism_shuffle.py -o flowfish_miborzoi_human_all_undo_ism_shuffle_clip -f 0,1 --rc 1 --shifts 0 --span 0 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --mononuc_shuffle 0 --dinuc_shuffle 1 --crispr_file /home/jlinder/flowfish/crispr_table.tsv -t /home/jlinder/mini_borzois_v2/human_all/targets_subset.txt /home/jlinder/mini_borzois_v2/human_all/params_pred.json /home/jlinder/mini_borzois_v2/human_all /home/jlinder/flowfish/crispr_genes.gtf +borzoi_satg_gene_crispr_ism_shuffle.py -o mini_borzois_v2/flowfish_miborzoi_human_all_ism_shuffle -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --dinuc_shuffle --crispr_file flowfish/crispr_table.tsv -t mini_borzois_v2/human_all/targets_subset.txt mini_borzois_v2/human_all/params_pred.json mini_borzois_v2/human_all flowfish/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu_crispr_ism_shuffle.py -o flowfish_miborzoi_human_dnase_atac_rna_undo_ism_shuffle_clip -f 0,1 --rc 1 --shifts 0 --span 0 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --mononuc_shuffle 0 --dinuc_shuffle 1 --crispr_file /home/jlinder/flowfish/crispr_table.tsv -t /home/jlinder/mini_borzois_v2/human_dnase_atac_rna/targets_human_dnase_atac_rna_subset.txt /home/jlinder/mini_borzois_v2/human_dnase_atac_rna/params_pred.json /home/jlinder/mini_borzois_v2/human_dnase_atac_rna /home/jlinder/flowfish/crispr_genes.gtf +borzoi_satg_gene_crispr_ism_shuffle.py -o mini_borzois_v2/flowfish_miborzoi_human_dnase_atac_rna_ism_shuffle -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --dinuc_shuffle --crispr_file flowfish/crispr_table.tsv -t mini_borzois_v2/human_dnase_atac_rna/targets_human_dnase_atac_rna_subset.txt mini_borzois_v2/human_dnase_atac_rna/params_pred.json mini_borzois_v2/human_dnase_atac_rna flowfish/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu_crispr_ism_shuffle.py -o flowfish_miborzoi_multisp_dnase_atac_rna_undo_ism_shuffle_clip -f 0,1 --rc 1 --shifts 0 --span 0 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --mononuc_shuffle 0 --dinuc_shuffle 1 --crispr_file /home/jlinder/flowfish/crispr_table.tsv -t /home/jlinder/mini_borzois_v2/multispecies_dnase_atac_rna/targets_human_dnase_atac_rna_subset.txt /home/jlinder/mini_borzois_v2/multispecies_dnase_atac_rna/params_pred.json /home/jlinder/mini_borzois_v2/multispecies_dnase_atac_rna /home/jlinder/flowfish/crispr_genes.gtf +borzoi_satg_gene_crispr_ism_shuffle.py -o mini_borzois_v2/flowfish_miborzoi_multisp_dnase_atac_rna_ism_shuffle -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --dinuc_shuffle --crispr_file flowfish/crispr_table.tsv -t mini_borzois_v2/multispecies_dnase_atac_rna/targets_human_dnase_atac_rna_subset.txt mini_borzois_v2/multispecies_dnase_atac_rna/params_pred.json mini_borzois_v2/multispecies_dnase_atac_rna flowfish/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu_crispr_ism_shuffle.py -o flowfish_miborzoi_multisp_rna_undo_ism_shuffle_clip -f 0,1 --rc 1 --shifts 0 --span 0 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --mononuc_shuffle 0 --dinuc_shuffle 1 --crispr_file /home/jlinder/flowfish/crispr_table.tsv -t /home/jlinder/mini_borzois_v2/multispecies_rna/targets_human_rna_subset.txt /home/jlinder/mini_borzois_v2/multispecies_rna/params_pred.json /home/jlinder/mini_borzois_v2/multispecies_rna /home/jlinder/flowfish/crispr_genes.gtf +borzoi_satg_gene_crispr_ism_shuffle.py -o mini_borzois_v2/flowfish_miborzoi_multisp_rna_ism_shuffle -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --dinuc_shuffle --crispr_file flowfish/crispr_table.tsv -t mini_borzois_v2/multispecies_rna/targets_human_rna_subset.txt mini_borzois_v2/multispecies_rna/params_pred.json mini_borzois_v2/multispecies_rna flowfish/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu_crispr_ism_shuffle.py -o flowfish_miborzoi_multisp_no_unet_undo_ism_shuffle_clip -f 0,1 --rc 1 --shifts 0 --span 0 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --mononuc_shuffle 0 --dinuc_shuffle 1 --crispr_file /home/jlinder/flowfish/crispr_table.tsv -t /home/jlinder/mini_borzois_v2/multispecies_no_unet/targets_subset.txt /home/jlinder/mini_borzois_v2/multispecies_no_unet/params_pred.json /home/jlinder/mini_borzois_v2/multispecies_no_unet /home/jlinder/flowfish/crispr_genes.gtf +borzoi_satg_gene_crispr_ism_shuffle.py -o mini_borzois_v2/flowfish_miborzoi_multisp_no_unet_ism_shuffle -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --dinuc_shuffle --crispr_file flowfish/crispr_table.tsv -t mini_borzois_v2/multispecies_no_unet/targets_subset.txt mini_borzois_v2/multispecies_no_unet/params_pred.json mini_borzois_v2/multispecies_no_unet flowfish/crispr_genes.gtf diff --git a/analysis/crispr/gasperini/run_gradients_gasperini.sh b/analysis/crispr/gasperini/run_gradients_gasperini.sh new file mode 100755 index 0000000..5616608 --- /dev/null +++ b/analysis/crispr/gasperini/run_gradients_gasperini.sh @@ -0,0 +1,3 @@ +#!/bin/sh + +borzoi_satg_gene.py -o saved_models/gasperini_k562 -f 3 -c 0,1,2,3 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 -t targets_k562.txt params_pred.json saved_models gasperini/crispr_genes.gtf diff --git a/analysis/crispr/gasperini/run_gradients_gasperini_borzoi.sh b/analysis/crispr/gasperini/run_gradients_gasperini_borzoi.sh deleted file mode 100644 index c5f7f64..0000000 --- a/analysis/crispr/gasperini/run_gradients_gasperini_borzoi.sh +++ /dev/null @@ -1,3 +0,0 @@ -#!/bin/sh - -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu.py -o gasperini_k562_undo_clip -f 0,1,2,3 --rc 1 --shifts 0 --span 0 --smoothgrad 0 --clip_soft 384.0 -t /home/jlinder/borzoi_v2/targets_k562.txt /home/jlinder/borzoi_v2/params_pred.json /home/jlinder/borzoi_v2 /home/jlinder/gasperini/crispr_genes.gtf diff --git a/analysis/crispr/gasperini/run_gradients_gasperini_miborzoi_ablations.sh b/analysis/crispr/gasperini/run_gradients_gasperini_miborzoi_ablations.sh old mode 100644 new mode 100755 index 36e1442..3dbbe62 --- a/analysis/crispr/gasperini/run_gradients_gasperini_miborzoi_ablations.sh +++ b/analysis/crispr/gasperini/run_gradients_gasperini_miborzoi_ablations.sh @@ -1,19 +1,19 @@ #!/bin/sh -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu.py -o gasperini_miborzoi_k562_all_undo_clip -f 0,1 --rc 1 --shifts 0 --span 0 --smoothgrad 0 --clip_soft 384.0 -t /home/jlinder/mini_borzois_v2/k562_all/targets_k562_subset.txt /home/jlinder/mini_borzois_v2/k562_all/params_pred.json /home/jlinder/mini_borzois_v2/k562_all /home/jlinder/gasperini/crispr_genes.gtf +borzoi_satg_gene.py -o mini_borzois_v2/gasperini_miborzoi_k562_all -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 -t mini_borzois_v2/k562_all/targets_k562_subset.txt mini_borzois_v2/k562_all/params_pred.json mini_borzois_v2/k562_all gasperini/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu.py -o gasperini_miborzoi_k562_dnase_atac_rna_undo_clip -f 0,1 --rc 1 --shifts 0 --span 0 --smoothgrad 0 --clip_soft 384.0 -t /home/jlinder/mini_borzois_v2/k562_dnase_atac_rna/targets_k562_dnase_atac_rna_subset.txt /home/jlinder/mini_borzois_v2/k562_dnase_atac_rna/params_pred.json /home/jlinder/mini_borzois_v2/k562_dnase_atac_rna /home/jlinder/gasperini/crispr_genes.gtf +borzoi_satg_gene.py -o mini_borzois_v2/gasperini_miborzoi_k562_dnase_atac_rna -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 -t mini_borzois_v2/k562_dnase_atac_rna/targets_k562_dnase_atac_rna_subset.txt mini_borzois_v2/k562_dnase_atac_rna/params_pred.json mini_borzois_v2/k562_dnase_atac_rna gasperini/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu.py -o gasperini_miborzoi_k562_rna_undo_clip -f 0,1 --rc 1 --shifts 0 --span 0 --smoothgrad 0 --clip_soft 384.0 -t /home/jlinder/mini_borzois_v2/k562_rna/targets_k562_rna_subset.txt /home/jlinder/mini_borzois_v2/k562_rna/params_pred.json /home/jlinder/mini_borzois_v2/k562_rna /home/jlinder/gasperini/crispr_genes.gtf +borzoi_satg_gene.py -o mini_borzois_v2/gasperini_miborzoi_k562_rna -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 -t mini_borzois_v2/k562_rna/targets_k562_rna_subset.txt mini_borzois_v2/k562_rna/params_pred.json mini_borzois_v2/k562_rna gasperini/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu.py -o gasperini_miborzoi_baseline_undo_clip -f 0,1 --rc 1 --shifts 0 --span 0 --smoothgrad 0 --clip_soft 384.0 -t /home/jlinder/mini_borzois_v2/baseline/targets_subset.txt /home/jlinder/mini_borzois_v2/baseline/params_pred.json /home/jlinder/mini_borzois_v2/baseline /home/jlinder/gasperini/crispr_genes.gtf +borzoi_satg_gene.py -o mini_borzois_v2/gasperini_miborzoi_baseline -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 -t mini_borzois_v2/baseline/targets_subset.txt mini_borzois_v2/baseline/params_pred.json mini_borzois_v2/baseline gasperini/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu.py -o gasperini_miborzoi_human_all_undo_clip -f 0,1 --rc 1 --shifts 0 --span 0 --smoothgrad 0 --clip_soft 384.0 -t /home/jlinder/mini_borzois_v2/human_all/targets_subset.txt /home/jlinder/mini_borzois_v2/human_all/params_pred.json /home/jlinder/mini_borzois_v2/human_all /home/jlinder/gasperini/crispr_genes.gtf +borzoi_satg_gene.py -o mini_borzois_v2/gasperini_miborzoi_human_all -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 -t mini_borzois_v2/human_all/targets_subset.txt mini_borzois_v2/human_all/params_pred.json mini_borzois_v2/human_all gasperini/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu.py -o gasperini_miborzoi_human_dnase_atac_rna_undo_clip -f 0,1 --rc 1 --shifts 0 --span 0 --smoothgrad 0 --clip_soft 384.0 -t /home/jlinder/mini_borzois_v2/human_dnase_atac_rna/targets_human_dnase_atac_rna_subset.txt /home/jlinder/mini_borzois_v2/human_dnase_atac_rna/params_pred.json /home/jlinder/mini_borzois_v2/human_dnase_atac_rna /home/jlinder/gasperini/crispr_genes.gtf +borzoi_satg_gene.py -o mini_borzois_v2/gasperini_miborzoi_human_dnase_atac_rna -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 -t mini_borzois_v2/human_dnase_atac_rna/targets_human_dnase_atac_rna_subset.txt mini_borzois_v2/human_dnase_atac_rna/params_pred.json mini_borzois_v2/human_dnase_atac_rna gasperini/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu.py -o gasperini_miborzoi_multisp_dnase_atac_rna_undo_clip -f 0,1 --rc 1 --shifts 0 --span 0 --smoothgrad 0 --clip_soft 384.0 -t /home/jlinder/mini_borzois_v2/multispecies_dnase_atac_rna/targets_human_dnase_atac_rna_subset.txt /home/jlinder/mini_borzois_v2/multispecies_dnase_atac_rna/params_pred.json /home/jlinder/mini_borzois_v2/multispecies_dnase_atac_rna /home/jlinder/gasperini/crispr_genes.gtf +borzoi_satg_gene.py -o mini_borzois_v2/gasperini_miborzoi_multisp_dnase_atac_rna -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 -t mini_borzois_v2/multispecies_dnase_atac_rna/targets_human_dnase_atac_rna_subset.txt mini_borzois_v2/multispecies_dnase_atac_rna/params_pred.json mini_borzois_v2/multispecies_dnase_atac_rna gasperini/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu.py -o gasperini_miborzoi_multisp_rna_undo_clip -f 0,1 --rc 1 --shifts 0 --span 0 --smoothgrad 0 --clip_soft 384.0 -t /home/jlinder/mini_borzois_v2/multispecies_rna/targets_human_rna_subset.txt /home/jlinder/mini_borzois_v2/multispecies_rna/params_pred.json /home/jlinder/mini_borzois_v2/multispecies_rna /home/jlinder/gasperini/crispr_genes.gtf +borzoi_satg_gene.py -o mini_borzois_v2/gasperini_miborzoi_multisp_rna -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 -t mini_borzois_v2/multispecies_rna/targets_human_rna_subset.txt mini_borzois_v2/multispecies_rna/params_pred.json mini_borzois_v2/multispecies_rna gasperini/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu.py -o gasperini_miborzoi_multisp_no_unet_undo_clip -f 0,1 --rc 1 --shifts 0 --span 0 --smoothgrad 0 --clip_soft 384.0 -t /home/jlinder/mini_borzois_v2/multispecies_no_unet/targets_subset.txt /home/jlinder/mini_borzois_v2/multispecies_no_unet/params_pred.json /home/jlinder/mini_borzois_v2/multispecies_no_unet /home/jlinder/gasperini/crispr_genes.gtf +borzoi_satg_gene.py -o mini_borzois_v2/gasperini_miborzoi_multisp_no_unet -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 -t mini_borzois_v2/multispecies_no_unet/targets_subset.txt mini_borzois_v2/multispecies_no_unet/params_pred.json mini_borzois_v2/multispecies_no_unet gasperini/crispr_genes.gtf diff --git a/analysis/crispr/gasperini/run_ism_shuffle_gasperini.sh b/analysis/crispr/gasperini/run_ism_shuffle_gasperini.sh old mode 100644 new mode 100755 index d45fd30..8c5feac --- a/analysis/crispr/gasperini/run_ism_shuffle_gasperini.sh +++ b/analysis/crispr/gasperini/run_ism_shuffle_gasperini.sh @@ -1,3 +1,3 @@ #!/bin/sh -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu_crispr_ism_shuffle.py -o gasperini_k562_ism_shuffle_undo_clip -f 2,3 --rc 1 --shifts 0 --span 0 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 16 --mononuc_shuffle 0 --dinuc_shuffle 1 --crispr_file /home/jlinder/gasperini/crispr_table.tsv -t /home/jlinder/borzoi_v2/targets_k562.txt /home/jlinder/borzoi_v2/params_pred.json /home/jlinder/borzoi_v2 /home/jlinder/gasperini/crispr_genes.gtf +borzoi_satg_gene_crispr_ism_shuffle.py -o saved_models/gasperini_k562_ism_shuffle -f 3 -c 0,1,2,3 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 16 --dinuc_shuffle --crispr_file gasperini/crispr_table.tsv -t targets_k562.txt params_pred.json saved_models gasperini/crispr_genes.gtf diff --git a/analysis/crispr/gasperini/run_ism_shuffle_gasperini_miborzoi_ablations.sh b/analysis/crispr/gasperini/run_ism_shuffle_gasperini_miborzoi_ablations.sh old mode 100644 new mode 100755 index 9d056c0..9e6ea82 --- a/analysis/crispr/gasperini/run_ism_shuffle_gasperini_miborzoi_ablations.sh +++ b/analysis/crispr/gasperini/run_ism_shuffle_gasperini_miborzoi_ablations.sh @@ -1,19 +1,19 @@ #!/bin/sh -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu_crispr_ism_shuffle.py -o gasperini_miborzoi_k562_all_undo_ism_shuffle_clip -f 0,1 --rc 1 --shifts 0 --span 0 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --mononuc_shuffle 0 --dinuc_shuffle 1 --crispr_file /home/jlinder/gasperini/crispr_table.tsv -t /home/jlinder/mini_borzois_v2/k562_all/targets_k562_subset.txt /home/jlinder/mini_borzois_v2/k562_all/params_pred.json /home/jlinder/mini_borzois_v2/k562_all /home/jlinder/gasperini/crispr_genes.gtf +borzoi_satg_gene_crispr_ism_shuffle.py -o mini_borzois_v2/gasperini_miborzoi_k562_all_ism_shuffle -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --dinuc_shuffle --crispr_file gasperini/crispr_table.tsv -t mini_borzois_v2/k562_all/targets_k562_subset.txt mini_borzois_v2/k562_all/params_pred.json mini_borzois_v2/k562_all gasperini/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu_crispr_ism_shuffle.py -o gasperini_miborzoi_k562_dnase_atac_rna_undo_ism_shuffle_clip -f 0,1 --rc 1 --shifts 0 --span 0 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --mononuc_shuffle 0 --dinuc_shuffle 1 --crispr_file /home/jlinder/gasperini/crispr_table.tsv -t /home/jlinder/mini_borzois_v2/k562_dnase_atac_rna/targets_k562_dnase_atac_rna_subset.txt /home/jlinder/mini_borzois_v2/k562_dnase_atac_rna/params_pred.json /home/jlinder/mini_borzois_v2/k562_dnase_atac_rna /home/jlinder/gasperini/crispr_genes.gtf +borzoi_satg_gene_crispr_ism_shuffle.py -o mini_borzois_v2/gasperini_miborzoi_k562_dnase_atac_rna_ism_shuffle -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --dinuc_shuffle --crispr_file gasperini/crispr_table.tsv -t mini_borzois_v2/k562_dnase_atac_rna/targets_k562_dnase_atac_rna_subset.txt mini_borzois_v2/k562_dnase_atac_rna/params_pred.json mini_borzois_v2/k562_dnase_atac_rna gasperini/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu_crispr_ism_shuffle.py -o gasperini_miborzoi_k562_rna_undo_ism_shuffle_clip -f 0,1 --rc 1 --shifts 0 --span 0 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --mononuc_shuffle 0 --dinuc_shuffle 1 --crispr_file /home/jlinder/gasperini/crispr_table.tsv -t /home/jlinder/mini_borzois_v2/k562_rna/targets_k562_rna_subset.txt /home/jlinder/mini_borzois_v2/k562_rna/params_pred.json /home/jlinder/mini_borzois_v2/k562_rna /home/jlinder/gasperini/crispr_genes.gtf +borzoi_satg_gene_crispr_ism_shuffle.py -o mini_borzois_v2/gasperini_miborzoi_k562_rna_ism_shuffle -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --dinuc_shuffle --crispr_file gasperini/crispr_table.tsv -t mini_borzois_v2/k562_rna/targets_k562_rna_subset.txt mini_borzois_v2/k562_rna/params_pred.json mini_borzois_v2/k562_rna gasperini/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu_crispr_ism_shuffle.py -o gasperini_miborzoi_baseline_undo_ism_shuffle_clip -f 0,1 --rc 1 --shifts 0 --span 0 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --mononuc_shuffle 0 --dinuc_shuffle 1 --crispr_file /home/jlinder/gasperini/crispr_table.tsv -t /home/jlinder/mini_borzois_v2/baseline/targets_subset.txt /home/jlinder/mini_borzois_v2/baseline/params_pred.json /home/jlinder/mini_borzois_v2/baseline /home/jlinder/gasperini/crispr_genes.gtf +borzoi_satg_gene_crispr_ism_shuffle.py -o mini_borzois_v2/gasperini_miborzoi_baseline_ism_shuffle -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --dinuc_shuffle --crispr_file gasperini/crispr_table.tsv -t mini_borzois_v2/baseline/targets_subset.txt mini_borzois_v2/baseline/params_pred.json mini_borzois_v2/baseline gasperini/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu_crispr_ism_shuffle.py -o gasperini_miborzoi_human_all_undo_ism_shuffle_clip -f 0,1 --rc 1 --shifts 0 --span 0 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --mononuc_shuffle 0 --dinuc_shuffle 1 --crispr_file /home/jlinder/gasperini/crispr_table.tsv -t /home/jlinder/mini_borzois_v2/human_all/targets_subset.txt /home/jlinder/mini_borzois_v2/human_all/params_pred.json /home/jlinder/mini_borzois_v2/human_all /home/jlinder/gasperini/crispr_genes.gtf +borzoi_satg_gene_crispr_ism_shuffle.py -o mini_borzois_v2/gasperini_miborzoi_human_all_ism_shuffle -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --dinuc_shuffle --crispr_file gasperini/crispr_table.tsv -t mini_borzois_v2/human_all/targets_subset.txt mini_borzois_v2/human_all/params_pred.json mini_borzois_v2/human_all gasperini/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu_crispr_ism_shuffle.py -o gasperini_miborzoi_human_dnase_atac_rna_undo_ism_shuffle_clip -f 0,1 --rc 1 --shifts 0 --span 0 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --mononuc_shuffle 0 --dinuc_shuffle 1 --crispr_file /home/jlinder/gasperini/crispr_table.tsv -t /home/jlinder/mini_borzois_v2/human_dnase_atac_rna/targets_human_dnase_atac_rna_subset.txt /home/jlinder/mini_borzois_v2/human_dnase_atac_rna/params_pred.json /home/jlinder/mini_borzois_v2/human_dnase_atac_rna /home/jlinder/gasperini/crispr_genes.gtf +borzoi_satg_gene_crispr_ism_shuffle.py -o mini_borzois_v2/gasperini_miborzoi_human_dnase_atac_rna_ism_shuffle -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --dinuc_shuffle --crispr_file gasperini/crispr_table.tsv -t mini_borzois_v2/human_dnase_atac_rna/targets_human_dnase_atac_rna_subset.txt mini_borzois_v2/human_dnase_atac_rna/params_pred.json mini_borzois_v2/human_dnase_atac_rna gasperini/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu_crispr_ism_shuffle.py -o gasperini_miborzoi_multisp_dnase_atac_rna_undo_ism_shuffle_clip -f 0,1 --rc 1 --shifts 0 --span 0 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --mononuc_shuffle 0 --dinuc_shuffle 1 --crispr_file /home/jlinder/gasperini/crispr_table.tsv -t /home/jlinder/mini_borzois_v2/multispecies_dnase_atac_rna/targets_human_dnase_atac_rna_subset.txt /home/jlinder/mini_borzois_v2/multispecies_dnase_atac_rna/params_pred.json /home/jlinder/mini_borzois_v2/multispecies_dnase_atac_rna /home/jlinder/gasperini/crispr_genes.gtf +borzoi_satg_gene_crispr_ism_shuffle.py -o mini_borzois_v2/gasperini_miborzoi_multisp_dnase_atac_rna_ism_shuffle -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --dinuc_shuffle --crispr_file gasperini/crispr_table.tsv -t mini_borzois_v2/multispecies_dnase_atac_rna/targets_human_dnase_atac_rna_subset.txt mini_borzois_v2/multispecies_dnase_atac_rna/params_pred.json mini_borzois_v2/multispecies_dnase_atac_rna gasperini/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu_crispr_ism_shuffle.py -o gasperini_miborzoi_multisp_rna_undo_ism_shuffle_clip -f 0,1 --rc 1 --shifts 0 --span 0 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --mononuc_shuffle 0 --dinuc_shuffle 1 --crispr_file /home/jlinder/gasperini/crispr_table.tsv -t /home/jlinder/mini_borzois_v2/multispecies_rna/targets_human_rna_subset.txt /home/jlinder/mini_borzois_v2/multispecies_rna/params_pred.json /home/jlinder/mini_borzois_v2/multispecies_rna /home/jlinder/gasperini/crispr_genes.gtf +borzoi_satg_gene_crispr_ism_shuffle.py -o mini_borzois_v2/gasperini_miborzoi_multisp_rna_ism_shuffle -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --dinuc_shuffle --crispr_file gasperini/crispr_table.tsv -t mini_borzois_v2/multispecies_rna/targets_human_rna_subset.txt mini_borzois_v2/multispecies_rna/params_pred.json mini_borzois_v2/multispecies_rna gasperini/crispr_genes.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu_crispr_ism_shuffle.py -o gasperini_miborzoi_multisp_no_unet_undo_ism_shuffle_clip -f 0,1 --rc 1 --shifts 0 --span 0 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --mononuc_shuffle 0 --dinuc_shuffle 1 --crispr_file /home/jlinder/gasperini/crispr_table.tsv -t /home/jlinder/mini_borzois_v2/multispecies_no_unet/targets_subset.txt /home/jlinder/mini_borzois_v2/multispecies_no_unet/params_pred.json /home/jlinder/mini_borzois_v2/multispecies_no_unet /home/jlinder/gasperini/crispr_genes.gtf +borzoi_satg_gene_crispr_ism_shuffle.py -o mini_borzois_v2/gasperini_miborzoi_multisp_no_unet_ism_shuffle -f 0,1 -c 0 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 --aggregate_tracks 10 --ism_size 1 --window_size 2048 --n_samples 32 --dinuc_shuffle --crispr_file gasperini/crispr_table.tsv -t mini_borzois_v2/multispecies_no_unet/targets_subset.txt mini_borzois_v2/multispecies_no_unet/params_pred.json mini_borzois_v2/multispecies_no_unet gasperini/crispr_genes.gtf diff --git a/analysis/crispr/gradients_aggregate_folds.ipynb b/analysis/crispr/gradients_aggregate_reps.ipynb similarity index 51% rename from analysis/crispr/gradients_aggregate_folds.ipynb rename to analysis/crispr/gradients_aggregate_reps.ipynb index 174f799..92c107d 100644 --- a/analysis/crispr/gradients_aggregate_folds.ipynb +++ b/analysis/crispr/gradients_aggregate_reps.ipynb @@ -19,52 +19,72 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "id": "d4988000", "metadata": {}, "outputs": [], "source": [ "#Load scores and auxiliary data, compute mean over folds, and save new scores\n", "\n", - "fold_index = [0, 1, 2, 3]\n", + "#Specify dir with score files\n", + "grad_dir = '../../../borzoi/examples/saved_models/flowfish'\n", + "\n", + "fold_index = [3]\n", + "cross_index = [0, 1]\n", "\n", "#Initialize HDF5\n", - "scores_h5 = h5py.File('scores_mean.h5', 'w')\n", + "scores_h5 = h5py.File(grad_dir + '/scores_mean.h5', 'w')\n", "\n", "seqs = None\n", "grads = None\n", + "preds = None\n", "genes = None\n", "chrs = None\n", "starts = None\n", "ends = None\n", "strands = None\n", "\n", - "#Loop over folds\n", - "for fold_i, fold_ix in enumerate(fold_index) :\n", - " \n", - " #Load score file\n", - " score_file = h5py.File('scores_f' + str(fold_ix) + 'c0.h5', 'r')\n", - "\n", - " if fold_i == 0 :\n", - " seqs = score_file['seqs'][()]\n", - " grads = score_file['grads'][()]\n", - " genes = score_file['gene'][()]\n", - " chrs = score_file['chr'][()]\n", - " starts = score_file['start'][()]\n", - " ends = score_file['end'][()]\n", - " strands = score_file['strand'][()]\n", - " else :\n", - " grads += score_file['grads'][()]\n", - " \n", - " #Collect garbage\n", - " gc.collect()\n", - "\n", - "grads /= float(len(fold_index))\n", + "rep_i = 0\n", + "\n", + "#Loop over folds and crosses\n", + "for fi in fold_index :\n", + " for ci in cross_index :\n", + "\n", + " print(\"Aggregating over replicate 'f\" + str(fi) + \"c\" + str(ci) + \"'\")\n", + "\n", + " score_file = h5py.File(grad_dir + '/scores_f' + str(fi) + 'c' + str(ci) + '.h5', 'r')\n", + "\n", + " if rep_i == 0 :\n", + " seqs = score_file['seqs'][()]\n", + " grads = score_file['grads'][()]\n", + " if 'preds' in score_file :\n", + " preds = score_file['preds'][()]\n", + " genes = score_file['gene'][()]\n", + " chrs = score_file['chr'][()]\n", + " starts = score_file['start'][()]\n", + " ends = score_file['end'][()]\n", + " strands = score_file['strand'][()]\n", + " else :\n", + " grads += score_file['grads'][()]\n", + " if 'preds' in score_file :\n", + " preds += score_file['preds'][()]\n", + "\n", + " #Collect garbage\n", + " gc.collect()\n", + " \n", + " rep_i += 1\n", + "\n", + "#Normalize by number of replicates\n", + "grads /= (float(len(fold_index)) * float(len(cross_index)))\n", + "\n", + "if preds is not None :\n", + " preds /= (float(len(fold_index)) * float(len(cross_index)))\n", "\n", "#Re-save datasets in h5\n", "scores_h5.create_dataset('seqs', data=np.array(seqs, dtype='bool'))\n", "scores_h5.create_dataset('grads', data=np.array(grads, dtype='float16'))\n", - "\n", + "if preds is not None :\n", + " scores_h5.create_dataset('preds', data=np.array(preds, dtype='float16'))\n", "scores_h5.create_dataset('gene', data=np.array(genes, dtype='S'))\n", "scores_h5.create_dataset('chr', data=np.array(chrs, dtype='S'))\n", "scores_h5.create_dataset('start', data=np.array(starts))\n", @@ -100,7 +120,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.13" + "version": "3.8.15" } }, "nbformat": 4, diff --git a/analysis/eqtl/bench_eqtl.sh b/analysis/eqtl/bench_eqtl.sh deleted file mode 100644 index cde91f5..0000000 --- a/analysis/eqtl/bench_eqtl.sh +++ /dev/null @@ -1,3 +0,0 @@ -#!/bin/sh - -westminster_gtex_folds.py -d 0 -e tf210 -g ~/seqnn/data/gtex_fine/susie_pip90 --max_proc 24 --msl 12 --name "gtex" -p 96 -o gtexu -q geforce --rc --stats SAD,logSAD,D2,logD2 -t v9/hg38/targets.txt -u params.json train diff --git a/analysis/eqtl/bench_eqtl_sad.sh b/analysis/eqtl/bench_eqtl_sad.sh new file mode 100755 index 0000000..52f5771 --- /dev/null +++ b/analysis/eqtl/bench_eqtl_sad.sh @@ -0,0 +1,3 @@ +#!/bin/sh + +borzoi_bench_gtex_folds_sad.py -d 0 -e borzoi_py310 -g data/gtex_fine/susie_pip90 --susie data/gtex_fine/tissues_susie --max_proc 24 --msl 12 -p 96 -o gtexu -q rtx4090 --f_list 3 -c 4 --rc --stats SAD,logSAD,D2,logD2 -t targets_human.txt -u params.json saved_models diff --git a/analysis/eqtl/bench_eqtl_sed.sh b/analysis/eqtl/bench_eqtl_sed.sh new file mode 100755 index 0000000..9e74bd1 --- /dev/null +++ b/analysis/eqtl/bench_eqtl_sed.sh @@ -0,0 +1,3 @@ +#!/bin/sh + +borzoi_bench_gtex_folds_sed.py -d 0 -e borzoi_py310 --gtex data/gtex_fine/susie_pip90 --susie data/gtex_fine/tissues_susie -p 8 -o gtexug -q rtx4090 --f_list 3 -c 4 --rc --stats SED,logSED -t targets_gtex.txt -u params.json saved_models diff --git a/analysis/gtex_motifs/explore_grads_liver_CFHR2.ipynb b/analysis/gtex_motifs/explore_grads_liver_CFHR2.ipynb new file mode 100644 index 0000000..5b0e3a8 --- /dev/null +++ b/analysis/gtex_motifs/explore_grads_liver_CFHR2.ipynb @@ -0,0 +1,502 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "id": "7030e9ad", + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "import os\n", + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "import h5py\n", + "\n", + "import matplotlib.pyplot as plt\n", + "\n", + "from scipy.stats import spearmanr, pearsonr\n", + "\n", + "from scipy.ndimage import gaussian_filter1d\n", + "\n", + "import seaborn as sns\n", + "\n", + "import matplotlib.cm as cm\n", + "import matplotlib.colors as colors\n", + "\n", + "import matplotlib as mpl\n", + "from matplotlib.text import TextPath\n", + "from matplotlib.patches import PathPatch, Rectangle\n", + "from matplotlib.font_manager import FontProperties\n", + "from matplotlib import gridspec\n", + "from matplotlib.ticker import FormatStrFormatter\n", + "\n", + "#Helper function to draw a letter at a given position\n", + "def dna_letter_at(letter, x, y, yscale=1, ax=None, color=None, alpha=1.0):\n", + "\n", + " #Define letter heights and colors\n", + " fp = FontProperties(family=\"DejaVu Sans\", weight=\"bold\")\n", + " globscale = 1.35\n", + " LETTERS = {\t\"T\" : TextPath((-0.305, 0), \"T\", size=1, prop=fp),\n", + " \"G\" : TextPath((-0.384, 0), \"G\", size=1, prop=fp),\n", + " \"A\" : TextPath((-0.35, 0), \"A\", size=1, prop=fp),\n", + " \"C\" : TextPath((-0.366, 0), \"C\", size=1, prop=fp),\n", + " \"UP\" : TextPath((-0.488, 0), '$\\\\Uparrow$', size=1, prop=fp),\n", + " \"DN\" : TextPath((-0.488, 0), '$\\\\Downarrow$', size=1, prop=fp),\n", + " \"(\" : TextPath((-0.25, 0), \"(\", size=1, prop=fp),\n", + " \".\" : TextPath((-0.125, 0), \"-\", size=1, prop=fp),\n", + " \")\" : TextPath((-0.1, 0), \")\", size=1, prop=fp)}\n", + " COLOR_SCHEME = {'G': 'orange',#'orange', \n", + " 'A': 'green',#'red', \n", + " 'C': 'blue',#'blue', \n", + " 'T': 'red',#'darkgreen',\n", + " 'UP': 'green', \n", + " 'DN': 'red',\n", + " '(': 'black',\n", + " '.': 'black', \n", + " ')': 'black'}\n", + "\n", + "\n", + " text = LETTERS[letter]\n", + "\n", + " #Choose color\n", + " chosen_color = COLOR_SCHEME[letter]\n", + " if color is not None :\n", + " chosen_color = color\n", + "\n", + " #Draw letter onto axis\n", + " t = mpl.transforms.Affine2D().scale(1*globscale, yscale*globscale) + \\\n", + " mpl.transforms.Affine2D().translate(x,y) + ax.transData\n", + " p = PathPatch(text, lw=0, fc=chosen_color, alpha=alpha, transform=t)\n", + " if ax != None:\n", + " ax.add_artist(p)\n", + " \n", + " return p\n", + "\n", + "#Function to plot sequence logo\n", + "def plot_seq_scores(importance_scores, figsize=(16, 2), plot_y_ticks=True, y_min=None, y_max=None, save_figs=False, fig_name=\"default\") :\n", + "\n", + " importance_scores = importance_scores.T\n", + "\n", + " fig = plt.figure(figsize=figsize)\n", + " \n", + " ref_seq = \"\"\n", + " \n", + " #Loop over reference sequence letters\n", + " for j in range(importance_scores.shape[1]) :\n", + " argmax_nt = np.argmax(np.abs(importance_scores[:, j]))\n", + " \n", + " if argmax_nt == 0 :\n", + " ref_seq += \"A\"\n", + " elif argmax_nt == 1 :\n", + " ref_seq += \"C\"\n", + " elif argmax_nt == 2 :\n", + " ref_seq += \"G\"\n", + " elif argmax_nt == 3 :\n", + " ref_seq += \"T\"\n", + "\n", + " ax = plt.gca()\n", + " \n", + " #Loop over reference sequence letters and draw\n", + " for i in range(0, len(ref_seq)) :\n", + " mutability_score = np.sum(importance_scores[:, i])\n", + " color = None\n", + " dna_letter_at(ref_seq[i], i + 0.5, 0, mutability_score, ax, color=color)\n", + " \n", + " plt.sca(ax)\n", + " plt.xticks([], [])\n", + " plt.gca().yaxis.set_major_formatter(FormatStrFormatter('%.3f'))\n", + " \n", + " plt.xlim((0, len(ref_seq)))\n", + " \n", + " #plt.axis('off')\n", + " \n", + " if plot_y_ticks :\n", + " plt.yticks(fontsize=12)\n", + " else :\n", + " plt.yticks([], [])\n", + " \n", + " #Set axis limits\n", + " if y_min is not None and y_max is not None :\n", + " plt.ylim(y_min, y_max)\n", + " elif y_min is not None :\n", + " plt.ylim(y_min)\n", + " else :\n", + " plt.ylim(\n", + " np.min(importance_scores) - 0.1 * np.max(np.abs(importance_scores)),\n", + " np.max(importance_scores) + 0.1 * np.max(np.abs(importance_scores))\n", + " )\n", + " \n", + " plt.axhline(y=0., color='black', linestyle='-', linewidth=1)\n", + "\n", + " #for axis in fig.axes :\n", + " # axis.get_xaxis().set_visible(False)\n", + " # axis.get_yaxis().set_visible(False)\n", + "\n", + " plt.tight_layout()\n", + "\n", + " if save_figs :\n", + " plt.savefig(fig_name + \".png\", transparent=True, dpi=300)\n", + " plt.savefig(fig_name + \".eps\")\n", + "\n", + " plt.show()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "a3c3eb2c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "len(gene_df) = 1\n", + "len(tissue_genes) = 1\n" + ] + } + ], + "source": [ + "#Load gene dataframe and select tissue\n", + "\n", + "tissue = 'liver'\n", + "\n", + "gene_df = pd.read_csv(\"/home/jlinder/seqnn/data/diff_expr/gtex_diff_expr_log2fc_5k.csv\", sep='\\t')\n", + "gene_df = gene_df.query(\"tissue == '\" + str(tissue) + \"'\").copy().reset_index(drop=True)\n", + "gene_df = gene_df.drop(columns=['Unnamed: 0'])\n", + "\n", + "#Select CFHR2 example gene\n", + "gene_df = gene_df.query(\"gene_base == 'ENSG00000080910'\").copy().reset_index(drop=True)\n", + "\n", + "print(\"len(gene_df) = \" + str(len(gene_df)))\n", + "\n", + "#Get list of gene for tissue\n", + "tissue_genes = gene_df['gene_base'].values.tolist()\n", + "\n", + "print(\"len(tissue_genes) = \" + str(len(tissue_genes)))\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "3bcaea3d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "scores_hyp.shape = (1, 1, 524288, 4)\n", + "scores.shape = (1, 1, 524288, 4)\n" + ] + }, + { + "data": { + "text/plain": [ + "0" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#Load scores for the selected set of GTEx tissues (grad)\n", + "\n", + "import gc\n", + "\n", + "seqs = None\n", + "strands = None\n", + "chrs = None\n", + "starts = None\n", + "ends = None\n", + "genes = None\n", + "\n", + "all_scores_hyp = []\n", + "all_scores = []\n", + "\n", + "gtex_tissues = ['liver']\n", + "\n", + "#Load score file\n", + "score_file = h5py.File('../../../borzoi/examples/saved_models/gtex_CFHR2/scores_f3c0.h5', 'r')\n", + "\n", + "#Get scores and onehots\n", + "scores = score_file['grads'][()][..., 0]\n", + "seqs = score_file['seqs'][()]\n", + "\n", + "#Get auxiliary information\n", + "strands = score_file['strand'][()]\n", + "strands = np.array([strands[j].decode() for j in range(strands.shape[0])])\n", + "\n", + "chrs = score_file['chr'][()]\n", + "chrs = np.array([chrs[j].decode() for j in range(chrs.shape[0])])\n", + "\n", + "starts = np.array(score_file['start'][()])\n", + "ends = np.array(score_file['end'][()])\n", + "\n", + "genes = score_file['gene'][()]\n", + "genes = np.array([genes[j].decode().split(\".\")[0] for j in range(genes.shape[0])])\n", + "\n", + "gene_dict = {gene : gene_i for gene_i, gene in enumerate(genes.tolist())}\n", + "\n", + "#Get index of rows to keep\n", + "keep_index = []\n", + "for tissue_gene in tissue_genes :\n", + " keep_index.append(gene_dict[tissue_gene])\n", + "\n", + "#Filter/sub-select data\n", + "scores = scores[keep_index, ...]\n", + "seqs = seqs[keep_index, ...]\n", + "strands = strands[keep_index]\n", + "chrs = chrs[keep_index]\n", + "starts = starts[keep_index]\n", + "ends = ends[keep_index]\n", + "genes = genes[keep_index]\n", + "\n", + "#Append hypothetical scores\n", + "all_scores_hyp.append(scores[None, ...])\n", + "\n", + "#Append input-gated scores\n", + "all_scores.append((scores * seqs)[None, ...])\n", + "\n", + "#Collect garbage\n", + "gc.collect()\n", + "\n", + "#Collect final scores\n", + "scores_hyp = np.concatenate(all_scores_hyp, axis=0)\n", + "scores = np.concatenate(all_scores, axis=0)\n", + "\n", + "print(\"scores_hyp.shape = \" + str(scores_hyp.shape))\n", + "print(\"scores.shape = \" + str(scores.shape))\n", + "\n", + "score_file = None\n", + "\n", + "#Collect garbage\n", + "gc.collect()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "955bf762", + "metadata": { + "scrolled": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "-- Example = 0 --\n", + " - ENSG00000080910(+)\n", + " - chr1:196692638-197216926\n", + " -- min_val = -1.719\n", + " -- max_val = 3.385\n", + " - (Gradient score profiles per tissue) - \n" + ] + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + " - (Attribution at position of Max positive differential saliency) -\n", + " - max_pos (rel) = 251085\n", + " - max_pos (abs) = 196943723\n", + " - chr1:196943627-196943819\n", + " - y_min = -1.78648438\n", + " - y_max = 3.45445312\n", + "liver\n" + ] + }, + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAxYAAABZCAYAAACjWLKDAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAuiUlEQVR4nO3dd3Qc1fnw8e8W7ap3y6q2XOQGNu4F9wYGh2YIeflBgFASIBAChEAIENMJxBBKANMCJIbQAomBYMANG5tiywYbN7kXSVZfSSuttt33j2ut2kpa1ZXt53OOzu5o7szcbTP3mdsMSimFEEIIIYQQQnSAMdgZEEIIIYQQQhz/JLAQQgghhBBCdJgEFkIIIYQQQogOk8BCCCGEEEII0WESWAghhBBCCCE6TAILIYQQQgghRIdJYCGEEEIIIYToMHN7N/R6veTm5hIVFYXBYOjMPAkhhBBCCCF6AKUUFRUVpKamYjS2XCfR7sAiNzeXjIyM9m4uhBBCCCGEOE4cOnSI9PT0FtO0O7CIioryHSQ6Orq9uxFCCHESmDcPLrkErrgi2DkRQgjRFuXl5WRkZPjK/i1pd2BR2/wpOjpaAgshhBDNKi+Hb7+F6Gi46aZg50YIIUR7BNL1QTpvCyGE6FKrV4PHA19/DUoFOzdCCCG6igQWQgghutTGjfrRZtMBhhBCiBOTBBZCCCG61P79wc6BEEKI7iCBhRBCiC518GCwcyCEEKI7SGAhhBCi7XL/B5X7A0paUtK1WRFCCNEzSGAhhBCibbwe+PpK+O76gJKXlcHgwRAX16W5EkIIEWTtHm5WCCHESapwLTgKIG8ZOIogNLHF5GVlcM89kJfXPdkTQggRHBJYCCGEaJv8z449UVCxs8XAwuvV81hMnAhFRd2TPSGEEMEhgYUQQoi2Kd8ReNJyPXdFv36QktKFeRJCCBF0ElgIIYRom8q9ASctLdWPSUlgMOg/IYQQJyYJLIQQQrRNTeBtmsrLISwMzHK1EUKIE56c6oUQQrRNTREMvhnc9laTut0QHd0NeRJCCBF0ElgIIYQInLsKPA7IWACmcPDWtJzcDVFRjf5Zvz2UUp2fRyGEEEEhgYUQQojA1TaDih0OITFg29ZicpcLIiO7IV9CCCGCTgILIYRoq8Y9kE+mu+41xWAKBcux2e6ih7aY3O0Gi6Ub8iWEECLoZOZtIYToBi+8AH/6U7Bz0QncFXVBBYDR1CSJx+vhlexXKLQX4naDqWkSIYQQJyAJLIQQootVV8O998KDD0JubmDbLFq3iO+OfOdbfnHji2wt2NpFOWwDrwss8S0mKa8p55ql17CtcBtudztGhKodl1bGphVCiOOKBBZCBMkPR3/g0bWPBjsbohusWAGFhXoW6p07A9tm0fpFfHXoK9/yPSvvYfX+1V2UwzbwusAS22KSMkeZ71FqLIQQ4uQhgYUQQfLN4W94eM3Dwc6G6AYbN7Z9mzJHma+ArpRqsBxUXhcYrS0msdXYfI9+ayxq+6ScTH1ThBDiJCCBhRBBUuYoo8JZgdvrDnZWRFvVLxgr1bDpjp/mO/v3g9UKaWmB7d7pcVLtrvYFEg63A6fH2TMCC+UCQ8ttm2wOm+/R6+2EFk3KCxt+A/nLO7gjIYQQXUkCCyE6i7MUvr0OKnYHlLzUUQrUFcLEievIEfj5z+G99wJL7yuY1/h/DCqvC4wh4CiEnOfBtqNJkvpNocxm8HjaeIzGNRr7l8CuZ2DtxQFNyieEECI4JLAQorP8cC/sXgzrLgsoef3ClzixHT4Mc+fChAnQp4//NE4nOBz6eePvRo/6rnhdYDRD1UH47gYo/rpJEluNjRBjCLYaG2azHnK2Ra111t77mn50loCjoEPZF0II0XUksBCiM3jdsO8N/bw8sN65tTUWtY/ixJWXB5mZutw8YEDT9UrBmWfCpEl6QrkyRxkx1pgGAUWUJapnBBbKA4aWe2PbHDbSotOwOWztq7FocDwFxd+1nk4IIUTQyQR5QnSGihxwlYMh8Fi9tLqUpIgkSqslsDju1fazaKYzst0OvXs3v/mnn8KqVfp5djaUJ5aREZPRILDoE9OnZwQWBpPu89ACW42NjOgMXWMRGUCNRUtqivXcGWOegpLsDuxICCFEV5MaCyE6Q8VuPVLOOXsgY0FAm5Q5yugb07dnFBZFl1FKN3NKTGw+zWef1T33evV3IyO6YWBRP9AIKmOIrqFrQf38BtQUqiWVe8EUDlnXw8g/6+MLIYTokSSwEKIzVO6BXlMgMhOGBza9cqmjlD4xfaQp1AnO6dSPoaHNp9mxA6ZPh3vu0ct+A4voIAYW9ftAGEP0yFD1eLwe7lt1HwdtBwHdFCo9Kt3Xx6K270izWhp+tuogRA3Uxw3rDeHpnfCChBBCdAUJLISo1ZHZfqtzIXqQfh7RTO/cRqTG4uRQUwNGY8uTxOXkwI036tm5k5P1dyMpIokadw1e5aXMUUZKZAqVzkpUsOd+MByrsbAkQPRgAArsBSxcvZBNeZsA3RQqPTodm8NGSAhUVHTgeG47hGd0MM8yk7cQQnQHCSyE6AzuqjYVfpRSlNeUkx6dLn0s2qOVeSN6BI8TPA6cTghppfXOgQMwYoSeSK5fP10wjwmNIdISSaWzkjJHGbGhsYSaQ6l2V3dP/ptjDAFvja6d6z0bgNyK3AaPjZtClZd34HiearDEdzDTQgghuoMEFkJ0Bk81mCMDTl7tribUHEpsaKzUWPjTnsCh6jAUftUzZnOuzodPToV/p2As3Yi3hb7OLpfug1B/GNraUaFiQmN8M27HhMYQbY0OzvelflMlY4geqKCexoGFrcZGWlQa5TXlhIbqGot2fyweB5isUJULB9+H6qPtfRVCCCG6mAQWQtRqqZ13rZoS/+s91WAKhT1/h8+ngbOsSRKbw8b2wu2AHhEqNjSW2NDYrutj4XXpTuU9oaDd1ewH4dMx8PkU2LEo2LnReajIAVcZ1op1uFzNfwxVVbqmon4fjDJHGdHWaN+Qs40DjaAymJt8v3MrckmPTq8LLBw24sPiMRgMhEXW4PHo4KK8vB0duT3VemCEkg2w9iIo+77teQ7kty2E6BYbN+q5fcSJSQILIQK1+S54PwFWzGm6zuvSBS77AShc43fUnCVbljDihRG4vW5KHaW+gmKXBBZeN6w8C5ZmwddXdv7+u1r9AqC/wmD9gqJSsPtFPSxpVJb+DIItfwUMuAZmfY41QtdkuVz+k1ZXN+3Y3biGona5NtCoroYbbtD9Mmpquvi1NBYSqWeZryevMo+RySPJq8zz5T/CEkG0NRpjuJ4tPDcXtm9vx/EMJqDl4W07W5Wrii1Ht+BtZVhdIUQrGtU8L1kCY8fC8OEBDOogjksSWAjRnPonxMp9sO0RiB4KpZubpjWFgtfZ4u52l+zG7XVzoOwAZY4yajw1fHfku665A537CRxdAUnT9J3zE13BahjxIPxkB/S5OLh58br0XfVhd0LyHMxZl2E0QmVlM8n9lF3LHGU88OUD/Fj4I2WOMmw1Nh5a8xC7indhc9h44AF4/nn4299gZ2DzMXYea6KeV8Jt9/0rtyKXkb1HNmgKlbIohfzKfGqwYbXCli2wYUM7jmcKA08nR0+tNLNbvnc5I14Y4RvlyumE996DI0c6NxtCnGwefRRGjYL09A4OQy16LAkshGhO/bvi+Z9D3CiYvxVmL2+a1hSm24InTgBzhN/d5ZTk0Cu8F7tLdlNaXcqOoh3cufzOrum8nfcZ9Lsc5qyGCa92/v57mvIdkPYTPUFh0tTg5sVp0/0QIvvrZWMIoaFwtJmuAeHhTWsdyhxlrDu0jgJ7ga/GYtX+VRy1H6XMUcYHH8Cf/wxr10JYWNe+nCYsCfqxfIfvX7kVuYxM1oGF0+PE4a67FWmrsREbC8uXwwcftON4plD920qaDkkzOpT1QG0v0lUrO4p2oBT85Cfw05/qR3F8sdvhZz+DIUPgm2+CnZuTgFJQ9C2U76xbPvZYWAg//ggffQSrV4PFErxsiq4jM28LEYjyXZAyTxdc40Y2XW8KBWcJpJ7V7Ag2OcU5zO4/m5ySHGKsMb7/d0mNReVeyLxUP48Z0vn770mUApdNz29QtkW3x68d+jcYXDYIidF3w9dfCb1nEhd3Bfn5MHSoDiKs1rrkYWG6mZTLVTd6VP3vRG1gUetoaSU7d8KvfgUxdV+j7mOOAKMFjnwCNUWADiyyErKwu+wU2gsbJLc5dGCxeHE7j2cKA0c+WGL0X2doZab07UXbGZI4hO2F2+njnMeaNfC//8HevXWbHzqkZ1Ov/1mKnudvf9OF2LPPhs2bYcKEYOfoBLfhRsh5Tj8/4xtIHO9bdeAADBgAqalBypvoFlJjIUQgnKUQ2lu3nd90u74rXV9ItJ7Iqxker4cjFUeY2mcqOcU5DfpVlDpKO39uAmeJnkxs7+uw8dbO3Xdn6MzhYpVbNz8yhcKXC2Djbzonj+3PEHDsNe17A4q/IS0Ntm7VzaGysxumtlj0PBf1OzPaauq+XzaHrUFgkVdQQ1xckIIK0J+XNQG23gcH3wF0YLFi3wqsJiu7S3Y3SF7mKCM2tgPHs8TpzvldqdF3cXvhdn6S9RO2F21n3TqYOxfmzdP9WjweXUjt1w/m+Olu1WVame1c+Ld0KTz2GLz6KlxxRbBzc4KrPgp7XoHZq+DMb5vcCCgthYSEoORMdCMJLFpiPwDZv4Nvr4PSdoxEIk4sBgMUfwvb/wLuRg3mIzJb/I4ctB0kNSqVrPgsckpyKHOUkRCWwOCEwXi8HqpcVZ2dWf2Qvxz2vtLJ+25eUNrMGkN0LYW7CmKHByEDjYRE6z4ISkHmZYBuT/z66/D44007LBoMkJamOzYrBYdz3VS5qrj4lIsZmTySoqoinB4nC4YuYHjScEorqzAHu67ZmgjKA4Db66HAXsAty26h1FHKvtJ9pESmcO+0exmVPEr3t0jpwLEiMvXM9o7CVpN2BqUUO4t3cnbW2Wwv2k5hIWRm1q1ftgx279aFpOef74YMVefDsknwQSp8Mb0bDnhiOXwYRo/WzxsPkiA6WelmXaOfNA3ix0JUw5pjk6lpJWF+ZT5z3pjDVwe/6rZsiq4lgUVzvB5YPgfCknVnULf9+JiUS3QNSxw4juqRfsL9zKwd2R9KN+mgw1nSZHVOSQ5Hyo9wy7JbyCnJobS6lLum3sWOG3eQEpXS+SNDWeN1gWTUnztnf8oLR1fBviVQtrXJ6q+/htNOg/nz4cUXA9mf6twhQC2xYN8HQ25pcIhuHzEJdDMod5WeVyN5FqADh40b4f77/W+SlQWvvQb//CdszbERY43h7Yve5tfjfk2hvZC40Djev/h9rh97PTWWPMrKgtzxMSLT9/SooxxF3Wd4wHaAwYmDuW/mfUzrOw2bw0bfvh04VmQmoODrX+jAvivU+y7mV+bj8rj4+vDXbC/cTkhIwxG9fvhB115ER8Opp3ZRXo6uhF3PQe7/YMcTEDcCFhyF099q/35P0muW293yrPeiE7lsuilw2Q/wlhH2vNxgdVwc5OU13GTpzqXsKt7FW1s78N0WPYoEFs2pztXth4fcBrkfQ9WhpkNcipNHVBbkfa6bgJj8NKqOytKPm25vMFpOrZziHOwuOz8W/sj+sv2UVJf4+ll0ySR5UVmQtwysvfyvDyBItjvtdfn67nrY/08daNv3N0l7771w5536bu6113bOS2iT2BGw/y3dLApYtAjOOQf+9CdYt66b82KyQuypsOsZPe8Jum9FSwYNgnffhcsvhwq3HloWINoaTVF1kW85JjSGanMe4eGwfr0eqraoKIA8dfYNkXp3InOdLrLis/jP//sP87Pmc6j8ENHWaF/+yxxlZGU1zEabmCMgLFWfh6vzWk/fQduLtmN32blrxV0UVxcTmVDOli16nculC6ldGtTteELfoIgfo5s/VeyCXlNh17Ow5sLAR8g6SQMJ0M3Vqo5VAicnw65d+rlctruYNRGqD0PUQOjz0yarBwzQw05v26Zr/RwO+HDnhzw651GW7lra+U2CRVCc+IFFe0+u1gRQLh1QhMTooTs7c//i+JIyF0q+g/+N0kPPNhaeXm8UIAuYQlm1fxVPrn+S4qpidpfsZuH0heTdluebSCw2NBaAGGtM548MlTxXBwKfjvW/vpUag0pnJTNen8G8f87jh6M/6Bmt+18Ftm2w7bEm6Q2GumFTu+za0NJvrfdMPSndynkAPPGEDi4efRROP72L8tOSpOmw/XHYdBsAU+sNVOWvs+/0ei1cKt1lvqAzxhrTIAiNscZgc5Yyd64e6WbcOH2hbkwpPRqO3X7s8+jsCeJ6z9CPBiO5ISkMShjEuYPPZXzaePIr8hvmt8bGjGPJhwxp593jpHpvkLVrG2nvKNrBuYPP5dNLP+W03qeRMnwb69fDLbfANdfo9/y//9VN1774ogsyUJKtRzhzFMDGm3TtUOFaGHC1Homr0dwadjusXAlr1ugRd052v/413HUXfPyxfj/mzoV77oEnn4Q33/Szwf43Yd8/4MjHevCHtlJKTxhZU9wj+sHUuGt4e+vbbMzd2OZty8vhoot0rfObb+pJLdskYbweOGT3i+CubrI6OhrGj4cZM3Qn+tKqCpbvXc5f1v2F3IpcsvOym+5THHc63FJ38+bNREZGBpS2oqaCKGtUk+ctGjNGtyFojrsS3A4ITQT0Sdbp1FVuPvV6S9auj4n1UuWqItISiVKKckc5y/cv55DtEHannTMHnskY62/hhQlgNEGfn4E5u8n+apc9Xg+7S3ZT5arC5XGRGZtJhCWCCEsEHq8Hh9tBhKVuGFK3143T4yQ8JNzv+iYv0+PG5XURFhLWYNvSUl3eiohy4VZuwswN1zdRXQAmK05jGB48On29fddyup148RJqDsXlcVHlrMIaYvUtV7uqsZgtvmW3cmMxhFFZqQsO/r4SSkFZmW7naghxYMCA1WytO5bJou/wmiNwYMCIEYvZQo27BoUi1BiiT+DmcDA3HGOz2l2N2WAmxBRCjVvf0bOamx+upX56h9vhO1Zzn69+Af8Hm96EhHHw4yEwNRo/VF0K+x6A/pew/pPXeeOHN5jWZxoXrLwAs9HMuYPPJXdXLpnVmRwtOkpxfDHZjmzCi8L5ZsM3hBaGYjKaqHZVE2IMweM2U12tRw1yG+yEmvX6KlcVFpMFV40Zp1O/n00Kq94kqDgd9q2D9Auo/PpLwsxhmIwmKp2VhIeEYzQYm77eY7+3J9Y9QT9PP1KiUrjqhat4buzPMb93E1iTwOmiInI1kZZIDAYDFTUVXHpZBA89ZOSpp+CMM2Du/ArKa8oxGoyEmkOJDY2lqsrA4cMweHCjvGZnt3xuqJ+/5j4b5wQoyNQ3AsJP58ors/nlL3XfhjlzdDOt5nhVw3NBpbMysHNTS/lxnAm5b+kmUUmjwZTN2LF6JCGjyct/Vh7B4Xbg9rpJjUqlX78Y+vfXF3J75SYiiyPJzs6moKCA8v3lJEcmk52dTdHRIkr3lnL1Rdl8/rkOKvLz9d3Z+s2+7rpLv+7wcJg4EaKimubXZtOdxmvXlZXp325oeL1zSeNzQ+32ngQ4GAbxY9i0ZQ8RRRFkZ2fjzfWSl5NHfK94srOzqTxQSV5hHo5e2SQm6kmxGrxlzb1/jZedP4F9b0HiJJy7nRyueA8jRrx4SY9Kr/vt1le6Wd85NUc23V8Ly99u+JahEUPpVd6Lwc7BfL/7M371Kwt//SvMnKnfr/HjdTOoKVMgLEr/Xs0mM9XuakyYMGKhvFyfBz2Geutd1bg8LsJDwn3LHq8Hs8mM2WDG4XEQ7jwb4xeLdC2NzQTxs+H7P8LqNLD24uDqj1ixfw0RlggiLZFU/jCXfXvMzJmjC24Nmv+19nqhwfnA7rTr86bXTEkJJCbq70h9FTUVvt++3WnH4/UQaY3EaDBSUVMBCiKten3tvr0eI3l5kJHR9GNq/NuPsERg9NSAqxxCk8BgoKYGSkrw31dnzJgGi85rNpKfr++KK6ULse+8A7ffDosWKVavr/t9VzhsRG17UQ9jXPahrm0c+rtmyx1NzhWOMqK23gvJM6HqiB7go2/dPDoer4dqd3WDc0uYKapeBWLd+ab+dbuiQte6+B30QHl9ffzcBitO5fGVI0qrS3nymycZEDuAp/Of5sKhFzI+fXzA5YyKCl0DevCgvq5szHZjDtV5cnlcONwOzEazLod43FQ6K9mcvxmXV6+blDGJRMtN8MHvISQKPBdCecNy0y9+Ab/5jR5R7c0VXzI1ZCp3nHoH72x9h9f+9xqGUYaG779Sx8oFob6h3IuL9XUx3E8RB/Q5MSoKTFY/5QxzXUeb+tf+GncNDreDMHOYbxkgxGilrEwfy28fHaWg+ghYEnUeW1B7ba89VzhdTiIsEb7l2jJJA16PboZt7UWV14XFZMFs1OcOp+fY9sZ65xZLuG85xBiC2YCu6Q1LAaMZt1u/f4mJ4PA0LVeYQR8vtDeVbofv3FBYFngfN4NqZ91TeXk5MUEblkQIIYQQQgjRXWw2G9HR0S2m6XCNxerVq4mcPr1hdN842j+2fMUHVzA0cSgGg4EtBVv454J/Npu2VV43fDYR5n4Fe/8O5kj+vekyqquhVy99N6nxWMm33abbOmdm6rumu1zL+Xzv5+wu2c1zZz9HUmRSh94Lf5RSGOo34aj3+i7996XEh8azo2gHd0+7m+mZ05t9P2wOG/PfnM85g85hR/EOxibM4M07ruCDD/RdjSpnNfP+NZPE8EQqaiq4ddKtnDfkvLr92Q/Bustg6r9h32sUmhM478sXGRg/kPzKfG6ecDPzB833pd9fup9f/PcXjEsdx87infxqzK8IMYbwxvdvMKXPFK4dcy3z35zPRUMv4tVNr/LJpZ/42oE3cGx/GzfqpimvvKLbKR915XDZB5cRGRJJpauSDyf8P1JUGfS7EvI/Y3PIAG5csZDE8EQK7YW8PeUa0m3fwsBrwbZDzxdhjfPt/52t77CndA9/mPoHbv7fzVw47EKm9Z3W7Pvp9ri56N2LuGzEZaw/vJ5FZyxq+4db/05ZIN/ZFjz85cNkxmVywdALuOBfF/Daea+RHJXsW//ixhd5f9v7xITGcLTyKCuvXFlX49A4Tx3MSyCKq4q57qPrSAhP4Lox1zEyZWSL6T/+GB58sO5Or79+GFuObsHtdTMqZVTnZ7jRXU02bmzwXv1y6S99TQcuHX4pt55+a/vPTT3cvn26ucgnn+jlnBy49VY9LGet+UvmMy5tHMv3LufV814lKyErOJltD48Lls8AjOCpgomvQlwLVVYdtO7QOm765CbMRjMY4MkRX/L+21amTtVDAs+cCdsKt7Hl6BbOzjo78NqwbvL65td5+punMRgM9I3tS//v3mfqVDj33EYJj33/b/vsNoYlDmN40nBuWXYLa69a2+Aat+bAGl7KfomB8QNJCEvAsebXlJbC3XfrGrva/ja1+7vwnQt5bM5jpEenM/216Xw9sBeM/gscXQ0FK7nuH29x/vm6w7zbTd2IaO38Pd694m425W3CrdzM6TeH2yff3vpG9Y71SvYrHCo/xAHbAa4aeRVT+/qZkLNe+kfWPILT42RH8Q6uG3Odvs43w+vVd/PHj4ekJD2aVVJSy6/1vH+dR2VNJXaXnd9P/j0Lhi7wrVNKMfP1mb5WBbdOvJWfnfqzFvPbGrfHTXZeNn1j+9I7sndA27Tb4aVQ8CUM+S0cfI81B3/KG++m8tJLjdIdy39enm7K9fe/6xrYmP47ufyDy4mwRGB32XntvNcY2muoL/0h2yHO/9f5DIgfwJ6SPTx79rPc+MmNvt0+Pvwz/vFSAldfrT+badMaHo9D/4Hcj2D8Ylh/BQxfCFEDmr12vLXlLVbtX8W0vtNYsmUJ0/tOx2w0c/lpl3Pj/25k4fSFDfLH9kV6GPWUs+Dba/lP76t55OunsZqsuLwubh5/M8989wzhIeHYnXa+uOILrvnvNaRHpTMubRwXhRRC5X4Y+Sh8MYNHvltBaprJN+zylwe+5OE1D3Na79MorCrk1cw0nf+Us2DNBSyOv4K3tr5FeEg4zionpS8E2GRbtZPNZlOAstlsurV2fc0sX/XhVYqFKBairvjgihbTBiT7dqVWn6fUpjuVOvBOQJvY7UoVF9ct3738bvXG5jcCP2ZH1Xt9D65+UN2z4h6V/JdkVe2qbrJe1bWEVwpU6qJUFflwpDLdZ1JP/We56t+/4a5HvTBKPbj6QZXylxS1rWBbw/0d+o9Sy+coZdul1LtxSu16XqUtSlPXf3S9inkkRm0v3N4gvcfrUREPRaiJL09U1gesasvRLU1eyh+++IM6e8nZ6sK3Lwzopa9erdTjjyu1eLHef8wjMerdH99VkQ9HKpejRKlvf63UxluVyv6dqrbnKesDVvXMN8+o8IfClcvjUqpij1JHPlHq6GqlPO4G+f1o50fKfL9ZRTwUoYz3Gevy28L36+WNL6uMJzLUhiMbAsp/V3ps7WO+34blAYvyeD0N1n+a86ka8NQA9cv//lLNen1W8ztq/0+6zVbvX63e3/Z+QGmvvFKp227Tzw8f7sJMtUW99+r+Vfer+1fdry741wXqo50fNVnvd/k45fUqdccd+jO55Ral9u1TKiVFqWXLlFq7VqnKSqVuW3abuuPzO9TgZwYHO7ttV7heqbejlKouUOqL2UoVfNWlh7M77SrxsUS15sAaNf3v033/93q79LCdZsORDWrSy5PU4g2L1Y0f36gWLlTqZz9TascOpTZtqpfw2Pf/95/93neuOvW5U5vsz+P1qEHPDFIDnhqgDpQdUDNnKrV0qZ8DH9vfrNdnqYe+fEi9mv2q6vtkX6U+OlWpvM+VqspXqnK/Ou00pb7y9xG28/f45g9vqis/vFKd+9a56tOcT1vfoNF1uKCyQGX+NVMNeXZIk/O0PxtzN6rRi0erzL9mKnftdasV1dVKVVQ0ykMzrv/oevVK9itq8DOD1b7SfU3Wz18yXz3zzTNq8DOD1ea8zf530lPPbc5ypb76uS4bbPit2rk5X6WnK5WdrdS6dfXSHcv/hx8qNXt23b/dHreKfiRa/Xvbv1XUw1F173+91zvw6YHq05xP1agXRimP16MSH0tU2bnZasBTA5RSSjkcSu3erVRRUdPjqezfKfXDfUrteU2pN81KlW5tsv/6y5/v+dz32zHfb1avbXrNt8xCVFl1WcPt11+p1I5nlCpYp9S2x9XO/O9V+hPpavGGxWryK5PV9/nfqz5P9lGPrX1MTX5lslJKqR8LflSvbXpNfzdXn6/UviVKbX9CqTdN6tFH3Orii/U5ft8+pWrcNSrlLynqpk9uUk99/ZRSX8xS6vBSpXIWK7X2EvVpzqdq+HPD1e+W/U7Nf3V+XZm/FZ0zGnpzranq36k3GDhl3SKuH3s9ZqOZjGg/jS3batRjuubCU6XHjg9AeHjDdnkPzHqg4/kIVO37cexxQcE2xr88nnkD5+l2f43WN35fRyw5i99N+h03fHIDU04ZwF1H9VjqsbH6Ts7E9InEhsbi8roYkthotmVTGLgqIKIvzF0HYcmMS1vG3P5zefvHtxmc0LDhu9FgZETvEfx13l854x9nNN0fcOXIK1nw9gKeOPOJgF7+tGn1In6MTEyfyL7SfYxJGYPZGgfjnvWlDQWG9x5OpCWS0Smj9d3AyP51HaQbyYzNxO114z7WeS4zNrPV/Fw9+mquHn11QHnvav3i+vme943p26Q2YnzaeKpcVQxJHEJCeM+YYWha32mtJzpm3z5YcOxGWlpaF2WoLRr91mYdWMvC1Qv5seBHXj//9SBmrOsZDLpTe31r18J77+nJ+saOhZ8O+ylT/z6VOybfEZxMdkTZVkiaCqG9umVwjfCQcIb1GsbT3zzN7H6zff8/Xsb1GJUyir2le1m2ZxmXDb+Mc8+Et9/WQx+feWbT9FkJWWTFZ9Enpo/fWmqjwcjinywmryKPPjF98Hpb7rCfHJnMH1f8EYAJaRNg3F9g8x26v0PSTGbNuocnntB9RywWmDz52IbtHIjgzIFncufyO3F5XMy4aEbrGzQ6Ti/g5XNeJjY01n+tcSOjU0YzKX0SI5NHYjIGNnJBW+bbOGPAGTz77bN4ldfvdW9qn6kcLj9MSXUJw3v3gHl+2iIkCk5/w7c4CD163nvvQXw8TJpEg3N59VuKiHpdSExGExPSJrCreBcT0if4ff/nDZjHXSvu4qyBZ2E0GJmcMZmnvnmKWf30UOFWqx7Nyi9zFNQUwil/gIwLGvbn8mNo4lCSIpJ49qxnuXfVvQxKGITZaCY2NBaP19P095Q6H364G4bfD8rLoN4jcHvd5FfmM6XPFE5NOhW7U49CeXqGHqVkWK9hDOs1TG9vDAGPA9IvgITx3BJt5KmndYuBs86Cn//cwhkDzmDxxsXsu3kf7NoLh/8L456DxNOZGJpBYVUhEZYIxqeN52M+bvH11ercUaHqjyVoMDSK8xWn9DqF3IpcjlQc4ZSkU/xv33g/rTGaAw4qgq7R+zG011AemvVQ3cW78b2RRoYnDWdT/iYK7YWM6t+Hq6/WVaWnnqo7cU5Kn8RL2S8xJmWMrpqu/34mjIfy7XqytOJvwX6QcanjeP371zmt92kNm2sdMzJ5JO9te4+shCxdsG9kUMIgtt6wlTMGnNGut2NyxmSe2/AckzMm+10/IW0Cizcu1hebVmTGZhIeEs7KK1aSEJZApCWwAQV6iv5x/RmXOo5lly2jf1zT4CkuLI5ISyRLdy1lfNr45nfUQ4frKyvTncV6jEa/tfFp4/nuyHdkxmb2uKYq3aF/f/j97+G3v9UX0gnpE1h6yVJ+MyHYs5i3g6sMIvpB4br2jfLTDrP7zeb97e8zu//s1hP3MEaDkdMzTmfpzqXMyJyByQT/93/wwAO6c3pjWfFZZMZmMj9rPlnx/pvIzcicwSXDLwFg2DD46CP9//37m6ZNiazrkZ0SlQJJU+CMr2D2Chh+D488ooPdf/7T/4hqbRUfFs/A+IGcnnF6iwN8tGR2/9mMSR3TesJjnj37Wa4ZfU27jtWaWf1msfrAaub2n+t3/dS+U3ll0yuMSxvnPxBqT7kriCZOhIce0k3bgQbn8T59YMsW3WypdhCD1soZ8wbOIzsvm3kD9YiCMzNn8uaWN5mZOdN/Buq/X+nnwb43dJOljbf4Jg1tTlp0Gi6PC6PByNDEoQyIH0DviN4sv3w5A+L9RC99LoLRT+qBKML072RS+iRe2PACU/pMwWgwMjplNC9mv+gLLBpInAy7F+v5RSr3Y7EauP12PeLXz3+ukzw06yFWXrGS1KhUOOWPuiP6h2mw/XFiQmNICEvg/e3vMz61hXJHI507f2srhZpTkk5hR9EOTEYTp/TyE1j00EJRV2rLhXtE7xE8v+F5hiQOwWAw8NRTevhDiwX69oVJUZP4/sPvuWDIBXqDxu/n5Lfhx4f0LMVJ0xibOpY/rfoTt0y8penBgFHJo3jgywd8P7jONrnPZO5ddS+T+/j/wU9Mn8jfvvsbt026ze/6+iIsEUSERGAxWQKqrehp+sX241D5IY6UH6FfbD+/aSakT2DJD0tYsmBJN+eu42rvM/RUIaYQHp79MGlRx6pTWqk9PBmcOdDP7erjgatCjx5jP6BHN+kGl5x6CSaDqeWgvwf7zYTfMKXPFOLC4pqubNTyYFB5LntK97C3dC8jk0e2uu+bb9Z3lpcsgZtu0gFLfcmRyUxIm0B8WDzJEclNtrda9Rw5nWn55cs7d4dBFG2NZvWVq+kb09fv+rGpY6l0VjIlw0+UCCfUuW3iRAgJ0SOBORzw7be6nLFw9cK6wKLRze+zss6i6PYi4sPiAZg/aD7Z+dnN3yRoUq56Bw69q0egM7RepB6cOJiPcz5mWK9hJEUkUeGsYGfRTr83FAHdnzT1LN/ilD5T+HDHh77XMzljMp/v/dx/YDHwl1C0Dr6YDinzoN+lTZKkRaeRFn3suhfaC2Y0rJWYlD6J179/vU19Hzs3sGhFenQ6xdXFAPSN7SsX7zYanjSc9YfWc+3oup6vw+vVbA6MH8jDsx7m/CHn+99B6jz9d8w4UwwhxpBmawRGJo/kUPmhgC4e7TE5YzIrr1jZ7PFnZM7g2tHXMrWPn85xfvSN7cvag2ubDywa16b1IHFhcdS4a9hZvLNBs6j6zhl0DnanXd/VO87ExekhAHuyG8bdULfQw74fog1MVn2HrtcUmPIORDce37jzZSVk8cdpf+zy43SVGZkzmJE5I6C0KVEplFSXsLN4Jz89pekkaI0NHqwHCNi+Xc9d0GR/kSkM7TWUgXED8bRyx/ekFEA5yW+h8hiLyULZHWV+Wx2caIxGPUnr88/XDRIwre801l+9nlHJ/gvGRoOxQfPigfED29YcNvVM/RegoYlD+TjnY54880lAt7b48sCXDIhrrr1VQ9eNvY4FQxf4bgJcPfpqRqeMJinCz+BD5jCY8nbAefPnhnE3MLXv1AbTEbSm279pgxIG4fEeO3nIxbtNhvYaisloYkTvEc2m+cPUPwS8v7iwOKr+WNXs+uG9h3PNqGuarxLsIKvZ2uLFLD06nRfPeTHg/WXGZvLVoa+arZ7v6TJjM1lzcE2zNUgXn3IxF59ysd91Pd2gQfqEv2CBLmC0NhO1EO0WEqP7WURk6D/RMX6u07Xn2kEJg/xs0FRCgv9mVaBrLArsBURZojg16dSO5PTE1AnlpPY2+ToeZWbCn/9ct2wxWZiYPrHuH11d7mx8A7PR8rBew/j75r/7+kH0j+vPqgOruHnCzf63byQ8JLzBzdP06HTSo9O76tUwKmUUo1JGUV5eHvA23T7z9o3jbuSm8Td192FPCBaThQO/PcBVo67qmgM0ujMSag7lpXNf0sOf9USN8psZk8lXB79qvsaihf4rPUG/uH5syN3QbFOo49mcOfDyy7o/0LvvBjs34oQW3geK1uuJxGpK9CSFolMNShiEyWAiObJp06W2SolK4WjlUQrsBQ36WwhxIhqTMoahiUN9A+YMiBvAjwU/1tVY9PBySiC6vcaitkOXaJ/UqNTWE52kMmMzKa4urgssenDTJ3/6x/bH6XE239byOHb++brd66pVOrgQosskTtL9K9b+DCpyYNbnYG5mil7RLqOSR1FSXdIp+0qJTOGo/ShR1qjjspmnEA00Lms0Wp7Zbybbfr3Nt9w/rj8KdUJd90/8RnfixNXoB1vbN+F4vePfL64fsaGx/jtQHufMZvjsMz06VELPGC1XnKhCEyF5Dhx6DzDokQNFp7pzyp3cOaVzelTHhcVRXlNOfmV+p9SACHE8GZ0ymtn9ZpMRc+I02zTouTjarry8nJiYmICm9xaiO+RW5PLMN89wz/R7CA85/u5Qur1uatw1RFgiWk8sut9xUvMlgJpi+P4u6D0L+vqZaVgET+MhTZUi86+ZHLUfpfzOckJMIcHJlxCiWW0p88utHHHCSI1K5ZE5jwQ7G+1mNpoxW+QnKUSHWRNg/OJg50IEKDkyGbvLLkGFECcAKcUIIYQQImhSolKockkneyFOBBJYCCGEEKJ7+GlOmByRjN1pD0JmhBCdTQILIYQQQgTN42c8jtvrDnY2hBCdQAILIYQQQgRNpCUy2FkQQnSSbp8gTwghjjv150RpPKqNEEIIIQAJLIQQQgghhBCdQAILIYQQQgghRIdJYCGEEEIIIYTosHZ33q6dsLu8vLzTMiOEED2SzdZwWc57QgghThK1ZX3lZ7joxtodWFRUVACQkZHR3l0IIYQQQgghjgMVFRXExMS0mMagAgk//PB6veTm5hIVFYVBRkkRQgghhBDihKOUoqKigtTUVIzGlntRtDuwEEIIIYQQQoha0nlbCCGEEEII0WESWAghhBBCCCE6TAILIYQQQgghRIdJYCGEEEIIIYToMAkshBBCCCGEEB0mgYUQQgghhBCiwySwEEIIIYQQQnSYBBZCCCGEEEKIDpPAQgghhBBCCNFhElgIIYQQQgghOkwCCyGEEEIIIUSHSWAhhBBCCCGE6LD/D+1t7S0lZaOrAAAAAElFTkSuQmCC\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "--------------------\n", + "\n" + ] + } + ], + "source": [ + "#Enumerate and visualize attributions; liver example CFHR2\n", + "\n", + "save_index = []\n", + "\n", + "#Visualization parameters\n", + "logo_width = 192\n", + "\n", + "top_n = 1\n", + "\n", + "use_gaussian = True\n", + "min_padding = 65536\n", + "gaussian_sigma = 8\n", + "local_window = 1024\n", + "\n", + "main_tissue_ix = 0\n", + "\n", + "tissue_colors = ['darkblue']\n", + "\n", + "#Loop over examples\n", + "for example_ix in range(top_n) :\n", + " \n", + " print(\"-- Example = \" + str(example_ix)+ \" --\")\n", + " \n", + " print(\" - \" + tissue_genes[example_ix] + \"(\" + str(strands[example_ix]) + \")\")\n", + " print(\" - \" + chrs[example_ix] + \":\" + str(starts[example_ix]) + \"-\" + str(ends[example_ix]))\n", + "\n", + " #Grad analysis\n", + " \n", + " #Calculate min and max scores globally (for scales)\n", + " min_val = np.min(scores[:, example_ix, ...])\n", + " max_val = np.max(scores[:, example_ix, ...])\n", + " \n", + " print(\" -- min_val = \" + str(round(min_val, 4)))\n", + " print(\" -- max_val = \" + str(round(max_val, 4)))\n", + " \n", + " max_abs_val = max(np.abs(min_val), np.abs(max_val))\n", + "\n", + " min_val -= 0.1 * max_abs_val\n", + " max_val += 0.1 * max_abs_val\n", + "\n", + " print(\" - (Gradient score profiles per tissue) - \")\n", + " \n", + " #Gradient profiles across input sequence\n", + " f, ax = plt.subplots(len(gtex_tissues), 1, figsize=(8, len(gtex_tissues) * 1.5))\n", + " \n", + " if len(gtex_tissues) == 1 :\n", + " ax = [ax]\n", + "\n", + " #Loop over tissues\n", + " for tissue_ix in range(len(gtex_tissues)) :\n", + "\n", + " #Get tissue scores\n", + " score = scores[tissue_ix, example_ix, ...]\n", + "\n", + " l1 = ax[tissue_ix].plot(np.arange(seqs.shape[1]), np.sum(score, axis=-1), linewidth=1, linestyle='-', color=tissue_colors[tissue_ix], label=gtex_tissues[tissue_ix])\n", + " \n", + " plt.sca(ax[tissue_ix])\n", + " \n", + " plt.xlim(0, seqs.shape[1])\n", + " plt.ylim(min_val, max_val)\n", + " \n", + " plt.legend(handles=[l1[0]], fontsize=8)\n", + " \n", + " plt.yticks([], [])\n", + " plt.xticks([], [])\n", + " \n", + " plt.sca(ax[0])\n", + " plt.title(\"Gradient Saliency for gene = '\" + tissue_genes[example_ix] + \"' (\" + str(strands[example_ix]) + \")\", fontsize=8)\n", + " \n", + " plt.sca(ax[len(gtex_tissues)-1])\n", + " plt.xlabel(chrs[example_ix] + \":\" + str(starts[example_ix]) + \"-\" + str(ends[example_ix]), fontsize=8)\n", + " \n", + " plt.sca(plt.gca())\n", + " plt.tight_layout()\n", + " \n", + " plt.show()\n", + "\n", + " #Apply gaussian filter\n", + " smooth_score = np.sum(scores[main_tissue_ix, example_ix, ...], axis=-1)\n", + " if use_gaussian :\n", + " smooth_score = gaussian_filter1d(smooth_score.astype('float32'), sigma=gaussian_sigma, truncate=2).astype('float16')\n", + " \n", + " #Calculate min/max positions and (differential) values\n", + " max_pos = np.argmax(smooth_score[min_padding:-min_padding]) + min_padding\n", + "\n", + " print(\" - (Attribution at position of Max positive differential saliency) -\")\n", + "\n", + " print(\" - max_pos (rel) = \" + str(max_pos))\n", + " print(\" - max_pos (abs) = \" + str(starts[example_ix] + max_pos))\n", + " \n", + " #Visualize contribution scores\n", + " plot_start = max_pos - logo_width // 2\n", + " plot_end = max_pos + logo_width // 2\n", + " \n", + " print(\" - \" + chrs[example_ix] + \":\" + str(starts[example_ix] + max_pos - logo_width // 2) + \"-\" + str(starts[example_ix] + max_pos + logo_width // 2))\n", + "\n", + " #Logo min/max value across tissues\n", + " min_logo_val = np.min(scores[:, example_ix, plot_start:plot_end, :])\n", + " max_logo_val = np.max(scores[:, example_ix, plot_start:plot_end, :])\n", + "\n", + " max_abs_logo_val = max(np.abs(min_logo_val), np.abs(max_logo_val))\n", + "\n", + " min_logo_val -= 0.02 * max_abs_logo_val\n", + " max_logo_val += 0.02 * max_abs_logo_val\n", + "\n", + " print(\" - y_min = \" + str(round(min_logo_val, 8)))\n", + " print(\" - y_max = \" + str(round(max_logo_val, 8)))\n", + "\n", + " #Loop over tissues\n", + " for tissue_ix in range(len(gtex_tissues)) :\n", + " print(gtex_tissues[tissue_ix])\n", + "\n", + " #Get tissue-specific scores\n", + " score = scores[tissue_ix, example_ix, plot_start:plot_end, :]\n", + "\n", + " #Plot scores as sequence logo\n", + " plot_seq_scores(\n", + " score,\n", + " y_min=min_logo_val,\n", + " y_max=max_logo_val,\n", + " figsize=(8, 1),\n", + " plot_y_ticks=False,\n", + " )\n", + " \n", + " print(\"--------------------\")\n", + " print(\"\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "67a3cf9d", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.15" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/analysis/gtex_motifs/explore_polya_grads_CD99.ipynb b/analysis/gtex_motifs/explore_polya_grads_CD99.ipynb new file mode 100644 index 0000000..2d813a5 --- /dev/null +++ b/analysis/gtex_motifs/explore_polya_grads_CD99.ipynb @@ -0,0 +1,329 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "id": "7030e9ad", + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "import os\n", + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "import h5py\n", + "\n", + "import matplotlib.pyplot as plt\n", + "\n", + "from scipy.stats import spearmanr, pearsonr\n", + "\n", + "import seaborn as sns\n", + "\n", + "import matplotlib.cm as cm\n", + "import matplotlib.colors as colors\n", + "\n", + "import matplotlib as mpl\n", + "from matplotlib.text import TextPath\n", + "from matplotlib.patches import PathPatch, Rectangle\n", + "from matplotlib.font_manager import FontProperties\n", + "from matplotlib import gridspec\n", + "from matplotlib.ticker import FormatStrFormatter\n", + "\n", + "#Helper function to draw a letter at a given position\n", + "def dna_letter_at(letter, x, y, yscale=1, ax=None, color=None, alpha=1.0):\n", + "\n", + " fp = FontProperties(family=\"DejaVu Sans\", weight=\"bold\")\n", + " globscale = 1.35\n", + " LETTERS = {\t\"T\" : TextPath((-0.305, 0), \"T\", size=1, prop=fp),\n", + " \"G\" : TextPath((-0.384, 0), \"G\", size=1, prop=fp),\n", + " \"A\" : TextPath((-0.35, 0), \"A\", size=1, prop=fp),\n", + " \"C\" : TextPath((-0.366, 0), \"C\", size=1, prop=fp),\n", + " \"UP\" : TextPath((-0.488, 0), '$\\\\Uparrow$', size=1, prop=fp),\n", + " \"DN\" : TextPath((-0.488, 0), '$\\\\Downarrow$', size=1, prop=fp),\n", + " \"(\" : TextPath((-0.25, 0), \"(\", size=1, prop=fp),\n", + " \".\" : TextPath((-0.125, 0), \"-\", size=1, prop=fp),\n", + " \")\" : TextPath((-0.1, 0), \")\", size=1, prop=fp)}\n", + " COLOR_SCHEME = {'G': 'orange',#'orange', \n", + " 'A': 'green',#'red', \n", + " 'C': 'blue',#'blue', \n", + " 'T': 'red',#'darkgreen',\n", + " 'UP': 'green', \n", + " 'DN': 'red',\n", + " '(': 'black',\n", + " '.': 'black', \n", + " ')': 'black'}\n", + "\n", + "\n", + " text = LETTERS[letter]\n", + "\n", + " chosen_color = COLOR_SCHEME[letter]\n", + " if color is not None :\n", + " chosen_color = color\n", + "\n", + " t = mpl.transforms.Affine2D().scale(1*globscale, yscale*globscale) + \\\n", + " mpl.transforms.Affine2D().translate(x,y) + ax.transData\n", + " p = PathPatch(text, lw=0, fc=chosen_color, alpha=alpha, transform=t)\n", + " if ax != None:\n", + " ax.add_artist(p)\n", + " return p\n", + "\n", + "#Function to plot sequence logo\n", + "def plot_seq_scores(importance_scores, figsize=(16, 2), plot_y_ticks=True, y_min=None, y_max=None, save_figs=False, fig_name=\"default\") :\n", + "\n", + " importance_scores = importance_scores.T\n", + "\n", + " fig = plt.figure(figsize=figsize)\n", + " \n", + " ref_seq = \"\"\n", + " for j in range(importance_scores.shape[1]) :\n", + " argmax_nt = np.argmax(np.abs(importance_scores[:, j]))\n", + " \n", + " if argmax_nt == 0 :\n", + " ref_seq += \"A\"\n", + " elif argmax_nt == 1 :\n", + " ref_seq += \"C\"\n", + " elif argmax_nt == 2 :\n", + " ref_seq += \"G\"\n", + " elif argmax_nt == 3 :\n", + " ref_seq += \"T\"\n", + "\n", + " ax = plt.gca()\n", + " \n", + " for i in range(0, len(ref_seq)) :\n", + " mutability_score = np.sum(importance_scores[:, i])\n", + " color = None\n", + " dna_letter_at(ref_seq[i], i + 0.5, 0, mutability_score, ax, color=color)\n", + " \n", + " plt.sca(ax)\n", + " plt.xticks([], [])\n", + " plt.gca().yaxis.set_major_formatter(FormatStrFormatter('%.3f'))\n", + " \n", + " plt.xlim((0, len(ref_seq)))\n", + " \n", + " #plt.axis('off')\n", + " \n", + " if plot_y_ticks :\n", + " plt.yticks(fontsize=12)\n", + " else :\n", + " plt.yticks([], [])\n", + " \n", + " if y_min is not None and y_max is not None :\n", + " plt.ylim(y_min, y_max)\n", + " elif y_min is not None :\n", + " plt.ylim(y_min)\n", + " else :\n", + " plt.ylim(\n", + " np.min(importance_scores) - 0.1 * np.max(np.abs(importance_scores)),\n", + " np.max(importance_scores) + 0.1 * np.max(np.abs(importance_scores))\n", + " )\n", + " \n", + " plt.axhline(y=0., color='black', linestyle='-', linewidth=1)\n", + "\n", + " #for axis in fig.axes :\n", + " # axis.get_xaxis().set_visible(False)\n", + " # axis.get_yaxis().set_visible(False)\n", + "\n", + " plt.tight_layout()\n", + "\n", + " if save_figs :\n", + " plt.savefig(fig_name + \".png\", transparent=True, dpi=300)\n", + " plt.savefig(fig_name + \".eps\")\n", + "\n", + " plt.show()\n", + "\n", + "#Function to visualize a pair of sequence logos\n", + "def visualize_input_gradient_pair(att_grad_wt, att_grad_mut, plot_start=0, plot_end=100, save_figs=False, fig_name='') :\n", + "\n", + " scores_wt = att_grad_wt[plot_start:plot_end, :]\n", + " scores_mut = att_grad_mut[plot_start:plot_end, :]\n", + "\n", + " y_min = min(np.min(scores_wt), np.min(scores_mut))\n", + " y_max = max(np.max(scores_wt), np.max(scores_mut))\n", + "\n", + " y_max_abs = max(np.abs(y_min), np.abs(y_max))\n", + "\n", + " y_min = y_min - 0.05 * y_max_abs\n", + " y_max = y_max + 0.05 * y_max_abs\n", + "\n", + " if np.sum(scores_mut) != 0. :\n", + " print(\"--- WT ---\")\n", + " \n", + " plot_seq_scores(\n", + " scores_wt, y_min=y_min, y_max=y_max,\n", + " figsize=(8, 1),\n", + " plot_y_ticks=False,\n", + " save_figs=save_figs,\n", + " fig_name=fig_name + '_wt',\n", + " )\n", + "\n", + " if np.sum(scores_mut) != 0. :\n", + " \n", + " print(\"--- Mut ---\")\n", + " plot_seq_scores(\n", + " scores_mut, y_min=y_min, y_max=y_max,\n", + " figsize=(8, 1),\n", + " plot_y_ticks=False,\n", + " save_figs=save_figs,\n", + " fig_name=fig_name + '_mut',\n", + " )\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "534495a0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "scores.shape = (1, 524288, 4)\n" + ] + } + ], + "source": [ + "#Load scores\n", + "\n", + "score_file = h5py.File('../../../borzoi/examples/saved_models/gtex_CD99/scores_f3c0.h5', 'r')\n", + "\n", + "scores = score_file['grads'][()][:, :, :, 0]\n", + "seqs = score_file['seqs'][()][:]\n", + "genes = score_file['gene'][()][:]\n", + "genes = np.array([genes[j].decode() for j in range(genes.shape[0])])\n", + "strands = score_file['strand'][()][:]\n", + "strands = np.array([strands[j].decode() for j in range(strands.shape[0])])\n", + "\n", + "#Input-gate the scores\n", + "scores = scores * seqs\n", + "\n", + "print(\"scores.shape = \" + str(scores.shape))\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "4dcb8667", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "-- 0 (+) --\n", + " - gene_id = 'ENSG00000002586.20\n" + ] + }, + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAxYAAABZCAYAAACjWLKDAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAUXUlEQVR4nO3df5RUdf3H8ded2cXdWXZXlF2VdkWE5deCISs/DCIsKsUICiyTtBBT05NpHis7St8lKY5ZYaQeUZODlT8KqLDyB5QawRISpLASC4rtsuIuIvv758x8/3hz984su8syM+uy9HycM2dmPnN/fO7nfu6P9+dz7x0nHA6HBQAAAABx8PV2BgAAAAD0fQQWAAAAAOJGYAEAAAAgbgQWAAAAAOJGYAEAAAAgbgQWAAAAAOJGYAEAAAAgbkmxjhgKhVReXq709HQ5jpPIPAEAAAA4CYTDYdXU1GjQoEHy+bruk4g5sCgvL1dubm6sowMAAADoI0pLS5WTk9PlMDEHFunp6W0zycjIiHUyAAAAAE5S1dXVys3NbTv370rMgYV7+VNGRgaBBQAAOEZdXZ3OO+88SdL+/fuV9swzUr9+qpszJzo9La33MgmgW7pz60PMgQUAAMDxHDp0yPty7bX2XlsbnQ7glMBToQAAAADEjR4LAAAA9BnBYFAtLS29nY1TTnJysvx+f1zTILAAAABAn1BbW6uysjKFw+Hezsopx3Ec5eTkqH///jFPg8ACAAAAJ71gMKiysjIFAgFlZWXxP2oJFA6HVVlZqbKyMuXl5cXcc0FgAQAAgJNeS0uLwuGwsrKylJqa2tvZOeVkZWVp//79amlpIbAAAAAnF5/Pp4suuqjt8/HSge6gp6JnJKJcCSwAAECPSE1N1datW7udDqBvo5kAAAAAiFFLS4sKCws1cuRI5efn68ILL9ScOXO0Y8eOuKftOI5qa2slSePGjVNDQ0Nc01u2bJkqKirizldn6LEAAAAAYrRgwQLV1tZq8+bNGjBggCRp3bp12rVrl8aNGxc1bDAYjPn+hUQEKsuWLdOMGTOUnZ0d97Q6QmABAAB6RH19vUaPHi1JKi4uVqCz9ECgkykAXaivl3bv7tl5jBwpdVE/S0pKtHbtWpWWlrYFFZI0a9YsSdLKlSv11FNPKTs7W8XFxVq+fLk2b96sJ598Uq2trUpOTtby5cs1adIkSdKaNWv0ve99TwMGDNDMmTOj5uU4jmpqatS/f3+VlJTo1ltvVUVFhZqbm3XDDTfopptuahtu6dKlWrNmjSoqKrRo0SItWLBAixcvVnl5uebNm6eUlBStXLnymMAnXgQWAACgR4TDYb399tttn4+XDpyQ3bulgoKence2bdL48Z3+vH37dg0bNkxnnHFGp8Ns3LhR27dvV15eniRp2LBh+ta3viVJKioq0sKFC7Vz505VVFToa1/7mjZt2qQRI0bo3nvv7XB6wWBQV111lZ544gmNHDlS9fX1mjx5siZPnqzxR/OakpKiLVu26I033tDEiRN19dVXa9GiRfrlL3+p3/3udxozZkysJdIlAgsAAAD0PSNH2ol/T8/jOCKfprRv3z7NnTtXDQ0NmjZtmqZMmaKpU6e2BRWSBSNLlizRe++9p6SkJBUXF6u5uVlFRUUaP368RowYIUm6/vrr9Z3vfOeY+f3nP//Rrl27dOWVV7al1dTUqLi4uC2wmD9/viRp1KhRSkpK0sGDB5WTkxNbGZwAAgsAAAD0PYFAl70JH4QLL7xQJSUlev/99zVgwAANHTpUO3bs0MqVK/Xss89KUtQ/WTc3N2vu3Ll66aWXVFBQoOrqamVmZqq5ubnbvXfhcFgDBw7s8p6LlJSUts9+v1+tra2xLeAJ4qlQAAAAQAzy8vI0e/ZsLVy4UEeOHGlLr6ur63D4xsZGtbS0KDc3V5K0fPnytt8uvvhibd++XXv27JEkPfroox1OY8SIEQoEAlq1alVb2t69e3X48OHj5jcjI0NVVVXHHS5WBBYAAABAjFauXKmxY8dq0qRJGj16tKZMmaL169frjjvuOGbYjIwMLV68WBMnTtS0adN02mmntf2WnZ2tFStWaNasWfrIRz7S6Z9HJiUlad26dXrmmWd0wQUXKD8/X9ddd123HkV7yy23aMGCBRo3blxCnjLVnhOO8a4pt+umqqpKGRkZic4XAADo4+rq6touA6mtrVXa0c91tbXR6WlpvZZH9B2NjY166623NGTIkKhLfZAYnZXviZzzc48FAADoEY7jtD1WNvIG187SAfRtBBYAAKBHBAIB7dq1q9vpAPo27rEAAABAn8F/n/SMRJQrPRYAAAA46SUnJ8txHFVWViorK4vL6BIoHA6rsrJSjuMoOTk55ukQWAAAgB5RX1+vCRMmSJK2bt2qQGfpgUAnUwA8fr9fOTk5Kisr0/79+3s7O6ccx3GUk5Mjv98f8zQILAAAQI8Ih8MqLi5u+3y8dOB4+vfvr7y8PLW0tPR2Vk45ycnJcQUVEoEFAADoCatXS8OG9XYucAry+/1xnwCjZxBYAACAxJs3T0pP7+1cAPgA8VQoAADQM2pqejsHAD5ABBYAAAAA4kZgAQAAACBu3GMBAAB6hCNp8ODB9jniPwccx+kwHUDfRmABAAB6REDq8P8GAoEA/0MAnIK4FAoAAABA3AgsAAAAAMSNwAIAAPSIBkkTJkzQhAkT1NDQ4KU3NHSYDqBv4x4LAADQI0KSXn31VfscCnnpoVCH6QD6NnosAAAAAMSNwAIAAABA3AgsAAAAAMSNwAIAAABA3AgsAAAAAMSNp0IBAIAeM3DgwBNKB9B3EVgAAID4/f3v0pQpks+7GCJNUmVl5TGDpqWldZgOoG/jUigAABCfhx+Wpk2T7r+/t3MCoBcRWAAAgPjcc4+9b97cu/kA0KsILAAAQHzcf8/2+6OSGyRNnz5d06dPV0NDg5fe0NBhOoC+jXssACAeO3dKw4dL/fr1dk6A7tu6VTr9dCkvLzHTCwbt3b2/orRUkhSS9PLLL9vniMukQqGQl+4GJa4XX5QOH5a++MVj5/Puu1JlpTRmTGLyfaprbpb+9jfp05/u7ZzgfwQ9FgAQq6YmaexY6etfT8z0Nm2SHEeqqkrM9E4FBw9KLS32+cgRK59vf1sKh6WNG3s1a8dVVCQVF/d2LsyWLdLbb3vfJ060gDhWd98tbdjgfW9qsvdw2E78zz332HHuvNP73L+/9/mVV6SsLGnHDjsR/tSnpCuvtGm1N3asvdA9S5dKl14q7dkjvfxyx2Xa06qrpZOtVyoUsoALCeeEw7HVsurqamVmZqrqoYeUkZpqrRSOYz+6k1y/3g6Q8+bZSnQcbxhJqqmxin755Udz49i41dVSRoY3rDtu5LTbZ9udfzhsr8h57dgh5eZKZ55pvzU22gH8E5+QVq2SrrhCSkmJzoPjSK++Kg0aJJ19tv32la/Ye2GhVF5u0xs1qvNCilzWztJDITs4ZmZKp53m7eg3bJAmT5YCgehxDxywnfPjj0t1dTZeezU10v791mI0daqUnh5dXu3Xg/vbtm12490NN9j3qVOjf3ccK7snn5QWLPDK2v29rMze6+tt2JEjvWUsLbV8dVReket2507pvvuk735Xys+3ZbzpJukXv/AORI4jXX21NGSI7awOHrTfjxyx9TJ5svTCC9KkSdJzz1mrV+Q8amqsDJctk+66y8ZZtUpqbZWSkmyabhm0d++9Una21bemJpvOnDnSoUPShAnWypafL91+u9WrFSuix589W5o1S7ruOvt+ySU23pAhUmqqjb9tm/TrX0sf/agN98gjUkGB9Oyz0r593rQGDbK8u77xDemf/7QTiFisWhW9Db3yirR3r3T99dL8+dHDTpokJSdb3T3vPCknxzvJGzVKeuMNadw42/Yi3X+/1dmvflW65RZrLX38cenWW6X335duu0265hrpnXdsnqGQ9Ic/SGvXSjfeaMOfdZatp+JiafFi6YEHpJtvtjy4ddDN4803e9uzFF3vH3nElvGaa6Tp021fNWCA1fE//clOXkaO9MrDbYltarL9RUmJzb8jhYWWzxtvtJbVG2+UfvMby9P48VZ/T9SXvmTbXqRLL5W+8AXp2muj0z/2Mdu3xiM3V/rBD+zzpk1Wly+7zMrdPUGdPl166aXo8ebMse1927aOp3vttVa2P/+59NZb0o9+5J1wPvCAtaL//vdWl884Q9q+Pb7liNftt0s/+UnipvfEE976j1z2vDyrU911ySXS3Lm2HS5aZPU2kQFp++0pRnWS3BCiVvaUqK7STwoDB9o+XbIb0s880/ZBJ2LCBOsV6o7LLpP+8pfotMsvt/1Qfr7lJ97tuSsFBZ1vr5EWLpRef922zcsus310dbUdo1133mmXw7n33Fx1le37XLNmSevWdT2f7GzbL993n5c2e7YdC+66y86P3Ht5HnzQzpNKS20bWLLE0j/+cdt/P/jg8ZdrwADL16pV0tChdpwdMsS2rT17bD8bDkvPP2/DP/ywHa8jz4Hanw91ll5SIv30p968H3vMjnd//KOV46pV0l//ar8VFkqDB9vx0u+3Y+X//Z/05pud1y/3GN7e44/b8vzrX9Hzl6Tvf186/3z7vHatBfnvvivNmGH74C9/WdWSMiVVVVUpIyOjy+KMP7CQ1PUsgD6ks40SAHDC+mRgASSS25jrON4lg33MiQQW8d9jceiQ9S60v0YyHLa0cNgirc5a70MhryWwo14Nd1rSsa2O7rChUPT8Ozo5jJyGu3J9vuj5t9e+pXPvXmud9ftt/HDYWk47G/dE0iOXOzL/7Zc5HLYWU/d67s7Kyi37pCSvtdUtq47y4E7HnXZn5eKWXeQNepEbS/uyj1yuYNDy01FviaulxXoUMjK8+bs9CZHTb22178nJ3npyl83ni153HS1nVZXNw73Eol+/Y6cTDEqrV1tL4tix9ltysi175DCOE91j5vdbT8LZZ9v03Xk2NlrLitvzk5wcvcNxpxMMShUV1krmDuPWt6YmaylautRaLNxlam21YYNBa0FqbrYyS072eqwOHPB639xlCQatVyglxb67fD4bv77elqepyX5varKeNXd5Kyuttae11XqJCgqkD33I2ybd+YRC1hMUCERvP01N1jM4dap9b2nx8uXWMbec3f2Iu/7r6qyVauxYr5zfe09KS7P1mZRk47Svjy43j+42Etlb45ZB5HYX2RLllvmbb0of/rBdFnLbbbZ87jL06+ddHiLZ8qemet/d5WlosDz4fF7PqWTlOXGiLWd6utdL9sADdplIVpZNIynJq4dFRdYr4i5bOGzrcP9+y2d6upeX+nrbfyclWatUUpJUW2t1b/hwr+zc18aNVufcerFkifTNb9q4hw5Zq2R6uvVw+f1Wx6urrafRrVtuft1tvbnZ6pNbNyL3T+781661lrP+/aWZM72W2yuukH77Wyvn11+3no65cy2fCxZYr2d2tlfPamqkX/1KWrPGWvsvukj6zGcsPzU1Vs6OY8uTkmLj+P2Wd7eOtLbaMqWledt6fr70s59ZuVdWWs/hCy/Ysj30kJXZ6NE2nUDA1mNlpc3v/POlp56yXifHsfm0tNjrzTdt23j0Ues5GjrU8hUKWZm5L5/P20e++KL1FL7zjo0/c6a3vm+/3cpl9Gj7PniwNGyYXQ5SWWk9nWPG2H6xvFy64AJbhgMHvN7rAwest2r4cCvDO+6wHqyDB6WVK5UQK1ZYHZKsJbWw0K4WWLjQ6v7TT3vDfvKT1nrs89m6dre9sjJrgf/c57zt9PTTbfsLBm09BALSa6/ZstTWeuuzpcX2aT6f9ZYVFHjb59KlVmeys20+br1++GHp4outN/bcc706fOSIbafBoPSPf9g8hw+3utzaai/HsXyFQjb/hgYr98xMq5f9+nn5qquzfVx6uuWhosLq6znnWL1++mlbR489ZstbUmIt4YWFtt7dy/I2bLDW/MxMy4N7/PP5rCwqK209795tveY1NdaKX1Bgw+/bZ3XH3UdOnGiXJ+7Z4/VYhMO2LCkp3n4mPd3e3f3Tli02vYIC26ekptowzc3esd09xvh89nKPC7t32/CDB3v58PnsPK211XocQiH77Pfbcrn7v/vus5b/OXNsn+Cu+8ZGKwv3+OT327mCe7VLUlL0sVs69hje/lyuI3V1ViaZmVY+7vHEXbbnn7ee7aws+x4M2rK4eXP3m62t3vGxqMi2U3ddNjRY2d1zj83nhz/06lhrq5WdO77f783bcayuuec5kvVe7Npldb8b4u+x6Eb0AgA4Cbz+evQll31NYaGdEEh24v/cc9JnP+v9/tprFkAVF3d9mSoSb9w46d//tpOV8vK2eyy61WNx111Ku+ceuxQ2P987ITt82E4qkRgbNlgAHHl/C9ANJ3LOT2ABAOgbmpqsl2L+/M5bA9E7/vtfuw/Mvd7+6Pqpk5R9dJAKRQQW55yj7HfesfTaWqWlRVwM1f6eSgC9isACAAD0nrFjrQciNbXjJwLt2eP1nLU/DRkyxC7fI7AATgoncs7P/1gAAIDEan/fZXud3Z8o2VNxXnklsfkB8IEgsAAAAInVUW9DUZE9DlyKflhEe0OG2AtAn0NgAQAAEutoj0VjKKS5M2ZIwaBWn3WW3OeeNTqO5h79vLqxUSmRT0QD0GcRWAAAgMQ6GlgEJf15/Xr77D7eW1Lw9NP1Z/dzH322P4BjdfIHDgAAADHq6FKoyP9xAXBKIrAAAACJ9eMf23vk/1DwnxTAKY/AAgAAJNacOdLq1fanbAD+Z3CPBQAASLzPf16qq4tOu/tu6cCB3skPgB5HYAEAAD4Yixfbe/uAA8ApIebAwv3D7urq6oRlBgAAnDrqIgKI6urqtidAdZYO4OTjnuuHO3ooQztOuDtDdaCsrEy5ubmxjAoAAACgDyktLVVOTk6Xw8QcWIRCIZWXlys9PV2O48SUQQAAAAAnr3A4rJqaGg0aNEg+X9fPfYo5sAAAAAAAF4+bBQAAABA3AgsAAAAAcSOwAAAAABA3AgsAAAAAcSOwAAAAABA3AgsAAAAAcSOwAAAAABA3AgsAAAAAcSOwAAAAABA3AgsAAAAAcSOwAAAAABA3AgsAAAAAcft/j8SIsC+/oBgAAAAASUVORK5CYII=\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "#Visualize polya-centric gradient for gene(s)\n", + "\n", + "#Find position of max saliency\n", + "max_poses = np.argmax(np.sum(scores, axis=-1), axis=-1)\n", + "\n", + "#Loop over genes\n", + "for example_ix in range(scores.shape[0]) :\n", + " \n", + " #Get max pos\n", + " max_pos = max_poses[example_ix]\n", + " \n", + " #Only visualize genes that are not extremely long\n", + " if max_pos >= 150000 and max_pos < seqs.shape[1] - 150000 :\n", + " \n", + " print(\"-- \" + str(example_ix) + \" (\" + str(strands[example_ix]) + \") --\")\n", + " print(\" - gene_id = '\" + str(genes[example_ix]))\n", + "\n", + " #Plot scores\n", + " f = plt.figure(figsize=(8, 1))\n", + "\n", + " #Annotate 4kb window\n", + " plot_start = max_pos - 2000\n", + " plot_end = max_pos + 6 + 2000\n", + "\n", + " l1 = plt.plot(np.arange(seqs.shape[1]), np.sum(scores[example_ix, ...], axis=-1), linewidth=1, linestyle='-', color='red', label='Gradient')\n", + "\n", + " plt.axvline(x=plot_start, color='black', linestyle='--')\n", + " plt.axvline(x=plot_end, color='black', linestyle='--')\n", + "\n", + " plt.xlim(0, seqs.shape[1])\n", + " \n", + " plt.legend(handles=[l1[0]], fontsize=8)\n", + " \n", + " plt.yticks([], [])\n", + " plt.xticks([], [])\n", + "\n", + " plt.tight_layout()\n", + "\n", + " plt.show()\n", + " \n", + " #Visualize contribution scores\n", + " plot_start = max_pos - 100\n", + " plot_end = max_pos + 6 + 100\n", + " \n", + " #Rev-comp scores if gene is on minus strand\n", + " if strands[example_ix] == '-' :\n", + " plot_end = seqs.shape[1] - (max_pos - 100)\n", + " plot_start = seqs.shape[1] - (max_pos + 6 + 100)\n", + " \n", + " #Plot sequence logo\n", + " visualize_input_gradient_pair(\n", + " scores[example_ix, :, :] if strands[example_ix] == '+' else scores[example_ix, ::-1, ::-1],\n", + " np.zeros(scores[example_ix, ...].shape),\n", + " plot_start=plot_start,\n", + " plot_end=plot_end,\n", + " save_figs=False,\n", + " )\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3d7aefe0", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.15" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/analysis/gtex_motifs/explore_splice_grads_GCFC2.ipynb b/analysis/gtex_motifs/explore_splice_grads_GCFC2.ipynb new file mode 100644 index 0000000..4296cf6 --- /dev/null +++ b/analysis/gtex_motifs/explore_splice_grads_GCFC2.ipynb @@ -0,0 +1,329 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "id": "7030e9ad", + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "import os\n", + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "import h5py\n", + "\n", + "import matplotlib.pyplot as plt\n", + "\n", + "from scipy.stats import spearmanr, pearsonr\n", + "\n", + "import seaborn as sns\n", + "\n", + "import matplotlib.cm as cm\n", + "import matplotlib.colors as colors\n", + "\n", + "import matplotlib as mpl\n", + "from matplotlib.text import TextPath\n", + "from matplotlib.patches import PathPatch, Rectangle\n", + "from matplotlib.font_manager import FontProperties\n", + "from matplotlib import gridspec\n", + "from matplotlib.ticker import FormatStrFormatter\n", + "\n", + "#Helper function to draw a letter at a given position\n", + "def dna_letter_at(letter, x, y, yscale=1, ax=None, color=None, alpha=1.0):\n", + "\n", + " fp = FontProperties(family=\"DejaVu Sans\", weight=\"bold\")\n", + " globscale = 1.35\n", + " LETTERS = {\t\"T\" : TextPath((-0.305, 0), \"T\", size=1, prop=fp),\n", + " \"G\" : TextPath((-0.384, 0), \"G\", size=1, prop=fp),\n", + " \"A\" : TextPath((-0.35, 0), \"A\", size=1, prop=fp),\n", + " \"C\" : TextPath((-0.366, 0), \"C\", size=1, prop=fp),\n", + " \"UP\" : TextPath((-0.488, 0), '$\\\\Uparrow$', size=1, prop=fp),\n", + " \"DN\" : TextPath((-0.488, 0), '$\\\\Downarrow$', size=1, prop=fp),\n", + " \"(\" : TextPath((-0.25, 0), \"(\", size=1, prop=fp),\n", + " \".\" : TextPath((-0.125, 0), \"-\", size=1, prop=fp),\n", + " \")\" : TextPath((-0.1, 0), \")\", size=1, prop=fp)}\n", + " COLOR_SCHEME = {'G': 'orange',#'orange', \n", + " 'A': 'green',#'red', \n", + " 'C': 'blue',#'blue', \n", + " 'T': 'red',#'darkgreen',\n", + " 'UP': 'green', \n", + " 'DN': 'red',\n", + " '(': 'black',\n", + " '.': 'black', \n", + " ')': 'black'}\n", + "\n", + "\n", + " text = LETTERS[letter]\n", + "\n", + " chosen_color = COLOR_SCHEME[letter]\n", + " if color is not None :\n", + " chosen_color = color\n", + "\n", + " t = mpl.transforms.Affine2D().scale(1*globscale, yscale*globscale) + \\\n", + " mpl.transforms.Affine2D().translate(x,y) + ax.transData\n", + " p = PathPatch(text, lw=0, fc=chosen_color, alpha=alpha, transform=t)\n", + " if ax != None:\n", + " ax.add_artist(p)\n", + " return p\n", + "\n", + "#Function to plot sequence logo\n", + "def plot_seq_scores(importance_scores, figsize=(16, 2), plot_y_ticks=True, y_min=None, y_max=None, save_figs=False, fig_name=\"default\") :\n", + "\n", + " importance_scores = importance_scores.T\n", + "\n", + " fig = plt.figure(figsize=figsize)\n", + " \n", + " ref_seq = \"\"\n", + " for j in range(importance_scores.shape[1]) :\n", + " argmax_nt = np.argmax(np.abs(importance_scores[:, j]))\n", + " \n", + " if argmax_nt == 0 :\n", + " ref_seq += \"A\"\n", + " elif argmax_nt == 1 :\n", + " ref_seq += \"C\"\n", + " elif argmax_nt == 2 :\n", + " ref_seq += \"G\"\n", + " elif argmax_nt == 3 :\n", + " ref_seq += \"T\"\n", + "\n", + " ax = plt.gca()\n", + " \n", + " for i in range(0, len(ref_seq)) :\n", + " mutability_score = np.sum(importance_scores[:, i])\n", + " color = None\n", + " dna_letter_at(ref_seq[i], i + 0.5, 0, mutability_score, ax, color=color)\n", + " \n", + " plt.sca(ax)\n", + " plt.xticks([], [])\n", + " plt.gca().yaxis.set_major_formatter(FormatStrFormatter('%.3f'))\n", + " \n", + " plt.xlim((0, len(ref_seq)))\n", + " \n", + " #plt.axis('off')\n", + " \n", + " if plot_y_ticks :\n", + " plt.yticks(fontsize=12)\n", + " else :\n", + " plt.yticks([], [])\n", + " \n", + " if y_min is not None and y_max is not None :\n", + " plt.ylim(y_min, y_max)\n", + " elif y_min is not None :\n", + " plt.ylim(y_min)\n", + " else :\n", + " plt.ylim(\n", + " np.min(importance_scores) - 0.1 * np.max(np.abs(importance_scores)),\n", + " np.max(importance_scores) + 0.1 * np.max(np.abs(importance_scores))\n", + " )\n", + " \n", + " plt.axhline(y=0., color='black', linestyle='-', linewidth=1)\n", + "\n", + " #for axis in fig.axes :\n", + " # axis.get_xaxis().set_visible(False)\n", + " # axis.get_yaxis().set_visible(False)\n", + "\n", + " plt.tight_layout()\n", + "\n", + " if save_figs :\n", + " plt.savefig(fig_name + \".png\", transparent=True, dpi=300)\n", + " plt.savefig(fig_name + \".eps\")\n", + "\n", + " plt.show()\n", + "\n", + "#Function to visualize a pair of sequence logos\n", + "def visualize_input_gradient_pair(att_grad_wt, att_grad_mut, plot_start=0, plot_end=100, save_figs=False, fig_name='') :\n", + "\n", + " scores_wt = att_grad_wt[plot_start:plot_end, :]\n", + " scores_mut = att_grad_mut[plot_start:plot_end, :]\n", + "\n", + " y_min = min(np.min(scores_wt), np.min(scores_mut))\n", + " y_max = max(np.max(scores_wt), np.max(scores_mut))\n", + "\n", + " y_max_abs = max(np.abs(y_min), np.abs(y_max))\n", + "\n", + " y_min = y_min - 0.05 * y_max_abs\n", + " y_max = y_max + 0.05 * y_max_abs\n", + "\n", + " if np.sum(scores_mut) != 0. :\n", + " print(\"--- WT ---\")\n", + " \n", + " plot_seq_scores(\n", + " scores_wt, y_min=y_min, y_max=y_max,\n", + " figsize=(8, 1),\n", + " plot_y_ticks=False,\n", + " save_figs=save_figs,\n", + " fig_name=fig_name + '_wt',\n", + " )\n", + "\n", + " if np.sum(scores_mut) != 0. :\n", + " \n", + " print(\"--- Mut ---\")\n", + " plot_seq_scores(\n", + " scores_mut, y_min=y_min, y_max=y_max,\n", + " figsize=(8, 1),\n", + " plot_y_ticks=False,\n", + " save_figs=save_figs,\n", + " fig_name=fig_name + '_mut',\n", + " )\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "534495a0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "scores.shape = (1, 524288, 4)\n" + ] + } + ], + "source": [ + "#Load scores\n", + "\n", + "score_file = h5py.File('../../../borzoi/examples/saved_models/gtex_GCFC2/scores_f3c0.h5', 'r')\n", + "\n", + "scores = score_file['grads'][()][:, :, :, 0]\n", + "seqs = score_file['seqs'][()][:]\n", + "genes = score_file['gene'][()][:]\n", + "genes = np.array([genes[j].decode() for j in range(genes.shape[0])])\n", + "strands = score_file['strand'][()][:]\n", + "strands = np.array([strands[j].decode() for j in range(strands.shape[0])])\n", + "\n", + "#Input-gate the scores\n", + "scores = scores * seqs\n", + "\n", + "print(\"scores.shape = \" + str(scores.shape))\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "fd114809", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "-- 0 (-) --\n", + " - gene_id = 'ENSG00000005436.14\n" + ] + }, + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAxYAAABZCAYAAACjWLKDAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAVi0lEQVR4nO3de3BU5f3H8c8SUiEJQdRgocmolRQKarkUQXDQqq2jqMhQa4tTLUO9o6WodawIir8qo61F4/RqNd7wjlYYbJFWrArhIgElEQlqlBgkgUAum4Qku8/vj8eTs5vsJtlLspvwfs3snLPPOec5z7ns7vPd5znneIwxRgAAAAAQg36JLgAAAACA3o/AAgAAAEDMCCwAAAAAxIzAAgAAAEDMCCwAAAAAxIzAAgAAAEDMCCwAAAAAxKx/tAv6/X6Vl5dr0KBB8ng88SwTAAAAgCRgjFFtba2GDx+ufv06bpOIOrAoLy9XTk5OtIsDAAAA6CX27Nmj7OzsDueJOrAYNGhQ60oyMzOjzQYAAABAkqqpqVFOTk5r3b8jUQcWTvenzMxMAgsAQNx4vV6deOKJkqTS0lKlP/OMNGSIvNOnB6enpyeukABwhOnKpQ9RBxYAAHSX/fv3u2+uu84O6+qC0wEASYW7QgEAAACIGS0WAAAA6DV8Pp+am5sTXYw+JzU1VSkpKTHlQWABAACAXqGurk5lZWUyxiS6KH2Ox+NRdna2MjIyos6DwAIAAABJz+fzqaysTGlpacrKyuI5anFkjFFlZaXKysqUm5sbdcsFgQUAAACSXnNzs4wxysrK0sCBAxNdnD4nKytLpaWlam5uJrAAAPQN/fr10/e///3W8c7SARxZaKnoHvHYrwQWAICkMnDgQG3evLnL6QCA5MBfPgAAAECUmpubdc8992jUqFEaM2aMxo0bp0svvVTbtm2LOW+Px6O6ujpJ0tixY9XQ0BBTfsuWLVNFRUXM5QqHFgsAAAAgSnPmzFFdXZ02bNigIUOGSJJWrlypoqIijR07Nmhen88X9fUL8QhUli1bpvPOO09Dhw6NOa9QCCwAAEmlvr5eo0ePliQVFxcrLVx6WlqYHAAcEerrpZ07u3cdo0ZJHXzXlJSU6NVXX9WePXtagwpJuvjiiyVJ+fn5ev755zV06FAVFxcrLy9PGzZs0HPPPaeWlhalpqYqLy9PkyZNkiStWLFCv/3tbzVkyBBdeOGFQevyeDyqra1VRkaGSkpKNH/+fFVUVKipqUnXXnutbrjhhtb5li5dqhUrVqiiokKLFi3SnDlztGTJEpWXl+vHP/6xBgwYoPz8/HaBT6wILAAAScUYo88//7x1vLN0AEeonTulCRO6dx3vvy+NHx92cmFhoUaMGKFjjjkm7DzvvvuuCgsLlZubK0kaMWKEFixYIEkqKCjQ3LlztWPHDlVUVOjqq6/W+vXrNXLkSD3wwAMh8/P5fJo9e7aefvppjRo1SvX19Zo8ebImT56s8V+XdcCAAdq4caM++ugjnX766fr5z3+uRYsW6fHHH9fLL7+sU045Jdo90iECCwAAAPQ+o0bZin93r6MTgXdT+uSTTzRr1iw1NDRo2rRpmjp1qs4888zWoEKywcjvfvc7HThwQP3791dxcbGamppUUFCg8ePHa+TIkZKka665Rrfffnu79X388ccqKirST3/609a02tpaFRcXtwYWV1xxhSTpu9/9rvr376+vvvpK2dnZ0e2DCBBYAAAAoPdJS+uwNaEnjBs3TiUlJTp48KCGDBmik08+Wdu2bVN+fr5WrVolSUFPsm5qatKsWbO0bt06TZgwQTU1NRo8eLCampq63BJrjNFxxx3X4TUXAwYMaB1PSUlRS0tLdBsYIe4KBQAAAEQhNzdXM2bM0Ny5c3Xo0KHWdK/XG3L+xsZGNTc3KycnR5KUl5fXOu2MM85QYWGhdu3aJUl67LHHQuYxcuRIpaWl6amnnmpN2717t6qqqjotb2ZmpqqrqzudL1oEFgAAAECU8vPzdeqpp2rSpEkaPXq0pk6dqrVr1+q2225rN29mZqaWLFmi008/XdOmTdNRRx3VOm3o0KH629/+posvvlhTpkwJ+yDQ/v37a+XKlXrxxRd12mmnacyYMfrlL3/ZpVvR3nzzzZozZ47Gjh0bl7tMteUxUV4B5zTdVFdXKzMzM97lAgAcobxeb2vXgbq6OqV/Pe6tqwtOT09PWBkB9LzGxkZ99tlnOumkk4K6+iA+wu3fSOr8XGMBAEgqHo+n9baygRdFhksHACQHAgugpxUUSFlZ0sknJ7okQFJKS0tTUVFRl9MBAMmBwALoaWecYYfchx8AgIjxHJvuEY/9SmABAACApJeamiqPx6PKykplZWXRJTKOjDGqrKyUx+NRampq1PkQWAAAkkp9fb0mTpwoSdq8ebPSwqWnpYXJAUBflJKSouzsbJWVlam0tDTRxelzPB6PsrOzlZKSEnUeBBYAgKRijFFxcXHreGfpAI4cGRkZys3NVXNzc6KL0uekpqbGFFRIBBYAAADoRVJSUmKuAKN78IA8AAAAADEjsAAAAAAQMwILAAAAADEjsAAAAAAQMy7eBgAkFY/HoxNOOKF1vLN0AEByILAAACSVtLS0kPeoD5cOAEgOdIUCAAAAEDMCCwBActm1K9ElAABEgcACSJS33kp0CdDX+HzS0qW9u2K+YYMaRo7UxNxcTZw4UQ0NDa2TGhoaNHHixHbpYXm9Ek/nBYAeQ2ABJMo55yS6BH3P1q3S3XcnuhSJc/fd0h13SCNHJrok0du1S35JW3bv1pYtW+R/6aXWSf5Nm7Rlyxab7vd3nldGhnThhd1XVgBAEAKLI8XatfbfTPR99fXSmDFH5j+1EyZI99yT6FIkztatiS5B7JYuDX5/1VXu+NlnR57f2rUxFQcA0HUEFvH2r39JL7+c6FIE++gj6Yc/lH72s+SqbOblSa+9luhS9Kye2N4zzpCKi6Ubbuj+dSG59OY/D/74R+mdd6SqqsiXLSqSDhwIP72xMXR6c7OUny8ZE/k6AQDtEFjE2wUXSJddluhSBPv0Uzt86SVp2LDElsVhjHTzzdLMme2n/ehH0qxZ7dO93u4vV3dZskR67LHQ2xtvlZV2WFsb33xLSqT33gtOe/11yeOJz7pKS6VLLpGammLPKxl98on02Wfdu459+9zx116Tqqvbz1NWJhUU2Ar86tU2LdR8PW3BAmnaNKmiIvJlTzlFmjw58uUefVSaM0d6++3IlwUAtENg0RX79tnKU6J+fP1+W+mK1kUXueMd/avXkzr6Z/XNN6UVK4LTVq+2/aV37+7ecnVFY2PkLT+LF0tXX90+/ZJLurZ8U5N08KANEu+6q+N59+61w1WrIitjKMZIhw/b8e98RzrzzODpM2bY4Y4dduj322UaG20FNhI33iitXNl3KnnXXy+9+KL7fsQI6dvf7tqymzbZ7xxnv3ZV4D/zM2dKkya57z/+2J4/OTm2VesHP5CmT5fWrJGOPtr+6x+otlb6+9+D06K9GPr55+327NxpP8tOC8GhQ/b7taYm8jxffz34/e7dNpgK1foQrkXCuQA8GQIrAOgLTJSqq6uNJFNdXe0mNjUZ4/O57/1+Y5Yvt+mOL74wprnZDnfvNmbzZvu+udmYvXuNOXw49Ar9fmNKSuzQ73fTAqcHDnfsMObYY425/353nsZGY+67z5j6evve5zPm7beN2bfPmK1b3WULC42RjLnpJruM/VmyL0d5uTFlZXbcmfaXv7jjX35pp0vGXHGFnd/vt9u4b58x771nX2vW2G0uLbXzzphh1+9sp9/v5vnnPxvz4YfuPm5psfvt/vuNue46O8+DD9rtk4xZujS47M7rr381ZssWY6ZMMaaiwt3urVvtsdq+3a7L2U/Ll7vb3txsTFWVMffea8vh9xszfLgxv/61MUVFNr/Ac6HtcamoMOaqq+yxd8rT0GCn+3zBZV6+3Jh//9seMyfttdfceevq2udfXGzLH3huOOV2lnOO/fbtNs8rr7TDiy6y6T6fPSaHD9v0hQuNqay0+3vhQrcsr7/u5uf3G5Ofb9PffNOYQ4eMeeSR0Ps/3Ovss+3wssuMufZaYx5+2Jht2+z+bjvvsGH2XHD238GDdn94vcHzHThgP2sNDcZ8/rk9n7/80k6rqQn9WXKWra11x194wR0vLHTPV+c1a1bw+3PPtcPHH7flvPFGY044wZiNG+0x3rLFmCeftPOcd57dzsDli4qMeeUVY2691Zh//CN42vLlwZ97p+w+nzuP32+P+bp19v28eXaegwfd5f77X3c7neO+YYM9Dz791JivvrLfAXv32vn27rXLVlXZ89vvt+eEMcYMGmTM6tX2s71pk93fTln277f5Ou/XrXPLXVRkz9mHHrLTsrKM+eST9sd7//7g8WHD3PdTpkR2nrV9TZvmjjufk507jcnNtWlPPWXPocBlvvENeywDz5sPPrDT/vMfY/75Tzv+1lt2HzrLDR/uni/O91SIV51k9PWrrqP0W24x5je/aZ/HqlXt0x5+2Jj//c+W+fBhu/4TT7TT/u//bHpDg/0MFRQYc/nl7nHas8eYJ56wx76y0pa/qso9lyorjXnjDXss33nHft+vWWPPo8pKm++779p8nO8tv9+eCxddZMyyZXYf79jR/rurM1VV9nNtjD0fzznHmF/8wv6+BM5TUhK8nN9vyxKoosL9PESipcWWYeNGY9avt+872o5w+bddZtcuY555xv5O9+9vl/vww9B5V1cH/+Y4+Tm/L85ntSuam+1xa7ueUPUT57smlKYm+53z3nvGfPZZ8P7u7Dg7v0OB87e02PO2s+MTrh7VttydaWx06wGRqKqyv4HOPvd67f4MpazMnnN+v1tXjPR4OcJtU1NT+3O9t4n0eyFQ289FFELW+cPwGGNMNAFJTU2NBg8erGpJmXELcwAARzqvpBO/Hi+VlN5JOrrZgAFSerptdWr7qqiwLZXo+9LTpYED3fcejx063W+zstxzoicNGhS+O67HY/9aiMSxx9phRz08jjnG3X7ns+Dsh860LdOQIbZHgmPgQNtLIfPr2vXhw/amLOFkZEipqcF5xMvX+6LG79fggwdVXV2tzMyOa/39418KSd/7nrR9e7dkHTd5ebZJf8GCRJcEABAgXVKon+hw6ehmixa5laFQr8WL47u++fOlZcs6nicrq+sVuVBGj7Y3uUgGzrWPN94oLVzYtWXGjw++C9zixfY6yki36aSTunbt17hx0k9+YsedSrEz3LlTev996YorbFq4bTjtNNvlMtKbTEyYIJ1/vnTffe2nTZwoXX65VF4uPfRQ8PYsWGArxnfeGbzMvffabrebNrXPb/Zs6dRT7fgdd9jh4MH2Zij332/fX3mlPX8CPwOSPR+d8zY93XYdzciw3WIffNCmX3ON7bKakiI98YRN+9Wvgm+TftdddtnMTJu3z+duw/Tpdt0+n91eSbr1Vhv8P/GE7Z78wANuXnPmuOuZN89eVyZJN91k68GOjAyprs6W5bjj3C7Xt91mh42NXb+Ve080i4TU0BB5c2ukbrvNHvJo1NS4442NxuTlhW9Kck4tZ94vvrDjBw8aM26c7VbQ2bY6XZ5WrAif/y23RLwZIX8GjLFdZCRjVq7sPA+v1zbzh+J0SYrEoUN2OG+eMX/4Q/vl16835rnngtP8frcbS6yam+3559izJ/y8Tz5pu8Q4Tfs5OcFdKxz19W6XqgcftGkPP9y1rig7dhhz9dVul5SPP7bdLmbPdvNvu8zAgbZLzIwZ9hgENhu3PdahdLTNgbZutV2zAvN1jld9vT2309OD15mRETqvxsbgLnErVtj5H3nEpj/7bMfldroWhjsXA8voaLtvAn3wge3u012c7lXGGDN0aHA3rc7Mndv5MQwU7txatMh2XQvsqnH77Xaa19t5voHLOXl2pYuC32+7HQUqKLCfpd273bQ773S7EEb62rHD5vHSS/a90yXq6adt+vz57rzvvBO6nOXlxkyf3r67x4QJwft+//7u+b1qabFdx/oKp2tfZzZvjrlrBnqRggL7vYNeK5I6f+ICi75k1SpbAesuV11lf+TWrIl82aam0IEFord9uzEvvtg+vb7eVm6c/s7GdF45WrfOnbeqyl5/EOoHt+1yHVVynHmOPz76bQzlkUds+cJZs8autysVi1D8fmP+9Ce373oonVXu5s2z16okm40b7b754IOuze9cl9DR/g60bZsx3/qWe+ynTAm+hiYePv88uv7WXbF2bWRBRSgVFfbYO39A1NRE/53n9dprkQAAPXyNRRf6WyFGBw7YZs5ly6T+UfReu+ce6YUX7PMsJLfZDt1v/37bZPnNb4ae3tVj8eqrUmGhbY79xjds/85wnH6f559vn6vSk+rqbJMqYrN+vTR1qu2rnJXV9eXmzpUef9w+E2L+/G4rXrdwzltJDY8+qgvmzZMkvSHJ6dXdIOkCSTrrLL3xxhsaGNjfuy2/33Y3kPjOA4AYRFLn755rLBBfxx7r9ouLxuLF9hXww40ectxx8cln5szIn4Fx9NHxWXckCCriY8qU6CrDZ51lA4ujjop/mXqQ3+eTc9PhwMuC/ZJNf/tt+Tu7YLjf13dTT02Ne/kAAKHxHIsjybPP2ifbom+79lo7vP76xJYDPc+pbPf2PxHidZehFSsifxYIACBqBBZHktmz2z/gDD3n+ONDP1E83gYMCB7iyOG0cvTGwKK83L1LSby6Ls2caR/sCADoEQQWQE+59FLpscfc92vXds96nH7lkd7SD72f809/v1741T5smHvtUGCLhXPLRwBA0uuFvz5AL1Raav+NDez7fu653bMu5wL/lpbuyR/JqzcHFpLb0hIYWNx+uzteV9ez5QEARKSX/voAvcwJJ/TcRaS0WBy5evs1FhdeKJ19tnTZZW5aNHfCAwAkBN/YQF8zb559IuukSYkuCXrarFnSk0/ap6/2RoMHS2+9JXm9SktLCzlLuHQAQOLxHAugJzU0SGlp9ja0lZWJLg3QOzgtMDyPAgB6XCR1frpCAYlw6qmJLgEAAEBcEVgAidBb+8ADAACEQWABAEgqjY2Nmj59uqZPn67GxsZO0wEAyYGLtwEAScXn82n16tWt452lAwCSAy0WQE866ijppJOkhQsTXRIAAIC4osUC6En9+kmffproUgAAAMQdLRYAAAAAYkaLBQAguf3+99L69YkuBQCgE7RYAACS2y23SK+8kuhSAAA6EXWLhfPA7pqamrgVBgAAr9fbOl5TU9N6B6hw6QCA7uPU9Z26f0c8pitzhVBWVqacnJxoFgUAAADQi+zZs0fZ2dkdzhN1YOH3+1VeXq5BgwbJw1OEAQAAgD7HGKPa2loNHz5c/fp1fBVF1IEFAAAAADi4eBsAAABAzAgsAAAAAMSMwAIAAABAzAgsAAAAAMSMwAIAAABAzAgsAAAAAMSMwAIAAABAzAgsAAAAAMSMwAIAAABAzAgsAAAAAMSMwAIAAABAzAgsAAAAAMTs/wEbSMf9OrtM2wAAAABJRU5ErkJggg==\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAxYAAABZCAYAAACjWLKDAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/NK7nSAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAoO0lEQVR4nO3dd3hUVfrA8e+k90IKSSAUgQBKpKsgRSxYVl0B18qu7vrT1d0V7KwrKrurrmVt6IoFVKQLSjdA6CA9hYQQICQhpCckmWRSZpKZub8/TkiB9IQM5f08T56UuffOeydTznvPOe/RaZqmIYQQQgghhBDtYGfrAIQQQgghhBCXPkkshBBCCCGEEO0miYUQQgghhBCi3SSxEEIIIYQQQrSbJBZCCCGEEEKIdpPEQgghhBBCCNFuklgIIYQQQggh2s2hrTtarVaysrLw9PREp9N1ZExCCCGEEEKIi4CmaRgMBkJCQrCza7pPos2JRVZWFqGhoW3dXQghhBBCCHGJSE9Pp3v37k1u0+bEwtPTs+ZOvLy82noYIYQQQgghxEWqpKSE0NDQmrZ/U9qcWJwd/uTl5SWJhRBCCCGEEJexlkx9aHNiIYQQQpzn3A8eTbNNHEIIITqdVIUSQgghhBBCtJskFkIIIYQQQoh2k8RCCCGEEEII0W6SWAghhBBCCCHaTRILIYQQQgghRLtJYiGEEEIIIYRoN0kshBBCCCGEEO0miYUQQgghhBCi3SSxEEII0T7lWbBzElgqaxfE0zRZHE8IIa4wklgIIYRon8zVkLEKMtfaOhIhhBA2JImFEEKI9smoTiiyI2wbhxBCCJuSxEIIIUT7FB5U301nbBuHEEIIm5LEQgghRNtVlUhCIYQQApDEQgghRHuUZ9o6AiGEEBcJSSyEEEK0XWWh+u7ka9s4hBBC2JwkFkIIIdrOVAg6O7gzFhy9bR2NEEIIG5LEQgghRNtVFoLvMHDvAUETbR2NEEIIG5LEQgghRNtVFoJnP/VzwCjbxiKEEMKmJLEQQgjRdqZCcAlUP7v3tm0sQgghbEoSCyGEEG1nKa+dW6HT2TYWIYQQNiWJhRBCiLazVoGTj62jEEIIcRGQxEIIIUTbaWZw9LJ1FEIIIS4CklgIIYRoO2sV2Dk1uUlEUgTFxuJOCkgIIYStSGIhhBCidSwm2DoRsjaoHgudQ6ObWjUrv1n8GxLyEzoxQCGEELbQ+KeBEEII0ZCU7yAnEgwnIWA02DlAVgSYzkDv39fb1GAyoKGhN+ptE6sQQohOIz0WQgghWid9pfpekQnW6h6L1AVw9P3zNi0yFgFIYiGEEFcASSyEEEK0jv5w7c+aGXT2jW9anVBIYiGEEJc/GQolhBCiWQaTAWcHZ5ywgjEXRi+CjNXqRs3a6H6SWAghxJVDeiyEEEI0a9jXw1h6ZCmUpYODO/R4EAbNBDtH1WvRCL1RT4BbAEUVRZ0YrRBCCFuQxEIIIUSz8svyyS/LV/MqPPqAnT34hKv5FdamE4se3j2kx0IIIa4AklgIYUua1mSjTIiLQaWlkmJTMfnl+WAuB9eQ2hvP9lh4X9PgvjWJhUnfOcEKIYSwGUksxCXlrZ1vsSR+ia3D6BhWM2y7A1aFQskJADYlb6LEVGLjwISo70z5GQDyyvLAagIn39obdQ5qkbx+zzS4r/RYCCHElUMSC3FJWXN8DTvTdjZ8o6apBnoTE0kvKplrIWcTGHMgdxsVVRXcvvB2DmYetHVkHULT4MsvYWcj/y5x6cgry6O7V3fVY2ExqpW2K3Ihdzvo7MBsaHRfSSyEEOLKIYlFZzm9AjaNhpytto7kkpZlyCLTkNnwjUf+Bev6w56pAFRUVVBeVd6J0bVS3k7wDIP+zwHUnFdGSYYNg+o4ixbBM8/ApEm2jkS0V35ZPv39+qs5FhYj2DurBfK2TABrJVQWN7rvuYnF3vS93Dz/5k6KXAghRGeSxKI9Mn+B6BehIqfp7UwFsPf3cGYvnJjdObFdhixWCyWmErIMWeffaC6HhP+oK6nVNfaf3/g8j/78aCdH2QpFMTDodRj+MYTcRUZJBjp0jSdOF4GiInj6adiwofltFy2CsDDo2rVlxy4vB6OxffGJCyO/vDqxKM9XPRSaVnujvQtUNZ1YhHiGYDCpXo2E/AS2ndpW87sQQojLh00Ti7jcuJqxu5ec4kTYeQ8c+wiOfdj0tnk7wdELxq0C70GdEt7lKLcslzC/MDXO+1yFUeDkA1Py4ZrXAEjVp5JalNrwwaxVcGoJlBy/cAE3pzwTugxXP7uHklmSycCAgRd1j8Uzz8BXX8Hf/970dpoGBw/C0qWwZk3zx01JgX79oG9fMFyi7c2TJ2HJEjBfhnPx88vyCfUOVT2Ads6ql+IsRx8wNf4+rjfq8XHxwapZ0TSNU/pTAKQVp13YoIUQQnQ6myYWt/xwCwvjFtoyhLbLWAW+w2D8enDv1fS2hYeg9x+g+29h0BudEd3lQ7OCqRBQw6BCPFU1mipLFZqmsfTIUgrKC6D4CATfoRK4Xo8AkFqUSm5ZLpqmgU5X+wWw9w+w5xHYMNImpwWAuRRcg2p+zSjJYHjw8Noei3Nj7iiaBtmbQB9f/2/NqKqC9eth/nz4wx+a3ra0FCoqYPBglSw0Z/ZscHBQvRulpc1vf9Goftzy8uC66+CRR1RycbnJL8/Hz9UPJ3snKnV2agJ3wBhVdtbJG0pTGt1Xb9TzcuTLGCoNlFeVk1acRn+//qTpJbEQQojLjc0SizPlZzhTfobE/ERbhdA+BQeg/3TodldNNZT7f7yfNcfV5Vmj2UhkciRVlip1Ne9sKUZ7J1tFfOmxmGDLLfCTH0S9QGZJJsEewQS6B5JTmkNOaQ4P//Qw205tU2O86yR4FquFCnMFAW4BFFYU1j+uqRDSf4bRi6H/sx0a8okTMGgQXHttC668m8vA3g1WdIGYV8g0ZDI8eHhtj8XZxn4LGv2tEvsKbLsdfhkM5VkQ8wosc4Z9fwRg2ZFlzI2ee95ux49DQIBKKp5/vum7MBggKAjsWvgOs2ULzJsHu3aBl1drT8hG6jxuK1dC796wcSMEBjawrcUEuTvU4nKXoPyyfPzd/PFz9aO40qSGd3r0Amc/cPRW74dV+gb3LTIW8UvSLxjNRvRGPWn6NG4MvVF6LK5ElkqwWmwdhRDiArJZYpGYn8h13a4j8Ux1YpH0JazuDdvuslVIrWMqUBNvK7LBdIYSUwk/J/7MlpQtAOzL2MfEhROJz4uHqhJ1JT3uTfjVxmP+O7qReiFlrFaLcY1dCR69yDRkEuwZTLBnMFmGLOJy4/Bw8iAuN05VpXFwh/h/wo8eZBoyCfEMoZdPL1L1qfUb6YWHwH8U9HoYBr/doSH/61/Qpw/ceWcLrrzbO6khWWdPtySDIUFDyDZkd2hM9VhMcPJruDkSJmwEw0lImgPDZ4OTHwDfRH/D7P3nzwUqLKztfWiuE6W8HFxcWh5WejqMGgVubuDu3vL9Otp7u98jfE548xvqj9R73Pbvh6eegokT4fbbG9h+5yTYfgf8cmkOhcwrz+NfO/9FSlEKheZKqKgzz8nJR/VgbJ3Y4L51q0HpjXrSitO4ofsNNUOixBUi6StY7gWrQsB4iQ6BFkI0y3aJxZlEbrvqNlKKUlTjPPbvMPS/EHTrhb3jymLI2gD6hNbvW3doirkMHFxhx2/hwNPE5sRyfffricmJAeBg5kGCPYJV6VCdA2gW1YjTxwEw5+Ac1h5f25Fn1rzjs2GFL6zq0fwbe6UeEj+Eox9AZVGnhHee/F3Q768Qeh/0n0aWIYsDmQfIKMkg05BJXG4cd4fdrZI3e1fVuMEONAupRan09O5JT++e6jlWl+kMeFylngNHP4CqjhvUv2ULfP45vPeeumKfVJDETd/fVDPhvKKqgqKK6sfTwUMlRP3+CqjEIjo7mrKqMiotlY3dRQ29XjX2W6U0Bdy6q9dZ8G1qonuvqdDvaRj6AZqmkVSYRHlVORVVFQAk5CVQWFFIWRm4urbsbtzc1FCoRlXkwPbfwJq+aCkLqahoOqGYtX0WvT/trYa1XUC/pv/KKf0pNbyuKbnb6j1uBQXQs2cj2+rjoTgeJufCbXs6POb9GfvJKW2mgEQ75ZflE5sTS7GpmByrAxiSVMEEUHMsAEqTz9vvbMGFs86Un8FsNdO3S1/psbiSmMvh8Kswfg3cML/jh3cKIS4aNu2xGOg/EC9nL0oyN6uGTsid0PPBC9dVWlUKG0dC+k8Qpyb4nig4QWRyZMv2r3vV29FTHa/rBABismO4J+weUopSsGpWDmYd5PEhj3Mw66DqrTCdgcH/VmFYqvj3zn/zwZ4POvwUG2UqUOVY74qD23aDvTPrTqzjum+uo6yyDIDMkszacc8774OyNLBzgLI0EvMTuen7mzp3XLTpDLiHwpG3YN8fyTRkEnEygticWDJLMonPi2fKwCnE58arx9iYC+Gvg5MfKUUpJBclczj38PkTuM/+H4ti1LCgJiratIamqapJ3bur33U6iDgZQUJ+Qs1z7B9b/kHY52GYrWY1hMRwErwGAKrc7AubXmi88lUdn38ON9wA992nxvc3qyJH3ZepAJwDoDBaJZiFMeDeE1IXwe77OaU/RTfPbgwKHERcbhwGk4ER34zg/V/fx8vr/OFda4+v5e+ba2dyx+bEklyYjJcX5OSApc5LOTYnlpc3vYzFalE9S5794dad6HyuxsGh/qRno9nInvQ9aighsCV1C3Y6O9X7BPyc+DO7T+9uwYm3gLUKNA1N0zice5hJAyZxIPNAg5tmG7Krhzfm13vcmsx3ihMh8CZ1UcOQBOamMq7WKTGVMP778fx3z3877JgNyS/Pr/k5vcqqHrPE/6qLQi6Nl/0qMZUQ6B7IkilLGNtjLMmFyRSUF/C3iL/Zdo7FOfOXKqoq2JW2q+b51hobN8KMGbWV0jae3MiqY6s6MNjLgOGkupgTcKOaV2bnfMHuqspSxfu/vn/Bn1+VlkpmbZ91/oUrqP/8kiRKVNM0jRc3vsjW1Mt72YHOTyyqX2SJZxKZsXkGqfpUsvUnVMMwcy2s7gHGtg8FyciAadPghRdg/371JvPQiof4MeFHVXfde5Aa/jLwZbCYeCXyFR5b9ViDlYZyS3OZMH9CzfCmely6qgnDoVMAiMmJQW/U42TvRHJhMoeyDjG+53iVWLj3gPy9NVf29mXsY0yPMeSV5VFs7JhGbbMMyeA1EDSzSjDKM5l/eD52OjsiTkYAcP/y+7lr8V1YK4tVz8rQ98ElCBw8WHJkCaWVpSw/urxz4gVw8FTDyNx7QsEhMksy8XXxxdPJs6bHYuPJjWSXZlPh3BXyf61JGlKKUojOjmbX6V3nv/E7+0PZaTXJu8sIAL6O+prfLf8d1urF9fak7yEhr+FerZWJKxn73djangdLJVQWoUPDyQlMptptI1MieXn0y0SmRKJpGhEnI7i+2/XsSd8D3ler9U20KqqsFnJKc7DTqZdkZsn5JWejs6PZm74XgH/+Uy08t2MH+Ps38zgefg0O/BlSF0BJouqBcu2m/rd2jmoSuUsA5GwlOjsaHxcf3J3cic6OJjIlkkkDJrE+aT2BgZCQAFZr7Tm+v+d9Vh1bRVJBEkazkcnLJvPEmidwdwdPT4iKUvsAfH7gc1YeW6nmxBTFQO+pcHo56OPw94ekJLWd1QpLjyzlzkV3su7EOsoqy8gyZPHwoIfZmrqV0spSXo58mec3NjPRoyGaVTVySpLAWKiG7xz4M+z/P07pTxHoHsjIkJHsz9x/3q4x2TFcNfsqvo76Wg27q/O4+fvD6dO12xpMBkI/DuXfO/6teit19uo1tecRcHRrstGRV5bH7QtvZ/up7Q2ewlNrn+L+H+8HVCP2gWseqHkNAzXHLChQVbymTYPVq9VN86Ln8UvSLw0eNyY7hplbZzbYW1ZQXsCah9Yw9dqp5FUUgWsIxL8J5elqFW7H6okxdg719tMb9QR7BPPQoIcIDwwnuSgZk8XE0fyjth0Kdc78paVHlnL3krtZe2JtqxqF+/bByy/DQw+Br69qOLyw6QVmbJ6ByWxqct/LrvFZngH7n4JD09Swp7rnNmKw6lUuOQ6bRqnhqBfIxuSNfHbgM76K+qrDjz0/dj6bkjcBEJEUwfex36v3g3NpWr3nmNFsZPz34/km6psOj+lyszNtZ81jfLmJzYllXdI6Ptr7ka1DuaBs1mNx7MwxBgYMpJtnN45XVKpyoaH31zT02mrGDFWd5aOPYPhwWH50OW6Obry18y0sVrP64CtPh11TyCtM5HjBcZ4c9iSL4xefd6z/HfwfQ4OG8s7ud86/I//RcPR9SF8BqMTigz0fkKpPZUfaDlL1qdyx6A4S8hIw+gyFtCVw8GlAvfEVVhRib2fPltQtaJqG3qhXV7Hbw1oFZ/ap9TWKz5kU79xFjYt28gVDMmWGFHac2sHkgZNZcXQFh3MO4+HkwbDgYWw/vVs1hiwmSJ6HlrOVZQnLeHLYkyxLWNa+GFvDeyCkr4QgNXY705BJ3DNxLJy8kIySDE4UnGBD8gY0TeOo1VM1VrfdDlV6UvWpPHf9c7w5/s2aq9w1ugyHM7+qIXFmA0azkU/2fYKrgyvrTqzjdPFppm+YztSVU6m0VFJaWco3Ud+QXJiMpmn8Z/d/GBkyUk1wTngHDv1NxZm7jYEDVXnV06ehoKiK7ae2k3gmkc0pmzmcexiLZiHYI1hd0fQdptY12f8kOaYyBvoPxPKGhT8P//N5JWfLKsv40+o/8WzEs+SW5mE0goeH+sy2s6PpBsrJb2DUfDXky94NSlPVat8evVXSm/WLuqLu4EZUdhQRJyNYHL+YqOwo1hxfg7ezt5oA3+UkmqYmbj//PJwuPk1eWR739r+XpUeWsvrYau7oewcmi4mThUlcdx387ndw771gMpuIOBnB9Ounsyh+kRqXb8xVC61lrGHUKHj9dZg1C3JzYV7MPD678zPmxczj1/RfKa0sJTIlkq2pW/k58Wdu7X0rDnYOHMk7Uv9c9fGw7wk1Z+vUOeWZrGbVqMndrobZnVoALoEQ9jdw9md/2jYOZR1i2oZp7MvYB8CWlC1EZ0cDKvn8cOKHzIuZh+Y5oN7jNnIkzJ0L27ZBZCQsiFvA1PCpLIpfRJVbTzizB4JuA59wNYG7iYn5Xxz8ggF+A3hn1/nvO2n6NOJy46gwVxCTHcO6pHW4OLigN+pJKkiqt+0nn0C3bqri1j33qF7i+YfnM2PzjAYXjpyxeQbHC46z4PCCen8/exX/nv73MKr7KLVI3tkyyaDeU32HqJ/r/h2VWHi7eAPg7eLN6eLTTBk4hc2/34xVs9YMt7O1b2O/rXm+ndsozM2FlSth1arzhx4ePgzjx8PQoXD99eqCRDfPbowIGVGbwDWWNJxzP5ekuu878f8Ev5EQOklN6q97bgm5UHwUPPtB98nq79YqVUjD0kwC1tI4qi2IW8Cb49/kx4Qfay4UdYRjZ44xN2YuL216ifKqchbGL+S1sa+16H4Wxy8mPDCcT/Z/0v7P+SZYrBbmx84nNif2gt3HhVRiKuHZiGf5x5Z/XNi5hh2tuQsD1bcvjl/M1PCpxOfFNz/c9hLm0PwmTYuNjcXDw6NV+1Ts/5XStFLeG/ceqxJXsfxwMt01E8zuqxo8ngng0pLxHefz8YEff4SyMvDz03gr9i3C/MIw5hv5dF8iN2XuhjRHyDazqHAOznnORBmjSCpMYkDlALad2oamadze93bmrp/LjaE3kpieyJJNS+jv31/dSXQ0VPSDoxkQ/wGmbveRfiydQ78/xIqjK1gWuYwJzhN4bexrzNw2kyXRaQzOCYKU5eA/mp/Tf+b2PrcTYAlgwYYFrN+xnrjcOJzsnfhg4ge4OLRi1mtdxz5VE4IDxkHuj9B/Wu1tmgZZXeDbO6A8ncj01QQZgjgad5RtJ7dhybRQXFyMycHEu0mf4uM/Gr67E8ylnMg6hDHdyK+6X8lMzWTl1pWcKDgBwJCgIQS4B7Q4xKoqNalZp1Pj9Z2b6hGvuBr2zoRfu4HXADKSMsg6nkVxQTGxMbH01fXl+zu/5+O9H7Ny7x50paMgNRLcexFXGMfMcTPxc/RjXtw8oqNV45Cz30uug+9/A3aOrC78El22jiJjEbMOz6KHdw98q3zJL8/n9UWvcyjrEA72Drxb8C7TbphGTlIOaWVprMhawYSw/th1GQrJO6B8IRMn+vDgg2BvD+/9EE2v8l70MfbBs9CTd5a9Q2hFKPa59ixPWs6j93yB7rQzWEzEV5rxKPAgOjoaLUtj/5n99KvsVxPzdzHf4ZTnhJujGy98/zwTJ77IXXdBSAg88QT4RkWpTDoqqvYcz/6uHwIrplVfOd8EVbfCF0PUNtdPhKPZEOsO9m5sq9zGNyO+wc/VjxmbZ1BsKubBqx+kj7EPX66dwx13PMrs2XDNNeD403y8C71JS0zjUNYhAt0DcXdyx2w08+7ydxk79q+sW6dek5+t3opXoRe79+9me9p29t00DaclfwKvfqBzZPToaKZPh59+gt5D0jh55CRLC5YSlR3F3Py5THCewDCfYXy450NOJpzE380fnVHHBys+YPoN02ufM+kroTAbgivh2Kswrn/tbZZKOJoFuipIngt+10PWr1AUDicWs0Y3nOk9pjOuxzj+b+3/Mct+FiuPrURv1PP6uNdZuW0lp4NPk5+Xzw9xeYQfq33c+gyJJjERbr4ZZs3SWGj8mAH+A6AQPti2jTvyneDj7mAuBs+jte9vZ/9XZ5/y5gq+Xvc1N4beSPzpeH7w/4EViSswmo2M6TGG5MJkdIU6yuzLmLloJoeyD/HwoIfpZ+rHnLVzmHrt1JrjenioBrG/v5q/siR3Ji5WFxzLHXl98es8Gl5bSCI+N56jcUe5MfRG3o56m3BrOA7VvQ/5Zfm4F7gTHR1NyakSEjMSiQ4bBalrwOdaSMwG/WBI3Qk+A2rPKTqag5kH0bI0oqOjKT1VyvGM44zuMRpfvS/+xf5E7Iygl2+vBl/+Vs3K14e+Jqcsh17evXhsyGPoWntl/+xroDHR0ZwqOlXv+bah+wYCPQJrbn/ySXj8cRg5Up3ap5+q1eSdndU8qrfeUr1zbm4Q4/ERxnIj+Q75fHbiM3pW9Kz9PxtOqmp0AWNU71Xvx9Q8vbO3H54JgeOh/DR49IOu4+rFn1+WT7GpmO5e3dv+GYEq270haQMaGg9c8wCezp6tf9zOqvu+c2oZxC9T55e2CMZcVXtuAMXD1We8uRwyPKH0HTX8OW8bhM9q8/nUiI7GYDKwbc82zBlmrDlW5q6fy4iQ9l2sPGvm1pm4Wl0pKy9jxvwZ7Di6A7JAl62rvZ8GHjctKop3f3qXgf4DqTpTxYc/fchtfW5r9H6qLFWcKDiBnc6OAf4DWvWcn71/Nqf0pzhZeJIv7/6ypjz7pWJe9Dxc8l1wdXDlxfkv8tLol+rdnpCXQEFFAUHuQYT5h9koykbUfS9v4HlgjTrEksgl3Bh6I356Pz76+SOmXD2lk4Nsu9JW1IHXaW2cDVlSUoK3t3dbdhVCCCGEEEJcQoqLi/FqpiZ8u3ssvvtuB25uHnTvDi4pH0DaUgB+MfzM/oSe/POf6mJ5m4aQVmd9j/z0CI+GP0pGSQa5ZbmYrWb83fzZmbaTZSPuwb6yCPxuUFUnbt3W3lOyqYySDB5c/iAju42ksKKQHyb9QEWFutLv7g7v7/kPVdYqdp/ezfz75hPsGWzrkJm1fRbR2dFUVFXwwqB7ubNiLwz7BBLegj5Pge+ghq/kaBq3/HAL8+6dx/qk9bg6uPLEsCdscxJCiFpmI1RUD8dz9AUXvwt7f9XvDw89BH/9K1x1FXTp0oIqZC29sn4lqH4s4nPjeXvX20weOJmlR5Yye+xzdD/8vCqEYecMY1o+nPXNN9Vh770XjEY1vO7aa9UwzOuvb6bHuRX+/W/o3x8eeEBVk3N1rT0fSo5D7D/gui8hajqLTn5FaronM2dCcTG05/rm/w78jxJTCbtP72buXQuZer8vzz2nCklMmtQx53au+Hg1VPu779R8ss2bYc8emDBB9UD361e77Zkz8NhjamHSdhtef5hivd6mKgPsmgxDP4bUb8G5K9jZQ7d7IG839H2i9hhRUZC6GI59CECk/wPsLTfxxvg3eCXyFUY4/p4vZoXz2GNqTaIpU6j3On33XVi+XJUj/8+inXy8/2PCuoQRlR3FxqkbsbezrxdmjiEHT2dP3J3OKRt44gs1nC7kLoibCWOWYbaY0el02NvZk5CXwOpjqwnvGo5j6j3s3AnvvKPaUo6Oatii2WpmXM9xjBunijA09H4zf74anvvKK6qMuptbB/wvWuCll9Rr7LbbVBva15eax/HAAfjiC5gzR1WHDD7bDKy+fcoU9doNDVXx7syIZE/6Hm656hbSdo3h1Cl47TX1+inWTvPO7newWC3MHvsCrkfegBGzIeo5sru8xsN/uZZXXoGyslLefXd8i2Jvd49FvexlfbhahbU8C8vIeXy7fgIVFar2/V1tWZ5CpwNN4w8r/8CCODXu9+PbP2bywMk8vupxXhz1Ir9xM6syqsM+VFVXurbsxC9m478fTx/fPowIGcFfRv6l3m0pRSncPP9mxvQYw8LJF8eq5WuOr2H9ifXE5MSwcsoCukU9od6QAPo+peZ1NGLoV0NrxoOu+N2KS6prUAjRQarf693c1Id3i/ep61Kdp9AR6jwWFZXleP7HEx8XH4xmIyX3zcKu8BCM/AK23KQqA7ZQWRmsXQslJWqNll69Oj50gEcfhbvvhocfrr4QaVfnf5u6CHK2QM+HIOFt/r7hFwJD3Hjhhfbfr96oZ+KCiUwZOIWxdjN49VVVEONCW7UK0tLUULoHH1R/M5vB4ZxLvZoGf/mLuqjo66sagxdEwUGIfhFGfA5bb4UxP6oKgqZ8VagmtDrLqn6dsnOSKi9t58wurwn8KzmGb+/9lgdWPMB1KStxMQfx3nt1jn92PyAsTJ2HnR3cNaWAfp/14/Ehj3Mk7wibft+KSdt7H1NJhc5ezdcL+2ujm2qaeh5nZakEdsKE+re/8IKqeNitG7z4ohraeNbMmerxf/XVlofWEa6+GtatUxdZalQ/jl99BYmJKtGv+Xsd7m4aBkPDC9T+/e9qEde6r5/SylIc7BxwqSyEzWNh/Dowl7JgzSC27nTlu+8aafM3ot09FjUslWA4Dvckw+FXsbeHJ5/smEMPChzE9Ounc+zMMQYFDqKHdw+2PlanXJdLsFr51bNf4we5hDw57Ele3fIq/514fgnJq3yvYsPUDXR1b7zEY2eb2Gci0yKmEeQRRLcu/eHWXaoMpb1Lk0kFwED/gYwIHkFkSiQDAwZ2UsRCiItC3Q9EnY6wwRrr16sP1a5dm7k6eCUnEueq81i4AmF+Yfww6QeejXgWu9JkVc5dH9/qw7q7q4pXF9rrr6vqZStWwNNPw211/7dFsaqS4eC3YNjHPNrVjQceUI3E8HB1Rb+tfFx8OPCkKi29davqiekM9913/t/OTSpAvTzmzLng4agS5OWnwfsatd6OZlU9Fo0pOgyjvoecrQRbnNicspken/QAYHCxFwMbWQe0okIVNnnkEdVrAH4EewYT4BbAmB5jWhdz90nqedH3KVXdsAk6nep1a8xHH6neI01T8yPrevpplfDu36/+b48/3row22roUFi0CCZPVkllSAg1r/MbboD//lfFY7HALee8F970G5W89eun9vfxqb3t0UdVz2B2NgwapF4/Hk7VT3yHEJVUpi0DBzdKywarnpJW6rjEoiJLPTndQzvskGddE3ANJwpOkFeWx6DABp6x/tepr8vE1Gun1k7CbMAA/wGdGE3zXBxceGP8G7UTxXQ6cGvZpLEB/gMIcAsgMiWSvl36XsAohRAXnXM+EOcfVlfUQA2XEG0zOGgwS+KXMKTrEKjSqypNJzu+/GpHGTAANjV2sdp3CPT7G+y6H9x7En7jYg4cgNRUtV9HCQtTo3WKi9XV67PrEV0RPHqpoi8RQwANbo4E1yaGWVdkQJeRkLOVIJfaq9eOdo4E+LpS1Miaunq96hVwrJMHjO4+mjmH5vDdb79rXcyh94FnH9AfUdXI2qmhq/ugnge7drVjSH8bffqpKiv/1lsqiahr8GD43/9UGfHbGqgD8OOPsHBh7RD6usLD4cABSEmBgQ1dy+0yvKa6X680WFr9PlxW1vLYOy6xMBvUir6lp1TvQZ+OGys/KHAQb+96m8KKQoI8gjrsuKLj/Gnon9q03wD/AWxO2YyzgzNO9k4dHJUQ4lIyeDBERDS/nWja0KChzN4/m9fHvQ7mfaqka/gsMLeidXAx6f839VXN01PN9ehI3bvDb3+rGr5Dh6rG5BVl9A9QmgL27uDaxIgIc7laY0qzQMZKPMKexdPJkz8O+SM/Jf7EyME6XntNlbguKFDlmM8qKzt/HsONPW7k+8Pfc0P3G1ofs0+4+uoEnb3UjL8/fPZZ47dPnKi+GuLuDn/+c+P7enqq99rm3HorTJ+upm7079/89md13DoW5jJVI9+Yq1aX7UA9fXpyvOA4vXx6dehxhe0N9B9IxMmIi64XRgghLlVDgoaQachkSNAQcA5Uw5S9wsC3Ba2JK9hXX6nJ1Rs32joSG/G4qumkAlRbz8ENLEZVNhkIdA/kqeFPEeQRxJ13gp+fujIeFUVti1ynw71f8HlzqB4Jf4SM5zPOn5wtbM7RURUWeOklNVyxpTquxwIADRy9oest4NSlQ48c5hfGNQHXdOgxhe2F+YWRZcjikUGP2DoUIYS4LFzX7Treufkdru16LZADB54G955qLY3Bb9k6vIta7962juBiVz0R295ZLWrscRXBnsHE58UT5BGEoyNs367mwISEAC/W7ulLEZmZUFkJTtUDFJzsnejqcfHMGRX1+furOSYlJS3fp+MSCwcP1UXmPQBu2dy+Y50zoQ9NY8mUJbg6NFALTFzSnB2cGdV9FMOCh9k6FCGEuCz4uPjw6tjqMjZBE9Xn874/Qq9Hm95RiIbUbZN5BcAib3D0hLFqAH6Qxxzic+PrDVUPOTvNss48Kheg9wD49ls1p+GppzohdtHpOi6xcPRUVQUuEBkGdfna/afdtg5BCCEuTw6uMHEv5O1QK10L0R5VgKVclfd3UouIBLkHcST/COGBzc93mDgRnnlGzQOQxOLy1HFzLFxDoLIIDMlQngHG/LYfS9PqfwkhhBCibVz8occUNTZeiNY6t03m0QfytoOpEPRHCPIIOq/HojFvvAHjxnXccgTi4tNxPRZ2juB9Nez+HVgqYMKVOvtJCCGEEOIy5TsU9j+p1qkaOYcgjyBO6U8R7NFEidpq/v6dswihsJ2O67EAtQhPUQyUHAM6uTaXEEIIIYS4sELuUqtyG06Azp5gz2A0NFkOQAAdXRXqmn9Azma1oI3blbS6jBBCCCHEFaDXo5C3E9AgYAxB1sMAklgIoKMTCydfuDOmQw8phBBCCCEuEjodXP91za+9fHrxxNAnCPEMaWIncaXo4HUshBBCCCHElaKLaxfm3jvX1mGIi0SbEwutulpTSWtWzRBCCCGEEEJcMs629bUWVGptc2JhMBgACA0NbeshhBBCCCGEEJcAg8GAt7d3k9votJakHw2wWq1kZWXh6emJTicVoIQQQgghhLjcaJqGwWAgJCQEO7umC8q2ObEQQgghhBBCiLM6dh0LIYQQQgghxBVJEgshhBBCCCFEu0liIYQQQgghhGg3SSyEEEIIIYQQ7SaJhRBCCCGEEKLdJLEQQgghhBBCtJskFkIIIYQQQoh2k8RCCCGEEEII0W6SWAghhBBCCCHaTRILIYQQQgghRLtJYiGEEEIIIYRoN0kshBBCCCGEEO32/x0B1BN2CAoMAAAAAElFTkSuQmCC\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "#Visualize splice-centric gradient for gene(s)\n", + "\n", + "#Find position of max saliency\n", + "max_poses = np.argmax(np.sum(scores, axis=-1), axis=-1)\n", + "\n", + "#Loop over genes\n", + "for example_ix in range(scores.shape[0]) :\n", + " \n", + " #Get max pos\n", + " max_pos = max_poses[example_ix]\n", + " \n", + " #Only visualize genes that are not extremely long\n", + " if max_pos >= 150000 and max_pos < seqs.shape[1] - 150000 :\n", + " \n", + " print(\"-- \" + str(example_ix) + \" (\" + str(strands[example_ix]) + \") --\")\n", + " print(\" - gene_id = '\" + str(genes[example_ix]))\n", + "\n", + " #Plot scores\n", + " f = plt.figure(figsize=(8, 1))\n", + "\n", + " #Annotate 4kb window\n", + " plot_start = max_pos - 2000\n", + " plot_end = max_pos + 6 + 2000\n", + "\n", + " l1 = plt.plot(np.arange(seqs.shape[1]), np.sum(scores[example_ix, ...], axis=-1), linewidth=1, linestyle='-', color='red', label='Gradient')\n", + "\n", + " plt.axvline(x=plot_start, color='black', linestyle='--')\n", + " plt.axvline(x=plot_end, color='black', linestyle='--')\n", + "\n", + " plt.xlim(0, seqs.shape[1])\n", + " \n", + " plt.legend(handles=[l1[0]], fontsize=8)\n", + " \n", + " plt.yticks([], [])\n", + " plt.xticks([], [])\n", + "\n", + " plt.tight_layout()\n", + "\n", + " plt.show()\n", + " \n", + " #Visualize contribution scores\n", + " plot_start = max_pos - 100\n", + " plot_end = max_pos + 6 + 100\n", + " \n", + " #Rev-comp scores if gene is on minus strand\n", + " if strands[example_ix] == '-' :\n", + " plot_end = seqs.shape[1] - (max_pos - 100)\n", + " plot_start = seqs.shape[1] - (max_pos + 6 + 100)\n", + " \n", + " #Plot sequence logo\n", + " visualize_input_gradient_pair(\n", + " scores[example_ix, :, :] if strands[example_ix] == '+' else scores[example_ix, ::-1, ::-1],\n", + " np.zeros(scores[example_ix, ...].shape),\n", + " plot_start=plot_start,\n", + " plot_end=plot_end,\n", + " save_figs=False,\n", + " )\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3d7aefe0", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.15" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/analysis/gtex_motifs/gradients_aggregate_reps.ipynb b/analysis/gtex_motifs/gradients_aggregate_reps.ipynb new file mode 100644 index 0000000..3e6a8c7 --- /dev/null +++ b/analysis/gtex_motifs/gradients_aggregate_reps.ipynb @@ -0,0 +1,137 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "id": "7c5f93db", + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "import os\n", + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "import h5py\n", + "\n", + "import gc\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "d4988000", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Aggregating over replicate 'f3c0'\n", + "Aggregating over replicate 'f3c1'\n" + ] + } + ], + "source": [ + "#Load scores and auxiliary data, compute mean over folds, and save new scores\n", + "\n", + "#Specify dir with score files\n", + "grad_dir = '../../../borzoi/examples/saved_models/gtex_CFHR2'\n", + "\n", + "fold_index = [3]\n", + "cross_index = [0, 1]\n", + "\n", + "#Initialize HDF5\n", + "scores_h5 = h5py.File(grad_dir + '/scores_mean.h5', 'w')\n", + "\n", + "seqs = None\n", + "grads = None\n", + "preds = None\n", + "genes = None\n", + "chrs = None\n", + "starts = None\n", + "ends = None\n", + "strands = None\n", + "\n", + "rep_i = 0\n", + "\n", + "#Loop over folds and crosses\n", + "for fi in fold_index :\n", + " for ci in cross_index :\n", + "\n", + " print(\"Aggregating over replicate 'f\" + str(fi) + \"c\" + str(ci) + \"'\")\n", + "\n", + " score_file = h5py.File(grad_dir + '/scores_f' + str(fi) + 'c' + str(ci) + '.h5', 'r')\n", + "\n", + " if rep_i == 0 :\n", + " seqs = score_file['seqs'][()]\n", + " grads = score_file['grads'][()]\n", + " if 'preds' in score_file :\n", + " preds = score_file['preds'][()]\n", + " genes = score_file['gene'][()]\n", + " chrs = score_file['chr'][()]\n", + " starts = score_file['start'][()]\n", + " ends = score_file['end'][()]\n", + " strands = score_file['strand'][()]\n", + " else :\n", + " grads += score_file['grads'][()]\n", + " if 'preds' in score_file :\n", + " preds += score_file['preds'][()]\n", + "\n", + " #Collect garbage\n", + " gc.collect()\n", + " \n", + " rep_i += 1\n", + "\n", + "#Normalize by number of replicates\n", + "grads /= (float(len(fold_index)) * float(len(cross_index)))\n", + "\n", + "if preds is not None :\n", + " preds /= (float(len(fold_index)) * float(len(cross_index)))\n", + "\n", + "#Re-save datasets in h5\n", + "scores_h5.create_dataset('seqs', data=np.array(seqs, dtype='bool'))\n", + "scores_h5.create_dataset('grads', data=np.array(grads, dtype='float16'))\n", + "if preds is not None :\n", + " scores_h5.create_dataset('preds', data=np.array(preds, dtype='float16'))\n", + "scores_h5.create_dataset('gene', data=np.array(genes, dtype='S'))\n", + "scores_h5.create_dataset('chr', data=np.array(chrs, dtype='S'))\n", + "scores_h5.create_dataset('start', data=np.array(starts))\n", + "scores_h5.create_dataset('end', data=np.array(ends))\n", + "scores_h5.create_dataset('strand', data=np.array(strands, dtype='S'))\n", + "\n", + "#Close h5\n", + "scores_h5.close()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "959fec3a", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.15" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/analysis/gtex_motifs/run_gradients_diff_expr_log2fc.sh b/analysis/gtex_motifs/run_gradients_diff_expr_log2fc.sh old mode 100644 new mode 100755 index aebbca0..bc3b14e --- a/analysis/gtex_motifs/run_gradients_diff_expr_log2fc.sh +++ b/analysis/gtex_motifs/run_gradients_diff_expr_log2fc.sh @@ -1,11 +1,11 @@ #!/bin/sh -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu.py -o gtex_muscle_log2fc_undo_clip -f 0,1,2,3 --rc 1 --shifts 0 --span 0 --smoothgrad 0 --clip_soft 384.0 -t /home/jlinder/borzoi_v2/targets_gtex_muscle.txt /home/jlinder/borzoi_v2/params_pred.json /home/jlinder/borzoi_v2 /home/jlinder/gtex_diff_expr/gtex_diff_expr_log2fc_5k.gtf +borzoi_satg_gene.py -o saved_models/gtex_muscle_log2fc -f 3 -c 0,1,2,3 --rc --untransform_old --track_scale 0.01 --track_transform 0.75 --clip_soft 384.0 -t targets_gtex_muscle.txt params_pred.json saved_models diff_expr/gtex_diff_expr_log2fc_5k.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu.py -o gtex_blood_log2fc_undo_clip -f 0,1,2,3 --rc 1 --shifts 0 --span 0 --smoothgrad 0 --clip_soft 384.0 -t /home/jlinder/borzoi_v2/targets_gtex_blood.txt /home/jlinder/borzoi_v2/params_pred.json /home/jlinder/borzoi_v2 /home/jlinder/gtex_diff_expr/gtex_diff_expr_log2fc_5k.gtf +borzoi_satg_gene.py -o saved_models/gtex_blood_log2fc -f 3 -c 0,1,2,3 --rc --untransform_old --track_scale 0.01 --track_transform 0.75 --clip_soft 384.0 -t targets_gtex_blood.txt params_pred.json saved_models diff_expr/gtex_diff_expr_log2fc_5k.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu.py -o gtex_liver_log2fc_undo_clip -f 0,1,2,3 --rc 1 --shifts 0 --span 0 --smoothgrad 0 --clip_soft 384.0 -t /home/jlinder/borzoi_v2/targets_gtex_liver.txt /home/jlinder/borzoi_v2/params_pred.json /home/jlinder/borzoi_v2 /home/jlinder/gtex_diff_expr/gtex_diff_expr_log2fc_5k.gtf +borzoi_satg_gene.py -o saved_models/gtex_liver_log2fc -f 3 -c 0,1,2,3 --rc --untransform_old --track_scale 0.01 --track_transform 0.75 --clip_soft 384.0 -t targets_gtex_liver.txt params_pred.json saved_models diff_expr/gtex_diff_expr_log2fc_5k.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu.py -o gtex_esophagus_log2fc_undo_clip -f 0,1,2,3 --rc 1 --shifts 0 --span 0 --smoothgrad 0 --clip_soft 384.0 -t /home/jlinder/borzoi_v2/targets_gtex_esophagus.txt /home/jlinder/borzoi_v2/params_pred.json /home/jlinder/borzoi_v2 /home/jlinder/gtex_diff_expr/gtex_diff_expr_log2fc_5k.gtf +borzoi_satg_gene.py -o saved_models/gtex_esophagus_log2fc -f 3 -c 0,1,2,3 --rc --untransform_old --track_scale 0.01 --track_transform 0.75 --clip_soft 384.0 -t targets_gtex_esophagus.txt params_pred.json saved_models diff_expr/gtex_diff_expr_log2fc_5k.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu.py -o gtex_brain_log2fc_undo_clip -f 0,1,2,3 --rc 1 --shifts 0 --span 0 --smoothgrad 0 --clip_soft 384.0 -t /home/jlinder/borzoi_v2/targets_gtex_brain.txt /home/jlinder/borzoi_v2/params_pred.json /home/jlinder/borzoi_v2 /home/jlinder/gtex_diff_expr/gtex_diff_expr_log2fc_5k.gtf +borzoi_satg_gene.py -o saved_models/gtex_brain_log2fc -f 3 -c 0,1,2,3 --rc --untransform_old --track_scale 0.01 --track_transform 0.75 --clip_soft 384.0 -t targets_gtex_brain.txt params_pred.json saved_models diff_expr/gtex_diff_expr_log2fc_5k.gtf diff --git a/analysis/gtex_motifs/run_gradients_diff_expr_log2fc_k562.sh b/analysis/gtex_motifs/run_gradients_diff_expr_log2fc_k562.sh old mode 100644 new mode 100755 index 7428c5a..6e57e4f --- a/analysis/gtex_motifs/run_gradients_diff_expr_log2fc_k562.sh +++ b/analysis/gtex_motifs/run_gradients_diff_expr_log2fc_k562.sh @@ -1,3 +1,3 @@ #!/bin/sh -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu.py -o gtex_k562_log2fc_undo_clip -f 0,1,2,3 --rc 1 --shifts 0 --span 0 --smoothgrad 0 --clip_soft 384.0 -t /home/jlinder/borzoi_v2/targets_k562.txt /home/jlinder/borzoi_v2/params_pred.json /home/jlinder/borzoi_v2 /home/jlinder/gtex_diff_expr/gtex_diff_expr_log2fc_5k.gtf +borzoi_satg_gene.py -o saved_models/gtex_k562_log2fc -f 3 -c 0,1,2,3 --rc --untransform_old --track_scale 0.3 --track_transform 0.75 --clip_soft 384.0 -t targets_k562.txt params_pred.json saved_models diff_expr/gtex_diff_expr_log2fc_5k.gtf diff --git a/analysis/gtex_motifs/run_gradients_expr_CFHR2.sh b/analysis/gtex_motifs/run_gradients_expr_CFHR2.sh new file mode 100644 index 0000000..200da30 --- /dev/null +++ b/analysis/gtex_motifs/run_gradients_expr_CFHR2.sh @@ -0,0 +1,3 @@ +#!/bin/sh + +borzoi_satg_gene.py -o saved_models/gtex_CFHR2 -f 3 -c 0 --rc --untransform_old --track_scale 0.01 --track_transform 0.75 --clip_soft 384.0 -t targets_gtex_liver.txt params_pred.json saved_models CFHR2_example.gtf diff --git a/analysis/gtex_motifs/run_gradients_polya.sh b/analysis/gtex_motifs/run_gradients_polya.sh new file mode 100644 index 0000000..271a097 --- /dev/null +++ b/analysis/gtex_motifs/run_gradients_polya.sh @@ -0,0 +1,3 @@ +#!/bin/sh + +borzoi_satg_polya.py -o saved_models/gtex_polya -f 3 -c 0,1,2,3 --rc --untransform_old --track_scale 0.01 --track_transform 0.75 --clip_soft 384.0 -t targets_gtex.txt params_pred.json saved_models gasperini/crispr_genes.gtf diff --git a/analysis/gtex_motifs/run_gradients_polya_CD99.sh b/analysis/gtex_motifs/run_gradients_polya_CD99.sh new file mode 100644 index 0000000..43e5c9b --- /dev/null +++ b/analysis/gtex_motifs/run_gradients_polya_CD99.sh @@ -0,0 +1,3 @@ +#!/bin/sh + +borzoi_satg_polya.py -o saved_models/gtex_CD99 -f 3 -c 0 --rc --untransform_old --track_scale 0.01 --track_transform 0.75 --clip_soft 384.0 -t targets_gtex.txt params_pred.json saved_models CD99_example.gtf diff --git a/analysis/gtex_motifs/run_gradients_splice.sh b/analysis/gtex_motifs/run_gradients_splice.sh new file mode 100644 index 0000000..75f45c6 --- /dev/null +++ b/analysis/gtex_motifs/run_gradients_splice.sh @@ -0,0 +1,3 @@ +#!/bin/sh + +borzoi_satg_splice.py -o saved_models/gtex_splice -f 3 -c 0,1,2,3 --rc --untransform_old --track_scale 0.01 --track_transform 0.75 --clip_soft 384.0 -t targets_gtex.txt params_pred.json saved_models gasperini/crispr_genes.gtf diff --git a/analysis/gtex_motifs/run_gradients_splice_GCFC2.sh b/analysis/gtex_motifs/run_gradients_splice_GCFC2.sh new file mode 100644 index 0000000..37fdf07 --- /dev/null +++ b/analysis/gtex_motifs/run_gradients_splice_GCFC2.sh @@ -0,0 +1,3 @@ +#!/bin/sh + +borzoi_satg_splice.py -o saved_models/gtex_GCFC2 -f 3 -c 0 --rc --untransform_old --track_scale 0.01 --track_transform 0.75 --clip_soft 384.0 -t targets_gtex.txt params_pred.json saved_models GCFC2_example.gtf diff --git a/analysis/gtex_motifs/run_ism_diff_expr_log2fc.sh b/analysis/gtex_motifs/run_ism_diff_expr_log2fc.sh old mode 100644 new mode 100755 index be38190..b4cadf4 --- a/analysis/gtex_motifs/run_ism_diff_expr_log2fc.sh +++ b/analysis/gtex_motifs/run_ism_diff_expr_log2fc.sh @@ -1,11 +1,11 @@ #!/bin/sh -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu_focused_ism.py -o gtex_blood_log2fc_ism_aggr -f 0 --rc 1 --shifts 0 --span 0 --tissue blood --main_tissue_ix 0 --clip_soft 384.0 --aggregate_tracks 3 --tissue_files gtex_blood_log2fc_undo_clip/scores_f0c0.h5,gtex_brain_log2fc_undo_clip/scores_f0c0.h5,gtex_liver_log2fc_undo_clip/scores_f0c0.h5,gtex_muscle_log2fc_undo_clip/scores_f0c0.h5,gtex_esophagus_log2fc_undo_clip/scores_f0c0.h5 --tissues blood,brain_cortex,liver,muscle,esophagus_muscularis --gene_file /home/jlinder/gtex_diff_expr/gtex_diff_expr_log2fc_5k.csv --max_n_genes 200 --ism_size 192 --gaussian_sigma 8 -t /home/jlinder/borzoi_v2/targets_gtex_5_tissues.txt /home/jlinder/borzoi_v2/params_pred.json /home/jlinder/borzoi_v2 /home/jlinder/gtex_diff_expr/gtex_diff_expr_log2fc_5k.gtf +borzoi_satg_gene_focused_ism.py -o saved_models/gtex_blood_log2fc_ism_aggr -f 3 -c 0 --rc --tissue blood --main_tissue_ix 0 --untransform_old --track_scale 0.01 --track_transform 0.75 --clip_soft 384.0 --aggregate_tracks 3 --tissue_files saved_models/gtex_blood_log2fc/scores_f3c0.h5,saved_models/gtex_brain_log2fc/scores_f3c0.h5,saved_models/gtex_liver_log2fc/scores_f3c0.h5,saved_models/gtex_muscle_log2fc/scores_f3c0.h5,saved_models/gtex_esophagus_log2fc/scores_f3c0.h5 --tissues blood,brain_cortex,liver,muscle,esophagus_muscularis --gene_file diff_expr/gtex_diff_expr_log2fc_5k.csv --max_n_genes 200 --ism_size 192 --gaussian_sigma 8 -t targets_gtex_5_tissues.txt params_pred.json saved_models diff_expr/gtex_diff_expr_log2fc_5k.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu_focused_ism.py -o gtex_brain_log2fc_ism_aggr -f 0 --rc 1 --shifts 0 --span 0 --tissue brain_cortex --main_tissue_ix 1 --clip_soft 384.0 --aggregate_tracks 3 --tissue_files gtex_blood_log2fc_undo_clip/scores_f0c0.h5,gtex_brain_log2fc_undo_clip/scores_f0c0.h5,gtex_liver_log2fc_undo_clip/scores_f0c0.h5,gtex_muscle_log2fc_undo_clip/scores_f0c0.h5,gtex_esophagus_log2fc_undo_clip/scores_f0c0.h5 --tissues blood,brain_cortex,liver,muscle,esophagus_muscularis --gene_file /home/jlinder/gtex_diff_expr/gtex_diff_expr_log2fc_5k.csv --max_n_genes 200 --ism_size 192 --gaussian_sigma 8 -t /home/jlinder/borzoi_v2/targets_gtex_5_tissues.txt /home/jlinder/borzoi_v2/params_pred.json /home/jlinder/borzoi_v2 /home/jlinder/gtex_diff_expr/gtex_diff_expr_log2fc_5k.gtf +borzoi_satg_gene_focused_ism.py -o saved_models/gtex_brain_log2fc_ism_aggr -f 3 -c 0 --rc --tissue brain_cortex --main_tissue_ix 1 --untransform_old --track_scale 0.01 --track_transform 0.75 --clip_soft 384.0 --aggregate_tracks 3 --tissue_files saved_models/gtex_blood_log2fc/scores_f3c0.h5,saved_models/gtex_brain_log2fc/scores_f3c0.h5,saved_models/gtex_liver_log2fc/scores_f3c0.h5,saved_models/gtex_muscle_log2fc/scores_f3c0.h5,saved_models/gtex_esophagus_log2fc/scores_f3c0.h5 --tissues blood,brain_cortex,liver,muscle,esophagus_muscularis --gene_file diff_expr/gtex_diff_expr_log2fc_5k.csv --max_n_genes 200 --ism_size 192 --gaussian_sigma 8 -t targets_gtex_5_tissues.txt params_pred.json saved_models diff_expr/gtex_diff_expr_log2fc_5k.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu_focused_ism.py -o gtex_liver_log2fc_ism_aggr -f 0 --rc 1 --shifts 0 --span 0 --tissue liver --main_tissue_ix 2 --clip_soft 384.0 --aggregate_tracks 3 --tissue_files gtex_blood_log2fc_undo_clip/scores_f0c0.h5,gtex_brain_log2fc_undo_clip/scores_f0c0.h5,gtex_liver_log2fc_undo_clip/scores_f0c0.h5,gtex_muscle_log2fc_undo_clip/scores_f0c0.h5,gtex_esophagus_log2fc_undo_clip/scores_f0c0.h5 --tissues blood,brain_cortex,liver,muscle,esophagus_muscularis --gene_file /home/jlinder/gtex_diff_expr/gtex_diff_expr_log2fc_5k.csv --max_n_genes 200 --ism_size 192 --gaussian_sigma 8 -t /home/jlinder/borzoi_v2/targets_gtex_5_tissues.txt /home/jlinder/borzoi_v2/params_pred.json /home/jlinder/borzoi_v2 /home/jlinder/gtex_diff_expr/gtex_diff_expr_log2fc_5k.gtf +borzoi_satg_gene_focused_ism.py -o saved_models/gtex_liver_log2fc_ism_aggr -f 3 -c 0 --rc --tissue liver --main_tissue_ix 2 --untransform_old --track_scale 0.01 --track_transform 0.75 --clip_soft 384.0 --aggregate_tracks 3 --tissue_files saved_models/gtex_blood_log2fc/scores_f3c0.h5,saved_models/gtex_brain_log2fc/scores_f3c0.h5,saved_models/gtex_liver_log2fc/scores_f3c0.h5,saved_models/gtex_muscle_log2fc/scores_f3c0.h5,saved_models/gtex_esophagus_log2fc/scores_f3c0.h5 --tissues blood,brain_cortex,liver,muscle,esophagus_muscularis --gene_file diff_expr/gtex_diff_expr_log2fc_5k.csv --max_n_genes 200 --ism_size 192 --gaussian_sigma 8 -t targets_gtex_5_tissues.txt params_pred.json saved_models diff_expr/gtex_diff_expr_log2fc_5k.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu_focused_ism.py -o gtex_muscle_log2fc_ism_aggr -f 0 --rc 1 --shifts 0 --span 0 --tissue muscle --main_tissue_ix 3 --clip_soft 384.0 --aggregate_tracks 3 --tissue_files gtex_blood_log2fc_undo_clip/scores_f0c0.h5,gtex_brain_log2fc_undo_clip/scores_f0c0.h5,gtex_liver_log2fc_undo_clip/scores_f0c0.h5,gtex_muscle_log2fc_undo_clip/scores_f0c0.h5,gtex_esophagus_log2fc_undo_clip/scores_f0c0.h5 --tissues blood,brain_cortex,liver,muscle,esophagus_muscularis --gene_file /home/jlinder/gtex_diff_expr/gtex_diff_expr_log2fc_5k.csv --max_n_genes 200 --ism_size 192 --gaussian_sigma 8 -t /home/jlinder/borzoi_v2/targets_gtex_5_tissues.txt /home/jlinder/borzoi_v2/params_pred.json /home/jlinder/borzoi_v2 /home/jlinder/gtex_diff_expr/gtex_diff_expr_log2fc_5k.gtf +borzoi_satg_gene_focused_ism.py -o saved_models/gtex_muscle_log2fc_ism_aggr -f 3 -c 0 --rc --tissue muscle --main_tissue_ix 3 --untransform_old --track_scale 0.01 --track_transform 0.75 --clip_soft 384.0 --aggregate_tracks 3 --tissue_files saved_models/gtex_blood_log2fc/scores_f3c0.h5,saved_models/gtex_brain_log2fc/scores_f3c0.h5,saved_models/gtex_liver_log2fc/scores_f3c0.h5,saved_models/gtex_muscle_log2fc/scores_f3c0.h5,saved_models/gtex_esophagus_log2fc/scores_f3c0.h5 --tissues blood,brain_cortex,liver,muscle,esophagus_muscularis --gene_file diff_expr/gtex_diff_expr_log2fc_5k.csv --max_n_genes 200 --ism_size 192 --gaussian_sigma 8 -t targets_gtex_5_tissues.txt params_pred.json saved_models diff_expr/gtex_diff_expr_log2fc_5k.gtf -python /home/jlinder/basenji/bin/borzoi_satg_gene_gpu_focused_ism.py -o gtex_esophagus_log2fc_ism_aggr -f 0 --rc 1 --shifts 0 --span 0 --tissue esophagus_muscularis --main_tissue_ix 4 --clip_soft 384.0 --aggregate_tracks 3 --tissue_files gtex_blood_log2fc_undo_clip/scores_f0c0.h5,gtex_brain_log2fc_undo_clip/scores_f0c0.h5,gtex_liver_log2fc_undo_clip/scores_f0c0.h5,gtex_muscle_log2fc_undo_clip/scores_f0c0.h5,gtex_esophagus_log2fc_undo_clip/scores_f0c0.h5 --tissues blood,brain_cortex,liver,muscle,esophagus_muscularis --gene_file /home/jlinder/gtex_diff_expr/gtex_diff_expr_log2fc_5k.csv --max_n_genes 200 --ism_size 192 --gaussian_sigma 8 -t /home/jlinder/borzoi_v2/targets_gtex_5_tissues.txt /home/jlinder/borzoi_v2/params_pred.json /home/jlinder/borzoi_v2 /home/jlinder/gtex_diff_expr/gtex_diff_expr_log2fc_5k.gtf +borzoi_satg_gene_focused_ism.py -o saved_models/gtex_esophagus_log2fc_ism_aggr -f 3 -c 0 --rc --tissue esophagus_muscularis --main_tissue_ix 4 --untransform_old --track_scale 0.01 --track_transform 0.75 --clip_soft 384.0 --aggregate_tracks 3 --tissue_files saved_models/gtex_blood_log2fc/scores_f3c0.h5,saved_models/gtex_brain_log2fc/scores_f3c0.h5,saved_models/gtex_liver_log2fc/scores_f3c0.h5,saved_models/gtex_muscle_log2fc/scores_f3c0.h5,saved_models/gtex_esophagus_log2fc/scores_f3c0.h5 --tissues blood,brain_cortex,liver,muscle,esophagus_muscularis --gene_file diff_expr/gtex_diff_expr_log2fc_5k.csv --max_n_genes 200 --ism_size 192 --gaussian_sigma 8 -t targets_gtex_5_tissues.txt params_pred.json saved_models diff_expr/gtex_diff_expr_log2fc_5k.gtf diff --git a/analysis/ipaqtl/bench_ipaqtl.sh b/analysis/ipaqtl/bench_ipaqtl.sh old mode 100644 new mode 100755 index 5dbb336..31bf829 --- a/analysis/ipaqtl/bench_ipaqtl.sh +++ b/analysis/ipaqtl/bench_ipaqtl.sh @@ -1,3 +1,3 @@ #!/bin/sh -python /home/jlinder/basenji/bin/borzoi_bench_ipaqtl_folds.py --name "5/18-johannes-ipaqtl-v2" -r -d 0 -e tf210 --rc --msl 12 --max_proc 8 -q geforce --stats COVR -t targets_gtex.txt params_pred.json test_ipaqtl +borzoi_bench_ipaqtl_folds.py -r --vcf data/qtl_cat/ipaqtl_pip90ea -d 0 -e borzoi_py310 --rc -u --msl 12 --max_proc 8 -q rtx4090 --f_list 3 -c 4 --stats COVR -t targets_gtex.txt params_pred.json saved_models \ No newline at end of file diff --git a/analysis/paqtl/bench_paqtl.sh b/analysis/paqtl/bench_paqtl.sh old mode 100644 new mode 100755 index 694c3db..e8f93e2 --- a/analysis/paqtl/bench_paqtl.sh +++ b/analysis/paqtl/bench_paqtl.sh @@ -1,3 +1,3 @@ #!/bin/sh -python /home/jlinder/basenji/bin/borzoi_bench_paqtl_folds.py --name "5/16-johannes-paqtl-v2" -r -d 0 -e tf210 --rc --msl 12 --max_proc 8 -q geforce --stats COVR -t targets_gtex.txt params_pred.json test_paqtl +borzoi_bench_paqtl_folds.py -r --vcf data/qtl_cat/paqtl_pip90ea -d 0 -e borzoi_py310 --rc -u --msl 12 --max_proc 8 -q rtx4090 --f_list 3 -c 4 --stats COVR -t targets_gtex.txt params_pred.json saved_models diff --git a/analysis/satmut/bench_satmut_sad.sh b/analysis/satmut/bench_satmut_sad.sh new file mode 100755 index 0000000..241ad69 --- /dev/null +++ b/analysis/satmut/bench_satmut_sad.sh @@ -0,0 +1,3 @@ +#!/bin/sh + +borzoi_sad_folds.py -r --vcf data/satmutmpra/satmutmpra_v1.vcf -o satmut_sad_4k -d 0 -e borzoi_py310 --rc -q rtx4090 --f_list 3 -c 4 --stats SAD,logSAD,SADlog -u -t targets_human.txt params_pred_4k.json saved_models diff --git a/analysis/satmut/bench_satmut_sad_4k.sh b/analysis/satmut/bench_satmut_sad_4k.sh deleted file mode 100644 index 5f9a3ca..0000000 --- a/analysis/satmut/bench_satmut_sad_4k.sh +++ /dev/null @@ -1,3 +0,0 @@ -#!/bin/sh - -python /home/jlinder/basenji/bin/basenji_sad_folds.py --name "10/13-johannes-sad-v2" -r --vcf /home/jlinder/seqnn/data/satmutmpra/satmutmpra_v1.vcf -o snp_sad_4k -d 0 -e tf28 --rc -q geforce --stats SAD,logSAD,SADlog -u -t targets_human.txt params_pred_4k.json test_satmut diff --git a/analysis/satmut/bench_satmut_sed.sh b/analysis/satmut/bench_satmut_sed.sh old mode 100644 new mode 100755 index 2dc36f8..e4352c0 --- a/analysis/satmut/bench_satmut_sed.sh +++ b/analysis/satmut/bench_satmut_sed.sh @@ -1,3 +1,3 @@ #!/bin/sh -python /home/jlinder/basenji/bin/borzoi_sed_folds.py --name "10/13-johannes-sed-v2" -r --vcf /home/jlinder/seqnn/data/satmutmpra/satmutmpra_v1.vcf -o snp_sed -d 0 -e tf28 --rc -q geforce --stats SED,logSED,D2,logD2 -u -t targets_human_rna.txt params_pred2.json test_satmut +borzoi_sed_folds.py -r --vcf data/satmutmpra/satmutmpra_v1.vcf -o satmut_sed -d 0 -e borzoi_py310 --rc -q rtx4090 --f_list 3 -c 4 --stats SED,logSED,D2,logD2 -u -t targets_rna.txt params_pred2.json saved_models diff --git a/analysis/satmut/params_pred2.json b/analysis/satmut/params_pred2.json new file mode 100644 index 0000000..6ec84bd --- /dev/null +++ b/analysis/satmut/params_pred2.json @@ -0,0 +1,88 @@ +{ + "train": { + "batch_size": 2, + "shuffle_buffer": 256, + "optimizer": "adam", + "learning_rate": 0.00006, + "loss": "poisson_mn", + "total_weight": 0.2, + "warmup_steps": 20000, + "global_clipnorm": 0.15, + "adam_beta1": 0.9, + "adam_beta2": 0.999, + "patience": 30, + "train_epochs_min": 130, + "train_epochs_max": 180 + }, + "model": { + "verbose": false, + "seq_length": 524288, + "augment_rc": true, + "augment_shift": 3, + "activation": "gelu", + "norm_type": "batch-sync", + "bn_momentum": 0.9, + "kernel_initializer": "lecun_normal", + "l2_scale": 2.0e-8, + "trunk": [ + { + "name": "conv_dna", + "filters": 512, + "kernel_size": 15, + "norm_type": null, + "activation": "linear", + "pool_size": 2 + }, + { + "name": "res_tower", + "filters_init": 608, + "filters_end": 1536, + "divisible_by": 32, + "kernel_size": 5, + "num_convs": 1, + "pool_size": 2, + "repeat": 6 + }, + { + "name": "transformer_tower", + "key_size": 64, + "heads": 8, + "num_position_features": 32, + "dropout": 0.2, + "mha_l2_scale": 1.0e-8, + "l2_scale": 1.0e-8, + "kernel_initializer": "he_normal", + "repeat": 8 + }, + { + "name": "unet_conv", + "kernel_size": 3, + "upsample_conv": true + }, + { + "name": "unet_conv", + "kernel_size": 3, + "upsample_conv": true + }, + { + "name": "Cropping1D", + "cropping": 1024 + }, + { + "name": "conv_nac", + "filters": 1920, + "dropout": 0.1 + } + ], + "head_human": { + "name": "final", + "units": 7611, + "activation": "softplus" + }, + "head_mouse": { + "name": "final", + "units": 2608, + "activation": "softplus" + } + } +} diff --git a/analysis/satmut/params_pred_4k.json b/analysis/satmut/params_pred_4k.json new file mode 100644 index 0000000..b2b9b38 --- /dev/null +++ b/analysis/satmut/params_pred_4k.json @@ -0,0 +1,87 @@ +{ + "train": { + "batch_size": 1, + "shuffle_buffer": 256, + "optimizer": "adam", + "learning_rate": 0.00006, + "loss": "poisson_mn", + "total_weight": 0.2, + "warmup_steps": 20000, + "global_clipnorm": 0.15, + "adam_beta1": 0.9, + "adam_beta2": 0.999, + "patience": 30, + "train_epochs_min": 130, + "train_epochs_max": 180 + }, + "model": { + "seq_length": 524288, + "augment_rc": true, + "augment_shift": 3, + "activation": "gelu", + "norm_type": "batch-sync", + "bn_momentum": 0.9, + "kernel_initializer": "lecun_normal", + "l2_scale": 2.0e-8, + "trunk": [ + { + "name": "conv_dna", + "filters": 512, + "kernel_size": 15, + "norm_type": null, + "activation": "linear", + "pool_size": 2 + }, + { + "name": "res_tower", + "filters_init": 608, + "filters_end": 1536, + "divisible_by": 32, + "kernel_size": 5, + "num_convs": 1, + "pool_size": 2, + "repeat": 6 + }, + { + "name": "transformer_tower", + "key_size": 64, + "heads": 8, + "num_position_features": 32, + "dropout": 0.2, + "mha_l2_scale": 1.0e-8, + "l2_scale": 1.0e-8, + "kernel_initializer": "he_normal", + "repeat": 8 + }, + { + "name": "unet_conv", + "kernel_size": 3, + "upsample_conv": true + }, + { + "name": "unet_conv", + "kernel_size": 3, + "upsample_conv": true + }, + { + "name": "Cropping1D", + "cropping": 8128 + }, + { + "name": "conv_nac", + "filters": 1920, + "dropout": 0.1 + } + ], + "head_human": { + "name": "final", + "units": 7611, + "activation": "softplus" + }, + "head_mouse": { + "name": "final", + "units": 2608, + "activation": "softplus" + } + } +} diff --git a/analysis/setup_data.sh b/analysis/setup_data.sh new file mode 100755 index 0000000..002607c --- /dev/null +++ b/analysis/setup_data.sh @@ -0,0 +1,3 @@ +#!/bin/sh + +westminster_train_folds.py -e borzoi_py310 --f_list 3 -c 4 --identical_crosses -q standard -r --setup -o saved_models params.json data/hg38 data/mm10 diff --git a/analysis/sqtl/bench_sqtl.sh b/analysis/sqtl/bench_sqtl.sh old mode 100644 new mode 100755 index d294aab..7caac6d --- a/analysis/sqtl/bench_sqtl.sh +++ b/analysis/sqtl/bench_sqtl.sh @@ -1,3 +1,3 @@ #!/bin/sh -python /home/jlinder/basenji/bin/borzoi_bench_sqtl_folds.py --name "5/9-johannes-sqtl-v2" -r --span --vcf /home/jlinder/seqnn/data/qtl_cat/sqtl_pip90ea -o sqtl_span -d 0 -e tf210 --rc --msl 4 --max_proc 12 -q p100 --stats D2,JS,D0 -t targets_rna.txt params_pred2.json test_sqtl +borzoi_bench_sqtl_folds.py -r --span --no_untransform --vcf data/qtl_cat/sqtl_pip90ea -o sqtl_span -d 0 -e borzoi_py310 --rc -u --msl 4 --max_proc 12 -q rtx4090 --f_list 3 -c 4 --stats nDi -t targets_rna.txt params_pred.json saved_models diff --git a/analysis/test_apa/test_apa.sh b/analysis/test_apa/test_apa.sh old mode 100644 new mode 100755 index 44e5a2e..cb023ea --- a/analysis/test_apa/test_apa.sh +++ b/analysis/test_apa/test_apa.sh @@ -1,3 +1,3 @@ #!/bin/sh -python /home/jlinder/basenji_latest/basenji/bin/borzoi_test_apa_folds_polaydb.py --name "12/06-polyadb" -g polyadb_human_v3.csv.gz -d 0 -e basenji_py310 -q rtx4090 -f 3 -c 4 --rc -o test_apa -t targets_gtex.txt params.json /scratch3/drk/seqnn/data/v9/hg38 +borzoi_test_apa_folds.py -d 0 -e borzoi_py310 -q rtx4090 --f_list 3 -c 4 --rc -u -o saved_models -t targets_gtex.txt params.json data/hg38 diff --git a/analysis/test_expression/test.sh b/analysis/test_expression/test.sh old mode 100644 new mode 100755 index 9484777..7bb7467 --- a/analysis/test_expression/test.sh +++ b/analysis/test_expression/test.sh @@ -1,3 +1,3 @@ #!/bin/sh -basenji_train_folds.py -r --name "9/9" -e tf210 -c 4 -f 4 -o train -q titan --rc --shifts "0,1" --spec_step 32 params.json v9/hg38 v9/mm10 +westminster_eval_folds.py --name "9/9" -c 4 --f_list 3 -o saved_models -q geforce --rc params.json data/hg38 data/mm10 diff --git a/analysis/test_expression/testg.sh b/analysis/test_expression/testg.sh old mode 100644 new mode 100755 index 092efe4..f790290 --- a/analysis/test_expression/testg.sh +++ b/analysis/test_expression/testg.sh @@ -1,3 +1,3 @@ #!/bin/sh -python /home/jlinder/basenji_latest/basenji/bin/borzoi_test_genes_folds.py -e basenji_py310 -q rtx4090 -f 3 -c 4 -t targets_rna.txt -d 0 --rc --no_unclip -o train -s testg params.json /scratch3/drk/seqnn/data/v9/hg38 +borzoi_test_genes_folds.py -e borzoi_py310 -q rtx4090 --f_list 3 -c 4 -t targets_rna.txt -d 0 --rc -u --no_unclip -o saved_models -s testg params.json data/hg38 diff --git a/analysis/test_expression/testg_pseudo.sh b/analysis/test_expression/testg_pseudo.sh new file mode 100755 index 0000000..af6d533 --- /dev/null +++ b/analysis/test_expression/testg_pseudo.sh @@ -0,0 +1,3 @@ +#!/bin/sh + +borzoi_test_genes_folds.py -e borzoi_py310 -q rtx4090 --f_list 3 -c 4 -t targets_rna.txt -d 0 --rc -u --no_unclip --pseudo_qtl 0.05 -o saved_models -s testg_pseudo params.json data/hg38 diff --git a/analysis/test_expression/testgs.sh b/analysis/test_expression/testgs.sh new file mode 100755 index 0000000..5278a80 --- /dev/null +++ b/analysis/test_expression/testgs.sh @@ -0,0 +1,3 @@ +#!/bin/sh + +borzoi_test_genes_folds.py -e borzoi_py310 -q rtx4090 --f_list 3 -c 4 -t targets_rna.txt -d 0 --rc --span -u --no_unclip --store_span -o saved_models -s testgs params.json data/hg38 diff --git a/analysis/test_tss/test_tss.sh b/analysis/test_tss/test_tss.sh old mode 100644 new mode 100755 index f0bdd06..18a5e4b --- a/analysis/test_tss/test_tss.sh +++ b/analysis/test_tss/test_tss.sh @@ -1,3 +1,3 @@ #!/bin/sh -python /home/jlinder/basenji_latest/basenji/bin/borzoi_test_tss_folds_gencode.py --name "11/20-tssmax" -g /home/drk/common/data/genomes/hg38/genes/gencode41/gencode41_basic_tss2.bed -d 0 -e basenji_py310 -q rtx4090 -f 3 -c 4 --rc --windowcov 9 --maxcov -o test_tss -t targets_gtex.txt params.json /scratch3/drk/seqnn/data/v9/hg38 +borzoi_test_tss_folds.py -d 0 -e borzoi_py310 -q rtx4090 --f_list 3 -c 4 --rc -u --windowcov 9 --maxcov -o saved_models -t targets_gtex.txt params.json data/hg38 diff --git a/analysis/trip/bench_trip.sh b/analysis/trip/bench_trip.sh old mode 100644 new mode 100755 index a559b63..3a024e7 --- a/analysis/trip/bench_trip.sh +++ b/analysis/trip/bench_trip.sh @@ -1,3 +1,3 @@ #!/bin/sh -python /home/jlinder/basenji/bin/borzoi_bench_trip_folds.py --name "7/10-johannes-trip-v2" -r -d 0 -e tf210 --rc --max_proc 8 -q geforce -t targets_k562.txt params_pred.json test_trip /home/drk/hic/data/trip/promoters.xlsx /home/drk/hic/data/trip/Dataset_S2_TRIP.tsv +borzoi_bench_trip_folds.py -r -d 0 -e borzoi_py310 --rc --max_proc 8 -q rtx4090 --f_list 3 -c 4 -t targets_k562.txt params_pred.json saved_models trip/promoters.xlsx trip/Dataset_S2_TRIP.tsv diff --git a/analysis/trip/bench_trip_reporter.sh b/analysis/trip/bench_trip_reporter.sh old mode 100644 new mode 100755 index 1877f29..962a7b8 --- a/analysis/trip/bench_trip_reporter.sh +++ b/analysis/trip/bench_trip_reporter.sh @@ -1,3 +1,3 @@ #!/bin/sh -python /home/jlinder/basenji/bin/borzoi_bench_trip_folds.py --name "7/10-johannes-trip-v2" -r -d 0 -e tf210 -o trip_reporter --reporter --rc --max_proc 8 -q geforce -t targets_k562.txt params_pred.json test_trip /home/drk/hic/data/trip/promoters.xlsx /home/drk/hic/data/trip/Dataset_S2_TRIP.tsv +borzoi_bench_trip_folds.py -r -d 0 -e borzoi_py310 -o trip_reporter --reporter --rc --max_proc 8 -q rtx4090 --f_list 3 -c 4 -t targets_k562.txt params_pred.json saved_models trip/promoters.xlsx trip/Dataset_S2_TRIP.tsv diff --git a/data/qtl/README.md b/data/qtl/README.md new file mode 100755 index 0000000..af90d88 --- /dev/null +++ b/data/qtl/README.md @@ -0,0 +1,34 @@ +## QTL data processing + +The scripts in this folder are used to extract fine-mapped causal sQTLs, paQTLs and ipaQTLs from the results of the eQTL Catalogue, as well as construct distance- and expression-matched negative SNPs.
+ +*Notes*: +- The pipeline requires the GTEx v8 (median) TPM matrix, which can be downloaded [here](https://storage.googleapis.com/adult-gtex/bulk-gex/v8/rna-seq/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct.gz). +- The fine-mapped eQTLs analyzed in the manuscript were curated from [this paper](https://doi.org/10.1038/s41467-021-23134-8). +
+ +As a prerequisite to generating any of the QTL datasets, run the following scripts (in order): +1. download_finemap.py +2. download_sumstat.py +3. merge_finemapping_tables.py +4. make_expression_tables.py +
+ +To prepare the sQTL dataset, run these scripts: +1. sqtl_make_positive_sets.py +2. sqtl_make_negative_sets.py +
+ +To prepare the paQTL dataset, run these scripts: +1. paqtl_make_positive_sets.py +2. paqtl_make_negative_sets.py +
+ +To prepare the ipaQTL dataset, run these scripts: +1. ipaqtl_make_positive_sets.py +2. ipaqtl_make_negative_sets.py +
+ +Finally, to generate the QTL VCF files, run this script: +1. make_vcfs.py +
diff --git a/data/qtl/download_finemap.py b/data/qtl/download_finemap.py new file mode 100755 index 0000000..558e3ef --- /dev/null +++ b/data/qtl/download_finemap.py @@ -0,0 +1,62 @@ +#!/usr/bin/env python +from optparse import OptionParser + +import os + +import pandas as pd + +import util + +''' +download_finemap.py + +Download QTL Catalogue fine-mapping results. +''' + +################################################################################ +# main +################################################################################ +def main(): + usage = 'usage: %prog [options] arg' + parser = OptionParser(usage) + #parser.add_option() + (options,args) = parser.parse_args() + + # read remote table + samples_df = pd.read_csv('https://raw.githubusercontent.com/eQTL-Catalogue/eQTL-Catalogue-resources/master/tabix/tabix_ftp_paths.tsv', sep='\t') + + # filter GTEx (for now) + samples_df = samples_df[samples_df.study == 'GTEx'] + + + ################################################ + # txrevise for splicing / polyA / TSS QTLs + + os.makedirs('txrev', exist_ok=True) + txrev_df = samples_df[samples_df.quant_method == 'txrev'] + + jobs = [] + for all_ftp_path in txrev_df.ftp_path: + # ftp://ftp.ebi.ac.uk/pub/databases/spot/eQTL/sumstats/Alasoo_2018/txrev/Alasoo_2018_txrev_macrophage_IFNg+Salmonella.all.tsv.gz + # ftp://ftp.ebi.ac.uk/pub/databases/spot/eQTL/credible_sets//Alasoo_2018_txrev_macrophage_IFNg+Salmonella.purity_filtered.txt.gz + + all_ftp_file = all_ftp_path.split('/')[-1] + fine_ftp_file = all_ftp_file.replace('all.tsv', 'purity_filtered.txt') + + fine_ftp_path = 'ftp://ftp.ebi.ac.uk/pub/databases/spot/eQTL/credible_sets/' + fine_ftp_path += fine_ftp_file + + local_path = 'txrev/%s' % fine_ftp_file + if not os.path.isfile(local_path): + cmd = 'curl -o %s %s' % (local_path, fine_ftp_path) + jobs.append(cmd) + + util.exec_par(jobs, 4, verbose=True) + # print('\n'.join(jobs)) + + +################################################################################ +# __main__ +################################################################################ +if __name__ == '__main__': + main() diff --git a/data/qtl/download_sumstat.py b/data/qtl/download_sumstat.py new file mode 100755 index 0000000..ca402df --- /dev/null +++ b/data/qtl/download_sumstat.py @@ -0,0 +1,56 @@ +#!/usr/bin/env python +from optparse import OptionParser + +import os + +import pandas as pd + +import util + +''' +download_sumstat.py + +Download QTL Catalogue sumstats. +''' + +################################################################################ +# main +################################################################################ +def main(): + usage = 'usage: %prog [options] arg' + parser = OptionParser(usage) + #parser.add_option() + (options,args) = parser.parse_args() + + # read remote table + samples_df = pd.read_csv('https://raw.githubusercontent.com/eQTL-Catalogue/eQTL-Catalogue-resources/master/tabix/tabix_ftp_paths.tsv', sep='\t') + + # filter GTEx (for now) + samples_df = samples_df[samples_df.study == 'GTEx'] + + + ################################################ + # ge for sumstat (we want SNPs and possibly also base expression) + + os.makedirs('ge', exist_ok=True) + txrev_df = samples_df[samples_df.quant_method == 'ge'] + + jobs = [] + for all_ftp_path in txrev_df.ftp_path: + # ftp://ftp.ebi.ac.uk/pub/databases/spot/eQTL/sumstats/Alasoo_2018/txrev/Alasoo_2018_txrev_macrophage_IFNg+Salmonella.all.tsv.gz + + local_path = 'ge/%s' % all_ftp_path.split("/")[-1] + + if not os.path.isfile(local_path): + cmd = 'curl -o %s %s' % (local_path, all_ftp_path) + jobs.append(cmd) + + util.exec_par(jobs, 4, verbose=True) + # print('\n'.join(jobs)) + + +################################################################################ +# __main__ +################################################################################ +if __name__ == '__main__': + main() diff --git a/data/qtl/ipaqtl_make_negative_sets.py b/data/qtl/ipaqtl_make_negative_sets.py new file mode 100755 index 0000000..3f4d49d --- /dev/null +++ b/data/qtl/ipaqtl_make_negative_sets.py @@ -0,0 +1,196 @@ +#!/usr/bin/env python +from optparse import OptionParser + +import os + +import util + +import numpy as np +import pandas as pd + +import pyranges as pr + +''' +paqtl_make_negative_sets.py + +Build tables with negative (non-causal) SNPs for paQTLs. +''' + +################################################################################ +# main +################################################################################ +def main(): + usage = 'usage: %prog [options] arg' + parser = OptionParser(usage) + #parser.add_option() + (options,args) = parser.parse_args() + + #Parameters + pip_cutoff = 0.01 + max_distance = 10000 + gene_pad = 50 + apa_file = 'polyadb_intron.bed' + gtf_file = '/home/drk/common/data/genomes/hg38/genes/gencode41/gencode41_basic_nort.gtf' + finemap_file = 'txrev/GTEx_txrev_finemapped_merged.csv.gz' + + #Define tissues + tissue_names = [ + 'adipose_subcutaneous', + 'adipose_visceral', + 'adrenal_gland', + 'artery_aorta', + 'artery_coronary', + 'artery_tibial', + 'blood', + 'brain_amygdala', + 'brain_anterior_cingulate_cortex', + 'brain_caudate', + 'brain_cerebellar_hemisphere', + 'brain_cerebellum', + 'brain_cortex', + 'brain_frontal_cortex', + 'brain_hippocampus', + 'brain_hypothalamus', + 'brain_nucleus_accumbens', + 'brain_putamen', + 'brain_spinal_cord', + 'brain_substantia_nigra', + 'breast', + 'colon_sigmoid', + 'colon_transverse', + 'esophagus_gej', + 'esophagus_mucosa', + 'esophagus_muscularis', + 'fibroblast', + 'heart_atrial_appendage', + 'heart_left_ventricle', + 'kidney_cortex', + 'LCL', + 'liver', + 'lung', + 'minor_salivary_gland', + 'muscle', + 'nerve_tibial', + 'ovary', + 'pancreas', + 'pituitary', + 'prostate', + 'skin_not_sun_exposed', + 'skin_sun_exposed', + 'small_intestine', + 'spleen', + 'stomach', + 'testis', + 'thyroid', + 'uterus', + 'vagina', + ] + + #Compile negative SNP set for each tissue + for tissue_name in tissue_names : + + print("-- " + str(tissue_name) + " --") + + #Load summary stats and extract unique set of SNPs + vcf_df = pd.read_csv("ge/GTEx_ge_" + tissue_name + ".all.tsv.gz", sep='\t', compression='gzip', usecols=['chromosome', 'position', 'ref', 'alt']).drop_duplicates(subset=['chromosome', 'position', 'ref', 'alt'], keep='first').copy().reset_index(drop=True) + + #Only keep SNPs (no indels) + vcf_df = vcf_df.loc[(vcf_df['ref'].str.len() == vcf_df['alt'].str.len()) & (vcf_df['ref'].str.len() == 1)].copy().reset_index(drop=True) + + vcf_df['chromosome'] = 'chr' + vcf_df['chromosome'].astype(str) + vcf_df['start'] = vcf_df['position'].astype(int) + vcf_df['end'] = vcf_df['start'] + 1 + vcf_df['strand'] = "." + + vcf_df = vcf_df[['chromosome', 'start', 'end', 'ref', 'alt', 'strand']] + vcf_df = vcf_df.rename(columns={'chromosome' : 'Chromosome', 'start' : 'Start', 'end' : 'End', 'strand' : 'Strand'}) + + print("len(vcf_df) = " + str(len(vcf_df))) + + #Store intermediate SNPs + #vcf_df.to_csv("ge/GTEx_snps_" + tissue_name + ".bed.gz", sep='\t', index=False, header=False) + + #Load polyadenylation site annotation + apa_df = pd.read_csv(apa_file, sep='\t', names=['Chromosome', 'Start', 'End', 'pas_id', 'feat1', 'Strand']) + apa_df['Start'] += 1 + + #Load gene span annotation + gtf_df = pd.read_csv(gtf_file, sep='\t', skiprows=5, names=['Chromosome', 'havana_str', 'feature', 'Start', 'End', 'feat1', 'Strand', 'feat2', 'id_str']) + gtf_df = gtf_df.query("feature == 'gene'").copy().reset_index(drop=True) + + gtf_df['gene_id'] = gtf_df['id_str'].apply(lambda x: x.split("gene_id \"")[1].split("\";")[0].split(".")[0]) + + gtf_df = gtf_df[['Chromosome', 'Start', 'End', 'gene_id', 'feat1', 'Strand']].drop_duplicates(subset=['gene_id'], keep='first').copy().reset_index(drop=True) + + gtf_df['Start'] = gtf_df['Start'].astype(int) - gene_pad + gtf_df['End'] = gtf_df['End'].astype(int) + gene_pad + + #Join dataframes against gtf annotation + apa_pr = pr.PyRanges(apa_df) + gtf_pr = pr.PyRanges(gtf_df) + vcf_pr = pr.PyRanges(vcf_df) + + apa_gtf_pr = apa_pr.join(gtf_pr, strandedness='same') + vcf_gtf_pr = vcf_pr.join(gtf_pr, strandedness=False) + + apa_gtf_df = apa_gtf_pr.df[['Chromosome', 'Start', 'End', 'pas_id', 'gene_id', 'Strand']].copy().reset_index(drop=True) + vcf_gtf_df = vcf_gtf_pr.df[['Chromosome', 'Start', 'End', 'ref', 'alt', 'Strand', 'gene_id']].copy().reset_index(drop=True) + + apa_gtf_df['Start'] -= max_distance + apa_gtf_df['End'] += max_distance + + #Join vcf against polyadenylation annotation + apa_gtf_pr = pr.PyRanges(apa_gtf_df) + vcf_gtf_pr = pr.PyRanges(vcf_gtf_df) + + vcf_apa_pr = vcf_gtf_pr.join(apa_gtf_pr, strandedness=False) + + #Force gene_id of SNP to be same as the gene_id of the polyA site + vcf_apa_df = vcf_apa_pr.df.query("gene_id == gene_id_b").copy().reset_index(drop=True) + vcf_apa_df = vcf_apa_df[['Chromosome', 'Start', 'ref', 'alt', 'gene_id', 'pas_id', 'Strand_b', 'Start_b']] + + #PolyA site position + vcf_apa_df['Start_b'] += max_distance + vcf_apa_df = vcf_apa_df.rename(columns={'Start' : 'Pos', 'Start_b' : 'pas_pos', 'Strand_b' : 'Strand'}) + + #Distance to polyA site + vcf_apa_df['distance'] = np.abs(vcf_apa_df['Pos'] - vcf_apa_df['pas_pos']) + + #Choose unique SNPs by shortest distance to polyA site + vcf_apa_df = vcf_apa_df.sort_values(by='distance', ascending=True).drop_duplicates(subset=['Chromosome', 'Pos', 'ref', 'alt'], keep='first').copy().reset_index(drop=True) + vcf_apa_df = vcf_apa_df.sort_values(['Chromosome', 'Pos', 'alt'], ascending=True).copy().reset_index(drop=True) + + vcf_df_filtered = vcf_apa_df.rename(columns={'Chromosome' : 'chrom', 'Pos' : 'pos', 'Strand' : 'strand'}) + vcf_df_filtered = vcf_df_filtered[['chrom', 'pos', 'ref', 'alt', 'gene_id', 'pas_id', 'strand', 'pas_pos', 'distance']] + + print("len(vcf_df_filtered) = " + str(len(vcf_df_filtered))) + + #Store intermediate SNPs (filtered) + vcf_df_filtered.to_csv("ge/GTEx_snps_" + tissue_name + "_intronic_polya_filtered.bed.gz", sep='\t', index=False) + + #Reload filtered SNP file + vcf_df_filtered = pd.read_csv("ge/GTEx_snps_" + tissue_name + "_intronic_polya_filtered.bed.gz", sep='\t', compression='gzip') + + #Create variant identifier + vcf_df_filtered['variant'] = vcf_df_filtered['chrom'] + "_" + vcf_df_filtered['pos'].astype(str) + "_" + vcf_df_filtered['ref'] + "_" + vcf_df_filtered['alt'] + + #Load merged fine-mapping dataframe + finemap_df = pd.read_csv(finemap_file, sep='\t')[['variant', 'pip']] + + #Join against fine-mapping dataframe + neg_df = vcf_df_filtered.join(finemap_df.set_index('variant'), on='variant', how='left') + neg_df.loc[neg_df['pip'].isnull(), 'pip'] = 0. + + #Only keep SNPs with PIP < cutoff + neg_df = neg_df.query("pip < " + str(pip_cutoff)).copy().reset_index(drop=True) + + #Store final table of negative SNPs + neg_df.to_csv("ge/GTEx_snps_" + tissue_name + "_intronic_polya_negatives.bed.gz", sep='\t', index=False) + + print("len(neg_df) = " + str(len(neg_df))) + +################################################################################ +# __main__ +################################################################################ +if __name__ == '__main__': + main() diff --git a/data/qtl/ipaqtl_make_positive_sets.py b/data/qtl/ipaqtl_make_positive_sets.py new file mode 100755 index 0000000..f1afb7b --- /dev/null +++ b/data/qtl/ipaqtl_make_positive_sets.py @@ -0,0 +1,191 @@ +#!/usr/bin/env python +from optparse import OptionParser + +import os + +import util + +import numpy as np +import pandas as pd + +import pyranges as pr + +''' +paqtl_make_positive_sets.py + +Build tables with positive (causal) SNPs for paQTLs. +''' + +################################################################################ +# main +################################################################################ +def main(): + usage = 'usage: %prog [options] arg' + parser = OptionParser(usage) + #parser.add_option() + (options,args) = parser.parse_args() + + #Parameters + pip_cutoff = 0.01 + max_distance = 10000 + gene_pad = 50 + apa_file = 'polyadb_intron.bed' + gtf_file = '/home/drk/common/data/genomes/hg38/genes/gencode41/gencode41_basic_nort.gtf' + + #Define tissues + tissue_names = [ + 'adipose_subcutaneous', + 'adipose_visceral', + 'adrenal_gland', + 'artery_aorta', + 'artery_coronary', + 'artery_tibial', + 'blood', + 'brain_amygdala', + 'brain_anterior_cingulate_cortex', + 'brain_caudate', + 'brain_cerebellar_hemisphere', + 'brain_cerebellum', + 'brain_cortex', + 'brain_frontal_cortex', + 'brain_hippocampus', + 'brain_hypothalamus', + 'brain_nucleus_accumbens', + 'brain_putamen', + 'brain_spinal_cord', + 'brain_substantia_nigra', + 'breast', + 'colon_sigmoid', + 'colon_transverse', + 'esophagus_gej', + 'esophagus_mucosa', + 'esophagus_muscularis', + 'fibroblast', + 'heart_atrial_appendage', + 'heart_left_ventricle', + 'kidney_cortex', + 'LCL', + 'liver', + 'lung', + 'minor_salivary_gland', + 'muscle', + 'nerve_tibial', + 'ovary', + 'pancreas', + 'pituitary', + 'prostate', + 'skin_not_sun_exposed', + 'skin_sun_exposed', + 'small_intestine', + 'spleen', + 'stomach', + 'testis', + 'thyroid', + 'uterus', + 'vagina', + ] + + #Compile positive SNP set for each tissue + for tissue_name in tissue_names : + + print("-- " + str(tissue_name) + " --") + + #Load fine-mapping table + vcf_df = pd.read_csv("txrev/GTEx_txrev_" + tissue_name + ".purity_filtered.txt.gz", sep='\t', usecols=['chromosome', 'position', 'ref', 'alt', 'variant', 'pip', 'molecular_trait_id'], low_memory=False) + + #Only keep SNPs (no indels) + vcf_df = vcf_df.loc[(vcf_df['ref'].str.len() == vcf_df['alt'].str.len()) & (vcf_df['ref'].str.len() == 1)].copy().reset_index(drop=True) + + #Only keep SNPs associated with polyadenylation events + vcf_df = vcf_df.loc[vcf_df['molecular_trait_id'].str.contains(".downstream.")].copy().reset_index(drop=True) + + vcf_df['chromosome'] = 'chr' + vcf_df['chromosome'].astype(str) + vcf_df['start'] = vcf_df['position'].astype(int) + vcf_df['end'] = vcf_df['start'] + 1 + vcf_df['strand'] = "." + + vcf_df = vcf_df[['chromosome', 'start', 'end', 'ref', 'alt', 'strand', 'variant', 'pip', 'molecular_trait_id']] + vcf_df = vcf_df.rename(columns={'chromosome' : 'Chromosome', 'start' : 'Start', 'end' : 'End', 'strand' : 'Strand'}) + + print("len(vcf_df) = " + str(len(vcf_df))) + + #Load polyadenylation site annotation + apa_df = pd.read_csv(apa_file, sep='\t', names=['Chromosome', 'Start', 'End', 'pas_id', 'feat1', 'Strand']) + apa_df['Start'] += 1 + + #Load gene span annotation + gtf_df = pd.read_csv(gtf_file, sep='\t', skiprows=5, names=['Chromosome', 'havana_str', 'feature', 'Start', 'End', 'feat1', 'Strand', 'feat2', 'id_str']) + gtf_df = gtf_df.query("feature == 'gene'").copy().reset_index(drop=True) + + gtf_df['gene_id'] = gtf_df['id_str'].apply(lambda x: x.split("gene_id \"")[1].split("\";")[0].split(".")[0]) + + gtf_df = gtf_df[['Chromosome', 'Start', 'End', 'gene_id', 'feat1', 'Strand']].drop_duplicates(subset=['gene_id'], keep='first').copy().reset_index(drop=True) + + gtf_df['Start'] = gtf_df['Start'].astype(int) - gene_pad + gtf_df['End'] = gtf_df['End'].astype(int) + gene_pad + + #Join dataframes against gtf annotation + apa_pr = pr.PyRanges(apa_df) + gtf_pr = pr.PyRanges(gtf_df) + vcf_pr = pr.PyRanges(vcf_df) + + apa_gtf_pr = apa_pr.join(gtf_pr, strandedness='same') + vcf_gtf_pr = vcf_pr.join(gtf_pr, strandedness=False) + + apa_gtf_df = apa_gtf_pr.df[['Chromosome', 'Start', 'End', 'pas_id', 'gene_id', 'Strand']].copy().reset_index(drop=True) + vcf_gtf_df = vcf_gtf_pr.df[['Chromosome', 'Start', 'End', 'ref', 'alt', 'Strand', 'gene_id', 'variant', 'pip', 'molecular_trait_id']].copy().reset_index(drop=True) + + apa_gtf_df['Start'] -= max_distance + apa_gtf_df['End'] += max_distance + + #Join vcf against polyadenylation annotation + apa_gtf_pr = pr.PyRanges(apa_gtf_df) + vcf_gtf_pr = pr.PyRanges(vcf_gtf_df) + + vcf_apa_pr = vcf_gtf_pr.join(apa_gtf_pr, strandedness=False) + + #Force gene_id of SNP to be same as the gene_id of the polyA site + vcf_apa_df = vcf_apa_pr.df.query("gene_id == gene_id_b").copy().reset_index(drop=True) + vcf_apa_df = vcf_apa_df[['Chromosome', 'Start', 'ref', 'alt', 'gene_id', 'pas_id', 'Strand_b', 'Start_b', 'variant', 'pip', 'molecular_trait_id']] + + #Force gene_id of SNP to be same as the gene_id of the finemapped molecular trait + vcf_apa_df['molecular_trait_gene_id'] = vcf_apa_df['molecular_trait_id'].apply(lambda x: x.split(".")[0]) + vcf_apa_df = vcf_apa_df.query("gene_id == molecular_trait_gene_id").copy().reset_index(drop=True) + + #PolyA site position + vcf_apa_df['Start_b'] += max_distance + vcf_apa_df = vcf_apa_df.rename(columns={'Start' : 'Pos', 'Start_b' : 'pas_pos', 'Strand_b' : 'Strand'}) + + #Distance to polyA site + vcf_apa_df['distance'] = np.abs(vcf_apa_df['Pos'] - vcf_apa_df['pas_pos']) + + #Choose unique SNPs by shortest distance to polyA site (and inverse PIP for tie-breaking) + vcf_apa_df['pip_inv'] = 1. - vcf_apa_df['pip'] + + vcf_apa_df = vcf_apa_df.sort_values(by=['distance', 'pip_inv'], ascending=True).drop_duplicates(subset=['Chromosome', 'Pos', 'ref', 'alt'], keep='first').copy().reset_index(drop=True) + vcf_apa_df = vcf_apa_df.sort_values(['Chromosome', 'Pos', 'alt'], ascending=True).copy().reset_index(drop=True) + + vcf_df_filtered = vcf_apa_df.rename(columns={'Chromosome' : 'chrom', 'Pos' : 'pos', 'Strand' : 'strand'}) + vcf_df_filtered = vcf_df_filtered[['chrom', 'pos', 'ref', 'alt', 'gene_id', 'pas_id', 'strand', 'pas_pos', 'distance', 'variant', 'pip', 'molecular_trait_id']] + + print("len(vcf_df_filtered) = " + str(len(vcf_df_filtered))) + + #Store intermediate SNPs (filtered) + vcf_df_filtered.to_csv("txrev/GTEx_snps_" + tissue_name + "_intronic_polya_finemapped_filtered.bed.gz", sep='\t', index=False) + + #Reload filtered SNP file + vcf_df_filtered = pd.read_csv("txrev/GTEx_snps_" + tissue_name + "_intronic_polya_finemapped_filtered.bed.gz", sep='\t', compression='gzip') + + #Only keep SNPs with PIP > cutoff + pos_df = vcf_df_filtered.query("pip > " + str(pip_cutoff)).copy().reset_index(drop=True) + + #Store final table of positive SNPs + pos_df.to_csv("txrev/GTEx_snps_" + tissue_name + "_intronic_polya_positives.bed.gz", sep='\t', index=False) + + print("len(pos_df) = " + str(len(pos_df))) + +################################################################################ +# __main__ +################################################################################ +if __name__ == '__main__': + main() diff --git a/data/qtl/ipaqtl_vcfs.py b/data/qtl/ipaqtl_vcfs.py new file mode 100755 index 0000000..773c45e --- /dev/null +++ b/data/qtl/ipaqtl_vcfs.py @@ -0,0 +1,234 @@ +#!/usr/bin/env python +from optparse import OptionParser +import os +import pdb +import time + +import numpy as np +import pandas as pd +import pyranges as pr +from tqdm import tqdm + +''' +ipaqtl_vcfs.py + +Generate positive and negative intronic paQTL sets from the QTL catalog txrevise. +''' + +################################################################################ +# main +################################################################################ +def main(): + usage = 'usage: %prog [options]' + parser = OptionParser(usage) + parser.add_option('--neg_pip', dest='neg_pip', + default=0.01, type='float', + help='PIP upper limit for negative examples. [Default: %default]') + parser.add_option('--pos_pip', dest='pos_pip', + default=0.9, type='float', + help='PIP lower limit for positive examples. [Default: %default]') + parser.add_option('--match_gene', dest='match_gene', + default=0, type='int', + help='Try finding negative in same gene as positive. [Default: %default]') + parser.add_option('--match_allele', dest='match_allele', + default=0, type='int', + help='Try finding negative with same ref and alt alleles. [Default: %default]') + parser.add_option('-o', dest='out_prefix', + default='qtlcat_ipaqtl') + (options,args) = parser.parse_args() + + tissue_name = options.out_prefix.split('txrev_')[1] + + gtf_file = '/home/drk/common/data/genomes/hg38/genes/gencode41/gencode41_basic_nort_protein.gtf' + + # read variant table + qtlcat_df_neg = pd.read_csv("ge/GTEx_snps_" + tissue_name + "_intronic_polya_negatives.bed.gz", sep='\t') + qtlcat_df_pos = pd.read_csv("txrev/GTEx_snps_" + tissue_name + "_intronic_polya_positives.bed.gz", sep='\t') + + # read TPM bin table and construct lookup dictionaries + tpm_df = pd.read_csv('ge/GTEx_ge_' + tissue_name + "_tpms.csv", sep='\t')[['gene_id', 'tpm', 'bin_index', 'bin_index_l', 'bin_index_r']] + gene_to_tpm_dict = tpm_df.set_index('gene_id').to_dict(orient='index') + + # filter on SNPs with genes in TPM bin dict + qtlcat_df_neg = qtlcat_df_neg.loc[qtlcat_df_neg['gene_id'].isin(tpm_df['gene_id'].values.tolist())].copy().reset_index(drop=True) + qtlcat_df_pos = qtlcat_df_pos.loc[qtlcat_df_pos['gene_id'].isin(tpm_df['gene_id'].values.tolist())].copy().reset_index(drop=True) + + #Load gene span annotation (protein-coding/categorized only) + gtf_df = pd.read_csv(gtf_file, sep='\t', skiprows=5, names=['id_str']) + gtf_genes = gtf_df['id_str'].apply(lambda x: x.split("gene_id \"")[1].split("\";")[0].split(".")[0]).unique().tolist() + + # filter on SNPs with genes in GTF file + qtlcat_df_neg = qtlcat_df_neg.loc[qtlcat_df_neg['gene_id'].isin(gtf_genes)].copy().reset_index(drop=True) + qtlcat_df_pos = qtlcat_df_pos.loc[qtlcat_df_pos['gene_id'].isin(gtf_genes)].copy().reset_index(drop=True) + + bin_to_genes_dict = {} + for _, row in tpm_df.iterrows() : + + if row['bin_index'] not in bin_to_genes_dict : + bin_to_genes_dict[row['bin_index']] = [] + + bin_to_genes_dict[row['bin_index']].append(row['gene_id']) + + for sample_bin in bin_to_genes_dict : + bin_to_genes_dict[sample_bin] = set(bin_to_genes_dict[sample_bin]) + + # split molecular trait id and filter for polyadenylation (for positives) + qtlcat_df_pos['gene'] = [mti.split('.')[0] for mti in qtlcat_df_pos.molecular_trait_id] + qtlcat_df_pos['event'] = [mti.split('.')[2] for mti in qtlcat_df_pos.molecular_trait_id] + + qtlcat_df_pos = qtlcat_df_pos[qtlcat_df_pos.event == 'downstream'] + qtlcat_df_pos = qtlcat_df_pos.rename(columns={'distance' : 'pas_dist'}) + + qtlcat_df_neg['molecular_trait_id'] = qtlcat_df_neg['gene_id'] + "." + "grp_0.downstream.negative" + qtlcat_df_neg['gene'] = qtlcat_df_neg['gene_id'] + qtlcat_df_neg['event'] = 'downstream' + qtlcat_df_neg = qtlcat_df_neg.rename(columns={'distance' : 'pas_dist'}) + + paqtl_df = pd.concat([qtlcat_df_neg, qtlcat_df_pos]).copy().reset_index(drop=True) + + # determine positive variants + paqtl_pos_df = paqtl_df[paqtl_df.pip >= options.pos_pip] + paqtl_neg_df = paqtl_df[paqtl_df.pip < options.neg_pip] + pos_variants = set(paqtl_pos_df.variant) + + neg_gene_and_allele_variants = 0 + neg_gene_variants = 0 + + neg_expr_and_allele_variants = 0 + neg_expr_variants = 0 + + unmatched_variants = 0 + + # choose negative variants + neg_variants = set() + neg_dict = {} + for pvariant in tqdm(pos_variants): + paqtl_this_df = paqtl_pos_df[paqtl_pos_df.variant == pvariant] + + neg_found = False + + # optionally prefer negative from positive's gene set + if options.match_gene == 1 and options.match_allele == 1 : + pgenes = set(paqtl_this_df.gene) + neg_found = find_negative(neg_variants, neg_dict, pos_variants, paqtl_this_df, paqtl_neg_df, pgenes, True) + + if neg_found : + neg_gene_and_allele_variants += 1 + + if not neg_found and options.match_gene == 1 : + pgenes = set(paqtl_this_df.gene) + neg_found = find_negative(neg_variants, neg_dict, pos_variants, paqtl_this_df, paqtl_neg_df, pgenes, False) + + if neg_found : + neg_gene_variants += 1 + + if not neg_found and options.match_allele == 1 : + pgenes = bin_to_genes_dict[gene_to_tpm_dict[paqtl_this_df.iloc[0].gene]['bin_index']] + neg_found = find_negative(neg_variants, neg_dict, pos_variants, paqtl_this_df, paqtl_neg_df, pgenes, True) + + if not neg_found and gene_to_tpm_dict[paqtl_this_df.iloc[0].gene]['bin_index'] != gene_to_tpm_dict[paqtl_this_df.iloc[0].gene]['bin_index_l'] : + pgenes = bin_to_genes_dict[gene_to_tpm_dict[paqtl_this_df.iloc[0].gene]['bin_index_l']] + neg_found = find_negative(neg_variants, neg_dict, pos_variants, paqtl_this_df, paqtl_neg_df, pgenes, True) + + if not neg_found and gene_to_tpm_dict[paqtl_this_df.iloc[0].gene]['bin_index'] != gene_to_tpm_dict[paqtl_this_df.iloc[0].gene]['bin_index_r'] : + pgenes = bin_to_genes_dict[gene_to_tpm_dict[paqtl_this_df.iloc[0].gene]['bin_index_r']] + neg_found = find_negative(neg_variants, neg_dict, pos_variants, paqtl_this_df, paqtl_neg_df, pgenes, True) + + if neg_found : + neg_expr_and_allele_variants += 1 + + if not neg_found : + pgenes = bin_to_genes_dict[gene_to_tpm_dict[paqtl_this_df.iloc[0].gene]['bin_index']] + neg_found = find_negative(neg_variants, neg_dict, pos_variants, paqtl_this_df, paqtl_neg_df, pgenes, False) + + if not neg_found and gene_to_tpm_dict[paqtl_this_df.iloc[0].gene]['bin_index'] != gene_to_tpm_dict[paqtl_this_df.iloc[0].gene]['bin_index_l'] : + pgenes = bin_to_genes_dict[gene_to_tpm_dict[paqtl_this_df.iloc[0].gene]['bin_index_l']] + neg_found = find_negative(neg_variants, neg_dict, pos_variants, paqtl_this_df, paqtl_neg_df, pgenes, False) + + if not neg_found and gene_to_tpm_dict[paqtl_this_df.iloc[0].gene]['bin_index'] != gene_to_tpm_dict[paqtl_this_df.iloc[0].gene]['bin_index_r'] : + pgenes = bin_to_genes_dict[gene_to_tpm_dict[paqtl_this_df.iloc[0].gene]['bin_index_r']] + neg_found = find_negative(neg_variants, neg_dict, pos_variants, paqtl_this_df, paqtl_neg_df, pgenes, False) + + if neg_found : + neg_expr_variants += 1 + + if not neg_found : + print("[Warning] Could not find a matching negative for '" + pvariant + "'") + unmatched_variants += 1 + + print('%d positive variants' % len(pos_variants)) + print('%d negative variants' % len(neg_variants)) + print(' - %d gene-matched negatives with same alleles' % neg_gene_and_allele_variants) + print(' - %d gene-matched negatives ' % neg_gene_variants) + print(' - %d expr-matched negatives with same alleles' % neg_expr_and_allele_variants) + print(' - %d expr-matched negatives ' % neg_expr_variants) + print(' - %d unmatched negatives ' % unmatched_variants) + + pos_dict = {pv: pv for pv in pos_variants} + + # write VCFs + write_vcf('%s_pos.vcf' % options.out_prefix, paqtl_df, pos_variants, pos_dict) + write_vcf('%s_neg.vcf' % options.out_prefix, paqtl_df, neg_variants, neg_dict) + +def find_negative(neg_variants, neg_dict, pos_variants, paqtl_this_df, paqtl_neg_df, pgenes, match_allele) : + + gene_mask = np.array([gene in pgenes for gene in paqtl_neg_df.gene]) + paqtl_neg_gene_df = paqtl_neg_df[gene_mask] + + # match PAS distance + this_dist = paqtl_this_df.iloc[0].pas_dist + dist_cmp = np.abs(paqtl_neg_gene_df.pas_dist - this_dist) + dist_cmp_unique = np.sort(np.unique(dist_cmp.values)) + + this_ref = paqtl_this_df.iloc[0].ref + this_alt = paqtl_this_df.iloc[0].alt + + for ni_unique in dist_cmp_unique: + + paqtl_neg_gene_dist_df = paqtl_neg_gene_df.loc[dist_cmp == ni_unique] + + shuffle_index = np.arange(len(paqtl_neg_gene_dist_df), dtype='int32') + np.random.shuffle(shuffle_index) + + for npaqtl_i in range(len(paqtl_neg_gene_dist_df)) : + npaqtl = paqtl_neg_gene_dist_df.iloc[shuffle_index[npaqtl_i]] + + if not match_allele or (npaqtl.ref == this_ref and npaqtl.alt == this_alt): + if npaqtl.variant not in neg_variants and npaqtl.variant not in pos_variants: + + neg_variants.add(npaqtl.variant) + neg_dict[npaqtl.variant] = paqtl_this_df.iloc[0].variant + + return True + + return False + +def write_vcf(vcf_file, df, variants_write, variants_dict): + vcf_open = open(vcf_file, 'w') + print('##fileformat=VCFv4.2', file=vcf_open) + print('##INFO=', + file=vcf_open) + print('##INFO=', + file=vcf_open) + print('##INFO=', + file=vcf_open) + cols = ['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO'] + print('\t'.join(cols), file=vcf_open) + + variants_written = set() + + for v in df.itertuples(): + if v.variant in variants_write and v.variant not in variants_written: + cols = [v.chrom, str(v.pos), v.variant, v.ref, v.alt, '.', '.'] + cols += ['MT=%s;PD=%d;PI=%s' % (v.molecular_trait_id, v.pas_dist, variants_dict[v.variant])] + print('\t'.join(cols), file=vcf_open) + variants_written.add(v.variant) + + vcf_open.close() + + +################################################################################ +# __main__ +################################################################################ +if __name__ == '__main__': + main() diff --git a/data/qtl/make_expression_tables.py b/data/qtl/make_expression_tables.py new file mode 100755 index 0000000..ddc2a63 --- /dev/null +++ b/data/qtl/make_expression_tables.py @@ -0,0 +1,181 @@ +#!/usr/bin/env python +from optparse import OptionParser + +import os + +import util + +import numpy as np +import pandas as pd + +import pyranges as pr + +import matplotlib.pyplot as plt + +''' +make_expression_tables.py + +Contruct TPM bucket to sample genes from. +''' + +################################################################################ +# main +################################################################################ +def main(): + usage = 'usage: %prog [options] arg' + parser = OptionParser(usage) + #parser.add_option() + (options,args) = parser.parse_args() + + #Define tissue column-to-file mapping + tissue_dict = { + 'Adipose - Subcutaneous' : 'adipose_subcutaneous', + 'Adipose - Visceral (Omentum)' : 'adipose_visceral', + 'Adrenal Gland' : 'adrenal_gland', + 'Artery - Aorta' : 'artery_aorta', + 'Artery - Coronary' : 'artery_coronary', + 'Artery - Tibial' : 'artery_tibial', + 'Whole Blood' : 'blood', + 'Brain - Amygdala' : 'brain_amygdala', + 'Brain - Anterior cingulate cortex (BA24)' : 'brain_anterior_cingulate_cortex', + 'Brain - Caudate (basal ganglia)' : 'brain_caudate', + 'Brain - Cerebellar Hemisphere' : 'brain_cerebellar_hemisphere', + 'Brain - Cerebellum' : 'brain_cerebellum', + 'Brain - Cortex' : 'brain_cortex', + 'Brain - Frontal Cortex (BA9)' : 'brain_frontal_cortex', + 'Brain - Hippocampus' : 'brain_hippocampus', + 'Brain - Hypothalamus' : 'brain_hypothalamus', + 'Brain - Nucleus accumbens (basal ganglia)' : 'brain_nucleus_accumbens', + 'Brain - Putamen (basal ganglia)' : 'brain_putamen', + 'Brain - Spinal cord (cervical c-1)' : 'brain_spinal_cord', + 'Brain - Substantia nigra' : 'brain_substantia_nigra', + 'Breast - Mammary Tissue' : 'breast', + 'Colon - Sigmoid' : 'colon_sigmoid', + 'Colon - Transverse' : 'colon_transverse', + 'Esophagus - Gastroesophageal Junction' : 'esophagus_gej', + 'Esophagus - Mucosa' : 'esophagus_mucosa', + 'Esophagus - Muscularis' : 'esophagus_muscularis', + 'Cells - Cultured fibroblasts' : 'fibroblast', + 'Heart - Atrial Appendage' : 'heart_atrial_appendage', + 'Heart - Left Ventricle' : 'heart_left_ventricle', + 'Kidney - Cortex' : 'kidney_cortex', + 'Cells - EBV-transformed lymphocytes' : 'LCL', + 'Liver' : 'liver', + 'Lung' : 'lung', + 'Minor Salivary Gland' : 'minor_salivary_gland', + 'Muscle - Skeletal' : 'muscle', + 'Nerve - Tibial' : 'nerve_tibial', + 'Ovary' : 'ovary', + 'Pancreas' : 'pancreas', + 'Pituitary' : 'pituitary', + 'Prostate' : 'prostate', + 'Skin - Not Sun Exposed (Suprapubic)' : 'skin_not_sun_exposed', + 'Skin - Sun Exposed (Lower leg)' : 'skin_sun_exposed', + 'Small Intestine - Terminal Ileum' : 'small_intestine', + 'Spleen' : 'spleen', + 'Stomach' : 'stomach', + 'Testis' : 'testis', + 'Thyroid' : 'thyroid', + 'Uterus' : 'uterus', + 'Vagina' : 'vagina', + } + + for tissue_name in tissue_dict : + + #Load TPM matrix + tpm_df = pd.read_csv("GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct.gz", sep='\t', compression='gzip', skiprows=2) + + save_name = tissue_dict[tissue_name] + + print("-- " + save_name + " --") + + #Clean dataframe + tpm_df['gene_id'] = tpm_df['Name'].apply(lambda x: x.split(".")[0]) + + tpm_df = tpm_df.drop_duplicates(subset=['gene_id'], keep='first').copy().reset_index(drop=True) + + tpm_df['tpm'] = tpm_df[tissue_name] + tpm_df = tpm_df[['gene_id', 'tpm']] + + #Get non-zero TPM entries + tpm_df_zero = tpm_df.loc[tpm_df['tpm'] == 0].copy().reset_index(drop=True) + tpm_df_nonzero = tpm_df.loc[tpm_df['tpm'] > 0].copy().reset_index(drop=True) + + tpm_df_zero['tpm_log2'] = 0. + tpm_df_nonzero['tpm_log2'] = np.log2(tpm_df_nonzero['tpm']) + + #Clip at extremes + min_q = 0.0075 + max_q = 0.9925 + + #Log2 fold change bin sizes + bin_size = 0.4 + bin_offset = 0.15 + + min_tpm_log2 = np.quantile(tpm_df_nonzero['tpm_log2'], q=min_q) + max_tpm_log2 = np.quantile(tpm_df_nonzero['tpm_log2'], q=max_q) + + tpm_df_nonzero.loc[tpm_df_nonzero['tpm_log2'] < min_tpm_log2, 'tpm_log2'] = min_tpm_log2 + tpm_df_nonzero.loc[tpm_df_nonzero['tpm_log2'] > max_tpm_log2, 'tpm_log2'] = max_tpm_log2 + + tpm_log2 = tpm_df_nonzero['tpm_log2'].values + + n_bins = int((max_tpm_log2 - min_tpm_log2) / bin_size) + + #Get sample bins + sample_bins = np.linspace(min_tpm_log2, max_tpm_log2, n_bins+1) + + #Map values to bins + bin_index = np.digitize(tpm_log2, sample_bins[1:], right=True) + bin_index_l = np.digitize(tpm_log2 - bin_offset, sample_bins[1:], right=True) + bin_index_r = np.digitize(tpm_log2 + bin_offset, sample_bins[1:], right=True) + + tpm_df_zero['bin_index_l'] = -1 * np.ones(len(tpm_df_zero), dtype='int32') + tpm_df_zero['bin_index'] = -1 * np.ones(len(tpm_df_zero), dtype='int32') + tpm_df_zero['bin_index_r'] = -1 * np.ones(len(tpm_df_zero), dtype='int32') + + tpm_df_nonzero['bin_index_l'] = bin_index_l + tpm_df_nonzero['bin_index'] = bin_index + tpm_df_nonzero['bin_index_r'] = bin_index_r + + tpm_df = pd.concat([tpm_df_zero, tpm_df_nonzero]).copy().reset_index(drop=True) + + tpm_df = tpm_df.sort_values(by='gene_id', ascending=True).copy().reset_index(drop=True) + + #Save dataframe + tpm_df.to_csv('ge/GTEx_ge_' + save_name + "_tpms.csv", sep='\t', index=False) + + #Visualize TPM sample bins + tpm_df_filtered = tpm_df.loc[tpm_df['tpm'] > 0.] + + f = plt.figure(figsize=(4, 3)) + + plt.hist(tpm_df_filtered['bin_index'].values, bins=np.unique(tpm_df_filtered['bin_index'].values)) + + plt.xlim(0, np.max(tpm_df_filtered['bin_index'].values)) + + plt.xticks(fontsize=8) + plt.yticks(fontsize=8) + + plt.xlabel("Sample bin (FC < " + str(round(2**(bin_size+2*bin_offset), 2)) + ")", fontsize=8) + plt.ylabel("# of genes", fontsize=8) + + plt.title("TPM sample bins (" + save_name + ")", fontsize=8) + + plt.tight_layout() + + plt.savefig('ge/GTEx_ge_' + save_name + "_tpms.png", transparent=False, dpi=300) + + plt.close() + + #Check and warn in case of low-support bins + _, bin_support = np.unique(tpm_df_filtered['bin_index'].values, return_counts=True) + + if np.any(bin_support < 100) : + print("[Warning] Less than 100 genes in some of the TPM sample bins (min = " + str(int(np.min(bin_support))) + ").") + +################################################################################ +# __main__ +################################################################################ +if __name__ == '__main__': + main() diff --git a/data/qtl/make_vcfs.py b/data/qtl/make_vcfs.py new file mode 100755 index 0000000..aa251d0 --- /dev/null +++ b/data/qtl/make_vcfs.py @@ -0,0 +1,112 @@ +#!/usr/bin/env python +from optparse import OptionParser + +import glob +import os + +import pandas as pd + +import util + +''' +make_vcfs.py + +Download QTL Catalogue fine-mapping results. +''' + +################################################################################ +# main +################################################################################ +def main(): + usage = 'usage: %prog [options] arg' + parser = OptionParser(usage) + #parser.add_option() + (options,args) = parser.parse_args() + + pip = 0.2 + match_gene = 0 + match_allele = 1 + + ################################################ + # intronic polyA QTLs + + out_dir = 'ipaqtl_pip%d%s%s' % (pip*100, 'g' if match_gene == 1 else 'e', 'a' if match_allele else '') + os.makedirs(out_dir, exist_ok=True) + + jobs = [] + for table_file in glob.glob('txrev/*.txt.gz'): + out_prefix = table_file.replace('txrev/', '%s/' % out_dir) + out_prefix = out_prefix.replace('.purity_filtered.txt.gz', '') + cmd = './ipaqtl_vcfs.py --neg_pip 0.01 --pos_pip %f --match_gene %d --match_allele %d -o %s' % (pip, match_gene, match_allele, out_prefix) + jobs.append(cmd) + util.exec_par(jobs, 6, verbose=True) + + # merge study/tissue variants + mpos_vcf_file = '%s/pos_merge.vcf' % out_dir + mneg_vcf_file = '%s/neg_merge.vcf' % out_dir + merge_variants(mpos_vcf_file, '%s/*_pos.vcf' % out_dir) + merge_variants(mneg_vcf_file, '%s/*_neg.vcf' % out_dir) + + + ################################################ + # polyA QTLs + + out_dir = 'paqtl_pip%d%s%s' % (pip*100, 'g' if match_gene == 1 else 'e', 'a' if match_allele else '') + os.makedirs(out_dir, exist_ok=True) + + jobs = [] + for table_file in glob.glob('txrev/*.txt.gz'): + out_prefix = table_file.replace('txrev/', '%s/' % out_dir) + out_prefix = out_prefix.replace('.purity_filtered.txt.gz', '') + cmd = './paqtl_vcfs.py --neg_pip 0.01 --pos_pip %f --match_gene %d --match_allele %d -o %s' % (pip, match_gene, match_allele, out_prefix) + jobs.append(cmd) + util.exec_par(jobs, 6, verbose=True) + + # merge study/tissue variants + mpos_vcf_file = '%s/pos_merge.vcf' % out_dir + mneg_vcf_file = '%s/neg_merge.vcf' % out_dir + merge_variants(mpos_vcf_file, '%s/*_pos.vcf' % out_dir) + merge_variants(mneg_vcf_file, '%s/*_neg.vcf' % out_dir) + + ################################################ + # splicing QTLs + + out_dir = 'sqtl_pip%d%s%s' % (pip*100, 'g' if match_gene == 1 else 'e', 'a' if match_allele else '') + os.makedirs(out_dir, exist_ok=True) + + jobs = [] + for table_file in glob.glob('txrev/*.txt.gz'): + out_prefix = table_file.replace('txrev/', '%s/' % out_dir) + out_prefix = out_prefix.replace('.purity_filtered.txt.gz', '') + cmd = './sqtl_vcfs.py --neg_pip 0.01 --pos_pip %f --match_gene %d --match_allele %d -o %s' % (pip, match_gene, match_allele, out_prefix) + jobs.append(cmd) + util.exec_par(jobs, 6, verbose=True) + + # merge study/tissue variants + mpos_vcf_file = '%s/pos_merge.vcf' % out_dir + mneg_vcf_file = '%s/neg_merge.vcf' % out_dir + merge_variants(mpos_vcf_file, '%s/*_pos.vcf' % out_dir) + merge_variants(mneg_vcf_file, '%s/*_neg.vcf' % out_dir) + + +def merge_variants(merge_vcf_file, vcf_glob): + with open(merge_vcf_file, 'w') as merge_vcf_open: + vcf0_file = list(glob.glob(vcf_glob))[0] + for line in open(vcf0_file): + if line[0] == '#': + print(line, end='', file=merge_vcf_open) + + merged_variants = set() + for vcf_file in glob.glob(vcf_glob): + for line in open(vcf_file): + if not line.startswith('#'): + variant = line.split()[2] + if variant not in merged_variants: + print(line, file=merge_vcf_open, end='') + merged_variants.add(variant) + +################################################################################ +# __main__ +################################################################################ +if __name__ == '__main__': + main() diff --git a/data/qtl/merge_finemapping_tables.py b/data/qtl/merge_finemapping_tables.py new file mode 100755 index 0000000..ac4fa7d --- /dev/null +++ b/data/qtl/merge_finemapping_tables.py @@ -0,0 +1,102 @@ +#!/usr/bin/env python +from optparse import OptionParser + +import os + +import util + +import numpy as np +import pandas as pd + +''' +merge_finemapping_tables.py + +Merge fine-mapping tables of QTL credible sets. +''' + +################################################################################ +# main +################################################################################ +def main(): + usage = 'usage: %prog [options] arg' + parser = OptionParser(usage) + #parser.add_option() + (options,args) = parser.parse_args() + + #Define tissues + tissue_names = [ + 'adipose_subcutaneous', + 'adipose_visceral', + 'adrenal_gland', + 'artery_aorta', + 'artery_coronary', + 'artery_tibial', + 'blood', + 'brain_amygdala', + 'brain_anterior_cingulate_cortex', + 'brain_caudate', + 'brain_cerebellar_hemisphere', + 'brain_cerebellum', + 'brain_cortex', + 'brain_frontal_cortex', + 'brain_hippocampus', + 'brain_hypothalamus', + 'brain_nucleus_accumbens', + 'brain_putamen', + 'brain_spinal_cord', + 'brain_substantia_nigra', + 'breast', + 'colon_sigmoid', + 'colon_transverse', + 'esophagus_gej', + 'esophagus_mucosa', + 'esophagus_muscularis', + 'fibroblast', + 'heart_atrial_appendage', + 'heart_left_ventricle', + 'kidney_cortex', + 'LCL', + 'liver', + 'lung', + 'minor_salivary_gland', + 'muscle', + 'nerve_tibial', + 'ovary', + 'pancreas', + 'pituitary', + 'prostate', + 'skin_not_sun_exposed', + 'skin_sun_exposed', + 'small_intestine', + 'spleen', + 'stomach', + 'testis', + 'thyroid', + 'uterus', + 'vagina', + ] + + #Load and merge fine-mapping results + dfs = [] + for tissue_name in tissue_names : + + print("-- " + tissue_name + " --") + + df = pd.read_csv("txrev/GTEx_txrev_" + tissue_name + ".purity_filtered.txt.gz", sep='\t', usecols=['chromosome', 'position', 'ref', 'alt', 'variant', 'pip'], low_memory=False) + dfs.append(df.sort_values(by='pip', ascending=False).drop_duplicates(subset=['variant'], keep='first').copy().reset_index(drop=True)) + + df = pd.concat(dfs).sort_values(by='pip', ascending=False).drop_duplicates(subset=['variant'], keep='first').copy().reset_index(drop=True) + + df['chromosome'] = "chr" + df['chromosome'].astype(str) + df = df.rename(columns={'chromosome' : 'chrom', 'position' : 'pos'}) + + print("len(df) = " + str(len(df))) + + #Save union of dataframes + df.to_csv("txrev/GTEx_txrev_finemapped_merged.csv.gz", sep='\t', index=False) + +################################################################################ +# __main__ +################################################################################ +if __name__ == '__main__': + main() diff --git a/data/qtl/paqtl_make_negative_sets.py b/data/qtl/paqtl_make_negative_sets.py new file mode 100755 index 0000000..a5da60d --- /dev/null +++ b/data/qtl/paqtl_make_negative_sets.py @@ -0,0 +1,196 @@ +#!/usr/bin/env python +from optparse import OptionParser + +import os + +import util + +import numpy as np +import pandas as pd + +import pyranges as pr + +''' +paqtl_make_negative_sets.py + +Build tables with negative (non-causal) SNPs for paQTLs. +''' + +################################################################################ +# main +################################################################################ +def main(): + usage = 'usage: %prog [options] arg' + parser = OptionParser(usage) + #parser.add_option() + (options,args) = parser.parse_args() + + #Parameters + pip_cutoff = 0.01 + max_distance = 10000 + gene_pad = 50 + apa_file = '/home/drk/common/data/genomes/hg38/genes/polyadb/polyadb_exon3.bed' + gtf_file = '/home/drk/common/data/genomes/hg38/genes/gencode41/gencode41_basic_nort.gtf' + finemap_file = 'txrev/GTEx_txrev_finemapped_merged.csv.gz' + + #Define tissues + tissue_names = [ + 'adipose_subcutaneous', + 'adipose_visceral', + 'adrenal_gland', + 'artery_aorta', + 'artery_coronary', + 'artery_tibial', + 'blood', + 'brain_amygdala', + 'brain_anterior_cingulate_cortex', + 'brain_caudate', + 'brain_cerebellar_hemisphere', + 'brain_cerebellum', + 'brain_cortex', + 'brain_frontal_cortex', + 'brain_hippocampus', + 'brain_hypothalamus', + 'brain_nucleus_accumbens', + 'brain_putamen', + 'brain_spinal_cord', + 'brain_substantia_nigra', + 'breast', + 'colon_sigmoid', + 'colon_transverse', + 'esophagus_gej', + 'esophagus_mucosa', + 'esophagus_muscularis', + 'fibroblast', + 'heart_atrial_appendage', + 'heart_left_ventricle', + 'kidney_cortex', + 'LCL', + 'liver', + 'lung', + 'minor_salivary_gland', + 'muscle', + 'nerve_tibial', + 'ovary', + 'pancreas', + 'pituitary', + 'prostate', + 'skin_not_sun_exposed', + 'skin_sun_exposed', + 'small_intestine', + 'spleen', + 'stomach', + 'testis', + 'thyroid', + 'uterus', + 'vagina', + ] + + #Compile negative SNP set for each tissue + for tissue_name in tissue_names : + + print("-- " + str(tissue_name) + " --") + + #Load summary stats and extract unique set of SNPs + vcf_df = pd.read_csv("ge/GTEx_ge_" + tissue_name + ".all.tsv.gz", sep='\t', compression='gzip', usecols=['chromosome', 'position', 'ref', 'alt']).drop_duplicates(subset=['chromosome', 'position', 'ref', 'alt'], keep='first').copy().reset_index(drop=True) + + #Only keep SNPs (no indels) + vcf_df = vcf_df.loc[(vcf_df['ref'].str.len() == vcf_df['alt'].str.len()) & (vcf_df['ref'].str.len() == 1)].copy().reset_index(drop=True) + + vcf_df['chromosome'] = 'chr' + vcf_df['chromosome'].astype(str) + vcf_df['start'] = vcf_df['position'].astype(int) + vcf_df['end'] = vcf_df['start'] + 1 + vcf_df['strand'] = "." + + vcf_df = vcf_df[['chromosome', 'start', 'end', 'ref', 'alt', 'strand']] + vcf_df = vcf_df.rename(columns={'chromosome' : 'Chromosome', 'start' : 'Start', 'end' : 'End', 'strand' : 'Strand'}) + + print("len(vcf_df) = " + str(len(vcf_df))) + + #Store intermediate SNPs + #vcf_df.to_csv("ge/GTEx_snps_" + tissue_name + ".bed.gz", sep='\t', index=False, header=False) + + #Load polyadenylation site annotation + apa_df = pd.read_csv(apa_file, sep='\t', names=['Chromosome', 'Start', 'End', 'pas_id', 'feat1', 'Strand']) + apa_df['Start'] += 1 + + #Load gene span annotation + gtf_df = pd.read_csv(gtf_file, sep='\t', skiprows=5, names=['Chromosome', 'havana_str', 'feature', 'Start', 'End', 'feat1', 'Strand', 'feat2', 'id_str']) + gtf_df = gtf_df.query("feature == 'gene'").copy().reset_index(drop=True) + + gtf_df['gene_id'] = gtf_df['id_str'].apply(lambda x: x.split("gene_id \"")[1].split("\";")[0].split(".")[0]) + + gtf_df = gtf_df[['Chromosome', 'Start', 'End', 'gene_id', 'feat1', 'Strand']].drop_duplicates(subset=['gene_id'], keep='first').copy().reset_index(drop=True) + + gtf_df['Start'] = gtf_df['Start'].astype(int) - gene_pad + gtf_df['End'] = gtf_df['End'].astype(int) + gene_pad + + #Join dataframes against gtf annotation + apa_pr = pr.PyRanges(apa_df) + gtf_pr = pr.PyRanges(gtf_df) + vcf_pr = pr.PyRanges(vcf_df) + + apa_gtf_pr = apa_pr.join(gtf_pr, strandedness='same') + vcf_gtf_pr = vcf_pr.join(gtf_pr, strandedness=False) + + apa_gtf_df = apa_gtf_pr.df[['Chromosome', 'Start', 'End', 'pas_id', 'gene_id', 'Strand']].copy().reset_index(drop=True) + vcf_gtf_df = vcf_gtf_pr.df[['Chromosome', 'Start', 'End', 'ref', 'alt', 'Strand', 'gene_id']].copy().reset_index(drop=True) + + apa_gtf_df['Start'] -= max_distance + apa_gtf_df['End'] += max_distance + + #Join vcf against polyadenylation annotation + apa_gtf_pr = pr.PyRanges(apa_gtf_df) + vcf_gtf_pr = pr.PyRanges(vcf_gtf_df) + + vcf_apa_pr = vcf_gtf_pr.join(apa_gtf_pr, strandedness=False) + + #Force gene_id of SNP to be same as the gene_id of the polyA site + vcf_apa_df = vcf_apa_pr.df.query("gene_id == gene_id_b").copy().reset_index(drop=True) + vcf_apa_df = vcf_apa_df[['Chromosome', 'Start', 'ref', 'alt', 'gene_id', 'pas_id', 'Strand_b', 'Start_b']] + + #PolyA site position + vcf_apa_df['Start_b'] += max_distance + vcf_apa_df = vcf_apa_df.rename(columns={'Start' : 'Pos', 'Start_b' : 'pas_pos', 'Strand_b' : 'Strand'}) + + #Distance to polyA site + vcf_apa_df['distance'] = np.abs(vcf_apa_df['Pos'] - vcf_apa_df['pas_pos']) + + #Choose unique SNPs by shortest distance to polyA site + vcf_apa_df = vcf_apa_df.sort_values(by='distance', ascending=True).drop_duplicates(subset=['Chromosome', 'Pos', 'ref', 'alt'], keep='first').copy().reset_index(drop=True) + vcf_apa_df = vcf_apa_df.sort_values(['Chromosome', 'Pos', 'alt'], ascending=True).copy().reset_index(drop=True) + + vcf_df_filtered = vcf_apa_df.rename(columns={'Chromosome' : 'chrom', 'Pos' : 'pos', 'Strand' : 'strand'}) + vcf_df_filtered = vcf_df_filtered[['chrom', 'pos', 'ref', 'alt', 'gene_id', 'pas_id', 'strand', 'pas_pos', 'distance']] + + print("len(vcf_df_filtered) = " + str(len(vcf_df_filtered))) + + #Store intermediate SNPs (filtered) + vcf_df_filtered.to_csv("ge/GTEx_snps_" + tissue_name + "_polya_filtered.bed.gz", sep='\t', index=False) + + #Reload filtered SNP file + vcf_df_filtered = pd.read_csv("ge/GTEx_snps_" + tissue_name + "_polya_filtered.bed.gz", sep='\t', compression='gzip') + + #Create variant identifier + vcf_df_filtered['variant'] = vcf_df_filtered['chrom'] + "_" + vcf_df_filtered['pos'].astype(str) + "_" + vcf_df_filtered['ref'] + "_" + vcf_df_filtered['alt'] + + #Load merged fine-mapping dataframe + finemap_df = pd.read_csv(finemap_file, sep='\t')[['variant', 'pip']] + + #Join against fine-mapping dataframe + neg_df = vcf_df_filtered.join(finemap_df.set_index('variant'), on='variant', how='left') + neg_df.loc[neg_df['pip'].isnull(), 'pip'] = 0. + + #Only keep SNPs with PIP < cutoff + neg_df = neg_df.query("pip < " + str(pip_cutoff)).copy().reset_index(drop=True) + + #Store final table of negative SNPs + neg_df.to_csv("ge/GTEx_snps_" + tissue_name + "_polya_negatives.bed.gz", sep='\t', index=False) + + print("len(neg_df) = " + str(len(neg_df))) + +################################################################################ +# __main__ +################################################################################ +if __name__ == '__main__': + main() diff --git a/data/qtl/paqtl_make_positive_sets.py b/data/qtl/paqtl_make_positive_sets.py new file mode 100755 index 0000000..3d07fa3 --- /dev/null +++ b/data/qtl/paqtl_make_positive_sets.py @@ -0,0 +1,191 @@ +#!/usr/bin/env python +from optparse import OptionParser + +import os + +import util + +import numpy as np +import pandas as pd + +import pyranges as pr + +''' +paqtl_make_positive_sets.py + +Build tables with positive (causal) SNPs for paQTLs. +''' + +################################################################################ +# main +################################################################################ +def main(): + usage = 'usage: %prog [options] arg' + parser = OptionParser(usage) + #parser.add_option() + (options,args) = parser.parse_args() + + #Parameters + pip_cutoff = 0.01 + max_distance = 10000 + gene_pad = 50 + apa_file = '/home/drk/common/data/genomes/hg38/genes/polyadb/polyadb_exon3.bed' + gtf_file = '/home/drk/common/data/genomes/hg38/genes/gencode41/gencode41_basic_nort.gtf' + + #Define tissues + tissue_names = [ + 'adipose_subcutaneous', + 'adipose_visceral', + 'adrenal_gland', + 'artery_aorta', + 'artery_coronary', + 'artery_tibial', + 'blood', + 'brain_amygdala', + 'brain_anterior_cingulate_cortex', + 'brain_caudate', + 'brain_cerebellar_hemisphere', + 'brain_cerebellum', + 'brain_cortex', + 'brain_frontal_cortex', + 'brain_hippocampus', + 'brain_hypothalamus', + 'brain_nucleus_accumbens', + 'brain_putamen', + 'brain_spinal_cord', + 'brain_substantia_nigra', + 'breast', + 'colon_sigmoid', + 'colon_transverse', + 'esophagus_gej', + 'esophagus_mucosa', + 'esophagus_muscularis', + 'fibroblast', + 'heart_atrial_appendage', + 'heart_left_ventricle', + 'kidney_cortex', + 'LCL', + 'liver', + 'lung', + 'minor_salivary_gland', + 'muscle', + 'nerve_tibial', + 'ovary', + 'pancreas', + 'pituitary', + 'prostate', + 'skin_not_sun_exposed', + 'skin_sun_exposed', + 'small_intestine', + 'spleen', + 'stomach', + 'testis', + 'thyroid', + 'uterus', + 'vagina', + ] + + #Compile positive SNP set for each tissue + for tissue_name in tissue_names : + + print("-- " + str(tissue_name) + " --") + + #Load fine-mapping table + vcf_df = pd.read_csv("txrev/GTEx_txrev_" + tissue_name + ".purity_filtered.txt.gz", sep='\t', usecols=['chromosome', 'position', 'ref', 'alt', 'variant', 'pip', 'molecular_trait_id'], low_memory=False) + + #Only keep SNPs (no indels) + vcf_df = vcf_df.loc[(vcf_df['ref'].str.len() == vcf_df['alt'].str.len()) & (vcf_df['ref'].str.len() == 1)].copy().reset_index(drop=True) + + #Only keep SNPs associated with polyadenylation events + vcf_df = vcf_df.loc[vcf_df['molecular_trait_id'].str.contains(".downstream.")].copy().reset_index(drop=True) + + vcf_df['chromosome'] = 'chr' + vcf_df['chromosome'].astype(str) + vcf_df['start'] = vcf_df['position'].astype(int) + vcf_df['end'] = vcf_df['start'] + 1 + vcf_df['strand'] = "." + + vcf_df = vcf_df[['chromosome', 'start', 'end', 'ref', 'alt', 'strand', 'variant', 'pip', 'molecular_trait_id']] + vcf_df = vcf_df.rename(columns={'chromosome' : 'Chromosome', 'start' : 'Start', 'end' : 'End', 'strand' : 'Strand'}) + + print("len(vcf_df) = " + str(len(vcf_df))) + + #Load polyadenylation site annotation + apa_df = pd.read_csv(apa_file, sep='\t', names=['Chromosome', 'Start', 'End', 'pas_id', 'feat1', 'Strand']) + apa_df['Start'] += 1 + + #Load gene span annotation + gtf_df = pd.read_csv(gtf_file, sep='\t', skiprows=5, names=['Chromosome', 'havana_str', 'feature', 'Start', 'End', 'feat1', 'Strand', 'feat2', 'id_str']) + gtf_df = gtf_df.query("feature == 'gene'").copy().reset_index(drop=True) + + gtf_df['gene_id'] = gtf_df['id_str'].apply(lambda x: x.split("gene_id \"")[1].split("\";")[0].split(".")[0]) + + gtf_df = gtf_df[['Chromosome', 'Start', 'End', 'gene_id', 'feat1', 'Strand']].drop_duplicates(subset=['gene_id'], keep='first').copy().reset_index(drop=True) + + gtf_df['Start'] = gtf_df['Start'].astype(int) - gene_pad + gtf_df['End'] = gtf_df['End'].astype(int) + gene_pad + + #Join dataframes against gtf annotation + apa_pr = pr.PyRanges(apa_df) + gtf_pr = pr.PyRanges(gtf_df) + vcf_pr = pr.PyRanges(vcf_df) + + apa_gtf_pr = apa_pr.join(gtf_pr, strandedness='same') + vcf_gtf_pr = vcf_pr.join(gtf_pr, strandedness=False) + + apa_gtf_df = apa_gtf_pr.df[['Chromosome', 'Start', 'End', 'pas_id', 'gene_id', 'Strand']].copy().reset_index(drop=True) + vcf_gtf_df = vcf_gtf_pr.df[['Chromosome', 'Start', 'End', 'ref', 'alt', 'Strand', 'gene_id', 'variant', 'pip', 'molecular_trait_id']].copy().reset_index(drop=True) + + apa_gtf_df['Start'] -= max_distance + apa_gtf_df['End'] += max_distance + + #Join vcf against polyadenylation annotation + apa_gtf_pr = pr.PyRanges(apa_gtf_df) + vcf_gtf_pr = pr.PyRanges(vcf_gtf_df) + + vcf_apa_pr = vcf_gtf_pr.join(apa_gtf_pr, strandedness=False) + + #Force gene_id of SNP to be same as the gene_id of the polyA site + vcf_apa_df = vcf_apa_pr.df.query("gene_id == gene_id_b").copy().reset_index(drop=True) + vcf_apa_df = vcf_apa_df[['Chromosome', 'Start', 'ref', 'alt', 'gene_id', 'pas_id', 'Strand_b', 'Start_b', 'variant', 'pip', 'molecular_trait_id']] + + #Force gene_id of SNP to be same as the gene_id of the finemapped molecular trait + vcf_apa_df['molecular_trait_gene_id'] = vcf_apa_df['molecular_trait_id'].apply(lambda x: x.split(".")[0]) + vcf_apa_df = vcf_apa_df.query("gene_id == molecular_trait_gene_id").copy().reset_index(drop=True) + + #PolyA site position + vcf_apa_df['Start_b'] += max_distance + vcf_apa_df = vcf_apa_df.rename(columns={'Start' : 'Pos', 'Start_b' : 'pas_pos', 'Strand_b' : 'Strand'}) + + #Distance to polyA site + vcf_apa_df['distance'] = np.abs(vcf_apa_df['Pos'] - vcf_apa_df['pas_pos']) + + #Choose unique SNPs by shortest distance to polyA site (and inverse PIP for tie-breaking) + vcf_apa_df['pip_inv'] = 1. - vcf_apa_df['pip'] + + vcf_apa_df = vcf_apa_df.sort_values(by=['distance', 'pip_inv'], ascending=True).drop_duplicates(subset=['Chromosome', 'Pos', 'ref', 'alt'], keep='first').copy().reset_index(drop=True) + vcf_apa_df = vcf_apa_df.sort_values(['Chromosome', 'Pos', 'alt'], ascending=True).copy().reset_index(drop=True) + + vcf_df_filtered = vcf_apa_df.rename(columns={'Chromosome' : 'chrom', 'Pos' : 'pos', 'Strand' : 'strand'}) + vcf_df_filtered = vcf_df_filtered[['chrom', 'pos', 'ref', 'alt', 'gene_id', 'pas_id', 'strand', 'pas_pos', 'distance', 'variant', 'pip', 'molecular_trait_id']] + + print("len(vcf_df_filtered) = " + str(len(vcf_df_filtered))) + + #Store intermediate SNPs (filtered) + vcf_df_filtered.to_csv("txrev/GTEx_snps_" + tissue_name + "_polya_finemapped_filtered.bed.gz", sep='\t', index=False) + + #Reload filtered SNP file + vcf_df_filtered = pd.read_csv("txrev/GTEx_snps_" + tissue_name + "_polya_finemapped_filtered.bed.gz", sep='\t', compression='gzip') + + #Only keep SNPs with PIP > cutoff + pos_df = vcf_df_filtered.query("pip > " + str(pip_cutoff)).copy().reset_index(drop=True) + + #Store final table of positive SNPs + pos_df.to_csv("txrev/GTEx_snps_" + tissue_name + "_polya_positives.bed.gz", sep='\t', index=False) + + print("len(pos_df) = " + str(len(pos_df))) + +################################################################################ +# __main__ +################################################################################ +if __name__ == '__main__': + main() diff --git a/data/qtl/paqtl_vcfs.py b/data/qtl/paqtl_vcfs.py new file mode 100755 index 0000000..f0884b1 --- /dev/null +++ b/data/qtl/paqtl_vcfs.py @@ -0,0 +1,234 @@ +#!/usr/bin/env python +from optparse import OptionParser +import os +import pdb +import time + +import numpy as np +import pandas as pd +import pyranges as pr +from tqdm import tqdm + +''' +paqtl_vcfs.py + +Generate positive and negative paQTL sets from the QTL catalog txrevise. +''' + +################################################################################ +# main +################################################################################ +def main(): + usage = 'usage: %prog [options]' + parser = OptionParser(usage) + parser.add_option('--neg_pip', dest='neg_pip', + default=0.01, type='float', + help='PIP upper limit for negative examples. [Default: %default]') + parser.add_option('--pos_pip', dest='pos_pip', + default=0.9, type='float', + help='PIP lower limit for positive examples. [Default: %default]') + parser.add_option('--match_gene', dest='match_gene', + default=0, type='int', + help='Try finding negative in same gene as positive. [Default: %default]') + parser.add_option('--match_allele', dest='match_allele', + default=0, type='int', + help='Try finding negative with same ref and alt alleles. [Default: %default]') + parser.add_option('-o', dest='out_prefix', + default='qtlcat_paqtl') + (options,args) = parser.parse_args() + + tissue_name = options.out_prefix.split('txrev_')[1] + + gtf_file = '/home/drk/common/data/genomes/hg38/genes/gencode41/gencode41_basic_nort_protein.gtf' + + # read variant table + qtlcat_df_neg = pd.read_csv("ge/GTEx_snps_" + tissue_name + "_polya_negatives.bed.gz", sep='\t') + qtlcat_df_pos = pd.read_csv("txrev/GTEx_snps_" + tissue_name + "_polya_positives.bed.gz", sep='\t') + + # read TPM bin table and construct lookup dictionaries + tpm_df = pd.read_csv('ge/GTEx_ge_' + tissue_name + "_tpms.csv", sep='\t')[['gene_id', 'tpm', 'bin_index', 'bin_index_l', 'bin_index_r']] + gene_to_tpm_dict = tpm_df.set_index('gene_id').to_dict(orient='index') + + # filter on SNPs with genes in TPM bin dict + qtlcat_df_neg = qtlcat_df_neg.loc[qtlcat_df_neg['gene_id'].isin(tpm_df['gene_id'].values.tolist())].copy().reset_index(drop=True) + qtlcat_df_pos = qtlcat_df_pos.loc[qtlcat_df_pos['gene_id'].isin(tpm_df['gene_id'].values.tolist())].copy().reset_index(drop=True) + + #Load gene span annotation (protein-coding/categorized only) + gtf_df = pd.read_csv(gtf_file, sep='\t', skiprows=5, names=['id_str']) + gtf_genes = gtf_df['id_str'].apply(lambda x: x.split("gene_id \"")[1].split("\";")[0].split(".")[0]).unique().tolist() + + # filter on SNPs with genes in GTF file + qtlcat_df_neg = qtlcat_df_neg.loc[qtlcat_df_neg['gene_id'].isin(gtf_genes)].copy().reset_index(drop=True) + qtlcat_df_pos = qtlcat_df_pos.loc[qtlcat_df_pos['gene_id'].isin(gtf_genes)].copy().reset_index(drop=True) + + bin_to_genes_dict = {} + for _, row in tpm_df.iterrows() : + + if row['bin_index'] not in bin_to_genes_dict : + bin_to_genes_dict[row['bin_index']] = [] + + bin_to_genes_dict[row['bin_index']].append(row['gene_id']) + + for sample_bin in bin_to_genes_dict : + bin_to_genes_dict[sample_bin] = set(bin_to_genes_dict[sample_bin]) + + # split molecular trait id and filter for polyadenylation (for positives) + qtlcat_df_pos['gene'] = [mti.split('.')[0] for mti in qtlcat_df_pos.molecular_trait_id] + qtlcat_df_pos['event'] = [mti.split('.')[2] for mti in qtlcat_df_pos.molecular_trait_id] + + qtlcat_df_pos = qtlcat_df_pos[qtlcat_df_pos.event == 'downstream'] + qtlcat_df_pos = qtlcat_df_pos.rename(columns={'distance' : 'pas_dist'}) + + qtlcat_df_neg['molecular_trait_id'] = qtlcat_df_neg['gene_id'] + "." + "grp_0.downstream.negative" + qtlcat_df_neg['gene'] = qtlcat_df_neg['gene_id'] + qtlcat_df_neg['event'] = 'downstream' + qtlcat_df_neg = qtlcat_df_neg.rename(columns={'distance' : 'pas_dist'}) + + paqtl_df = pd.concat([qtlcat_df_neg, qtlcat_df_pos]).copy().reset_index(drop=True) + + # determine positive variants + paqtl_pos_df = paqtl_df[paqtl_df.pip >= options.pos_pip] + paqtl_neg_df = paqtl_df[paqtl_df.pip < options.neg_pip] + pos_variants = set(paqtl_pos_df.variant) + + neg_gene_and_allele_variants = 0 + neg_gene_variants = 0 + + neg_expr_and_allele_variants = 0 + neg_expr_variants = 0 + + unmatched_variants = 0 + + # choose negative variants + neg_variants = set() + neg_dict = {} + for pvariant in tqdm(pos_variants): + paqtl_this_df = paqtl_pos_df[paqtl_pos_df.variant == pvariant] + + neg_found = False + + # optionally prefer negative from positive's gene set + if options.match_gene == 1 and options.match_allele == 1 : + pgenes = set(paqtl_this_df.gene) + neg_found = find_negative(neg_variants, neg_dict, pos_variants, paqtl_this_df, paqtl_neg_df, pgenes, True) + + if neg_found : + neg_gene_and_allele_variants += 1 + + if not neg_found and options.match_gene == 1 : + pgenes = set(paqtl_this_df.gene) + neg_found = find_negative(neg_variants, neg_dict, pos_variants, paqtl_this_df, paqtl_neg_df, pgenes, False) + + if neg_found : + neg_gene_variants += 1 + + if not neg_found and options.match_allele == 1 : + pgenes = bin_to_genes_dict[gene_to_tpm_dict[paqtl_this_df.iloc[0].gene]['bin_index']] + neg_found = find_negative(neg_variants, neg_dict, pos_variants, paqtl_this_df, paqtl_neg_df, pgenes, True) + + if not neg_found and gene_to_tpm_dict[paqtl_this_df.iloc[0].gene]['bin_index'] != gene_to_tpm_dict[paqtl_this_df.iloc[0].gene]['bin_index_l'] : + pgenes = bin_to_genes_dict[gene_to_tpm_dict[paqtl_this_df.iloc[0].gene]['bin_index_l']] + neg_found = find_negative(neg_variants, neg_dict, pos_variants, paqtl_this_df, paqtl_neg_df, pgenes, True) + + if not neg_found and gene_to_tpm_dict[paqtl_this_df.iloc[0].gene]['bin_index'] != gene_to_tpm_dict[paqtl_this_df.iloc[0].gene]['bin_index_r'] : + pgenes = bin_to_genes_dict[gene_to_tpm_dict[paqtl_this_df.iloc[0].gene]['bin_index_r']] + neg_found = find_negative(neg_variants, neg_dict, pos_variants, paqtl_this_df, paqtl_neg_df, pgenes, True) + + if neg_found : + neg_expr_and_allele_variants += 1 + + if not neg_found : + pgenes = bin_to_genes_dict[gene_to_tpm_dict[paqtl_this_df.iloc[0].gene]['bin_index']] + neg_found = find_negative(neg_variants, neg_dict, pos_variants, paqtl_this_df, paqtl_neg_df, pgenes, False) + + if not neg_found and gene_to_tpm_dict[paqtl_this_df.iloc[0].gene]['bin_index'] != gene_to_tpm_dict[paqtl_this_df.iloc[0].gene]['bin_index_l'] : + pgenes = bin_to_genes_dict[gene_to_tpm_dict[paqtl_this_df.iloc[0].gene]['bin_index_l']] + neg_found = find_negative(neg_variants, neg_dict, pos_variants, paqtl_this_df, paqtl_neg_df, pgenes, False) + + if not neg_found and gene_to_tpm_dict[paqtl_this_df.iloc[0].gene]['bin_index'] != gene_to_tpm_dict[paqtl_this_df.iloc[0].gene]['bin_index_r'] : + pgenes = bin_to_genes_dict[gene_to_tpm_dict[paqtl_this_df.iloc[0].gene]['bin_index_r']] + neg_found = find_negative(neg_variants, neg_dict, pos_variants, paqtl_this_df, paqtl_neg_df, pgenes, False) + + if neg_found : + neg_expr_variants += 1 + + if not neg_found : + print("[Warning] Could not find a matching negative for '" + pvariant + "'") + unmatched_variants += 1 + + print('%d positive variants' % len(pos_variants)) + print('%d negative variants' % len(neg_variants)) + print(' - %d gene-matched negatives with same alleles' % neg_gene_and_allele_variants) + print(' - %d gene-matched negatives ' % neg_gene_variants) + print(' - %d expr-matched negatives with same alleles' % neg_expr_and_allele_variants) + print(' - %d expr-matched negatives ' % neg_expr_variants) + print(' - %d unmatched negatives ' % unmatched_variants) + + pos_dict = {pv: pv for pv in pos_variants} + + # write VCFs + write_vcf('%s_pos.vcf' % options.out_prefix, paqtl_df, pos_variants, pos_dict) + write_vcf('%s_neg.vcf' % options.out_prefix, paqtl_df, neg_variants, neg_dict) + +def find_negative(neg_variants, neg_dict, pos_variants, paqtl_this_df, paqtl_neg_df, pgenes, match_allele) : + + gene_mask = np.array([gene in pgenes for gene in paqtl_neg_df.gene]) + paqtl_neg_gene_df = paqtl_neg_df[gene_mask] + + # match PAS distance + this_dist = paqtl_this_df.iloc[0].pas_dist + dist_cmp = np.abs(paqtl_neg_gene_df.pas_dist - this_dist) + dist_cmp_unique = np.sort(np.unique(dist_cmp.values)) + + this_ref = paqtl_this_df.iloc[0].ref + this_alt = paqtl_this_df.iloc[0].alt + + for ni_unique in dist_cmp_unique: + + paqtl_neg_gene_dist_df = paqtl_neg_gene_df.loc[dist_cmp == ni_unique] + + shuffle_index = np.arange(len(paqtl_neg_gene_dist_df), dtype='int32') + np.random.shuffle(shuffle_index) + + for npaqtl_i in range(len(paqtl_neg_gene_dist_df)) : + npaqtl = paqtl_neg_gene_dist_df.iloc[shuffle_index[npaqtl_i]] + + if not match_allele or (npaqtl.ref == this_ref and npaqtl.alt == this_alt): + if npaqtl.variant not in neg_variants and npaqtl.variant not in pos_variants: + + neg_variants.add(npaqtl.variant) + neg_dict[npaqtl.variant] = paqtl_this_df.iloc[0].variant + + return True + + return False + +def write_vcf(vcf_file, df, variants_write, variants_dict): + vcf_open = open(vcf_file, 'w') + print('##fileformat=VCFv4.2', file=vcf_open) + print('##INFO=', + file=vcf_open) + print('##INFO=', + file=vcf_open) + print('##INFO=', + file=vcf_open) + cols = ['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO'] + print('\t'.join(cols), file=vcf_open) + + variants_written = set() + + for v in df.itertuples(): + if v.variant in variants_write and v.variant not in variants_written: + cols = [v.chrom, str(v.pos), v.variant, v.ref, v.alt, '.', '.'] + cols += ['MT=%s;PD=%d;PI=%s' % (v.molecular_trait_id, v.pas_dist, variants_dict[v.variant])] + print('\t'.join(cols), file=vcf_open) + variants_written.add(v.variant) + + vcf_open.close() + + +################################################################################ +# __main__ +################################################################################ +if __name__ == '__main__': + main() diff --git a/data/qtl/sqtl_make_negative_sets.py b/data/qtl/sqtl_make_negative_sets.py new file mode 100755 index 0000000..7518ca4 --- /dev/null +++ b/data/qtl/sqtl_make_negative_sets.py @@ -0,0 +1,195 @@ +#!/usr/bin/env python +from optparse import OptionParser + +import os + +import util + +import numpy as np +import pandas as pd + +import pyranges as pr + +''' +sqtl_make_negative_sets.py + +Build tables with negative (non-causal) SNPs for sQTLs. +''' + +################################################################################ +# main +################################################################################ +def main(): + usage = 'usage: %prog [options] arg' + parser = OptionParser(usage) + #parser.add_option() + (options,args) = parser.parse_args() + + #Parameters + pip_cutoff = 0.01 + max_distance = 10000 + gene_pad = 50 + splice_file = '/home/drk/common/data/genomes/hg38/genes/gencode41/gencode41_basic_protein_splice.gff' + gtf_file = '/home/drk/common/data/genomes/hg38/genes/gencode41/gencode41_basic_nort.gtf' + finemap_file = 'txrev/GTEx_txrev_finemapped_merged.csv.gz' + + #Define tissues + tissue_names = [ + 'adipose_subcutaneous', + 'adipose_visceral', + 'adrenal_gland', + 'artery_aorta', + 'artery_coronary', + 'artery_tibial', + 'blood', + 'brain_amygdala', + 'brain_anterior_cingulate_cortex', + 'brain_caudate', + 'brain_cerebellar_hemisphere', + 'brain_cerebellum', + 'brain_cortex', + 'brain_frontal_cortex', + 'brain_hippocampus', + 'brain_hypothalamus', + 'brain_nucleus_accumbens', + 'brain_putamen', + 'brain_spinal_cord', + 'brain_substantia_nigra', + 'breast', + 'colon_sigmoid', + 'colon_transverse', + 'esophagus_gej', + 'esophagus_mucosa', + 'esophagus_muscularis', + 'fibroblast', + 'heart_atrial_appendage', + 'heart_left_ventricle', + 'kidney_cortex', + 'LCL', + 'liver', + 'lung', + 'minor_salivary_gland', + 'muscle', + 'nerve_tibial', + 'ovary', + 'pancreas', + 'pituitary', + 'prostate', + 'skin_not_sun_exposed', + 'skin_sun_exposed', + 'small_intestine', + 'spleen', + 'stomach', + 'testis', + 'thyroid', + 'uterus', + 'vagina', + ] + + #Compile negative SNP set for each tissue + for tissue_name in tissue_names : + + print("-- " + str(tissue_name) + " --") + + #Load summary stats and extract unique set of SNPs + vcf_df = pd.read_csv("ge/GTEx_ge_" + tissue_name + ".all.tsv.gz", sep='\t', compression='gzip', usecols=['chromosome', 'position', 'ref', 'alt']).drop_duplicates(subset=['chromosome', 'position', 'ref', 'alt'], keep='first').copy().reset_index(drop=True) + + #Only keep SNPs (no indels) + vcf_df = vcf_df.loc[(vcf_df['ref'].str.len() == vcf_df['alt'].str.len()) & (vcf_df['ref'].str.len() == 1)].copy().reset_index(drop=True) + + vcf_df['chromosome'] = 'chr' + vcf_df['chromosome'].astype(str) + vcf_df['start'] = vcf_df['position'].astype(int) + vcf_df['end'] = vcf_df['start'] + 1 + vcf_df['strand'] = "." + + vcf_df = vcf_df[['chromosome', 'start', 'end', 'ref', 'alt', 'strand']] + vcf_df = vcf_df.rename(columns={'chromosome' : 'Chromosome', 'start' : 'Start', 'end' : 'End', 'strand' : 'Strand'}) + + print("len(vcf_df) = " + str(len(vcf_df))) + + #Store intermediate SNPs + #vcf_df.to_csv("ge/GTEx_snps_" + tissue_name + ".bed.gz", sep='\t', index=False, header=False) + + #Load splice site annotation + splice_df = pd.read_csv(splice_file, sep='\t', names=['Chromosome', 'havana_str', 'feature', 'Start', 'End', 'feat1', 'Strand', 'feat2', 'id_str'], usecols=['Chromosome', 'Start', 'End', 'feature', 'feat1', 'Strand'])[['Chromosome', 'Start', 'End', 'feature', 'feat1', 'Strand']] + + #Load gene span annotation + gtf_df = pd.read_csv(gtf_file, sep='\t', skiprows=5, names=['Chromosome', 'havana_str', 'feature', 'Start', 'End', 'feat1', 'Strand', 'feat2', 'id_str']) + gtf_df = gtf_df.query("feature == 'gene'").copy().reset_index(drop=True) + + gtf_df['gene_id'] = gtf_df['id_str'].apply(lambda x: x.split("gene_id \"")[1].split("\";")[0].split(".")[0]) + + gtf_df = gtf_df[['Chromosome', 'Start', 'End', 'gene_id', 'feat1', 'Strand']].drop_duplicates(subset=['gene_id'], keep='first').copy().reset_index(drop=True) + + gtf_df['Start'] = gtf_df['Start'].astype(int) - gene_pad + gtf_df['End'] = gtf_df['End'].astype(int) + gene_pad + + #Join dataframes against gtf annotation + splice_pr = pr.PyRanges(splice_df) + gtf_pr = pr.PyRanges(gtf_df) + vcf_pr = pr.PyRanges(vcf_df) + + splice_gtf_pr = splice_pr.join(gtf_pr, strandedness='same') + vcf_gtf_pr = vcf_pr.join(gtf_pr, strandedness=False) + + splice_gtf_df = splice_gtf_pr.df[['Chromosome', 'Start', 'End', 'feature', 'gene_id', 'Strand']].copy().reset_index(drop=True) + vcf_gtf_df = vcf_gtf_pr.df[['Chromosome', 'Start', 'End', 'ref', 'alt', 'Strand', 'gene_id']].copy().reset_index(drop=True) + + splice_gtf_df['Start'] -= max_distance + splice_gtf_df['End'] += max_distance + + #Join vcf against splice annotation + splice_gtf_pr = pr.PyRanges(splice_gtf_df) + vcf_gtf_pr = pr.PyRanges(vcf_gtf_df) + + vcf_splice_pr = vcf_gtf_pr.join(splice_gtf_pr, strandedness=False) + + #Force gene_id of SNP to be same as the gene_id of the splice site + vcf_splice_df = vcf_splice_pr.df.query("gene_id == gene_id_b").copy().reset_index(drop=True) + vcf_splice_df = vcf_splice_df[['Chromosome', 'Start', 'ref', 'alt', 'gene_id', 'feature', 'Strand_b', 'Start_b']] + + #Splice site position + vcf_splice_df['Start_b'] += max_distance + vcf_splice_df = vcf_splice_df.rename(columns={'Start' : 'Pos', 'Start_b' : 'splice_pos', 'Strand_b' : 'Strand'}) + + #Distance to splice site + vcf_splice_df['distance'] = np.abs(vcf_splice_df['Pos'] - vcf_splice_df['splice_pos']) + + #Choose unique SNPs by shortest distance to splice site + vcf_splice_df = vcf_splice_df.sort_values(by='distance', ascending=True).drop_duplicates(subset=['Chromosome', 'Pos', 'ref', 'alt'], keep='first').copy().reset_index(drop=True) + vcf_splice_df = vcf_splice_df.sort_values(['Chromosome', 'Pos', 'alt'], ascending=True).copy().reset_index(drop=True) + + vcf_df_filtered = vcf_splice_df.rename(columns={'Chromosome' : 'chrom', 'Pos' : 'pos', 'Strand' : 'strand'}) + vcf_df_filtered = vcf_df_filtered[['chrom', 'pos', 'ref', 'alt', 'gene_id', 'feature', 'strand', 'splice_pos', 'distance']] + + print("len(vcf_df_filtered) = " + str(len(vcf_df_filtered))) + + #Store intermediate SNPs (filtered) + vcf_df_filtered.to_csv("ge/GTEx_snps_" + tissue_name + "_splice_filtered.bed.gz", sep='\t', index=False) + + #Reload filtered SNP file + vcf_df_filtered = pd.read_csv("ge/GTEx_snps_" + tissue_name + "_splice_filtered.bed.gz", sep='\t', compression='gzip') + + #Create variant identifier + vcf_df_filtered['variant'] = vcf_df_filtered['chrom'] + "_" + vcf_df_filtered['pos'].astype(str) + "_" + vcf_df_filtered['ref'] + "_" + vcf_df_filtered['alt'] + + #Load merged fine-mapping dataframe + finemap_df = pd.read_csv(finemap_file, sep='\t')[['variant', 'pip']] + + #Join against fine-mapping dataframe + neg_df = vcf_df_filtered.join(finemap_df.set_index('variant'), on='variant', how='left') + neg_df.loc[neg_df['pip'].isnull(), 'pip'] = 0. + + #Only keep SNPs with PIP < cutoff + neg_df = neg_df.query("pip < " + str(pip_cutoff)).copy().reset_index(drop=True) + + #Store final table of negative SNPs + neg_df.to_csv("ge/GTEx_snps_" + tissue_name + "_splice_negatives.bed.gz", sep='\t', index=False) + + print("len(neg_df) = " + str(len(neg_df))) + +################################################################################ +# __main__ +################################################################################ +if __name__ == '__main__': + main() diff --git a/data/qtl/sqtl_make_positive_sets.py b/data/qtl/sqtl_make_positive_sets.py new file mode 100755 index 0000000..954ab7e --- /dev/null +++ b/data/qtl/sqtl_make_positive_sets.py @@ -0,0 +1,190 @@ +#!/usr/bin/env python +from optparse import OptionParser + +import os + +import util + +import numpy as np +import pandas as pd + +import pyranges as pr + +''' +sqtl_make_positive_sets.py + +Build tables with positive (causal) SNPs for sQTLs. +''' + +################################################################################ +# main +################################################################################ +def main(): + usage = 'usage: %prog [options] arg' + parser = OptionParser(usage) + #parser.add_option() + (options,args) = parser.parse_args() + + #Parameters + pip_cutoff = 0.01 + max_distance = 10000 + gene_pad = 50 + splice_file = '/home/drk/common/data/genomes/hg38/genes/gencode41/gencode41_basic_protein_splice.gff' + gtf_file = '/home/drk/common/data/genomes/hg38/genes/gencode41/gencode41_basic_nort.gtf' + + #Define tissues + tissue_names = [ + 'adipose_subcutaneous', + 'adipose_visceral', + 'adrenal_gland', + 'artery_aorta', + 'artery_coronary', + 'artery_tibial', + 'blood', + 'brain_amygdala', + 'brain_anterior_cingulate_cortex', + 'brain_caudate', + 'brain_cerebellar_hemisphere', + 'brain_cerebellum', + 'brain_cortex', + 'brain_frontal_cortex', + 'brain_hippocampus', + 'brain_hypothalamus', + 'brain_nucleus_accumbens', + 'brain_putamen', + 'brain_spinal_cord', + 'brain_substantia_nigra', + 'breast', + 'colon_sigmoid', + 'colon_transverse', + 'esophagus_gej', + 'esophagus_mucosa', + 'esophagus_muscularis', + 'fibroblast', + 'heart_atrial_appendage', + 'heart_left_ventricle', + 'kidney_cortex', + 'LCL', + 'liver', + 'lung', + 'minor_salivary_gland', + 'muscle', + 'nerve_tibial', + 'ovary', + 'pancreas', + 'pituitary', + 'prostate', + 'skin_not_sun_exposed', + 'skin_sun_exposed', + 'small_intestine', + 'spleen', + 'stomach', + 'testis', + 'thyroid', + 'uterus', + 'vagina', + ] + + #Compile positive SNP set for each tissue + for tissue_name in tissue_names : + + print("-- " + str(tissue_name) + " --") + + #Load fine-mapping table + vcf_df = pd.read_csv("txrev/GTEx_txrev_" + tissue_name + ".purity_filtered.txt.gz", sep='\t', usecols=['chromosome', 'position', 'ref', 'alt', 'variant', 'pip', 'molecular_trait_id'], low_memory=False) + + #Only keep SNPs (no indels) + vcf_df = vcf_df.loc[(vcf_df['ref'].str.len() == vcf_df['alt'].str.len()) & (vcf_df['ref'].str.len() == 1)].copy().reset_index(drop=True) + + #Only keep SNPs associated with splice events + vcf_df = vcf_df.loc[vcf_df['molecular_trait_id'].str.contains(".contained.")].copy().reset_index(drop=True) + + vcf_df['chromosome'] = 'chr' + vcf_df['chromosome'].astype(str) + vcf_df['start'] = vcf_df['position'].astype(int) + vcf_df['end'] = vcf_df['start'] + 1 + vcf_df['strand'] = "." + + vcf_df = vcf_df[['chromosome', 'start', 'end', 'ref', 'alt', 'strand', 'variant', 'pip', 'molecular_trait_id']] + vcf_df = vcf_df.rename(columns={'chromosome' : 'Chromosome', 'start' : 'Start', 'end' : 'End', 'strand' : 'Strand'}) + + print("len(vcf_df) = " + str(len(vcf_df))) + + #Load splice site annotation + splice_df = pd.read_csv(splice_file, sep='\t', names=['Chromosome', 'havana_str', 'feature', 'Start', 'End', 'feat1', 'Strand', 'feat2', 'id_str'], usecols=['Chromosome', 'Start', 'End', 'feature', 'feat1', 'Strand'])[['Chromosome', 'Start', 'End', 'feature', 'feat1', 'Strand']] + + #Load gene span annotation + gtf_df = pd.read_csv(gtf_file, sep='\t', skiprows=5, names=['Chromosome', 'havana_str', 'feature', 'Start', 'End', 'feat1', 'Strand', 'feat2', 'id_str']) + gtf_df = gtf_df.query("feature == 'gene'").copy().reset_index(drop=True) + + gtf_df['gene_id'] = gtf_df['id_str'].apply(lambda x: x.split("gene_id \"")[1].split("\";")[0].split(".")[0]) + + gtf_df = gtf_df[['Chromosome', 'Start', 'End', 'gene_id', 'feat1', 'Strand']].drop_duplicates(subset=['gene_id'], keep='first').copy().reset_index(drop=True) + + gtf_df['Start'] = gtf_df['Start'].astype(int) - gene_pad + gtf_df['End'] = gtf_df['End'].astype(int) + gene_pad + + #Join dataframes against gtf annotation + splice_pr = pr.PyRanges(splice_df) + gtf_pr = pr.PyRanges(gtf_df) + vcf_pr = pr.PyRanges(vcf_df) + + splice_gtf_pr = splice_pr.join(gtf_pr, strandedness='same') + vcf_gtf_pr = vcf_pr.join(gtf_pr, strandedness=False) + + splice_gtf_df = splice_gtf_pr.df[['Chromosome', 'Start', 'End', 'feature', 'gene_id', 'Strand']].copy().reset_index(drop=True) + vcf_gtf_df = vcf_gtf_pr.df[['Chromosome', 'Start', 'End', 'ref', 'alt', 'Strand', 'gene_id', 'variant', 'pip', 'molecular_trait_id']].copy().reset_index(drop=True) + + splice_gtf_df['Start'] -= max_distance + splice_gtf_df['End'] += max_distance + + #Join vcf against splice annotation + splice_gtf_pr = pr.PyRanges(splice_gtf_df) + vcf_gtf_pr = pr.PyRanges(vcf_gtf_df) + + vcf_splice_pr = vcf_gtf_pr.join(splice_gtf_pr, strandedness=False) + + #Force gene_id of SNP to be same as the gene_id of the splice site + vcf_splice_df = vcf_splice_pr.df.query("gene_id == gene_id_b").copy().reset_index(drop=True) + vcf_splice_df = vcf_splice_df[['Chromosome', 'Start', 'ref', 'alt', 'gene_id', 'feature', 'Strand_b', 'Start_b', 'variant', 'pip', 'molecular_trait_id']] + + #Force gene_id of SNP to be same as the gene_id of the finemapped molecular trait + vcf_splice_df['molecular_trait_gene_id'] = vcf_splice_df['molecular_trait_id'].apply(lambda x: x.split(".")[0]) + vcf_splice_df = vcf_splice_df.query("gene_id == molecular_trait_gene_id").copy().reset_index(drop=True) + + #Splice site position + vcf_splice_df['Start_b'] += max_distance + vcf_splice_df = vcf_splice_df.rename(columns={'Start' : 'Pos', 'Start_b' : 'splice_pos', 'Strand_b' : 'Strand'}) + + #Distance to splice site + vcf_splice_df['distance'] = np.abs(vcf_splice_df['Pos'] - vcf_splice_df['splice_pos']) + + #Choose unique SNPs by shortest distance to splice site (and inverse PIP for tie-breaking) + vcf_splice_df['pip_inv'] = 1. - vcf_splice_df['pip'] + + vcf_splice_df = vcf_splice_df.sort_values(by=['distance', 'pip_inv'], ascending=True).drop_duplicates(subset=['Chromosome', 'Pos', 'ref', 'alt'], keep='first').copy().reset_index(drop=True) + vcf_splice_df = vcf_splice_df.sort_values(['Chromosome', 'Pos', 'alt'], ascending=True).copy().reset_index(drop=True) + + vcf_df_filtered = vcf_splice_df.rename(columns={'Chromosome' : 'chrom', 'Pos' : 'pos', 'Strand' : 'strand'}) + vcf_df_filtered = vcf_df_filtered[['chrom', 'pos', 'ref', 'alt', 'gene_id', 'feature', 'strand', 'splice_pos', 'distance', 'variant', 'pip', 'molecular_trait_id']] + + print("len(vcf_df_filtered) = " + str(len(vcf_df_filtered))) + + #Store intermediate SNPs (filtered) + vcf_df_filtered.to_csv("txrev/GTEx_snps_" + tissue_name + "_splice_finemapped_filtered.bed.gz", sep='\t', index=False) + + #Reload filtered SNP file + vcf_df_filtered = pd.read_csv("txrev/GTEx_snps_" + tissue_name + "_splice_finemapped_filtered.bed.gz", sep='\t', compression='gzip') + + #Only keep SNPs with PIP > cutoff + pos_df = vcf_df_filtered.query("pip > " + str(pip_cutoff)).copy().reset_index(drop=True) + + #Store final table of positive SNPs + pos_df.to_csv("txrev/GTEx_snps_" + tissue_name + "_splice_positives.bed.gz", sep='\t', index=False) + + print("len(pos_df) = " + str(len(pos_df))) + +################################################################################ +# __main__ +################################################################################ +if __name__ == '__main__': + main() diff --git a/data/qtl/sqtl_vcfs.py b/data/qtl/sqtl_vcfs.py new file mode 100755 index 0000000..d275a76 --- /dev/null +++ b/data/qtl/sqtl_vcfs.py @@ -0,0 +1,234 @@ +#!/usr/bin/env python +from optparse import OptionParser +import os +import pdb +import time + +import numpy as np +import pandas as pd +import pyranges as pr +from tqdm import tqdm + +''' +sqtl_vcfs.py + +Generate positive and negative sQTL sets from the QTL catalog txrevise. +''' + +################################################################################ +# main +################################################################################ +def main(): + usage = 'usage: %prog [options]' + parser = OptionParser(usage) + parser.add_option('--neg_pip', dest='neg_pip', + default=0.01, type='float', + help='PIP upper limit for negative examples. [Default: %default]') + parser.add_option('--pos_pip', dest='pos_pip', + default=0.9, type='float', + help='PIP lower limit for positive examples. [Default: %default]') + parser.add_option('--match_gene', dest='match_gene', + default=0, type='int', + help='Try finding negative in same gene as positive. [Default: %default]') + parser.add_option('--match_allele', dest='match_allele', + default=0, type='int', + help='Try finding negative with same ref and alt alleles. [Default: %default]') + parser.add_option('-o', dest='out_prefix', + default='qtlcat_sqtl') + (options,args) = parser.parse_args() + + tissue_name = options.out_prefix.split('txrev_')[1] + + gtf_file = '/home/drk/common/data/genomes/hg38/genes/gencode41/gencode41_basic_nort_protein.gtf' + + # read variant table + qtlcat_df_neg = pd.read_csv("ge/GTEx_snps_" + tissue_name + "_splice_negatives.bed.gz", sep='\t') + qtlcat_df_pos = pd.read_csv("txrev/GTEx_snps_" + tissue_name + "_splice_positives.bed.gz", sep='\t') + + # read TPM bin table and construct lookup dictionaries + tpm_df = pd.read_csv('ge/GTEx_ge_' + tissue_name + "_tpms.csv", sep='\t')[['gene_id', 'tpm', 'bin_index', 'bin_index_l', 'bin_index_r']] + gene_to_tpm_dict = tpm_df.set_index('gene_id').to_dict(orient='index') + + # filter on SNPs with genes in TPM bin dict + qtlcat_df_neg = qtlcat_df_neg.loc[qtlcat_df_neg['gene_id'].isin(tpm_df['gene_id'].values.tolist())].copy().reset_index(drop=True) + qtlcat_df_pos = qtlcat_df_pos.loc[qtlcat_df_pos['gene_id'].isin(tpm_df['gene_id'].values.tolist())].copy().reset_index(drop=True) + + #Load gene span annotation (protein-coding/categorized only) + gtf_df = pd.read_csv(gtf_file, sep='\t', skiprows=5, names=['id_str']) + gtf_genes = gtf_df['id_str'].apply(lambda x: x.split("gene_id \"")[1].split("\";")[0].split(".")[0]).unique().tolist() + + # filter on SNPs with genes in GTF file + qtlcat_df_neg = qtlcat_df_neg.loc[qtlcat_df_neg['gene_id'].isin(gtf_genes)].copy().reset_index(drop=True) + qtlcat_df_pos = qtlcat_df_pos.loc[qtlcat_df_pos['gene_id'].isin(gtf_genes)].copy().reset_index(drop=True) + + bin_to_genes_dict = {} + for _, row in tpm_df.iterrows() : + + if row['bin_index'] not in bin_to_genes_dict : + bin_to_genes_dict[row['bin_index']] = [] + + bin_to_genes_dict[row['bin_index']].append(row['gene_id']) + + for sample_bin in bin_to_genes_dict : + bin_to_genes_dict[sample_bin] = set(bin_to_genes_dict[sample_bin]) + + # split molecular trait id and filter for polyadenylation (for positives) + qtlcat_df_pos['gene'] = [mti.split('.')[0] for mti in qtlcat_df_pos.molecular_trait_id] + qtlcat_df_pos['event'] = [mti.split('.')[2] for mti in qtlcat_df_pos.molecular_trait_id] + + qtlcat_df_pos = qtlcat_df_pos[qtlcat_df_pos.event == 'contained'] + qtlcat_df_pos = qtlcat_df_pos.rename(columns={'distance' : 'splice_dist'}) + + qtlcat_df_neg['molecular_trait_id'] = qtlcat_df_neg['gene_id'] + "." + "grp_0.contained.negative" + qtlcat_df_neg['gene'] = qtlcat_df_neg['gene_id'] + qtlcat_df_neg['event'] = 'contained' + qtlcat_df_neg = qtlcat_df_neg.rename(columns={'distance' : 'splice_dist'}) + + sqtl_df = pd.concat([qtlcat_df_neg, qtlcat_df_pos]).copy().reset_index(drop=True) + + # determine positive variants + sqtl_pos_df = sqtl_df[sqtl_df.pip >= options.pos_pip] + sqtl_neg_df = sqtl_df[sqtl_df.pip < options.neg_pip] + pos_variants = set(sqtl_pos_df.variant) + + neg_gene_and_allele_variants = 0 + neg_gene_variants = 0 + + neg_expr_and_allele_variants = 0 + neg_expr_variants = 0 + + unmatched_variants = 0 + + # choose negative variants + neg_variants = set() + neg_dict = {} + for pvariant in tqdm(pos_variants): + sqtl_this_df = sqtl_pos_df[sqtl_pos_df.variant == pvariant] + + neg_found = False + + # optionally prefer negative from positive's gene set + if options.match_gene == 1 and options.match_allele == 1 : + pgenes = set(sqtl_this_df.gene) + neg_found = find_negative(neg_variants, neg_dict, pos_variants, sqtl_this_df, sqtl_neg_df, pgenes, True) + + if neg_found : + neg_gene_and_allele_variants += 1 + + if not neg_found and options.match_gene == 1 : + pgenes = set(sqtl_this_df.gene) + neg_found = find_negative(neg_variants, neg_dict, pos_variants, sqtl_this_df, sqtl_neg_df, pgenes, False) + + if neg_found : + neg_gene_variants += 1 + + if not neg_found and options.match_allele == 1 : + pgenes = bin_to_genes_dict[gene_to_tpm_dict[sqtl_this_df.iloc[0].gene]['bin_index']] + neg_found = find_negative(neg_variants, neg_dict, pos_variants, sqtl_this_df, sqtl_neg_df, pgenes, True) + + if not neg_found and gene_to_tpm_dict[sqtl_this_df.iloc[0].gene]['bin_index'] != gene_to_tpm_dict[sqtl_this_df.iloc[0].gene]['bin_index_l'] : + pgenes = bin_to_genes_dict[gene_to_tpm_dict[sqtl_this_df.iloc[0].gene]['bin_index_l']] + neg_found = find_negative(neg_variants, neg_dict, pos_variants, sqtl_this_df, sqtl_neg_df, pgenes, True) + + if not neg_found and gene_to_tpm_dict[sqtl_this_df.iloc[0].gene]['bin_index'] != gene_to_tpm_dict[sqtl_this_df.iloc[0].gene]['bin_index_r'] : + pgenes = bin_to_genes_dict[gene_to_tpm_dict[sqtl_this_df.iloc[0].gene]['bin_index_r']] + neg_found = find_negative(neg_variants, neg_dict, pos_variants, sqtl_this_df, sqtl_neg_df, pgenes, True) + + if neg_found : + neg_expr_and_allele_variants += 1 + + if not neg_found : + pgenes = bin_to_genes_dict[gene_to_tpm_dict[sqtl_this_df.iloc[0].gene]['bin_index']] + neg_found = find_negative(neg_variants, neg_dict, pos_variants, sqtl_this_df, sqtl_neg_df, pgenes, False) + + if not neg_found and gene_to_tpm_dict[sqtl_this_df.iloc[0].gene]['bin_index'] != gene_to_tpm_dict[sqtl_this_df.iloc[0].gene]['bin_index_l'] : + pgenes = bin_to_genes_dict[gene_to_tpm_dict[sqtl_this_df.iloc[0].gene]['bin_index_l']] + neg_found = find_negative(neg_variants, neg_dict, pos_variants, sqtl_this_df, sqtl_neg_df, pgenes, False) + + if not neg_found and gene_to_tpm_dict[sqtl_this_df.iloc[0].gene]['bin_index'] != gene_to_tpm_dict[sqtl_this_df.iloc[0].gene]['bin_index_r'] : + pgenes = bin_to_genes_dict[gene_to_tpm_dict[sqtl_this_df.iloc[0].gene]['bin_index_r']] + neg_found = find_negative(neg_variants, neg_dict, pos_variants, sqtl_this_df, sqtl_neg_df, pgenes, False) + + if neg_found : + neg_expr_variants += 1 + + if not neg_found : + print("[Warning] Could not find a matching negative for '" + pvariant + "'") + unmatched_variants += 1 + + print('%d positive variants' % len(pos_variants)) + print('%d negative variants' % len(neg_variants)) + print(' - %d gene-matched negatives with same alleles' % neg_gene_and_allele_variants) + print(' - %d gene-matched negatives ' % neg_gene_variants) + print(' - %d expr-matched negatives with same alleles' % neg_expr_and_allele_variants) + print(' - %d expr-matched negatives ' % neg_expr_variants) + print(' - %d unmatched negatives ' % unmatched_variants) + + pos_dict = {pv: pv for pv in pos_variants} + + # write VCFs + write_vcf('%s_pos.vcf' % options.out_prefix, sqtl_df, pos_variants, pos_dict) + write_vcf('%s_neg.vcf' % options.out_prefix, sqtl_df, neg_variants, neg_dict) + +def find_negative(neg_variants, neg_dict, pos_variants, sqtl_this_df, sqtl_neg_df, pgenes, match_allele) : + + gene_mask = np.array([gene in pgenes for gene in sqtl_neg_df.gene]) + sqtl_neg_gene_df = sqtl_neg_df[gene_mask] + + # match PAS distance + this_dist = sqtl_this_df.iloc[0].splice_dist + dist_cmp = np.abs(sqtl_neg_gene_df.splice_dist - this_dist) + dist_cmp_unique = np.sort(np.unique(dist_cmp.values)) + + this_ref = sqtl_this_df.iloc[0].ref + this_alt = sqtl_this_df.iloc[0].alt + + for ni_unique in dist_cmp_unique: + + sqtl_neg_gene_dist_df = sqtl_neg_gene_df.loc[dist_cmp == ni_unique] + + shuffle_index = np.arange(len(sqtl_neg_gene_dist_df), dtype='int32') + np.random.shuffle(shuffle_index) + + for nsqtl_i in range(len(sqtl_neg_gene_dist_df)) : + nsqtl = sqtl_neg_gene_dist_df.iloc[shuffle_index[nsqtl_i]] + + if not match_allele or (nsqtl.ref == this_ref and nsqtl.alt == this_alt): + if nsqtl.variant not in neg_variants and nsqtl.variant not in pos_variants: + + neg_variants.add(nsqtl.variant) + neg_dict[nsqtl.variant] = sqtl_this_df.iloc[0].variant + + return True + + return False + +def write_vcf(vcf_file, df, variants_write, variants_dict): + vcf_open = open(vcf_file, 'w') + print('##fileformat=VCFv4.2', file=vcf_open) + print('##INFO=', + file=vcf_open) + print('##INFO=', + file=vcf_open) + print('##INFO=', + file=vcf_open) + cols = ['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO'] + print('\t'.join(cols), file=vcf_open) + + variants_written = set() + + for v in df.itertuples(): + if v.variant in variants_write and v.variant not in variants_written: + cols = [v.chrom, str(v.pos), v.variant, v.ref, v.alt, '.', '.'] + cols += ['MT=%s;SD=%d;PI=%s' % (v.molecular_trait_id, v.splice_dist, variants_dict[v.variant])] + print('\t'.join(cols), file=vcf_open) + variants_written.add(v.variant) + + vcf_open.close() + + +################################################################################ +# __main__ +################################################################################ +if __name__ == '__main__': + main() diff --git a/data/training/Makefile b/data/training/Makefile new file mode 100755 index 0000000..e68c1bc --- /dev/null +++ b/data/training/Makefile @@ -0,0 +1,46 @@ +FASTA_HUMAN=$$BORZOI_HG38/assembly/ucsc/hg38.ml.fa +GAPS_HUMAN=$$BORZOI_HG38/assembly/ucsc/hg38_gaps.bed +UMAP_HUMAN=$$BORZOI_HG38/mappability/umap_k36_t10_l32.bed +BLACK_HUMAN=$$BORZOI_HG38/blacklist/blacklist_hg38_all.bed + +FASTA_MOUSE=$$BORZOI_MM10/assembly/ucsc/mm10.ml.fa +GAPS_MOUSE=$$BORZOI_MM10/assembly/ucsc/mm10_gaps.bed +UMAP_MOUSE=$$BORZOI_MM10/mappability/umap_k36_t10_l32.bed +BLACK_MOUSE=$$BORZOI_MM10/blacklist/blacklist_mm10_all.bed + +ALIGN=$$BORZOI_HG38/align/hg38.mm10.syn.net.gz + +OUT=data + +# LENGTH=393216 +# TSTRIDE=43691 # (393216-2*131072)/3 +# CROP=98304 + +LENGTH=524288 +TSTRIDE=49173 # (524288-2*163840)/4 + 21 +CROP=163840 +WIDTH=32 +FOLDS=8 + +AOPTS=--break 2097152 -c $(CROP) --nf 524288 --no 393216 -l $(LENGTH) --stride $(TSTRIDE) -f $(FOLDS) --umap_t 0.5 -w $(WIDTH) +DOPTS=-c $(CROP) -d 2 -f $(FOLDS) -l $(LENGTH) -p 64 -r 16 --umap_clip 0.5 -w $(WIDTH) --transform_old + +all: $(OUT)/hg38/tfrecords/train-0.tfr $(OUT)/mm10/tfrecords/train-0.tfr + +umap_human.bed: + cat $(UMAP_HUMAN) $(BLACK_HUMAN) | awk 'BEGIN {OFS="\t"} {print $$1, $$2, $$3}' | bedtools sort -i - | bedtools merge -i - > umap_human.bed + +umap_mouse.bed: + cat $(UMAP_MOUSE) $(BLACK_MOUSE) | awk 'BEGIN {OFS="\t"} {print $$1, $$2, $$3}' | bedtools sort -i - | bedtools merge -i - > umap_mouse.bed + +targets_human.txt targets_mouse.txt: + ./make_targets.py + +$(OUT)/hg38/sequences.bed $(OUT)/mm10/sequences.bed: umap_human.bed umap_mouse.bed + hound_data_align.py -a hg38,mm10 -g $(GAPS_HUMAN),$(GAPS_MOUSE) -u umap_human.bed,umap_mouse.bed $(AOPTS) -o $(OUT) $(ALIGN) $(FASTA_HUMAN),$(FASTA_MOUSE) + +$(OUT)/hg38/tfrecords/train-0.tfr: $(OUT)/hg38/sequences.bed targets_human.txt + hound_data.py --restart $(DOPTS) -b $(BLACK_HUMAN) -o $(OUT)/hg38 $(FASTA_HUMAN) -u umap_human.bed targets_human.txt + +$(OUT)/mm10/tfrecords/train-0.tfr: $(OUT)/mm10/sequences.bed targets_mouse.txt + hound_data.py --restart $(DOPTS) -b $(BLACK_MOUSE) -o $(OUT)/mm10 $(FASTA_MOUSE) -u umap_mouse.bed targets_mouse.txt diff --git a/data/training/README.md b/data/training/README.md new file mode 100755 index 0000000..461975f --- /dev/null +++ b/data/training/README.md @@ -0,0 +1,10 @@ +## Data Processing + +Processing of ENCODE, GTEx, FANTOM5, and CATlas training data is done through a Makefile. It requires a number of auxiliary files (e.g. genome alignments), which can be downloaded manually from [this](https://storage.googleapis.com/seqnn-share/helper/dependencies/) data bucket (GCP), or by running the script 'download_dependencies.sh'.
+ +The Makefile relies on the script 'hound_data.py' from the [baskerville repository](https://github.com/calico/baskerville/blob/main/src/baskerville/scripts/hound_data.py), which in turn calls the scripts 'hound_data_read.py' and 'hound_data_write.py' from the same repo, in order to (1) read coverage data (from bigwig-like .w5 files) along with a matched segment from a fasta genome file, and (2) write the (one-hot coded) sequence and coverage values into compressed TF records.
+ +*Notes*: +- The attached Makefile shows the exact commands used to call hound_data.py and other related scripts to create the specific training data for the published model. +- The script(s) take as input a fasta genome file, optional blacklist+unmappable region files, as well as a .txt file where each row points to a .w5 coverage file location (see for example [this file](https://raw.githubusercontent.com/calico/borzoi/main/examples/targets_human.txt)). +- The .w5 coverage files were converted from bigwig format using [this script](https://github.com/calico/borzoi/tree/main/src/scripts/bw_h5.py).
diff --git a/data/training/download_dependencies.sh b/data/training/download_dependencies.sh new file mode 100755 index 0000000..57424a1 --- /dev/null +++ b/data/training/download_dependencies.sh @@ -0,0 +1,97 @@ +#!/bin/bash + +# create additional folder in borzoi data folders +mkdir -p "$BORZOI_HG38/assembly/ucsc" +mkdir -p "$BORZOI_HG38/assembly/gnomad" +mkdir -p "$BORZOI_HG38/mappability" +mkdir -p "$BORZOI_HG38/blacklist" +mkdir -p "$BORZOI_HG38/align" + +mkdir -p "$BORZOI_MM10/assembly/ucsc" +mkdir -p "$BORZOI_MM10/mappability" +mkdir -p "$BORZOI_MM10/blacklist" + + +# download and uncompress auxiliary files required for Makefile (hg38) +if [ -f "$BORZOI_HG38/assembly/ucsc/hg38_gaps.bed" ]; then + echo "hg38_gaps.bed already exists." +else + wget -O - https://storage.googleapis.com/seqnn-share/helper/dependencies/hg38_gaps.bed.gz | gunzip -c > "$BORZOI_HG38/assembly/ucsc/hg38_gaps.bed" +fi + +if [ -f "$BORZOI_HG38/mappability/umap_k36_t10_l32.bed" ]; then + echo "umap_k36_t10_l32.bed (hg38) already exists." +else + wget -O - https://storage.googleapis.com/seqnn-share/helper/dependencies/umap_k36_t10_l32_hg38.bed.gz | gunzip -c > "$BORZOI_HG38/mappability/umap_k36_t10_l32.bed" +fi + +if [ -f "$BORZOI_HG38/blacklist/blacklist_hg38_all.bed" ]; then + echo "blacklist_hg38_all.bed already exists." +else + wget -O - https://storage.googleapis.com/seqnn-share/helper/dependencies/blacklist_hg38_all.bed.gz | gunzip -c > "$BORZOI_HG38/blacklist/blacklist_hg38_all.bed" +fi + +if [ -f "$BORZOI_HG38/align/hg38.mm10.syn.net.gz" ]; then + echo "Splice site annotation already exist." +else + wget https://storage.googleapis.com/seqnn-share/helper/dependencies/hg38.mm10.syn.net.gz -O "$BORZOI_HG38/align/hg38.mm10.syn.net.gz" +fi + + +# download and uncompress auxiliary files required for Makefile (mm10) +if [ -f "$BORZOI_MM10/assembly/ucsc/mm10_gaps.bed" ]; then + echo "mm10_gaps.bed already exists." +else + wget -O - https://storage.googleapis.com/seqnn-share/helper/dependencies/mm10_gaps.bed.gz | gunzip -c > "$BORZOI_MM10/assembly/ucsc/mm10_gaps.bed" +fi + +if [ -f "$BORZOI_MM10/mappability/umap_k36_t10_l32.bed" ]; then + echo "umap_k36_t10_l32.bed (mm10) already exists." +else + wget -O - https://storage.googleapis.com/seqnn-share/helper/dependencies/umap_k36_t10_l32_mm10.bed.gz | gunzip -c > "$BORZOI_MM10/mappability/umap_k36_t10_l32.bed" +fi + +if [ -f "$BORZOI_MM10/blacklist/blacklist_mm10_all.bed" ]; then + echo "blacklist_mm10_all.bed already exists." +else + wget -O - https://storage.googleapis.com/seqnn-share/helper/dependencies/blacklist_mm10_all.bed.gz | gunzip -c > "$BORZOI_MM10/blacklist/blacklist_mm10_all.bed" +fi + + +# download and uncompress pre-compiled umap bed files +if [ -f "$BORZOI_DIR/examples/umap_human.bed" ]; then + echo "umap_human.bed already exists." +else + wget -O - https://storage.googleapis.com/seqnn-share/helper/dependencies/umap_human.bed.gz | gunzip -c > "$BORZOI_DIR/examples/umap_human.bed" +fi + +if [ -f "$BORZOI_DIR/examples/umap_mouse.bed" ]; then + echo "umap_mouse.bed already exists." +else + wget -O - https://storage.googleapis.com/seqnn-share/helper/dependencies/umap_mouse.bed.gz | gunzip -c > "$BORZOI_DIR/examples/umap_mouse.bed" +fi + + +# download and index hg38 ml genome +if [ -f "$BORZOI_HG38/assembly/ucsc/hg38.ml.fa" ]; then + echo "hg38.ml.fa already exists." +else + wget -O - https://storage.googleapis.com/seqnn-share/helper/dependencies/hg38.ml.fa.gz | gunzip -c > "$BORZOI_HG38/assembly/ucsc/hg38.ml.fa" + idx_genome.py "$BORZOI_HG38/assembly/ucsc/hg38.ml.fa" +fi + +# download and index hg38 ml genome (gnomad major alleles) +if [ -f "$BORZOI_HG38/assembly/gnomad/hg38.ml.fa" ]; then + echo "hg38.ml.fa (gnomad) already exists." +else + wget -O - https://storage.googleapis.com/seqnn-share/helper/dependencies/hg38_gnomad.ml.fa.gz | gunzip -c > "$BORZOI_HG38/assembly/gnomad/hg38.ml.fa" + idx_genome.py "$BORZOI_HG38/assembly/gnomad/hg38.ml.fa" +fi + +# download and index mm10 ml genome +if [ -f "$BORZOI_MM10/assembly/ucsc/mm10.ml.fa" ]; then + echo "mm10.ml.fa already exists." +else + wget -O - https://storage.googleapis.com/seqnn-share/helper/dependencies/mm10.ml.fa.gz | gunzip -c > "$BORZOI_MM10/assembly/ucsc/mm10.ml.fa" + idx_genome.py "$BORZOI_MM10/assembly/ucsc/mm10.ml.fa" +fi diff --git a/model/README.md b/model/README.md new file mode 100644 index 0000000..b40c0f7 --- /dev/null +++ b/model/README.md @@ -0,0 +1,8 @@ +## Model Training + +The script 'train.sh' contains the command used to train the published Borzoi model ensemble. + +*Notes*: +- Model training is done through the script 'hound_train.py' from the [baskerville repository](https://github.com/calico/baskerville/blob/main/src/baskerville/scripts/hound_train.py). +- Multi-fold training is done through the script 'westminster_train_folds.py' from the [westminster repository](https://github.com/calico/westminster/blob/main/src/westminster/scripts/westminster_train_folds.py). +- Training parameters are specified in a .json file that is supplied to the script. The published model's .json can be found [here](https://storage.googleapis.com/seqnn-share/borzoi/params.json).
diff --git a/model/train.sh b/model/train.sh new file mode 100644 index 0000000..b5d6337 --- /dev/null +++ b/model/train.sh @@ -0,0 +1,3 @@ +#!/bin/sh + +westminster_train_folds.py -e borzoi_py310 --f_list 3 -c 4 --identical_crosses -q rtx4090 -o saved_models params.json data/hg38 data/mm10