-
Notifications
You must be signed in to change notification settings - Fork 127
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
4 changed files
with
850 additions
and
411 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,188 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Analyzing noncoding variation associated with disease is a major application of Basenji. I now offer several tools to enable that analysis. If you have a small set of variants and know what datasets are most relevant, [basenji_sat_vcf.py](https://github.com/calico/basenji/blob/master/bin/basenji_sat_vcf.py) lets you perform a saturation mutagenesis of the variant and surrounding region to see the relevant nearby motifs.\n", | ||
"\n", | ||
"If you want scores measuring the influence of those variants on all datasets,\n", | ||
" * [basenji_sad.py](https://github.com/calico/basenji/blob/master/bin/basenji_sad.py) computes my SNP activity difference (SAD) score--the predicted change in aligned fragments to the region.\n", | ||
" * [basenji_sed.py](https://github.com/calico/basenji/blob/master/bin/basenji_sed.py) computes my SNP expression difference (SED) score--the predicted change in aligned fragments to gene TSS's.\n", | ||
"\n", | ||
"Here, I'll demonstrate those two programs. You'll need\n", | ||
" * Trained model\n", | ||
" * Input file (FASTA or HDF5 with test_in/test_out)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"First, you can either train your own model in the [Train/test tutorial](https://github.com/calico/basenji/blob/master/tutorials/train_test.ipynb) or download one that I pre-trained." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"if not os.path.isfile('models/gm12878_d10.tf.meta'):\n", | ||
" subprocess.call('curl -o models/gm12878_d10.tf.index https://storage.googleapis.com/basenji_tutorial_data/model_gm12878_d10.tf.index', shell=True)\n", | ||
" subprocess.call('curl -o models/gm12878_d10.tf.meta https://storage.googleapis.com/basenji_tutorial_data/model_gm12878_d10.tf.meta', shell=True)\n", | ||
" subprocess.call('curl -o models/gm12878_d10.tf.data-00000-of-00001 https://storage.googleapis.com/basenji_tutorial_data/model_gm12878_d10.tf.data-00000-of-00001', shell=True)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We'll bash the PIM1 promoter to see what motifs drive its expression. I placed a 262 kb FASTA file surrounding the PIM1 TSS in data/pim1.fa, so we'll use [basenji_sat.py](https://github.com/calico/basenji/blob/master/bin/basenji_sat.py).\n", | ||
"\n", | ||
"The most relevant options are:\n", | ||
"\n", | ||
"| Option/Argument | Value | Note |\n", | ||
"|:---|:---|:---|\n", | ||
"| -g | data/human.hg19.genome | Genome assembly chromosome length to bound gene sequences. |\n", | ||
"| -f | 20 | Figure width, that I usually scale to 10x the saturation mutageneis region |\n", | ||
"| -l | 200 | Saturation mutagenesis region in the center of the given sequence(s) |\n", | ||
"| -o | pim1_sat | Outplot plot directory. |\n", | ||
"| -t | 0,38 | Target indexes. 0 is a DNase and 38 is CAGE, as you can see in data/gm12878_wigs.txt. |\n", | ||
"| params_file | models/params_small_sat.txt | Table of parameters to setup the model architecture and optimization parameters. |\n", | ||
"| model_file | models/gm12878_d10.tf | Trained saved model prefix. |\n", | ||
"| input_file | data/pim1.fa | Either FASTA or HDF5 with test_in/test_out keys. |" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"{'target_pool': 128, 'cnn_filter_sizes': [22, 1, 6, 6, 6, 3], 'cnn_filters': [128, 128, 160, 200, 250, 256], 'dcnn_filters': [32, 32, 32, 32, 32, 32], 'adam_beta2': 0.98, 'num_targets': 39, 'loss': 'poisson', 'batch_size': 1, 'learning_rate': 0.002, 'cnn_dropout': 0.05, 'adam_beta1': 0.97, 'dense': 1, 'cnn_pool': [1, 2, 4, 4, 4, 1], 'full_dropout': 0.05, 'link': 'softplus', 'full_units': 384, 'batch_buffer': 16384, 'batch_renorm': 1, 'dcnn_filter_sizes': [3, 3, 3, 3, 3, 3], 'dcnn_dropout': 0.1}\n", | ||
"Targets pooled by 128 to length 2048\n", | ||
"Convolution w/ 128 4x22 filters strided by 1\n", | ||
"Batch normalization\n", | ||
"ReLU\n", | ||
"Dropout w/ probability 0.050\n", | ||
"Convolution w/ 128 128x1 filters strided by 1\n", | ||
"Batch normalization\n", | ||
"ReLU\n", | ||
"Max pool 2\n", | ||
"Dropout w/ probability 0.050\n", | ||
"Convolution w/ 160 128x6 filters strided by 1\n", | ||
"Batch normalization\n", | ||
"ReLU\n", | ||
"Max pool 4\n", | ||
"Dropout w/ probability 0.050\n", | ||
"Convolution w/ 200 160x6 filters strided by 1\n", | ||
"Batch normalization\n", | ||
"ReLU\n", | ||
"Max pool 4\n", | ||
"Dropout w/ probability 0.050\n", | ||
"Convolution w/ 250 200x6 filters strided by 1\n", | ||
"Batch normalization\n", | ||
"ReLU\n", | ||
"Max pool 4\n", | ||
"Dropout w/ probability 0.050\n", | ||
"Convolution w/ 256 250x3 filters strided by 1\n", | ||
"Batch normalization\n", | ||
"ReLU\n", | ||
"Dropout w/ probability 0.050\n", | ||
"Dilated convolution w/ 32 256x3 rate 2 filters\n", | ||
"Batch normalization\n", | ||
"ReLU\n", | ||
"Dropout w/ probability 0.100\n", | ||
"Dilated convolution w/ 32 288x3 rate 4 filters\n", | ||
"Batch normalization\n", | ||
"ReLU\n", | ||
"Dropout w/ probability 0.100\n", | ||
"Dilated convolution w/ 32 320x3 rate 8 filters\n", | ||
"Batch normalization\n", | ||
"ReLU\n", | ||
"Dropout w/ probability 0.100\n", | ||
"Dilated convolution w/ 32 352x3 rate 16 filters\n", | ||
"Batch normalization\n", | ||
"ReLU\n", | ||
"Dropout w/ probability 0.100\n", | ||
"Dilated convolution w/ 32 384x3 rate 32 filters\n", | ||
"Batch normalization\n", | ||
"ReLU\n", | ||
"Dropout w/ probability 0.100\n", | ||
"Dilated convolution w/ 32 416x3 rate 64 filters\n", | ||
"Batch normalization\n", | ||
"ReLU\n", | ||
"Dropout w/ probability 0.100\n", | ||
"Linear transformation 448x384\n", | ||
"Batch normalization\n", | ||
"ReLU\n", | ||
"Dropout w/ probability 0.050\n", | ||
"Linear transform 384x39x1\n", | ||
"Model building time 8.534228\n", | ||
"2017-08-25 15:35:29.305208: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.\n", | ||
"2017-08-25 15:35:29.305239: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.\n", | ||
"2017-08-25 15:35:29.305244: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.\n", | ||
"2017-08-25 15:35:29.305248: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.\n", | ||
"Mutating sequence 1 / 1\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"! basenji_sat.py -f 20 -l 200 -o pim1_sat -t 0,38 models/params_small_sat.txt models/gm12878_d10.tf data/pim1.fa" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The saturated mutagenesis heatmaps go into pim1_sat" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from IPython.display import IFrame\n", | ||
"IFrame('pim1_sat/???.pdf', width=1200, height=400)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Describe the output..." | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python [default]", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.5.2" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 1 | ||
} |
Oops, something went wrong.