Skip to content

Commit

Permalink
updates
Browse files Browse the repository at this point in the history
  • Loading branch information
davek44 committed Sep 10, 2017
1 parent a76e72d commit 18cbc0f
Show file tree
Hide file tree
Showing 4 changed files with 850 additions and 411 deletions.
17 changes: 8 additions & 9 deletions tutorials/genes.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,9 @@
" * Trained model\n",
" * Gene Transfer Format (GTF) gene annotations\n",
" * BigWig coverage tracks\n",
" * Gene sequences saved in my HDF5 format."
" * Gene sequences saved in my HDF5 format.\n",
" \n",
"First, make sure you have an hg19 FASTA file visible. If you have it already, put a symbolic link into the data directory. Otherwise, I have a machine learning friendly simplified version you can download in the next cell."
]
},
{
Expand Down Expand Up @@ -62,7 +64,7 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"metadata": {
"collapsed": true
},
Expand Down Expand Up @@ -122,13 +124,10 @@
},
"outputs": [],
"source": [
"%% bash\n",
"if [ ! -e models/gm12878_best.tf.index ]\n",
"then\n",
" curl -o models/gm12878_best.tf.index https://storage.googleapis.com/basenji_tutorial_data/gm12878_best.tf.index\n",
" curl -o models/gm12878_best.tf.meta https://storage.googleapis.com/basenji_tutorial_data/gm12878_best.tf.meta\n",
" curl -o models/gm12878_best.tf.data-00000-of-00001 https://storage.googleapis.com/basenji_tutorial_data/gm12878_best.tf.data-00000-of-00001 \n",
"fi"
"if not os.path.isfile('models/gm12878_d10.tf.meta'):\n",
" subprocess.call('curl -o models/gm12878_d10.tf.index https://storage.googleapis.com/basenji_tutorial_data/model_gm12878_d10.tf.index', shell=True)\n",
" subprocess.call('curl -o models/gm12878_d10.tf.meta https://storage.googleapis.com/basenji_tutorial_data/model_gm12878_d10.tf.meta', shell=True)\n",
" subprocess.call('curl -o models/gm12878_d10.tf.data-00000-of-00001 https://storage.googleapis.com/basenji_tutorial_data/model_gm12878_d10.tf.data-00000-of-00001', shell=True)"
]
},
{
Expand Down
188 changes: 188 additions & 0 deletions tutorials/sad.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Analyzing noncoding variation associated with disease is a major application of Basenji. I now offer several tools to enable that analysis. If you have a small set of variants and know what datasets are most relevant, [basenji_sat_vcf.py](https://github.com/calico/basenji/blob/master/bin/basenji_sat_vcf.py) lets you perform a saturation mutagenesis of the variant and surrounding region to see the relevant nearby motifs.\n",
"\n",
"If you want scores measuring the influence of those variants on all datasets,\n",
" * [basenji_sad.py](https://github.com/calico/basenji/blob/master/bin/basenji_sad.py) computes my SNP activity difference (SAD) score--the predicted change in aligned fragments to the region.\n",
" * [basenji_sed.py](https://github.com/calico/basenji/blob/master/bin/basenji_sed.py) computes my SNP expression difference (SED) score--the predicted change in aligned fragments to gene TSS's.\n",
"\n",
"Here, I'll demonstrate those two programs. You'll need\n",
" * Trained model\n",
" * Input file (FASTA or HDF5 with test_in/test_out)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, you can either train your own model in the [Train/test tutorial](https://github.com/calico/basenji/blob/master/tutorials/train_test.ipynb) or download one that I pre-trained."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"if not os.path.isfile('models/gm12878_d10.tf.meta'):\n",
" subprocess.call('curl -o models/gm12878_d10.tf.index https://storage.googleapis.com/basenji_tutorial_data/model_gm12878_d10.tf.index', shell=True)\n",
" subprocess.call('curl -o models/gm12878_d10.tf.meta https://storage.googleapis.com/basenji_tutorial_data/model_gm12878_d10.tf.meta', shell=True)\n",
" subprocess.call('curl -o models/gm12878_d10.tf.data-00000-of-00001 https://storage.googleapis.com/basenji_tutorial_data/model_gm12878_d10.tf.data-00000-of-00001', shell=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll bash the PIM1 promoter to see what motifs drive its expression. I placed a 262 kb FASTA file surrounding the PIM1 TSS in data/pim1.fa, so we'll use [basenji_sat.py](https://github.com/calico/basenji/blob/master/bin/basenji_sat.py).\n",
"\n",
"The most relevant options are:\n",
"\n",
"| Option/Argument | Value | Note |\n",
"|:---|:---|:---|\n",
"| -g | data/human.hg19.genome | Genome assembly chromosome length to bound gene sequences. |\n",
"| -f | 20 | Figure width, that I usually scale to 10x the saturation mutageneis region |\n",
"| -l | 200 | Saturation mutagenesis region in the center of the given sequence(s) |\n",
"| -o | pim1_sat | Outplot plot directory. |\n",
"| -t | 0,38 | Target indexes. 0 is a DNase and 38 is CAGE, as you can see in data/gm12878_wigs.txt. |\n",
"| params_file | models/params_small_sat.txt | Table of parameters to setup the model architecture and optimization parameters. |\n",
"| model_file | models/gm12878_d10.tf | Trained saved model prefix. |\n",
"| input_file | data/pim1.fa | Either FASTA or HDF5 with test_in/test_out keys. |"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'target_pool': 128, 'cnn_filter_sizes': [22, 1, 6, 6, 6, 3], 'cnn_filters': [128, 128, 160, 200, 250, 256], 'dcnn_filters': [32, 32, 32, 32, 32, 32], 'adam_beta2': 0.98, 'num_targets': 39, 'loss': 'poisson', 'batch_size': 1, 'learning_rate': 0.002, 'cnn_dropout': 0.05, 'adam_beta1': 0.97, 'dense': 1, 'cnn_pool': [1, 2, 4, 4, 4, 1], 'full_dropout': 0.05, 'link': 'softplus', 'full_units': 384, 'batch_buffer': 16384, 'batch_renorm': 1, 'dcnn_filter_sizes': [3, 3, 3, 3, 3, 3], 'dcnn_dropout': 0.1}\n",
"Targets pooled by 128 to length 2048\n",
"Convolution w/ 128 4x22 filters strided by 1\n",
"Batch normalization\n",
"ReLU\n",
"Dropout w/ probability 0.050\n",
"Convolution w/ 128 128x1 filters strided by 1\n",
"Batch normalization\n",
"ReLU\n",
"Max pool 2\n",
"Dropout w/ probability 0.050\n",
"Convolution w/ 160 128x6 filters strided by 1\n",
"Batch normalization\n",
"ReLU\n",
"Max pool 4\n",
"Dropout w/ probability 0.050\n",
"Convolution w/ 200 160x6 filters strided by 1\n",
"Batch normalization\n",
"ReLU\n",
"Max pool 4\n",
"Dropout w/ probability 0.050\n",
"Convolution w/ 250 200x6 filters strided by 1\n",
"Batch normalization\n",
"ReLU\n",
"Max pool 4\n",
"Dropout w/ probability 0.050\n",
"Convolution w/ 256 250x3 filters strided by 1\n",
"Batch normalization\n",
"ReLU\n",
"Dropout w/ probability 0.050\n",
"Dilated convolution w/ 32 256x3 rate 2 filters\n",
"Batch normalization\n",
"ReLU\n",
"Dropout w/ probability 0.100\n",
"Dilated convolution w/ 32 288x3 rate 4 filters\n",
"Batch normalization\n",
"ReLU\n",
"Dropout w/ probability 0.100\n",
"Dilated convolution w/ 32 320x3 rate 8 filters\n",
"Batch normalization\n",
"ReLU\n",
"Dropout w/ probability 0.100\n",
"Dilated convolution w/ 32 352x3 rate 16 filters\n",
"Batch normalization\n",
"ReLU\n",
"Dropout w/ probability 0.100\n",
"Dilated convolution w/ 32 384x3 rate 32 filters\n",
"Batch normalization\n",
"ReLU\n",
"Dropout w/ probability 0.100\n",
"Dilated convolution w/ 32 416x3 rate 64 filters\n",
"Batch normalization\n",
"ReLU\n",
"Dropout w/ probability 0.100\n",
"Linear transformation 448x384\n",
"Batch normalization\n",
"ReLU\n",
"Dropout w/ probability 0.050\n",
"Linear transform 384x39x1\n",
"Model building time 8.534228\n",
"2017-08-25 15:35:29.305208: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.\n",
"2017-08-25 15:35:29.305239: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.\n",
"2017-08-25 15:35:29.305244: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.\n",
"2017-08-25 15:35:29.305248: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.\n",
"Mutating sequence 1 / 1\n"
]
}
],
"source": [
"! basenji_sat.py -f 20 -l 200 -o pim1_sat -t 0,38 models/params_small_sat.txt models/gm12878_d10.tf data/pim1.fa"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The saturated mutagenesis heatmaps go into pim1_sat"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from IPython.display import IFrame\n",
"IFrame('pim1_sat/???.pdf', width=1200, height=400)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Describe the output..."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [default]",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Loading

0 comments on commit 18cbc0f

Please sign in to comment.