updates

calico · Sep 10, 2017 · 18cbc0f · 18cbc0f
1 parent a76e72d
commit 18cbc0f
Show file tree

Hide file tree

Showing 4 changed files with 850 additions and 411 deletions.
diff --git a/tutorials/genes.ipynb b/tutorials/genes.ipynb
@@ -10,7 +10,9 @@
     " * Trained model\n",
     " * Gene Transfer Format (GTF) gene annotations\n",
     " * BigWig coverage tracks\n",
-    " * Gene sequences saved in my HDF5 format."
+    " * Gene sequences saved in my HDF5 format.\n",
+    " \n",
+    "First, make sure you have an hg19 FASTA file visible. If you have it already, put a symbolic link into the data directory. Otherwise, I have a machine learning friendly simplified version you can download in the next cell."
    ]
   },
   {
@@ -62,7 +64,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": null,
    "metadata": {
     "collapsed": true
    },
@@ -122,13 +124,10 @@
    },
    "outputs": [],
    "source": [
-    "%% bash\n",
-    "if [ ! -e models/gm12878_best.tf.index ]\n",
-    "then\n",
-    "    curl -o models/gm12878_best.tf.index https://storage.googleapis.com/basenji_tutorial_data/gm12878_best.tf.index\n",
-    "    curl -o models/gm12878_best.tf.meta https://storage.googleapis.com/basenji_tutorial_data/gm12878_best.tf.meta\n",
-    "    curl -o models/gm12878_best.tf.data-00000-of-00001 https://storage.googleapis.com/basenji_tutorial_data/gm12878_best.tf.data-00000-of-00001        \n",
-    "fi"
+    "if not os.path.isfile('models/gm12878_d10.tf.meta'):\n",
+    "    subprocess.call('curl -o models/gm12878_d10.tf.index https://storage.googleapis.com/basenji_tutorial_data/model_gm12878_d10.tf.index', shell=True)\n",
+    "    subprocess.call('curl -o models/gm12878_d10.tf.meta https://storage.googleapis.com/basenji_tutorial_data/model_gm12878_d10.tf.meta', shell=True)\n",
+    "    subprocess.call('curl -o models/gm12878_d10.tf.data-00000-of-00001 https://storage.googleapis.com/basenji_tutorial_data/model_gm12878_d10.tf.data-00000-of-00001', shell=True)"
    ]
   },
   {

diff --git a/tutorials/sad.ipynb b/tutorials/sad.ipynb
@@ -0,0 +1,188 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Analyzing noncoding variation associated with disease is a major application of Basenji. I now offer several tools to enable that analysis. If you have a small set of variants and know what datasets are most relevant, [basenji_sat_vcf.py](https://github.com/calico/basenji/blob/master/bin/basenji_sat_vcf.py) lets you perform a saturation mutagenesis of the variant and surrounding region to see the relevant nearby motifs.\n",
+    "\n",
+    "If you want scores measuring the influence of those variants on all datasets,\n",
+    " * [basenji_sad.py](https://github.com/calico/basenji/blob/master/bin/basenji_sad.py) computes my SNP activity difference (SAD) score--the predicted change in aligned fragments to the region.\n",
+    " * [basenji_sed.py](https://github.com/calico/basenji/blob/master/bin/basenji_sed.py) computes my SNP expression difference (SED) score--the predicted change in aligned fragments to gene TSS's.\n",
+    "\n",
+    "Here, I'll demonstrate those two programs. You'll need\n",
+    " * Trained model\n",
+    " * Input file (FASTA or HDF5 with test_in/test_out)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "First, you can either train your own model in the [Train/test tutorial](https://github.com/calico/basenji/blob/master/tutorials/train_test.ipynb) or download one that I pre-trained."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "if not os.path.isfile('models/gm12878_d10.tf.meta'):\n",
+    "    subprocess.call('curl -o models/gm12878_d10.tf.index https://storage.googleapis.com/basenji_tutorial_data/model_gm12878_d10.tf.index', shell=True)\n",
+    "    subprocess.call('curl -o models/gm12878_d10.tf.meta https://storage.googleapis.com/basenji_tutorial_data/model_gm12878_d10.tf.meta', shell=True)\n",
+    "    subprocess.call('curl -o models/gm12878_d10.tf.data-00000-of-00001 https://storage.googleapis.com/basenji_tutorial_data/model_gm12878_d10.tf.data-00000-of-00001', shell=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We'll bash the PIM1 promoter to see what motifs drive its expression. I placed a 262 kb FASTA file surrounding the PIM1 TSS in data/pim1.fa, so we'll use [basenji_sat.py](https://github.com/calico/basenji/blob/master/bin/basenji_sat.py).\n",
+    "\n",
+    "The most relevant options are:\n",
+    "\n",
+    "| Option/Argument | Value | Note |\n",
+    "|:---|:---|:---|\n",
+    "| -g | data/human.hg19.genome | Genome assembly chromosome length to bound gene sequences. |\n",
+    "| -f | 20 | Figure width, that I usually scale to 10x the saturation mutageneis region |\n",
+    "| -l | 200 | Saturation mutagenesis region in the center of the given sequence(s) |\n",
+    "| -o | pim1_sat | Outplot plot directory. |\n",
+    "| -t | 0,38 | Target indexes. 0 is a DNase and 38 is CAGE, as you can see in data/gm12878_wigs.txt. |\n",
+    "| params_file | models/params_small_sat.txt | Table of parameters to setup the model architecture and optimization parameters. |\n",
+    "| model_file | models/gm12878_d10.tf | Trained saved model prefix. |\n",
+    "| input_file | data/pim1.fa | Either FASTA or HDF5 with test_in/test_out keys. |"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'target_pool': 128, 'cnn_filter_sizes': [22, 1, 6, 6, 6, 3], 'cnn_filters': [128, 128, 160, 200, 250, 256], 'dcnn_filters': [32, 32, 32, 32, 32, 32], 'adam_beta2': 0.98, 'num_targets': 39, 'loss': 'poisson', 'batch_size': 1, 'learning_rate': 0.002, 'cnn_dropout': 0.05, 'adam_beta1': 0.97, 'dense': 1, 'cnn_pool': [1, 2, 4, 4, 4, 1], 'full_dropout': 0.05, 'link': 'softplus', 'full_units': 384, 'batch_buffer': 16384, 'batch_renorm': 1, 'dcnn_filter_sizes': [3, 3, 3, 3, 3, 3], 'dcnn_dropout': 0.1}\n",
+      "Targets pooled by 128 to length 2048\n",
+      "Convolution w/ 128 4x22 filters strided by 1\n",
+      "Batch normalization\n",
+      "ReLU\n",
+      "Dropout w/ probability 0.050\n",
+      "Convolution w/ 128 128x1 filters strided by 1\n",
+      "Batch normalization\n",
+      "ReLU\n",
+      "Max pool 2\n",
+      "Dropout w/ probability 0.050\n",
+      "Convolution w/ 160 128x6 filters strided by 1\n",
+      "Batch normalization\n",
+      "ReLU\n",
+      "Max pool 4\n",
+      "Dropout w/ probability 0.050\n",
+      "Convolution w/ 200 160x6 filters strided by 1\n",
+      "Batch normalization\n",
+      "ReLU\n",
+      "Max pool 4\n",
+      "Dropout w/ probability 0.050\n",
+      "Convolution w/ 250 200x6 filters strided by 1\n",
+      "Batch normalization\n",
+      "ReLU\n",
+      "Max pool 4\n",
+      "Dropout w/ probability 0.050\n",
+      "Convolution w/ 256 250x3 filters strided by 1\n",
+      "Batch normalization\n",
+      "ReLU\n",
+      "Dropout w/ probability 0.050\n",
+      "Dilated convolution w/ 32 256x3 rate 2 filters\n",
+      "Batch normalization\n",
+      "ReLU\n",
+      "Dropout w/ probability 0.100\n",
+      "Dilated convolution w/ 32 288x3 rate 4 filters\n",
+      "Batch normalization\n",
+      "ReLU\n",
+      "Dropout w/ probability 0.100\n",
+      "Dilated convolution w/ 32 320x3 rate 8 filters\n",
+      "Batch normalization\n",
+      "ReLU\n",
+      "Dropout w/ probability 0.100\n",
+      "Dilated convolution w/ 32 352x3 rate 16 filters\n",
+      "Batch normalization\n",
+      "ReLU\n",
+      "Dropout w/ probability 0.100\n",
+      "Dilated convolution w/ 32 384x3 rate 32 filters\n",
+      "Batch normalization\n",
+      "ReLU\n",
+      "Dropout w/ probability 0.100\n",
+      "Dilated convolution w/ 32 416x3 rate 64 filters\n",
+      "Batch normalization\n",
+      "ReLU\n",
+      "Dropout w/ probability 0.100\n",
+      "Linear transformation 448x384\n",
+      "Batch normalization\n",
+      "ReLU\n",
+      "Dropout w/ probability 0.050\n",
+      "Linear transform 384x39x1\n",
+      "Model building time 8.534228\n",
+      "2017-08-25 15:35:29.305208: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.\n",
+      "2017-08-25 15:35:29.305239: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.\n",
+      "2017-08-25 15:35:29.305244: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.\n",
+      "2017-08-25 15:35:29.305248: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.\n",
+      "Mutating sequence 1 / 1\n"
+     ]
+    }
+   ],
+   "source": [
+    "! basenji_sat.py -f 20 -l 200 -o pim1_sat -t 0,38 models/params_small_sat.txt models/gm12878_d10.tf data/pim1.fa"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The saturated mutagenesis heatmaps go into pim1_sat"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "from IPython.display import IFrame\n",
+    "IFrame('pim1_sat/???.pdf', width=1200, height=400)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Describe the output..."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python [default]",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.5.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}