Biased allele expression, refers to the imbalanced expression of the two alleles in a diploid genome. Unequal transcription of alleles may occur due to cis-regulatory element variation or allele-specific epigenetic modifications. Allelic imbalance may be incorrectly inferred due to technical variation inherent in RNA-Seq data, including read depth, reference mapping bias, and the overdispersion of reads. To correct for technical variation we develop a logistic regression model with a mixed effects approach to combine information regarding biased allele expression from many individuals in a population, and across multiple genes. Here we describe a new method for inferring allelic imbalance that combines information from multiple SNP sites within a transcribed unit, using a logistic regression model that explicitly models the effects of reference bias and SNP type biases. Additionally, by using a mixed effects approach the method also makes it possible to combine information from many individuals in a population for each gene and to test hypothesis regarding allelic imbalance in specific genes within and across populations.
If you use ASEofBases package, please cite:
- Inference on imbalanced allelic expression from RNA sequencing data in individuals and across groups using logistic regression models. Skotte L, Olney K, Nielsen R, and Wilson Sayres M. (https://dx.doi.org/10.6084/m9.figshare.1564499.v1) Manuscript in preparation
If you use the 1000Genomes
or Geuvidas
data sets, please also cite:
- A global reference for human genetic variation, The 1000 Genomes Project Consortium, Nature 526, 68-74 (01 October 2015) doi: 10.1038/nature15393.
- Lappalainen et al. Nature 2013 : Transcriptome and genome sequencing uncovers functional variation in humans (http://dx.doi.org/10.1038/nature12531).
- method model of allele-specific expression in single individuals
- method model of allele-specific expression using "population data"
Additionally code includes simulations:
- simulations for modeling statistcal power for various read depth and various number of SNPs per gene
- simulations for modeling SNP-wise binomial, binomial-based logistic regression, and binomial-based logistic regression correcting for the over dispersion of read counts.
Download the latest release of ASEofBases, unzip and add the bin
directory to your PATH
. E.g.:
wget -O ASEofBases-master.zip https://github.com/WilsonSayresLab/ASEofBases/archive/master.zip
unzip ASEofBases-master.zip && cd ASEofBases-master
export PATH=`pwd`/bin:$PATH
You will also need the following perviosuly published programs installed:
- angsd0.563 https://github.com/ANGSD/angsd
- vcftools_0.1.12b http://sourceforge.net/projects/vcftools/files/
- zlib-1.2.8 http://www.zlib.net
- samtools/htslib https://github.com/samtools/htslib
- bigwig https://genome.ucsc.edu/goldenPath/help/bigWig.html
/ASEofBases/1_code/ # bash scripts
- 1_get.sh
- 2_run.sh
- 3_makeData.sh
- ASEofBasesAnalysis.R
- simAoB.R
/ASEofBases/2_prog/ # locally written C++ programs
- getliners
- ieatgor
- VCFmergeGTF3
/ASEofBases/3_raw/ # raw and parsed data
- target mapability files
- annotation file
- raw vcf file
- filtered vcf file
- individual ID list
/ASEofBases/4_data/ # parsed RNAseq data for filtering & analysis in R
- idividual ouptut data (sampleID.data)
/ASEofBases/5_out/ # analysis and output made in R
- tab delimited output files
/ASEofBases/6_plot/ # plots made from the output data made in R
- plots in pdf format
Getliners - Merge the tmp.keys (positions) with tmp.het (heterozygous protein coding SNPs for individual) and greps the set of keys in tmp.keys in column 2 of the file tmp.het
ieatgor - Filter for alignability output chr$chr.ind$ind.data, greps any entry in tmp.data that has chromosome and position within the regions and specified in the "targetfile" (regions of the genome that are "callable")
VCFmergeGTF - This code is merging the genotype calls (from vcf) with position of protein coding genes (from gencode)
cd 2_prog/
g++ -O3 -o VCFmergeGTF3 VCFmergeGTF3.cpp -lz
g++ -O3 -o ieatgor ieatgorV2.cpp -lz
g++ -O3 -o getliners getliners.cpp -lz
Make sure each of these programs are installed and added to the /ASEofBases/2_prog/ directory
-
bigwig (https://genome.ucsc.edu/goldenPath/help/bigWig.html)
-
angsd (https://github.com/ANGSD/angsd)
-
vcftools_0.1.12b (http://sourceforge.net/projects/vcftools/files/)
-
zlib-1.2.8 (http://www.zlib.net)
-
samtools/htslib (https://github.com/samtools/htslib)
# angsd Using a local folder containing htslib git clone https://github.com/samtools/htslib.git; git clone https://github.com/angsd/angsd.git; cd htslib;make;cd ../angsd;make HTSSRC=../htslib Systemwide installation of htslib git clone https://github.com/angsd/angsd.git; cd angsd;make # vcftoolss_0.1.12b wget https://sourceforge.net/projects/vcftools/files/vcftools_0.1.12b.tar.gz tar -xzvf vcftools_0.1.12b.tar.gz rm vcftools_0.1.12b.tar.gz # zlib wget http://zlib.net/zlib-1.2.8.tar.gz tar -xzvf zlib-1.2.8.tar.gz rm zlib-1.2.8.tar.gz # samtools wget https://github.com/samtools/samtools/releases/download/1.3/samtools-1.3.tar.bz2 tar -vxjf samtools-1.3.tar.bz2 rm samtools-1.3.tar.bz2 #BigWig mkdir BigWig cd BigWig rsync -aP rsync://hgdownload.cse.ucsc.edu/genome/admin/exe/linux.x86_64.v287/ ./
The mapability file is used for identifing uniqueness of the reference GRCh37/hg19 genome assembly. Mapability files were generated using different window sizes, high signal will be found in areas where the sequence is unique.
Download the Mapability file:
wget http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeCrgMapabilityAlign50mer.bigWig
mv wgEncodeCrgMapabilityAlign50mer.bigWig /ASEofBases/2_prog/BigWig/
1_getRaw.sh is a bash script for downloading initial data and does some processing.
-
1_getRaw.sh Overview:
-
Convert Mapability file to bed format and create file to filter on mapability
-
Get protein coding gene annotations
-
Download & parse individual genotype information from
1000Genomes
individuals -
Download & parse individual information RNAseq data from
Geuvadis
sh /ASEofBases/bash_scripts/1_getRaw.sh
-
-
Outputs:
- chr#.50mer.target - target file to individual chromosome files. Example: chr1:10207-10212
- gencode.chr#.gore - protein coding gene entries. Example: chr1 69091 70008 gene_id "ENSG00000186092.4";
- gencode.protien_coding.genes.v19.gtf - gencode gene annotation file, containing only protein coding regions
- chr#.genotypes.vcf - 1000Genomes genotype file
- individual_ID_list - Geuvadis RNAseq individual id list
This bash script will call another script and together will conduct filtering and parsing of the genotype and RNAseq data.
sh /ASEofBases/1_code/2_run.sh # this program runs 3_makeData.sh
- Output: will result in a file for each individual with these columns:
- chr: chromosome
- pos: position
- ref: reference allele
- alt: alternative allele
- g: gene
- a1: observed phased allele 1
- a2: observed phased allele 2
- type: snp type (1 if allele 1 is the alternative allele and 2 if allele 2 is the alternative allele)
- n: total counts of reads overlapping
- nA: counts of As
- nC: counts of Cs
- nG: counts of Gs
- nT: counts of Ts
6. Infer allele specific expression using ASEofBases on the filtered data that was generated in the previous steps
For regression analysis using data generated in previous steps
ASEofBases.R
For simulations
simAoB.R