Skip to content

Latest commit

 

History

History
147 lines (112 loc) · 10.8 KB

README.md

File metadata and controls

147 lines (112 loc) · 10.8 KB

Borzoi - Predicting RNA-seq from DNA Sequence

Code repository for Borzoi models, which are convolutional neural networks trained to predict RNA-seq coverage at 32bp resolution given 524kb input sequences. The model is described in the following bioRxiv preprint:

https://www.biorxiv.org/content/10.1101/2023.08.30.555582v1.

Borzoi was trained on a large set of RNA-seq experiments from ENCODE and GTEx, as well as re-processed versions of the original Enformer training data (including ChIP-seq and DNase data from ENCODE, ATAC-seq data from CATlas, and CAGE data from FANTOM5). Here is a list of trained-on experiments: human / mouse.

The repository contains example usage code (including jupyter notebooks for predicting and visualizing genetic variants) as well as links for downloading model weights, training data, QTL benchmark tasks, etc.

Contact drk (at) @calicolabs.com or jlinder (at) @calicolabs.com for questions about the model or data.

Installation

Borzoi depends on the baskerville repository, which can be installed by issuing the following commands:

git clone https://github.com/calico/baskerville.git
cd baskerville
pip install -e .

Next, install the borzoi repository by issuing the following commands:

git clone https://github.com/calico/borzoi.git
cd borzoi
pip install -e .

To train new models, the westminster repository is also required and can be installed with these commands:

git clone https://github.com/calico/westminster.git
cd westminster
pip install -e .

These repositories further depend on a number of python packages (which are automatically installed with borzoi). See pyproject.toml for a complete list. The most important version dependencies are:

Note: The example notebooks require jupyter, which can be installed with pip install notebook.
A new conda environment can be created with conda create -n borzoi_py310 python=3.10.
Some of the scripts in this repository start multi-process jobs and require slurm.

Finally, the code base relies on a number of environment variables. For convenience, these can be configured in the active conda environment with the 'env_vars.sh' script. First, open up 'env_vars.sh' in each repository folder and change the few lines of code at the top to your local paths. Then, issue these commands:

cd borzoi
conda activate borzoi_py310
./env_vars.sh
cd ../baskerville
./env_vars.sh
cd ../westminster
./env_vars.sh

Alternatively, the environment variables can be set manually:

export BORZOI_DIR=/home/<user_path>/borzoi
export PATH=$BORZOI_DIR/src/scripts:$PATH
export PYTHONPATH=$BORZOI_DIR/src/scripts:$PYTHONPATH

export BASKERVILLE_DIR=/home/<user_path>/baskerville
export PATH=$BASKERVILLE_DIR/src/baskerville/scripts:$PATH
export PYTHONPATH=$BASKERVILLE_DIR/src/baskerville/scripts:$PYTHONPATH

export WESTMINSTER_DIR=/home/<user_path>/westminster
export PATH=$WESTMINSTER_DIR/src/westminster/scripts:$PATH
export PYTHONPATH=$WESTMINSTER_DIR/src/westminster/scripts:$PYTHONPATH

export BORZOI_CONDA=/home/<user>/anaconda3/etc/profile.d/conda.sh
export BORZOI_HG38=$BORZOI_DIR/examples/hg38
export BORZOI_MM10=$BORZOI_DIR/examples/mm10
export BASKERVILLE_CONDA=$BORZOI_CONDA

Note: The baskerville and westminster variables are only required for data processing and model training.

Model Availability

The model weights can be downloaded as .h5 files from the URLs below. We trained a total of 4 model replicates with identical train, validation and test splits (test = fold3, validation = fold4 from sequences_human.bed.gz).

Borzoi Replicate 0 (human) | (mouse)
Borzoi Replicate 1 (human) | (mouse)
Borzoi Replicate 2 (human) | (mouse)
Borzoi Replicate 3 (human) | (mouse)

Users can run the script download_models.sh to download all model replicates and annotations into the 'examples/' folder.

cd borzoi
./download_models.sh

Mini Borzoi Models

We have trained a collection of (smaller) model instances on various subsets of data modalities (or on all data modalities but with architectural changes compared to the original architecture). For example, some models are trained only on RNA-seq data while others are trained on DNase-, ATAC- and RNA-seq. Similarly, some model instances are trained on human-only data while others are trained on human- and mouse data. The models were trained with either 2- or 4-fold cross-validation and are available at the following URL:

Mini Borzoi Model Collection

For example, here are the weights, targets, and parameter file of a model trained on K562 RNA-seq:

Borzoi K562 RNA-seq Fold 0
Borzoi K562 RNA-seq Fold 1
Borzoi K562 RNA-seq Targets
Borzoi K562 RNA-seq Parameters

Note: To list the contents of the mini model repository, use gsutil:

gsutil ls gs://seqnn-share/borzoi/mini

Data Availability

The training data for Borzoi can be downloaded from the following URL:

Borzoi Training Data

Note: This data bucket is large (multiple TB) and thus set to "Requester Pays". To access the bucket, you must have a billable user project set up on the Google Cloud Platform (GCP) and included with the "-u" flag when issuing gsutil commands. For example, to list the contents of "gs://borzoi-paper/data", issue this command:

gsutil -u <user_project> ls gs://borzoi-paper/data

QTL Availability

The curated e-/s-/pa-/ipaQTL benchmarking data can be downloaded from the following URLs:

eQTL Data
sQTL Data
paQTL Data
ipaQTL Data

Paper Replication

To replicate the results presented in the paper, visit the borzoi-paper repository. This repository contains scripts for training, evaluating, and analyzing the published model, and for processing the training data.

Tutorials

The following directories contain minimal tutorials regarding model training, variant scoring, and interpretation. The 'legacy' tutorials use data transformations that are similar to those used in the manuscript, while 'latest' use updated (and simpler) transformations. Note that these tutorials are only intended to showcase core functionality on sample data (such as processing an RNA-seq experiment, or training a simple model). For advanced analyses, we recommend studying the results presented in the manuscript (see Paper Replication).

Example Notebooks

The following notebooks contain example code for predicting and interpreting genetic variants.

Notebook 1a: Interpret eQTL SNP (expression) (fancy)
Notebook 1b: Interpret paQTL SNP (polyadenylation) (fancy)
Notebook 1c: Interpret sQTL SNP (splicing)
Notebook 1d: Interpret ipaQTL SNP (splicing and polya)