DRG

This repository is under develpment. The current stage may not be fully functional. Please try to fix small bugs yourself and contribute with pull requests.

This is a repository with code to develop Convolutional Neural Networks (CNNs) for genomic sequence-to-function models that predict molecular phenotypes from genomic sequence, such as gene expression or chromatin accessibility.

There are also scripts to visualize, compare and assess model performance or to compute and visualize feature importance.

Summary

The cis-regulatory code engraves the details about when, where, and how much of all gene products are created. It enables multi-cellular organisms to create several cell types from a single genome by regulating transcription, processing, and degradation of gene products. Unsurprisingly, variants that disturb the information in cis-regulatory elements can result in various genetic diseases. However, measuring the impact of all cis-regulatory sequence variations to function at a cell type specific basis is combinatorially infeasible. Deep sequence-to-function models learn the relationship between genomic sequence and genome-wide functional molecular measurements. Trained on data from multiple cell types, these models are capable of developing a foundational understanding of the cis-regulatory code for individual cell types. However, recent evaluations have determined that the learned grammar is far from perfect, and that they are missing effects from distal elements as well as common and rare variants [1,2].

Multi-modal multi-species multi-cell type models

To improve the foundational knowledge of these models about gene regulation, three directions seem to be promising: 1) Improving cell type resolution with single cell data. 2) Inclusion of different data modalities of different scales that measure different aspects of the multi-layered gene-regulatory process into a single model 3) Inclusion of data from other species to increase the number of available data modalities from different cell types, and to increase the sequence variance from which the model can learn evolutionary conserved cis-regulatory elements. Incorporating these data types in a biologically meaningful way at cell type resolution is a major challenge to gain a more comprehensive view on gene expression regulation.

Generating Genomic sequences from Sequence-to-function models

Deep neural networks can learn the link between genomic sequence and functional molecular measurements in a cell type specific manner from large scale measurement of molecular phenotypes. Utilizing the power of these models enables us, not only to extract the learn regulatory features, but also to design new sequences with cell type specific regulatory functions which can be exploited in bioengineering or new therapeutics. To generate these sequences, accurate sequence-to-function models are combined with generative processes that exploit the knowledge of the model to generate artificial sequences with target specific functionalities. However, how to effectively generate sequences that are cell type specific with improved functionality compared to wild-type sequences is an open research question.

Learning complex long-range interactions

Gene expression is a multi-layered process and current models mostly focus in transcriptional activity. To improve models ability to predict gene expression, we require new model architectures that can effectively learn complex interactions betweeen variably spaced sequences elements. Moreover, new feature attribution methods will have to be developed to extract the information about complex long-range interactions, such as mRNA structure from these models.

Building cell type agnostic sequence-to-function models

Deep sequence-to-function models learn the relationship between genomic sequence and genome-wide functional molecular measurements. Trained on data from multiple cell types, these models are capable of developing a foundational understanding of the cis-regulatory code for individual cell types. However, current architectures have to learn the cell type specific code from data of each cell type individually or in a multi-task fashion. These models cannot reason which trans-factors are causing differences between cell types. It is hoped that next-generation models will be able to use the information about the abundance of trans-acting factors to interpolate to unseen cell types. These models will use readily accessible data, such as gene expression of regulatory factors, as an additional input to the model to determine the cell type and adjust the parameters of the model to interpolate to new unseen cell types.

Installation

Download the repository and setup conda environment.

Install by navigating to the location of the local repository

pip install -e .

Usage

Single sequence --> multi-task modalitiy

Single sequence --> multi-modal modalities

Multiple sequence --> multi-modal modalities

Load pre-trained models

You can download pre-trained model parameteters from ..., and load them with ...

mkdir data
cd data
wget https://zenodo.org/record/3402406/files/deepsea.beluga.pth

Sequence attributions

Sequence attributions are derived from linear approximations of the model to describe the impact of each input feature around the sequence of interest. From the linar model's coefficients, also referred to as multipliers, sequence attributions are derived as local, global, and hypothetical attributions.

$$a_{local} = m_{s_0}$$

$$a_{global} = m_{s_0} \cdot (s_0 - s_{baseline})$$

$$a_{hypo}(j)= m_{s_0}(j) - \sum_{i}^{\{A,C,G,T\}} b(i) \cdot m_{s_0}(i) \; ; \: j \in \{A,C,G,T\}$$

Sequence attributions can be easily determined with the model's gradient or ISM. See TISM for more details.

import numpy as np
import torch
import matplotlib.pyplot as plt

from tism.modddels import Beluga
from tism.utils import plot_attribution, ism, deepliftshap
from tism.torch_grad import correct_multipliers, takegrad
from tangermeme.utils import random_one_hot

parameters = '../data/deepsea.beluga.pth'
model = Beluga()
model.load_state_dict(torch.load(parameters))

N=1
b=4
input_length = 2000

x = random_one_hot((N, b, input_length), random_state = 1).type(torch.float32)
x = substitute(x, "CTCAGTGATG")
x = x.detach().cpu().numpy()

track = 267
vis_seq = 0

grad_local = takegrad(x, model, tracks = track, output = 'local', device = None, baseline = None)
grad_local0 = grad_local[vis_seq,0][...,900:1100]
fig_local = plot_attribution(grad_local0, heatmap = grad_local0, ylabel = 'Grad\n(local)')

References

[1] Quick and effective approximation of in silico saturation mutagenesis experiments with first-order Taylor expansion Alexander Sasse, Maria Chikina, Sara Mostafavi,bioRxiv 2023.11.10.566588; doi: https://doi.org/10.1101/2023.11.10.566588

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
docs		docs
drg_tools		drg_tools
examples		examples
scripts		scripts
tests		tests
tutorials		tutorials
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dependencies.yml		dependencies.yml
environment.yml		environment.yml
setup.py		setup.py
tasklist.md		tasklist.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DRG

Summary

Multi-modal multi-species multi-cell type models

Generating Genomic sequences from Sequence-to-function models

Learning complex long-range interactions

Building cell type agnostic sequence-to-function models

Installation

Usage

Single sequence --> multi-task modalitiy

Single sequence --> multi-modal modalities

Multiple sequence --> multi-modal modalities

Load pre-trained models

Sequence attributions

References

About

Releases

Packages

Contributors 2

Languages

License

LXsasse/DRG

Folders and files

Latest commit

History

Repository files navigation

DRG

Summary

Multi-modal multi-species multi-cell type models

Generating Genomic sequences from Sequence-to-function models

Learning complex long-range interactions

Building cell type agnostic sequence-to-function models

Installation

Usage

Single sequence --> multi-task modalitiy

Single sequence --> multi-modal modalities

Multiple sequence --> multi-modal modalities

Load pre-trained models

Sequence attributions

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages