GX_alBERTo

Gene Expression Prediction with Self-Attentive Multimodal Architecture

This project aims to predict gene expression levels by combining DNA sequence data with methylation values using a multimodal architecture. The project builds on the GX-BERT architecture to enhance prediction accuracy by incorporating epigenetic information.

Overview

Gene expression prediction is a critical task in oncology and other fields of biology. Traditional models primarily use DNA sequence data; however, incorporating additional modalities, such as methylation data, can potentially improve prediction accuracy. This project evaluates the performance of a multimodal architecture that integrates DNA sequences and methylation values.

Repository Structure

dataset/: where you have to save the dataset
src/': containes the source code of the
models/: Implementation of the GX-BERT baseline and the multimodal models.
notebooks/: Jupyter notebooks for data preprocessing, training, and analysis.
scripts/: Python scripts for various tasks like data preprocessing and model training.
results/: Results and performance metrics of the models.
README.md: Project overview and instructions.

Dataset

The dataset comprises human gene sequences and their corresponding methylation values and gene expression levels. Key details:

DNA Sequences: Single-strand DNA sequences of 131,072 bases, centered around the Transcription Start Site (TSS).
Methylation Values: Sparse arrays containing methylation beta values.
Gene Expression: mRNA gene expression levels used as labels.

Data Sources

DNA sequences were extracted from the Gencode Genes.
Methylation and gene expression data were sourced from the National Cancer Institute GDC data portal.

Methods

Preprocessing

Removal of samples with null values.
Normalization of gene expression values.
Methylation values are already in the range [0, 1] and require no normalization.

Architecture

The project employs an enhanced version of the GX-BERT model:

Baseline Model: Uses DNA sequences only.
Unimodal Model: Processes DNA sequences and integrates summed methylation values.
Multimodal Model: Combines DNA sequences with methylation values through an embedding layer.

Training and Evaluation

Models were trained on separate datasets to test different configurations.
The performance was evaluated using the R2 score metric.
Approximately 40GB of RAM was used, and training was conducted with a batch size of 128 on a single GPU with 12GB VRAM.

Results

The integration of methylation data showed improvement in gene expression prediction:

Baseline Model (DNA only): Median R2 ~ 0.568
Unimodal Model (with methylation): Median R2 ~ 0.569
Multimodal Model (combined data): Median R2 ~ 0.570

The results suggest that combining DNA and methylation data enhances prediction accuracy, particularly for low-expressed genes.

Conclusion

This project demonstrates the potential of multimodal approaches in gene expression prediction. Future work will focus on refining the integration of epigenetic data and exploring additional modalities.

References

Vittorio Pipoli, et al. Predicting gene expression levels from DNA sequences and post-transcriptional information with transformers.
Vikram Agarwal, et al. Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks.
Žiga Avsec, et al. Effective gene expression prediction from sequence by integrating long-range interactions.

Getting Started

Prerequisites

Python 3.x
PyTorch
Transformers library (Hugging Face)
NumPy
Pandas

Installation

Clone the repository:

git clone https://github.com/IloDan/GX-alBERTo.git
cd GX-alBERTo

Install the required packages:

pip install -r requirements.txt

AUTHORs	CONTACTs	GITHUBs
Cristian Bellucci	[email protected]	cleb98
Danilo Caputo	[email protected]	Ilodan
Riccardo Santi	[email protected]	RiccardoSanti092

License

This project is licensed under the MIT License - see the LICENSE file for details.

Feel free to customize further based on specific project details or preferences.

Name		Name	Last commit message	Last commit date
Latest commit History 222 Commits
.idea		.idea
.vscode		.vscode
complement_and_reverse		complement_and_reverse
data_manipulation		data_manipulation
dataset		dataset
gxbert		gxbert
ottimizzazione		ottimizzazione
src		src
weights		weights
weights_t		weights_t
.gitignore		.gitignore
GIT_GX_BERT_CTB_TPU_TRAIN_clean.ipynb		GIT_GX_BERT_CTB_TPU_TRAIN_clean.ipynb
README.md		README.md
attention_pooling.ipynb		attention_pooling.ipynb
bash.sh		bash.sh
boxplot.ipynb		boxplot.ipynb
clearml.conf		clearml.conf
environment.yml		environment.yml
environment_guide.txt		environment_guide.txt
evaluate.py		evaluate.py
evaluate.sbatch		evaluate.sbatch
evaluate_no_met.py		evaluate_no_met.py
evaluate_no_met.sbatch		evaluate_no_met.sbatch
train.py		train.py
train.sbatch		train.sbatch
train_ddp.py		train_ddp.py
train_ddp.sbatch		train_ddp.sbatch
train_no_met.py		train_no_met.py
train_no_met.sbatch		train_no_met.sbatch
train_no_met_ddp.py		train_no_met_ddp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GX_alBERTo

Gene Expression Prediction with Self-Attentive Multimodal Architecture

Overview

Repository Structure

Dataset

Data Sources

Methods

Preprocessing

Architecture

Training and Evaluation

Results

Conclusion

References

Getting Started

Prerequisites

Installation

License

About

Releases

Packages

Languages

cleb98/GX_alBERTo

Folders and files

Latest commit

History

Repository files navigation

GX_alBERTo

Gene Expression Prediction with Self-Attentive Multimodal Architecture

Overview

Repository Structure

Dataset

Data Sources

Methods

Preprocessing

Architecture

Training and Evaluation

Results

Conclusion

References

Getting Started

Prerequisites

Installation

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages