This project aims to predict gene expression levels by combining DNA sequence data with methylation values using a multimodal architecture. The project builds on the GX-BERT architecture to enhance prediction accuracy by incorporating epigenetic information.
Gene expression prediction is a critical task in oncology and other fields of biology. Traditional models primarily use DNA sequence data; however, incorporating additional modalities, such as methylation data, can potentially improve prediction accuracy. This project evaluates the performance of a multimodal architecture that integrates DNA sequences and methylation values.
dataset/
: where you have to save the dataset- src/': containes the source code of the
models/
: Implementation of the GX-BERT baseline and the multimodal models.notebooks/
: Jupyter notebooks for data preprocessing, training, and analysis.scripts/
: Python scripts for various tasks like data preprocessing and model training.results/
: Results and performance metrics of the models.README.md
: Project overview and instructions.
The dataset comprises human gene sequences and their corresponding methylation values and gene expression levels. Key details:
- DNA Sequences: Single-strand DNA sequences of 131,072 bases, centered around the Transcription Start Site (TSS).
- Methylation Values: Sparse arrays containing methylation beta values.
- Gene Expression: mRNA gene expression levels used as labels.
- DNA sequences were extracted from the Gencode Genes.
- Methylation and gene expression data were sourced from the National Cancer Institute GDC data portal.
- Removal of samples with null values.
- Normalization of gene expression values.
- Methylation values are already in the range [0, 1] and require no normalization.
The project employs an enhanced version of the GX-BERT model:
- Baseline Model: Uses DNA sequences only.
- Unimodal Model: Processes DNA sequences and integrates summed methylation values.
- Multimodal Model: Combines DNA sequences with methylation values through an embedding layer.
- Models were trained on separate datasets to test different configurations.
- The performance was evaluated using the R2 score metric.
- Approximately 40GB of RAM was used, and training was conducted with a batch size of 128 on a single GPU with 12GB VRAM.
The integration of methylation data showed improvement in gene expression prediction:
- Baseline Model (DNA only): Median R2 ~ 0.568
- Unimodal Model (with methylation): Median R2 ~ 0.569
- Multimodal Model (combined data): Median R2 ~ 0.570
The results suggest that combining DNA and methylation data enhances prediction accuracy, particularly for low-expressed genes.
This project demonstrates the potential of multimodal approaches in gene expression prediction. Future work will focus on refining the integration of epigenetic data and exploring additional modalities.
- Vittorio Pipoli, et al. Predicting gene expression levels from DNA sequences and post-transcriptional information with transformers.
- Vikram Agarwal, et al. Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks.
- Žiga Avsec, et al. Effective gene expression prediction from sequence by integrating long-range interactions.
- Python 3.x
- PyTorch
- Transformers library (Hugging Face)
- NumPy
- Pandas
Clone the repository:
git clone https://github.com/IloDan/GX-alBERTo.git
cd GX-alBERTo
Install the required packages:
pip install -r requirements.txt
AUTHORs | CONTACTs | GITHUBs |
---|---|---|
Cristian Bellucci | [email protected] | cleb98 |
Danilo Caputo | [email protected] | Ilodan |
Riccardo Santi | [email protected] | RiccardoSanti092 |
This project is licensed under the MIT License - see the LICENSE file for details.
Feel free to customize further based on specific project details or preferences.