Hadaca3 Data Challenge
The Data Challenge took place on December 2-6, 2024 in Aussois in France.
The detailed description of the challenge can be found here: https://hadaca3.sciencesconf.org/
- Table of contents
- Contributors
- Overview
- Workflow
- Data
- Methods
- Exploratory Data Analysis
- Pre-processing
- Model
- Usage
- Codabench Platform
- Results
- Conclusions
- Future Ideas
- Acknowledgments
- References
- Solène Weill ([email protected])
- Jędrzej Kubica ([email protected])
- Vesna Lukic ([email protected])
- Guillaume Appe ([email protected])
The aim of the project was to design and develop a bioinformatic workflow to quantify pancreatic tumor heterogeneity using supervised deconvolution methods and multi-omics data. There have been previous studies that introduced various deconvolution methods1, however there are a number of challenges that still persist in the field. The first challenge was the integration of multi-omics data (RNA-seq, single cell RNA-seq, and DNA methylation) for a reference in the deconvolution process, and the second challenge was the selection and combination of the best deconvolution software packages. The project results were measured and compared to other approaches on the platform2.
This documentation provides comprehensive details of our contribution, which focused on performing cell-type deconvolution using bulk RNA and methylation data, and trying both uni- and multi-modal predictions.
The challenge was split into three phases as follows:
- Phase 1: Discovery of the data and the Codabench platform
- Phase 2: Estimation of cell type heterogeneity and submissions of methods/results into the platform
- Phase 3: Migration from phase 2 of the best methods and evalution of them
Data was provided in the first two phases. Phase 1 data consisted of one simulated multi-omic dataset having an in-silico mixture of 5 cell types with explicit dependence between genes/CpG probes. Phase 2 data consisted of xxxxx.
Our approach consisted firstly of feature selection and cell type deconvolution.
The feature selection is done on both methylation and single-cell RNA data
For methylation data:
- Begin with full bulk methylation data
- Map CpG islands to genes from bulk RNA
- Find cell-type specific methylation sites by a) Ranking CpG islands by difference between max methylation across all cell types and mean of remaining cell types and b) Select top CpG islands based on threshold (minimum of means methylation across all cell types)
For single-cell RNA:
- Identify marker genes for cell types based on differential expression analysis
- Create pseudo-bulk data
Deconvolution: Running SCDC + NNLS and an Ensemble method (this was abandoned however, as it achieved poor results)
Unimodal predictions:
- a) NNLS on bulk RNA, bulk methylation and pseudo-bulk created from single-cell RNA-seq separately
- b) Trying to mix bulk RNA and methylation changing reference to (1-bulk_methylation)*bulk_RNA element wise
Multimodal predictions:
- All unimodal from a. + Ensemble method. This allowed using the intersection of CpG sites to gene, bulkRNA and scRNA genes
The script cellType_specific_CpGmet.ipynb
shows some preliminary analysis of the data. It shows how genes are clustered based on their expression.
We get the genes that have the most distinct methylation across the 5 cell types.
The script single_cell_preprocessing.R
reads the single cell reference data, then creates and processes a Seurat object for each single cell dataset. The differential expression is computed, and the markers for each cell type are obtained, as well as the final gene list.
meth_data_analysis.R
installs the annotation file IlluminaHumanMethylation450kanno.ilmn12.hg19
for Illuminas 450K methylation arrays. After loading this file:
- Read reference methylation data
- Filter data to keep only CpG islands from
ref_met
and reduce computation time - Further filtering to keep only CpG islands matching a gene name from a bulk reference
meth_rna_mapping.ipynb
produces a mapping between CpG sites and UCSC refgene names. The mapping is saved as mapping_meth_rna.csv
The main model that was submitted to the Codabench platform is submission_script.py
which provides a Python-based pipeline for estimating proportions of different components in a biological mixture using RNA and methylation data. The program integrates multiple data modalities and combines them for accurate proportion estimation. The main steps are listed below:
-
Data Alignment and Filtering:
- Aligns RNA and methylation datasets based on shared features.
- Filters features for variability
-
Proportion Estimation:
- Uses Non-Negative Least Squares (NNLS) to estimate component proportions in mixtures.
- Supports RNA, methylation, and pseudo-bulk RNA datasets.
-
Optimal Weighting:
- Combines results from RNA and methylation datasets.
- Finds optimal weights to minimize RMSE using
additionnal_script.py
Prepare Input Data:
- RNA and methylation mixture data (mix_rna, mix_met).
- RNA and methylation reference data (ref_rna, ref_met).
- Mapping file linking RNA and methylation features (mapping_meth_rna.csv).
- Pseudo-bulk RNA data (peng_pseudo_bulk_sum.csv).
Run the program:
python submission_script.py
Note: Ensure proper preprocessing of input data for accurate results.
Codabench-Platform2
Competition website: https://www.codabench.org/competitions/4714/
The best deconvolution results: Mmethylation
Approach NNLS + feature selection (CpG to gene mapping, cell-type specific methylation) (Python)
Codabench score = 0.66
The decomposition of scores for 9 validation datasets:
- It is difficult to make conclusions about which is the best-performing deconvolution algorithm given the limited available time in the competition.
- Feature selection is the key to improve performance
- Given our teams expertise in python, we preferred the usage of python rather than R (the InMoose python package, developed by Epigene Labs is a good example)
- Test different pre-processing methods (reduce noise, batch correction and integrate single cell data, TMM or DESeq2 normalisation)
- Improve Ensemble methods for late integration
- Use M-values for methylation instead of beta-values
- Test feature selection with biological a priori
- Gene set enrichment analysis using msgdib and hallmark of epigenetic
- Use genes signatures allowing molecular classification of the cancer (and predict based on subtypes)
- Generalize pipeline to other cancer types