Challenge: Multimodal data integration to quantify tumor heterogeneity in cancer

Hadaca3 Data Challenge

Challenge: Multimodal data integration to quantify tumor heterogeneity in cancer

The Data Challenge took place on December 2-6, 2024 in Aussois in France.

The detailed description of the challenge can be found here: https://hadaca3.sciencesconf.org/

Contributors

Team B

Solène Weill ([email protected])
Jędrzej Kubica ([email protected])
Vesna Lukic ([email protected])
Guillaume Appe ([email protected])

Overview

The aim of the project was to design and develop a bioinformatic workflow to quantify pancreatic tumor heterogeneity using supervised deconvolution methods and multi-omics data. There have been previous studies that introduced various deconvolution methods¹, however there are a number of challenges that still persist in the field. The first challenge was the integration of multi-omics data (RNA-seq, single cell RNA-seq, and DNA methylation) for a reference in the deconvolution process, and the second challenge was the selection and combination of the best deconvolution software packages. The project results were measured and compared to other approaches on the platform².

This documentation provides comprehensive details of our contribution, which focused on performing cell-type deconvolution using bulk RNA and methylation data, and trying both uni- and multi-modal predictions.

Workflow

Data

The challenge was split into three phases as follows:

Phase 1: Discovery of the data and the Codabench platform
Phase 2: Estimation of cell type heterogeneity and submissions of methods/results into the platform
Phase 3: Migration from phase 2 of the best methods and evalution of them

Data was provided in the first two phases. Phase 1 data consisted of one simulated multi-omic dataset having an in-silico mixture of 5 cell types with explicit dependence between genes/CpG probes. Phase 2 data consisted of xxxxx.

Methods

Our approach consisted firstly of feature selection and cell type deconvolution.

Feature selection

The feature selection is done on both methylation and single-cell RNA data

For methylation data:

Begin with full bulk methylation data
Map CpG islands to genes from bulk RNA
Find cell-type specific methylation sites by a) Ranking CpG islands by difference between max methylation across all cell types and mean of remaining cell types and b) Select top CpG islands based on threshold (minimum of means methylation across all cell types)

For single-cell RNA:

Identify marker genes for cell types based on differential expression analysis
Create pseudo-bulk data

Cell type deconvolution

Deconvolution: Running SCDC + NNLS and an Ensemble method (this was abandoned however, as it achieved poor results)

Unimodal predictions:

a) NNLS on bulk RNA, bulk methylation and pseudo-bulk created from single-cell RNA-seq separately
b) Trying to mix bulk RNA and methylation changing reference to (1-bulk_methylation)*bulk_RNA element wise

Multimodal predictions:

All unimodal from a. + Ensemble method. This allowed using the intersection of CpG sites to gene, bulkRNA and scRNA genes

Exploratory data analysis

The script cellType_specific_CpGmet.ipynb shows some preliminary analysis of the data. It shows how genes are clustered based on their expression.

We get the genes that have the most distinct methylation across the 5 cell types.

Pre-processing

The script single_cell_preprocessing.R reads the single cell reference data, then creates and processes a Seurat object for each single cell dataset. The differential expression is computed, and the markers for each cell type are obtained, as well as the final gene list.

meth_data_analysis.R installs the annotation file IlluminaHumanMethylation450kanno.ilmn12.hg19 for Illuminas 450K methylation arrays. After loading this file:

Read reference methylation data
Filter data to keep only CpG islands from ref_met and reduce computation time
Further filtering to keep only CpG islands matching a gene name from a bulk reference

meth_rna_mapping.ipynb produces a mapping between CpG sites and UCSC refgene names. The mapping is saved as mapping_meth_rna.csv

Model

The main model that was submitted to the Codabench platform is submission_script.py which provides a Python-based pipeline for estimating proportions of different components in a biological mixture using RNA and methylation data. The program integrates multiple data modalities and combines them for accurate proportion estimation. The main steps are listed below:

Data Alignment and Filtering:
- Aligns RNA and methylation datasets based on shared features.
- Filters features for variability
Proportion Estimation:
- Uses Non-Negative Least Squares (NNLS) to estimate component proportions in mixtures.
- Supports RNA, methylation, and pseudo-bulk RNA datasets.
Optimal Weighting:
- Combines results from RNA and methylation datasets.
- Finds optimal weights to minimize RMSE using additionnal_script.py

Usage

Prepare Input Data:

RNA and methylation mixture data (mix_rna, mix_met).
RNA and methylation reference data (ref_rna, ref_met).
Mapping file linking RNA and methylation features (mapping_meth_rna.csv).
Pseudo-bulk RNA data (peng_pseudo_bulk_sum.csv).

Run the program: python submission_script.py

Note: Ensure proper preprocessing of input data for accurate results.

Codabench-Platform²

Competition website: https://www.codabench.org/competitions/4714/

Results

The best deconvolution results: Mmethylation

Approach NNLS + feature selection (CpG to gene mapping, cell-type specific methylation) (Python)

Codabench score = 0.66

The decomposition of scores for 9 validation datasets:

Conclusions

It is difficult to make conclusions about which is the best-performing deconvolution algorithm given the limited available time in the competition.
Feature selection is the key to improve performance
Given our teams expertise in python, we preferred the usage of python rather than R (the InMoose python package, developed by Epigene Labs is a good example)

Future ideas

Test different pre-processing methods (reduce noise, batch correction and integrate single cell data, TMM or DESeq2 normalisation)
Improve Ensemble methods for late integration
Use M-values for methylation instead of beta-values
Test feature selection with biological a priori
- Gene set enrichment analysis using msgdib and hallmark of epigenetic
- Use genes signatures allowing molecular classification of the cancer (and predict based on subtypes)
Generalize pipeline to other cancer types

Special thank you to the Data Challenge Organizers!

References

Epigenomic Deconvolution of Breast Tumors Reveals Metabolic Coupling between Constituent Cell Types: 10.1016/j.celrep.2016.10.057 ↩
Codabench: Flexible, easy-to-use, and reproducible meta-benchmark platform: 10.1016/j.patter.2022.100543 ↩ ↩²

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
data		data
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Challenge: Multimodal data integration to quantify tumor heterogeneity in cancer

Table of contents

Contributors

Team B

Overview

Workflow

Data

Methods

Feature selection

Cell type deconvolution

Exploratory data analysis

Pre-processing

Model

Usage

Codabench-Platform²

Results

Conclusions

Future ideas

Special thank you to the Data Challenge Organizers!

References

About

Releases

Packages

Contributors 3

Languages

License

jjjk123/hadaca3_teamB

Folders and files

Latest commit

History

Repository files navigation

Challenge: Multimodal data integration to quantify tumor heterogeneity in cancer

Table of contents

Contributors

Team B

Overview

Workflow

Data

Methods

Feature selection

Cell type deconvolution

Exploratory data analysis

Pre-processing

Model

Usage

Codabench-Platform2

Results

Conclusions

Future ideas

Special thank you to the Data Challenge Organizers!

References

Footnotes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Codabench-Platform²

Packages