_viash.yaml

name: task_perturbation_prediction
label: Perturbation Prediction
summary: Predicting how small molecules change gene expression in different cell types.
description: |
  Human biology can be complex, in part due to the function and interplay of the body's
  approximately 37 trillion cells, which are organized into tissues, organs, and systems.
  However, recent advances in single-cell technologies have provided unparalleled insight
  into the function of cells and tissues at the level of DNA, RNA, and proteins. Yet
  leveraging single-cell methods to develop medicines requires mapping causal links
  between chemical perturbations and the downstream impact on cell state. These experiments
  are costly and labor intensive, and not all cells and tissues are amenable to
  high-throughput transcriptomic screening. If data science could help accurately predict
  chemical perturbations in new cell types, it could accelerate and expand the development
  of new medicines.

  Several methods have been developed for drug perturbation prediction, most of which are
  variations on the autoencoder architecture (Dr.VAE, scGEN, and ChemCPA). However, these
  methods lack proper benchmarking datasets with diverse cell types to determine how well
  they generalize. The largest available training dataset is the NIH-funded Connectivity
  Map (CMap), which comprises over 1.3M small molecule perturbation measurements. However,
  the CMap includes observations of only 978 genes, less than 5% of all genes. Furthermore,
  the CMap data is comprised almost entirely of measurements in cancer cell lines, which
  may not accurately represent human biology.

  This task aims to predict how small molecules change gene expression in different cell
  types. This task was a [Kaggle competition](https://www.kaggle.com/competitions/open-problems-single-cell-perturbations/overview)
  as part of the [NeurIPS 2023 competition track](https://neurips.cc/virtual/2023/competition/66586).

  The task is to predict the gene expression profile of a cell after a small molecule
  perturbation. For this competition, we designed and generated a novel single-cell
  perturbational dataset in human peripheral blood mononuclear cells (PBMCs). We
  selected 144 compounds from the Library of Integrated Network-Based Cellular Signatures
  (LINCS) Connectivity Map dataset ([PMID: 29195078](https://pubmed.ncbi.nlm.nih.gov/29195078/))
  and measured single-cell gene
  expression profiles after 24 hours of treatment. The experiment was repeated in three
  healthy human donors, and the compounds were selected based on diverse transcriptional
  signatures observed in CD34+ hematopoietic stem cells (data not released). We performed
  this experiment in human PBMCs because the cells are commercially available with
  pre-obtained consent for public release and PBMCs are a primary, disease-relevant tissue
  that contains multiple mature cell types (including T-cells, B-cells, myeloid cells,
  and NK cells) with established markers for annotation of cell types. To supplement this
  dataset, we also measured cells from each donor at baseline with joint scRNA and
  single-cell chromatin accessibility measurements using the 10x Multiome assay. We hope
  that the addition of rich multi-omic data for each donor and cell type at baseline will
  help establish biological priors that explain the susceptibility of particular genes to
  exhibit perturbation responses in difference biological contexts.

version: dev
license: MIT
keywords: [single-cell, perturbation prediction, perturbation, benchmark]
links:
  issue_tracker: https://github.com/openproblems-bio/task_perturbation_prediction/issues
  repository: https://github.com/openproblems-bio/task_perturbation_prediction
  docker_registry: ghcr.io
references:
  bibtex: |
    @article{slazata2024benchmark,
      title = {A benchmark for prediction of transcriptomic responses to chemical perturbations across cell types},
      author = {Artur Szałata and Andrew Benz and Robrecht Cannoodt and Mauricio Cortes and Jason Fong and Sunil Kuppasani and Richard Lieberman and Tianyu Liu and Javier A. Mas-Rosario and Rico Meinl and Jalil Nourisa and Jared Tumiel and Tin M. Tunjic and Mengbo Wang and Noah Weber and Hongyu Zhao and Benedict Anchang and Fabian J Theis and Malte D Luecken and Daniel B Burkhardt},
      booktitle = {The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
      year = {2024},
      url = {https://openreview.net/forum?id=WTI4RJYSVm}
    }

authors:
  - name: Artur Szałata
    roles: [ author ]
    info:
      github: szalata
      orcid: "000-0001-8413-234X"
  - name: Robrecht Cannoodt
    roles: [ author ]
    info:
      github: rcannood
      orcid: "0000-0003-3641-729X"
  - name: Daniel Burkhardt
    roles: [ author ]
    info:
      github: dburkhardt
      orcid: 0000-0001-7744-1363
  - name: Malte D. Luecken
    roles: [ author ]
    info:
      github: LuckyMD
      orcid: 0000-0001-7464-7921
  - name: Tin M. Tunjic
    roles: [ contributor ]
    info:
      github: ttunja
      orcid: 0000-0001-8842-6548
  - name: Mengbo Wang
    roles: [ contributor ]
    info:
      github: wangmengbo
      orcid: 0000-0002-0266-9993
  - name: Andrew Benz
    roles: [ author ]
    info:
      github: andrew-benz
      orcid: 0009-0002-8118-1861
  - name: Tianyu Liu
    roles: [ contributor ]
    info:
      github: HelloWorldLTY
      orcid: 0000-0002-9412-6573
  - name: Jalil Nourisa
    roles: [ contributor ]
    info:
      github: janursa
      orcid: 0000-0002-7539-4396
  - name: Rico Meinl
    roles: [ contributor ]
    info:
      github: ricomnl
      orcid: 0000-0003-4356-6058


# technical settings
organization: openproblems-bio
viash_version: 0.9.0
info:
  test_resources:
    - type: s3
      path: s3://openproblems-data/resources/task_perturbation_prediction/datasets
      dest: resources/datasets

# set default labels
config_mods: |
  .runners[.type == "nextflow"].config.labels := { lowmem : "memory = 20.Gb", midmem : "memory = 50.Gb", highmem : "memory = 100.Gb", lowcpu : "cpus = 5", midcpu : "cpus = 15", highcpu : "cpus = 30", lowtime : "time = 1.h", midtime : "time = 4.h", hightime : "time = 8.h", veryhightime : "time = 24.h" }