Skip to content

Latest commit

 

History

History
162 lines (118 loc) · 11.3 KB

_glossary.md

File metadata and controls

162 lines (118 loc) · 11.3 KB
A benchmarking task to evaluate the performance of different methods in solving a specific problem when analysing omics data. A task typically consists of a dataset processor, methods, control methods and metrics. Each component has a well-defined input-output interface, for which the file formats in the resulting AnnData are also described.

The source code of a task is located in `src/tasks/<task_id>` and is structured as follows:

* `api/task_info.yaml`: Contains metadata about the task.
* `api/comp_*.yaml`: Files defining component interfaces.
* `api/file_*.yaml`: Files specifying file formats.
* `dataset_processor/`: Converts common datasets into task-specific datasets.
* `methods/`: Implements methods to solve the task.
* `control_methods/`: Tests and controls the quality of other methods.
* `metrics/`: Metrics for evaluating method performance.
* `workflows/`: Nextflow workflow for benchmarking tasks.
* `resources_scripts/`: Scripts to execute workflows.
* `resources_test_scripts/`: Scripts to create test resources.

See the [reference documentation](/documentation/reference/openproblems/src-task_id.qmd) for more information.
A metric is a quantitative measure used to evaluate the performance of different methods in solving a specific problem in single-cell omics data analysis.
A control method is used to test the relative performance of all other methods, and also as a quality control for the pipeline as a whole. A control method can either be a positive control or a negative control. The positive control and negative control methods set a maximum and minimum threshold for performance, so any new method should perform better than the negative control methods and worse than the positive control metho
A method is a computational tool that can be used to solve a specific problem in single-cell omics data analysis.
AnnData, short for "Annotated Data", is a file format for handling annotated,
high-dimensional biological data [@virshup2021anndataannotateddata]. It is
a standard data format in the single-cell community, and is supported by
many single-cell analysis tools, including Scanpy and CellxGene.

AnnData objects have a structured format that includes 
the main data matrix (`X`, e.g. gene expression values), 
annotations of observations (`obs`, e.g. cell metadata),
annotations of variables (`var`, e.g. gene metadata),
and unstructured annotations (`uns`).
This organization makes it easy to work with complex datasets while maintaining
data integrity and ensuring a standardized structure across different components.

![Overview of the different data structures inside an AnnData object.](/documentation/images/anndata.svg){#fig-anndata-format width="60%"}

Files with the `.h5ad` extension represent AnnData objects stored in an HDF5 file.
AnnData objects can be opened in Python using the
[`anndata.read_h5ad()`](https://anndata.readthedocs.io/en/latest/generated/anndata.read_h5ad.html#anndata.read_h5ad)
function, and in R using the
[`anndata::read_h5ad()`](https://anndata.dynverse.org/reference/read_h5ad.html)
function. Technically it can be read in any language using an HDF5 library.
Viash is a meta-framework for creating modular [Nextflow](#nextflow) workflows from [Viash components](#viash-component) [@cannoodt2021viashfromscripts]. It allows developers to create reusable, modular, and robust components for OpenProblems, focusing on the specific functionality without having to worry about the chosen pipeline framework.

Specific benefits of Viash include:

* **Reproducible**: Viash generates a Docker container for each component, ensuring that the component can be run in a reproducible environment.

* **Modular**: Nextflow modules generated by Viash are more reusable and modular than typical Nextflow modules, since default parameters values and default directives can be overwritten by the user at runtime.

* **Robust**: Viash allow for easily unit testing the functionality of a component.

* **Less boilerplate**: Viash components are easier to write than typical Nextflow modules, since Viash takes care of a lot of the boilerplate Nextflow code (such as parsing and validating input sheets, generating a CLI, generating documentation).
A Viash component is a combination of an R or Python script and a small amount of metadata that makes it easy to generate pipeline modules, facilitating the separation of component functionality from the pipeline workflow [@cannoodt2021viashfromscripts]. This enables developers to create reusable, modular, and robust components for OpenProblems, focusing on the specific functionality without having to worry about the chosen pipeline framework.

A Viash component consists of three main parts: a Viash config, a script, and one or more unit tests.

![Viash supports robust pipeline development by allowing users to build their component as a standalone executable (with auto-generated CLI), build a Docker container to run the script inside, or turn the component into a standalone Nextflow module.](/documentation/images/viash_figure_2.svg){#fig-viash-runtime width="80%"}
A workflow management system that enables the development of portable and reproducible workflows.
A dataset is one or more [AnnData](#anndata) objects that are used as input for a benchmarking task. To ensure interoperability between components, each component has a strict [file format specification](#file-format-specification) for validating whether an input or output AnnData object is valid.

OpenProblems offers a collection of datasets that can be used to test components and run the benchmarking tasks. Raw datasets are generated by [dataset loaders](#dataset-loader) and processed into [common datasets](#common-datasets) by a [dataset processing workflow](#dataset-processing-workflow). Testing resources are typically subsampled versions of the common datasets.

Testing resources are stored at `s3://openproblems-bio/resources_test/`, while the processed datasets are stored at `s3://openproblems-bio/resources/`.
An unprocessed the dataset as generated by a [dataset loader](#dataset-loader). The [file format specification](#file-format-specification) for raw datasets is based on the [CELLxGENE schema](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/4.0.0/schema.md) and is stored at [`src/datasets/api/file_raw.yaml`](https://github.com/openproblems-bio/openproblems/blob/main/src/datasets/api/file_raw.yaml).
A Viash component that downloads and stores the dataset as an AnnData file.
An [AnnData](#anndata) object that follows the [common dataset format](#common-dataset-format). A common dataset is generated by the [dataset processing workflow](#dataset-processing-workflow) and is used as input for multiple benchmarking tasks.

Common datasets are stored at `s3://openproblems-bio/resources/datasets/` and are processed into task-specific AnnData objects by [dataset processors](#dataset-processor) and subsequently stored at `s3://openproblems-bio/resources/<task_id>`.

For a complete list of available common datasets, see the [dataset overview](/dataset) page.
The format of common datasets is based on the [CELLxGENE schema](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/4.0.0/schema.md) along with additional metadata that is specific to OpenProblems (in the `.uns` slot) and some additional output generated by our dataset preprocessors (in the `.layers`, `.obsm`, `.obsp` and `.varm` slots).

Here is what a typical common dataset looks like when printed to the console:

    AnnData object
      obs: 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', ...
      var: 'feature_id', 'feature_name', 'soma_joinid', 'hvg', 'hvg_score'
      obsm: 'X_pca'
      obsp: 'knn_distances', 'knn_connectivities'
      varm: 'pca_loadings'
      layers: 'counts', 'normalized'
      uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', ...

Some slots might not be available depending on the origin of the dataset. Please visit the [reference documentation](/documentation/reference/openproblems/src-datasets.html#file-format-common-dataset) for a detailed description of the available slots and purpose.
A Viash component that a common dataset into task-specific dataset objects.
The test resources is a set of files located at `s3://openproblems-bio/resources_test/` that are used to test components and workflows. Test resources are typically subsampled versions of the [common datasets](#common-datasets).
A workflow that processes a [raw dataset](#raw-dataset) into a [common dataset](#common-dataset). See the [reference](/documentation/reference/openproblems/src-datasets.qmd) for more information.
A file format specification is a metadata file which describes the expected structure of an AnnData object. This is used to verify whether an AnnData object is valid and to automatically generate documentation for tasks and the components therein. File format specification files are typically stored in `src/**/api/file_*.yaml`.
A component interface is a metadata file which describes the expected inputs and outputs of a component. This is used to verify whether a component is valid and to automatically generate documentation for tasks and the components therein. Component interface files are typically stored in `src/**/api/comp_*.yaml`.
Docker is a tool designed to make it easier to create, deploy, and run applications by using containers. Containers allow a developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and ship it all out as one package. By doing so, thanks to the container, the developer can rest assured that the application will run on any other Linux machine regardless of any customized settings that machine might have that could differ from the machine used for writing and testing the code.
A cloud-based library of single-cell RNA-seq datasets, developed by the Chan Zuckerberg Initiative. It provides a user-friendly interface for exploring and retrieving single-cell RNA-seq data, and is widely used in the single-cell community.

OpenProblems uses CELLxGENE census to retrieve raw datasets, and the CELLxGENE schema to define the file format specification for raw datasets.
GitHub Actions is a CI/CD service provided by GitHub that allows developers to automate their software development workflows. It is integrated into GitHub and is commonly used for testing, building, and deploying code.

OpenProblems uses GitHub Actions to automatically run tests and build Docker containers and Nextflow modules for each component.
Amazon Web Services (AWS) is a subsidiary of Amazon providing on-demand cloud computing platforms and APIs to individuals, companies, and governments, on a metered pay-as-you-go basis. OpenProblems uses AWS to store and distribute datasets and resources on S3, and to run Nextflow workflows using AWS Batch.