In March of 2022, the Childhood Cancer Data Lab launched the Single-cell Pediatric Cancer Atlas (ScPCA) Portal to make uniformly processed, summarized single-cell and single-nuclei RNA-seq data and de-identified metadata from pediatric tumor samples available for download.
Data available on the Portal was obtained using two different mechanisms: raw data was accepted from ALSF-funded investigators and processed using our open-source pipeline scpca-nf
, or investigators processed their raw data using scpca-nf
and submitted the output for inclusion on the Portal.
All samples on the Portal include a core set of metadata obtained from investigators, including age, sex, diagnosis, subdiagnosis (if applicable), tissue location, and disease stage. Some investigators submitted additional metadata, such as treatment and tumor stage, which can also be found on the Portal. All submitted metadata was standardized to maintain consistency across projects before adding to the Portal. In addition to providing a human-readable value for the submitted metadata, we also provide ontology term identifiers, if applicable. Submitted metadata was mapped to associated ontology term identifiers obtained from HsapDV (age) [@url:https://www.ebi.ac.uk/ols4/ontologies/hsapdv], PATO (sex) [@doi:10.1093/bib/bbx035; @url:https://www.ebi.ac.uk/ols4/ontologies/pato], NCBI taxonomy (organism) [@doi:10.1093/database/baaa062; @url:https://www.ncbi.nlm.nih.gov/taxonomy], MONDO (disease) [@doi:10.1101/2022.04.13.22273750; @url:https://www.ebi.ac.uk/ols4/ontologies/mondo], UBERON (tissue) [@doi:10.1186/2041-1480-5-21; @doi:10.1186/gb-2012-13-1-r5; @url:https://www.ebi.ac.uk/ols4/ontologies/uberon], and Hancestro (ethnicity, if applicable) [@doi:10.1186/s13059-018-1396-2; @url:https://www.ebi.ac.uk/ols4/ontologies/hancestro]. By providing these ontology term identifiers for each sample, users have access to standardized metadata terms that facilitate comparisons among datasets within the Portal as well as to data from other research projects.
The Portal contains data from over 500 samples and over 50 tumor types [@doi:10.1016/j.devcel.2022.04.003; @doi:10.21203/rs.3.rs-2517703/v1; @doi:10.21203/rs.3.rs-2517758/v1; @doi:10.1038/nature23647; @doi:10.1038/s41467-021-24781-7; @doi:10.1093/neuonc/noad207; @doi:10.1101/2023.12.26.573390].
Figure {@fig:fig1}A summarizes all samples from patient tumors and patient-derived xenografts currently available on the Portal. The total number of samples for each diagnosis is shown, along with the proportion of samples from each disease stage within a diagnosis group. The largest number of samples found on the Portal were obtained from patients with leukemia (n = 191). The Portal also includes samples from brain and central nervous system tumors (n = 166), sarcoma and soft tissue tumors (n = 68), and a variety of other solid tumors (n = 86). Most samples were collected at initial diagnosis (n = 426), with a smaller number of samples collected either at recurrence (n = 67), during progressive disease (n = 12), or post-mortem (n = 5). Along with the patient tumors, the Portal contains a small number of human tumor cell line samples (n = 4).
Each of the available samples contains summarized gene expression data from either single-cell or single-nuclei RNA sequencing. However, some samples also include additional data, such as quantified expression data from tagging cells with antibody-derived tags (ADT), such as CITE-seq antibodies [@doi:10.1038/nmeth.4380], or multiplexing samples with hashtag oligonucleotides (HTO) [@doi:10.1186/s13059-018-1603-1] prior to sequencing. Out of the 518 samples, 96 have associated CITE-seq data, and 19 have associated multiplexing data. In some cases, multiple libraries from the same sample were collected for additional sequencing, either for bulk RNA-seq or spatial transcriptomics. Specifically, 118 samples on the Portal were sequenced using bulk RNA-seq and 94 samples were sequenced using spatial transcriptomics. A summary of the number of samples with each additional modality is shown in Figure {@fig:fig1}B, and a detailed summary of the total samples with each sequencing method broken down by project is available in Table S1.
Samples on the Portal are organized by project, where each project is a collection of similar samples from an individual lab. Users can filter projects based on diagnosis, included modalities (e.g., CITE-seq, bulk RNA-seq), 10x Genomics kit version (e.g., 10Xv2, 10Xv3), and whether or not a project includes samples derived from patient-derived xenografts or cell lines. The project card displays an abstract, the total number of samples included, a list of diagnoses for all samples included in the Project, and links to any external information associated with the project, such as publications and links to external data, such as SRA or GEO (Figure {@fig:fig1}C). The project card also indicates the type(s) of sequencing performed, including the 10x Genomics kit version, the suspension type (cell or nucleus), and if additional sequencing is present, like bulk RNA-seq or multiplexing.
We developed scpca-nf
, an open-source and efficient Nextflow [@doi:10.1038/nbt.3820] workflow for quantifying single-cell and single-nuclei RNA-seq data and processed all data available on the Portal with it.
Using Nextflow as the backbone for the scpca-nf
workflow ensures both reproducibility and portability.
All dependencies for the workflow are handled automatically, as each process in the workflow is run in a Docker container.
Nextflow is compatible with various computing environments, including high-performance computing clusters and cloud-based computing, allowing users to run the workflow in their preferred environment.
Setup requires organizing input files and updating a single configuration file for the computing environment after installing Nextflow and either Docker or Singularity.
Nextflow will also handle parallelizing sample processing as allowed by the environment, minimizing run time.
The combination of being able to execute a Nextflow workflow in any environment and run individual processes in Docker containers makes this workflow easily portable for external use.
When building scpca-nf
, we sought a fast and memory-efficient tool for gene expression quantification to minimize processing costs.
We expected many users of the Portal to have their own single-cell or single-nuclei data processed with Cell Ranger [@doi:10.1038/ncomms14049; @url:https://www.10xgenomics.com/support/software/cell-ranger/latest], due to its popularity.
Thus, selecting a tool with comparable results to Cell Ranger was also desirable.
In comparing alevin-fry
[@doi:10.1038/s41592-022-01408-3] to Cell Ranger, we found alevin-fry
had a lower run time and memory usage (Figure {@fig:figS1}A), while retaining comparable mean gene expression for all genes (Figure {@fig:figS1}B), total UMIs per cell (Figure {@fig:figS1}C), and total genes detected per cell (Figure {@fig:figS1}D).
(All analyses comparing gene expression quantification tools are available in a public analysis repository [@url:https://github.com/AlexsLemonade/alsf-scpca].)
Based on these results, we elected to use salmon alevin
and alevin-fry
[@doi:10.1038/s41592-022-01408-3] in scpca-nf
to quantify gene expression data.
scpca-nf
takes FASTQ files as input (Figure {@fig:fig2}A).
Reads are aligned using the selective alignment option in salmon alevin
to an index with transcripts corresponding to spliced cDNA and intronic regions, denoted by alevin-fry
as a splici
index.
The output from alevin-fry
includes a gene by cell count matrix for all barcodes identified, even those that may not contain true cells.
This unfiltered counts matrix is stored in a SingleCellExperiment
object [@doi:10.1038/s41592-019-0654-x] and output from the workflow as a file with the suffix _unfiltered.rds
.
scpca-nf
performs filtering of empty droplets, removal of low-quality cells, normalization, dimensionality reduction, and cell type annotation (Figure {@fig:fig2}A).
The unfiltered gene by cell counts matrices are filtered to remove any barcodes that are not likely to contain cells using DropletUtils::emptyDropsCellRanger()
[@doi:10.1186/s13059-019-1662-y], and all cells that pass are saved in a SingleCellExperiment
object and a file with the suffix _filtered.rds
.
Low-quality cells are identified and removed with miQC
[@doi:10.1371/journal.pcbi.1009290], which jointly models the proportion of mitochondrial reads and detected genes per cell and calculates a probability that each cell is compromised.
The remaining cells' counts are normalized [@doi:10.1186/s13059-016-0947-7], and reduced-dimension representations are calculated using both principal component analysis (PCA) and uniform manifold approximation and projection (UMAP) [@arxiv:1802.03426].
Finally, cell types are classified using two automated methods, SingleR
[@doi:10.1038/s41590-018-0276-y] and CellAssign
[@doi:10.1038/s41592-019-0529-1].
The results from this analysis are stored in a processed SingleCellExperiment
object saved to a file with the suffix _processed.rds
.
To make downloading from the Portal convenient for R and Python users, downloads are available as either SingleCellExperiment
or AnnData
[@doi:10.1101/2021.12.16.473007] objects.
scpca-nf
converts all SingleCellExperiment
objects to AnnData
objects, which are saved as .hdf5
files (Figure {@fig:fig2}A).
Downloads contain the unfiltered, filtered, and processed objects from scpca-nf
to allow users to choose to perform their own filtering and normalization or to start their analysis from a processed object.
All downloads from the Portal include a quality control (QC) report with a summary of processing information (e.g., alevin-fry
version), library statistics (e.g., the total number of cells), and a collection of diagnostic plots for each library (Figure {@fig:fig2}B-G).
A knee plot displaying total UMI counts for all droplets (i.e., including empty droplets) indicates the effects of the empty drop filtering (Figure {@fig:fig2}B).
For each cell that remains after filtering empty droplets, the number of total UMIs, genes detected, and mitochondrial reads are calculated and summarized in a scatter plot (Figure {@fig:fig2}C).
We include plots showing the miQC
model and which cells are kept and removed after filtering with miQC
(Figure {@fig:fig2}D-E).
A UMAP plot with cells colored by the total number of genes detected and a faceted UMAP plot where cells are colored by the expression of a set of highly variable genes are also provided (Figure {@fig:fig2}F-G).
scpca-nf
includes modules for processing samples with sequencing modalities beyond single-cell or single-nuclei RNA-seq data: corresponding ADT or CITE-seq data [@doi:10.1038/nmeth.4380], multiplexed data via cell hashing [@doi:10.1186/s13059-018-1603-1], spatial transcriptomics, or bulk RNA-seq.
To process ADT libraries, the ADT FASTQ files were provided as input into scpca-nf
and quantified using salmon alevin
and alevin-fry
(Figure {@fig:figS2}A).
Along with the FASTQ files, scpca-nf
takes a tab-separated values (TSV) file with one row for each ADT – containing the name used for the ADT and associated barcode – required to build an ADT-specific index for quantifying ADT expression with alevin-fry
.
The output from alevin-fry
is the unfiltered ADT by cell counts matrix.
The ADT by cell counts matrix is read into R alongside the gene by cell counts matrix and saved as an alternative experiment (altExp
) within the main SingleCellExperiment
object containing the unfiltered RNA counts.
This SingleCellExperiment
object containing both RNA and ADT counts is output from the workflow to a file with the suffix _unfiltered.rds
.
scpca-nf
does not filter any cells based on ADT expression or remove cells with low-quality ADT expression.
Any cells removed after filtering empty droplets based on the unfiltered RNA counts matrix are also removed from the ADT counts matrix.
The workflow calculates QC statistics for ADT counts using DropletUtils::cleanTagCounts()
that are stored alongside the ADT by cell counts matrix in the filtered SingleCellExperiment
object.
The SingleCellExperiment
object containing the filtered RNA and ADT counts matrix and associated ADT QC statistics is saved to a file with the suffix _filtered.rds
.
The ADT by cell counts matrix is normalized by first determining the ambient profile and then using that profile to calculate median size factors with scuttle::computeMedianFactors()
[@doi:10.18129/B9.bioc.scuttle; @url:https://bioconductor.org/books/3.16/OSCA.advanced/integrating-with-protein-abundance.html#cite-seq-median-norm].
We skip normalization for cells with low-quality ADT expression, as indicated by DropletUtils::cleanTagCounts()
.
Although scpca-nf
normalizes ADT counts, the workflow does not perform any dimensionality reduction of ADT data; only the RNA counts data are used as input for dimensionality reduction.
The normalized ADT data are saved as an altExp
within the processed SingleCellExperiment
containing the normalized RNA data and is output to a file with the suffix _processed.rds
.
All files containing SingleCellExperiment
objects and associated altExp
objects are converted to AnnData
objects and exported as separate RNA (_rna.hdf5
) and ADT (_adt.hdf5
) AnnData
objects.
If a library contains associated ADT data, the QC report output by scpca-nf
will include an additional section with a summary of ADT-related statistics, such as how many cells express each ADT, and ADT-specific diagnostic plots (Figure {@fig:figS2}B-D).
As mentioned above, scpca-nf
uses DropletUtils::cleanTagCounts()
to calculate QC statistics for each cell using ADT expression but does not filter any cells from the object.
We include plots summarizing the potential effects of removing of low-quality cells based on RNA and ADT counts in the QC report (Figure {@fig:figS2}B).
The first quadrant indicates which cells would be kept if the object was filtered using both RNA and ADT quality measures.
The other facets highlight which cells would be removed if filtering was done using only RNA counts, only ADT counts, or both.
The top four ADTs with the most variable expression are also identified and visualized using density plots to show the normalized ADT expression across all cells (Figure {@fig:figS2}C) and UMAPs – calculated from RNA data – with cells colored by ADT expression (Figure {@fig:figS2}D).
To process multiplexed libraries, the HTO FASTQ files are input to scpca-nf
and quantified using salmon alevin
and alevin-fry
(Figure {@fig:figS2}C).
Along with the FASTQ files, scpca-nf
requires two TSV files to process multiplexed data: one to build an HTO-specific index for quantifying HTO expression with alevin-fry
, and a second to indicate which HTO was used for which sample when multiplexing the library.
The unfiltered HTO by cell counts matrix output from alevin-fry
is saved as an alternative experiment (altExp
) within the main SingleCellExperiment
containing the unfiltered RNA counts.
This SingleCellExperiment
object containing both RNA and HTO counts is output from the workflow to a file with the suffix _unfiltered.rds
.
As with ADT data, scpca-nf
does not filter any cells based on HTO expression, and any cells removed after filtering empty droplets based on the unfiltered RNA counts matrix are also removed from the HTO counts matrix with the remainder saved to a file with the _filtered.rds
suffix.
scpca-nf
does not perform any additional filtering or processing of the HTO by cell counts matrix, so the same filtered matrix is saved to the file with the _processed.rds
suffix.
Although scpca-nf
quantifies the HTO data and includes an HTO by cell counts matrix in all objects, scpca-nf
does not demultiplex the samples into one sample per library.
Instead, scpca-nf
applies multiple demultiplexing methods, including demultiplexing with DropletUtils::hashedDrops()
[@doi:10.18129/B9.bioc.DropletUtils], demultiplexing with Seurat::HTODemux()
[@doi:10.1186/s13059-018-1603-1], and genetic demultiplexing when bulk RNA-seq data are available.
scpca-nf
uses the genetic demultiplexing method described in Weber et al. [@doi:10.1093/gigascience/giab062], which uses bulk RNA-seq as a reference for the expected genotypes found in each single-cell RNA-seq sample.
The results from all available demultiplexing methods are saved in the filtered and processed SingleCellExperiment
objects.
If a library has associated HTO data, an additional section is included in the scpca-nf
QC report.
This section summarizes HTO-specific library statistics, such as how many cells express each HTO.
No additional plots are produced, but a table summarizing the results from all three demultiplexing methods is included.
Some samples also included data from bulk RNA-seq and/or spatial transcriptomics libraries.
Both of these additional sequencing methods are supported by scpca-nf
.
To quantify bulk RNA-seq data, scpca-nf
takes bulk FASTQ files as input, trims reads using fastp
[@doi:10.1093/bioinformatics/bty560], and then aligns and quantifies reads with salmon
(Figure {@fig:figS3}A) [@doi:10.1038/nmeth.4197].
The output is a single TSV file with the gene by sample counts matrix for all samples in a given ScPCA project.
This gene by sample matrix is only included with project downloads on the Portal.
To quantify spatial transcriptomics data, scpca-nf
takes the RNA FASTQ and slide image as input (Figure {@fig:figS3}B).
As alevin-fry
does not yet fully support spatial transcriptomics data, scpca-nf
uses Space Ranger to quantify all spatial transcriptomics data [@url:https://www.10xgenomics.com/support/software/space-ranger/latest].
The output includes the spot by gene matrix along with a summary report produced by Space Ranger.
On the Portal, users can select to download data from individual samples or all data from an entire ScPCA project.
When downloading data for an entire project, users can choose between receiving the individual files for each sample (default) or one file containing the gene expression data and metadata for all samples in the project as a merged object.
Users also have the option to choose their desired format and receive the data as SingleCellExperiment
(.rds
) or AnnData
(.hdf5
) objects.
For downloads with samples as individual files, the download folder will include a sub-folder for each sample in the project (Figure {@fig:fig3}A). Each sample folder contains all three object types (unfiltered, filtered, and processed) in the requested file format and the QC and cell type summary report for all libraries from the given sample. The objects house the summarized gene expression data and associated metadata for the library indicated in the filename.
All project downloads include a metadata file, single_cell_metadata.tsv
, containing relevant metadata for all samples, and a README.md
with information about the contents of each download, contact and citation information, and terms of use for data downloaded from the Portal (Figure {@fig:fig3}A-B).
If the ScPCA project includes samples with bulk RNA-seq, two additional files are included: a gene by sample counts matrix (bulk_quant.tsv
) with the quantified gene expression data for all samples in the project, and a metadata file (bulk_metadata.tsv
).
Providing data for all samples within a single file facilitates performing joint gene-level analyses, such as differential expression or gene set enrichment analyses, on multiple samples simultaneously.
Therefore, we provide a single, merged object for each project containing all raw and normalized gene expression data and metadata for all single-cell and single-nuclei RNA-seq libraries within a given ScPCA project.
We provide merged objects for all projects in the Portal except for those with multiplexing, due to potential ambiguity in identifying samples across multiplexed libraries.
The data in the merged object has simply been combined without further processing; no batch-corrected or integrated data are included.
If downloading data from an ScPCA project as a single, merged file, the download will include a single .rds
or .hdf5
file, a summary report for the merged object, and a folder with all individual QC and cell type reports for each library found in the merged object (Figure {@fig:fig3}B).
To build the merged objects, we created an additional stand-alone workflow for merging the output from scpca-nf
, merge.nf
(Figure {@fig:fig3}C).
merge.nf
takes as input the processed SingleCellExperiment
objects output by scpca-nf
for all single-cell and single-nuclei libraries included in a given ScPCA project.
The gene expression data stored in all SingleCellExperiment
objects are then merged to produce a single merged gene by cell counts matrix containing all cells from all libraries.
The genes available in the merged object will be the same as those in each individual object, as all objects on the Portal were quantified using the same index.
Where possible, library-, cell- and gene-specific metadata found in the individual processed SingleCellExperiment
objects are also merged.
The merged normalized counts matrix is then used to select high-variance genes in a library-aware manner before performing dimensionality reduction with both PCA and UMAP.
merge.nf
outputs the merged and processed object as a SingleCellExperiment
object.
The more samples that are included in a merged object, the larger the object, and the more difficult it is to work with that object in R or Python.
Therefore, we do not provide merged objects for projects with more than 100 samples.
We also account for additional modalities in merge.nf
.
If at least one library in a project contains ADT data, the raw and normalized ADT data are also merged and saved as an altExp
in the merged SingleCellExperiment
object.
If any libraries in a project are multiplexed, no merged object is created, as there is no guarantee that a unique HTO was used for each sample in a given project.
All merged SingleCellExperiment
objects are converted to AnnData
objects and exported as .hdf5
files.
If the merged object contains an altExp
with merged ADT data, two AnnData
objects are exported to create separate RNA (_rna.hdf5
) and ADT (_adt.hdf5
) objects.
merge.nf
outputs a summary report for each merged object, which includes a set of tables summarizing the types of samples and libraries included in the project, such as types of diagnosis, and a faceted UMAP showing all cells from all libraries.
In the UMAP, each panel represents a different library included in the merged object, with all cells from the specified library shown in color, while all other cells are gray.
An example of this UMAP showing a subset of libraries from an ScPCA project is available in Figure {@fig:fig3}D.
Assigning cell type labels to single-cell and single-nuclei RNA-seq data is often an essential step in analysis.
Cell type annotation requires knowledge of the expected cell types in a dataset and the associated gene expression patterns for each cell type, which is available in publications or other public databases for some biological contexts.
Automated cell type annotation methods leveraging public databases are an excellent initial step in the labeling process, as they can be applied consistently and transparently across all samples in a data set.
As such, we include cell type annotations determined using two different automated methods, SingleR
[@doi:10.1038/s41590-018-0276-y] and CellAssign
[@doi:10.1038/s41592-019-0529-1], in all processed SingleCellExperiment
and AnnData
objects available for download on the Portal, saving users analysis time.
Annotating cell types with automated methods like SingleR
and CellAssign
requires the use of previously annotated reference data.
For SingleR
, this can be in the form of an annotated gene expression dataset from a microarray, bulk RNA-seq, or single-cell RNA-seq experiment.
CellAssign
requires a matrix of cell types and expected marker genes.
Most public annotated reference datasets that can be used with these methods – including those we use for the Portal – are derived from normal tissue, making accurately annotating tumor datasets particularly difficult.
Because there are limitations to the annotations provided on the Portal, comparing the two methods and observing consistent cell type annotations across methods can indicate higher confidence in the provided labels.
For some ScPCA projects, submitters provided their own curated cell type annotations, including annotation of tumor cells and disease-specific cell states.
These submitter-provided annotations can be found in all SingleCellExperiment
and AnnData
objects (unfiltered, filtered, and processed).
SingleR
is a reference-based annotation method that requires an existing bulk or single-cell RNA-seq dataset with annotations.
To identify an appropriate reference to use with SingleR
, we annotated a small number of samples across multiple disease types with all human-specific references available in the celldex
package [@doi:10.1038/s41590-018-0276-y].
The output from SingleR
includes a score matrix containing a score for each cell and all possible cell types found in the reference, where higher scores are associated with assigned cell types.
We calculated the delta median statistic for each cell in the dataset by subtracting the median score from the score associated with the assigned cell type label.
The delta median statistic helps evaluate how confident SingleR
is in assigning each cell to a specific cell type, where low delta median values indicate ambiguous assignments and high delta median values indicate confident assignments [@url:https://bioconductor.org/books/release/SingleRBook/annotation-diagnostics.html#based-on-the-deltas-across-cells].
Using this measure, we found that the BlueprintEncodeData
reference [@doi:10.3324/haematol.2013.094243; @doi:10.1038/nature11247], which includes a variety of normal cell types, tended to perform better than or at least similarly to other references across samples from different disease types (Figure {@fig:figS4}).
Based on these findings, we used the BlueprintEncodeData
reference to annotate cells from all libraries on the Portal, as using a single reference is potentially valuable for cross-project analyses.
In contrast, CellAssign
is a marker-gene-based annotation method that requires a binary matrix with all cell types and all associated marker genes as the reference.
We used the list of marker genes available as part of PanglaoDB
[@doi:10.1093/database/baz046] to construct organ-specific marker gene matrices with marker genes from all cell types listed for the specified organ.
Since many cancers may have infiltrating immune cells, all immune cells were also included in each organ-specific reference.
For each ScPCA project, we provided the organ-specific marker gene matrix relevant to the disease and tissue type from which the sample was obtained (e.g., for brain tumors, we used a brain-specific marker gene matrix with all brain and immune cell types).
If CellAssign
cannot find a likely cell type from the marker gene matrix, it does not assign a cell type.
Because we annotate cells from tumor samples using references containing only normal cells, we anticipate that many cells, particularly the tumor cells, may not have an exact match; reporting this to the end user is valuable.
Indeed, when applying CellAssign
to tumor samples with our chosen reference, we observed that many of the cells were unassigned.
We included an example in Figure {@fig:figS5}A where unassigned cell types are labeled with Unknown
.
When comparing annotations obtained from CellAssign
to submitter-provided annotations, we noticed the labels for non-tumor cells are similar between CellAssign
and submitter annotations, while the tumor cells were not assigned using CellAssign
(Figure {@fig:figS5}B).
scpca-nf
adds cell type annotations from SingleR
and CellAssign
to all processed SingleCellExperiment
objects (Figure {@fig:fig4}A).
This requires two additional reference files as input to the workflow: a classification model built from a reference dataset for SingleR
and a marker gene by cell type matrix for CellAssign
.
SingleR::trainSingleR()
was used to build a classification model from the provided BlueprintEncodeData
dataset and create the required SingleR
input for scpca-nf
.
The classification model and processed SingleCellExperiment
were used as input for SingleR::classifySingleR()
, resulting in annotations for all cells and an associated score matrix.
The score matrix containing a score for all cells and each possible cell type and the assigned cell types are added to the processed SingleCellExperiment
object output by scpca-nf
.
Simultaneously, processed SingleCellExperiment
objects are converted to AnnData
objects for classification with CellAssign
.
CellAssign
uses the converted AnnData
object and the marker gene matrix to train a model and predict the most likely cell type from the possible cell types in the marker gene matrix.
The prediction matrix, which contains a probability that each cell is one of each possible cell types, and the assigned cell types are added to the processed SingleCellExperiment
object output by scpca-nf
.
The processed SingleCellExperiment
object is then converted to an AnnData
object to ensure cell type annotations are included in both data formats provided by scpca-nf
.
An additional cell type report with information about reference sources, comparisons among cell type annotation methods, and diagnostic plots is also output by scpca-nf
.
Tables summarizing the number of cells assigned to each cell type for each method are shown alongside UMAPs coloring cells by the assigned cell type.
The concordance of cell type annotations assigned between both methods can indicate higher confidence in the provided annotations.
We therefore used the Jaccard similarity index to compare annotations between the two methods, as well as submitter-provided annotations, if available.
This index is calculated between pairs of labels from each method and ranges from 0-1, with a value close to 1 indicating high agreement and a high proportion of overlapping cells and values close to 0 indicating a low proportion of overlapping cells.
The Jaccard similarity index is displayed in a heatmap, an example of which is shown in Figure {@fig:fig4}B.
The report also includes a diagnostic plot evaluating the confidence of cell type annotations determined by each method.
To evaluate confidence in SingleR
cell type annotations, the delta median statistic is calculated by subtracting the median score from the score associated with the assigned cell type label [@url:https://bioconductor.org/books/release/SingleRBook/annotation-diagnostics.html#based-on-the-deltas-across-cells].
The distribution of delta median values for each cell type is shown in the cell type report, where a higher delta median statistic for a cell indicates higher confidence in the final cell type annotation (Figure {@fig:figS6}A).
CellAssign
calculates the probability that each cell belongs to each possible cell type provided in the reference, and the cell type label with the highest probability is assigned as the cell type for that cell.
These values range from 0 to 1, with larger values indicating greater confidence in a given cell type label, so we expect more confident labels to have most values close to 1. An example of the plot included in the report displaying the distribution of all probabilities for each cell type is shown in Figure {@fig:figS6}B.
If the submitter provided cell types, the submitter annotations are compared to the annotations from both SingleR
and CellAssign
.
A summary of this comparison is included in the cell type report along with a table summarizing the submitter cell type annotations and a UMAP plot where each cell is colored by the submitter annotation.
The Jaccard similarity index is calculated for all pairs of cell type labels in submitter annotations and SingleR
annotations and in submitter annotations and CellAssign
annotations.
The results from both comparisons are displayed in a stacked heatmap available in the report, an example of which is shown in Figure {@fig:figS7}.