Raw data and metadata were generated and compiled by each lab and institution contributing to the Portal.
Single-cell or single-nuclei libraries were generated using one of the commercially available kits from 10x Genomics.
For bulk RNA-seq, RNA was collected and sequenced using either paired-end or single-end sequencing.
For spatial transcriptomics, cDNA libraries were generated using the Visium kit from 10x Genomics.
All libraries were processed using our open-source pipeline, scpca-nf
, to produce summarized gene expression data.
A detailed summary with the total number of samples and libraries collected for each sequencing method broken down by project is available in Table S1.
Submitters were required to submit the age, sex, organism, diagnosis, subdiagnosis (if applicable), and tissue of origin for each sample.
The submitted metadata was standardized across projects, including converting all ages to years, removing abbreviations used in diagnosis, subdiagnosis, or tissue of origin, and using standard values across projects as much as possible for diagnosis, subdiagnosis, disease timing, and tissue of origin.
For example, all samples obtained at diagnosis were assigned the value Initial diagnosis
for disease timing.
In an effort to ensure sample metadata for ScPCA are compatible with CZI's CELLxGENE, ontology term identifiers were assigned to metadata categories for each sample following the guidelines present in the CELLxGENE schema [@url:https://cellxgene.cziscience.com; @url:https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md], as shown in Table {@tbl:metadata}.
Metadata field | Ontology term description |
---|---|
Age | Ontology term obtained from HsapDv [@url:https://www.ebi.ac.uk/ols4/ontologies/hsapdv]. For ages 0-11 months, the HsapDv for age in months was used. For ages 12 months and greater, the HsapDv for age in years was used. |
Sex | Ontology term obtained from PATO, either male (PATO:0000384), female (PATO:0000383), or unknown [@doi:10.1093/bib/bbx035; @url:https://www.ebi.ac.uk/ols4/ontologies/pato]. |
Organism | NCBI taxonomy term for organism. All current samples available on the Portal are from Homo sapiens or NCBITaxon:9606 [@doi:10.1093/database/baaa062; @url:https://www.ncbi.nlm.nih.gov/taxonomy]. |
Diagnosis | The most appropriate MONDO term based on the provided diagnosis [@doi:10.1101/2022.04.13.22273750; @url:https://www.ebi.ac.uk/ols4/ontologies/mondo]. An exact match was identified for most samples, but in a handful of cases, the most closely related term was used. |
Tissue of origin | The most appropriate UBERON term based on the provided tissue of origin [@doi:10.1186/2041-1480-5-21; @doi:10.1186/gb-2012-13-1-r5; @url:https://www.ebi.ac.uk/ols4/ontologies/uberon]. An exact match was identified for most samples, but in a handful of cases, the most closely related term was used. |
Ethnicity (if applicable) | If the submitter provided ethnicity, the associated Hancestro term [@doi:10.1186/s13059-018-1396-2; @url:https://www.ebi.ac.uk/ols4/ontologies/hancestro]. If ethnicity is unavailable, unknown is used. |
Table: Assignment of metadata fields to ontology terms. {#tbl:metadata}
To quantify RNA-seq gene expression for each cell or nucleus in a library, scpca-nf
uses salmon alevin
[@doi:10.1186/s13059-020-02151-8] and alevin-fry
[@doi:10.1038/s41592-022-01408-3] to generate a gene by cell counts matrix.
Prior to mapping, we generated an index using transcripts from both spliced cDNA and unspliced cDNA sequences, denoted as the splici
index [@doi:10.1038/s41592-022-01408-3].
The index was generated from the human genome, GRCh38, Ensembl version 104.
salmon alevin
was run using selective alignment to the splici
index with the --rad
option to generate a reduced alignment data (RAD) file required for input to alevin-fry
.
The RAD file was used as input to the recommended alevin-fry
workflow, with the following customizations.
At the generate-permit-list
step, we used the --unfiltered-pl
option to provide a list of expected barcodes specific to the 10x kit used to generate each library.
The quant
step was run using the cr-like-em
resolution strategy for feature quantification and UMI de-duplication.
The output from running alevin-fry
includes a gene by cell counts matrix, with reads from both spliced and unspliced reads for all potential cell barcodes.
This output is read into R to create a SingleCellExperiment
using fishpond::load_fry()
.
The resulting SingleCellExperiment
contains a counts
assay with a gene by cell counts matrix where all spliced and unspliced reads for a given gene are totaled together.
We also include a spliced
assay that contains a gene by cell counts matrix with only spliced reads.
These matrices include all potential cells, including empty droplets, and are provided for all Portal downloads in the unfiltered objects saved as .rds
files with the _unfiltered.rds
suffix.
Each droplet was tested for deviation from the ambient RNA profile using DropletUtils::emptyDropsCellRanger()
and those with an FDR ≤ 0.01 were retained as likely cells.
If a library did not have a sufficient number of droplets and DropletUtils::emptyDropsCellRanger()
failed, cells with fewer than 100 UMIs were removed.
Gene expression data for any cells that remain after filtering are provided in the filtered objects saved as .rds
files with the _filtered.rds
suffix.
In addition to removing empty droplets, scpca-nf
also removes cells that are likely to be compromised by damage or low-quality sequencing.
miQC
was used to calculate the posterior probability that each cell is compromised [@doi:10.1371/journal.pcbi.1009290].
Any cells with a probability of being compromised greater than 0.75 and fewer than 200 genes detected were removed before further processing.
The gene expression counts from the remaining cells were log-normalized using the deconvolution method from Lun, Bach, and Marioni [@doi:10.1186/s13059-016-0947-7].
scran::modelGeneVar()
was used to model gene variance from the log-normalized counts and scran::getTopHVGs()
was used to select the top 2000 high-variance genes.
These were used as input to calculate the top 50 principal components using scater::runPCA()
.
Finally, UMAP embeddings were calculated from the principal components with scater::runUMAP()
.
The raw and log-normalized counts, list of 2000 high-variance genes, principal components, and UMAP embeddings are all stored in the processed objects saved as .rds
files with the _processed.rds
suffix.
All libraries with antibody-derived tags (ADTs) or hashtag oligonucleotides (HTOs) were mapped to a reference index using salmon alevin
and quantified using alevin-fry
.
The reference indices were constructed using the salmon index
command with the --feature
option.
References were custom-built for each ScPCA project and constructed using the submitter-provided list of ADTs or HTOs and their barcode sequences.
The ADT by cell or HTO by cell counts matrix produced by alevin-fry
were read into R as a SingleCellExperiment
object and saved as an alternative experiment (altExp
) in the same SingleCellExperiment
object with the unfiltered gene expression counts data.
The altExp
within the unfiltered object contains all identified ADTs or HTOs and all barcodes identified in the RNA-seq gene expression data.
Any barcodes that only appeared in either ADT or HTO data were discarded, and cell barcodes that were only found in the gene expression data (i.e., did not appear in the ADT or HTO data) were assigned zero counts for all ADTs and HTOs.
Any cells removed after filtering empty droplets were also removed from the ADT and HTO counts matrices and before creating the filtered SingleCellExperiment
object.
The ADT count matrix stored in the unfiltered object was used to calculate an ambient profile with DropletUtils::ambientProfileEmpty()
.
This ambient profile was used to calculate quality-control statistics with DropletUtils::cleanTagCounts()
for all cells remaining after removing empty droplets.
Any negative or isotype controls were taken into account when calculating QC statistics.
Cells with a high level of ambient contamination or negative/isotype controls were flagged as having low-quality ADT expression, but we did not remove any cells based on ADT quality from the object.
The filtered and processed objects contain the results from running DropletUtils::cleanTagCounts()
.
ADT count data were then normalized by calculating median size factors using the ambient profile with scuttle::computeMedianFactors()
.
If median-based normalization failed for any reason, ADT counts were log-transformed after adding a pseudocount of 1.
Normalized counts are only available for any cells that would be retained after ADT filtering, and any cells that would be filtered out based on DropletUtils::cleanTagCounts()
are assigned NA
.
The normalized ADT data are available in the altExp
of the processed object.
To identify which cells come from which samples in a multiplexed library, we applied three different demultiplexing methods: genetic demultiplexing, HTO demultiplexing using DropletUtils::hashedDrops()
, and HTO demultiplexing using Seurat::HTODemux()
.
We do not provide separate SingleCellExperiment
objects for each sample in a library.
Each multiplexed library object contains the counts data from all samples and the results from all three demultiplexing methods to allow users to select which method(s) to use.
If all samples in a multiplexed library were also sequenced using bulk RNA-seq, we performed genetic demultiplexing using genotype data from both bulk RNA-seq and single-cell or single-nuclei RNA-seq [@doi:10.1093/gigascience/giab062]. If bulk RNA-seq was not available, no genetic demultiplexing was performed.
Bulk RNA-seq reads for each sample were mapped to a reference genome using STAR
[@doi:10.1093/bioinformatics/bts635], and multiplexed single-cell or single-nuclei RNA-seq reads were mapped to the same reference genome using STARsolo
[@doi:10.1101/2021.05.05.442755].
The mapped bulk reads were used to call variants and assign genotypes with bcftools mpileup
[@doi:10.1093/gigascience/giab008].
cellsnp-lite
was then used to genotype single-cell data at the identified sites found in the bulk RNA-seq data [@doi:10.1093/bioinformatics/btab358].
Finally, vireo
was used to identify the sample of origin [@doi:10.1093/bioinformatics/btab358].
For all multiplexed libraries, we performed demultiplexing using DropletUtils::hashedDrops()
and Seurat::HTODemux()
.
For both methods, we used the default parameters and only performed demultiplexing on the filtered cells present in the filtered object.
The results from both these methods are available in the filtered and processed objects.
10x Genomics' Space Ranger [@url:https://www.10xgenomics.com/support/software/space-ranger/latest] was used to quantify gene expression data from spatial transcriptomics libraries.
cellranger mkref
was used to create a reference index from the human genome, GRCh38, Ensembl version 104.
The FASTQ files, microscopic slide image, and slide serial number were provided as input to spaceranger count
.
The raw and filtered counts matrix and the summary report output by spaceranger count
are included in the folder output from scpca-nf
.
fastp
was used to trim adapters and perform quality and length filtering on all FASTQ files from bulk RNA-seq.
We used a decoy-aware reference created from spliced cDNA sequences with the entire human genome sequence (GRCh38, Ensembl version 104) as the decoy [@doi:10.1038/nmeth.4197].
The trimmed reads were then provided as input to salmon quant
for selective alignment.
In addition to using the default parameters for salmon quant
, we applied the --seqBias
and --gcBias
flags to correct for sequence-specific biases due to random hexamer priming and fragment-level GC biases, respectively.
Cell type labels determined by both SingleR
[@doi:10.1038/s41590-018-0276-y] and CellAssign
[@doi:10.1038/s41592-019-0529-1] were added to processed SingleCellExperiment
objects.
If cell types were obtained from the submitter of the dataset, the submitter-provided annotations were incorporated into all SingleCellExperiment
objects (unfiltered, filtered, and processed).
To prepare the references used for assigning cell types, we developed a separate workflow build-celltype-index.nf
within scpca-nf
.
For SingleR
, we used the BlueprintEncodeData
from the celldex
package [@doi:10.3324/haematol.2013.094243; @doi:10.1038/nature11247] to train the SingleR
classification model with SingleR::trainSingleR()
.
In the main scpca-nf
workflow, this model and the processed SingleCellExperiment
object were input to SingleR::classifySingleR()
.
The SingleR
output of cell type annotations and a score matrix for each cell and all possible cell types were added to the processed SingleCellExperiment
object output.
To evaluate confidence in SingleR
cell type assignments, we also calculated a delta median statistic for each cell by subtracting the median cell type score from the score associated with the assigned cell type [@url:https://bioconductor.org/books/release/SingleRBook/annotation-diagnostics.html#based-on-the-deltas-across-cells].
For CellAssign
, marker gene references were created using the marker gene lists available on PanglaoDB
[@doi:10.1093/database/baz046].
Organ-specific references were built using all cell types in a specified organ listed in PanglaoDB
to accommodate all ScPCA projects encompassing a variety of disease and tissue types.
If a set of disease types in a given project encompassed cells that may be present in multiple organ groups, multiple organs were combined.
For example, we created a reference containing bone, connective tissue, smooth muscle, and immune cells for sarcomas that appear in bone or soft tissue.
Given the processed SingleCellExperiment
object and organ-specific reference, scvi.external.CellAssign
was used in the main scpca-nf
workflow to train the model and predict the assigned cell type.
For each cell, CellAssign
calculates a probability of assignment to each cell type in the reference.
The probability matrix and a prediction based on the most probable cell type were added as cell type annotations to the processed SingleCellExperiment
object output.
Merged objects are created with the merge.nf
workflow within scpca-nf
.
This workflow takes as input the processed SingleCellExperiment
objects in a given ScPCA project output by scpca-nf
and creates a single merged SingleCellExperiment
object containing gene expression data and metadata from all libraries in that project.
The merged object includes both raw and normalized counts for all cells from all libraries.
Because the same reference index was used to quantify all single-cell and single-nuclei RNA-seq data, the set of genes is the same in the merged object and the individual objects.
Library-, cell- and gene-specific metadata from each of the processed SingleCellExperiment
objects are also combined and stored in the merged object.
The merge.nf
workflow does not perform batch-correction or integration.
The counts in the merged object are therefore not batch-corrected.
The top 2000 shared high-variance genes are identified from the merged counts matrix by modeling variance using scran::modelGeneVar()
and specifying library IDs for the block
argument.
These genes are used to calculate library-aware principal components with batchelor::multiBatchPCA()
.
The top 50 principal components were selected and used to calculate UMAP embeddings for the merged object.
If any libraries included in the ScPCA project contain additional ADT data, the ADT data are also merged and stored in the altExp
slot of the merged SingleCellExperiment
object.
By contrast, if any libraries included in the ScPCA project are multiplexed and contain HTO data, no merged object is created.
Merged objects were not created for projects with more than 100 samples because of the computational resources that would be required for working with those objects.
zellkonverter::writeH5AD()
[@doi:10.18129/B9.bioc.zellkonverter] was used to convert SingleCellExperiment
objects to AnnData
format and export the objects as .hdf5
files.
For any SingleCellExperiment
objects containing an altExp
(e.g., ADT data), the RNA and ADT data were exported and saved separately as RNA (_rna.hdf5
) and ADT (_adt.hdf5
) files.
Multiplexed libraries were not converted to AnnData
objects, due to the potential for ambiguity in sample origin assignments.
All merged SingleCellExperiment
objects were converted to AnnData
objects and saved as .hdf5
files.
If a merged SingleCellExperiment
object contained any ADT data, the RNA and ADT data were exported and saved separately as RNA (_rna.hdf5
) and ADT (_adt.hdf5
) objects.
In contrast, if a merged SingleCellExperiment
object contained HTO data due to the presence of any multiplexed libraries in the merged object, the HTO data was removed from the SingleCellExperiment
object and not included in the exported AnnData
object.
All summarized gene expression data and de-identified metadata are available for download on the ScPCA Portal, https://scpca.alexslemonade.org/.
Documentation for the Portal can be found at https://scpca.readthedocs.io.
All original code was developed within the following repositories and is publicly available as follows:
- The
scpca-nf
workflow used to process all samples available on the Portal can be found at https://github.com/AlexsLemonade/scpca-nf. - The Single-cell Pediatric Cancer Atlas Portal code can be found at https://github.com/AlexsLemonade/scpca-portal.
- Benchmarking of tools used to build
scpca-nf
can be found at https://github.com/AlexsLemonade/alsf-scpca/tree/main/analysis and https://github.com/AlexsLemonade/sc-data-integration/tree/main/celltype_annotation. - All code for the underlying figures can be found at https://github.com/AlexsLemonade/scpca-paper-figures.
- The manuscript can be found at https://github.com/AlexsLemonade/ScPCA-manuscript.