This document summarises thoughts about improving data analysis
workflows for mass spectrometry-based single-cell proteomics. The
effort here is in response to our recent articles. In the overview of
data processing workflows (Vanderaa et Gatto
2023), we highlighted the lack of
consensus when processing data leading to a "one lab, one workflow"
situation. Although this might be the consequence of heterogeneous
data structures, we could show this is not the case and is rather due
to the lack of a principled approach. This document provides future
tracks for scp
and beyond to harmonise the current practices.
Van der Watt et al. 2021 showed it is important to benchmark spectrum identification tools. Furthermore, the same lab later reported that the properties of the spectra from SCP experiment do not fulfill the assumption of identification tools designed on bulk proteomics samples (Boekweg et al. 2021).
There is definitely room for improvement, here are some thoughts about SCP spectrum identification:
- Is there a rigourous platform to test/benchmark different identification tools or to have a standardised workflow (cf snakemake)? Maybe MassIVE.quant. or Ryan Kelly's data management software.
- IonBot seems a promising tool to improve spectrum identification, especially when including PTMs. It was not tested on SCP data.
- Prosit and deepLC (both python) seem very promising for peptide identifications and validation. These software are based n deep-learning and may adapted/learn the features of SCP if we provide a "good" dataset.
Low-quality features (PSMs) and low-quality samples (single-cells) may introduce noise (eg because of low signal due to sample prep issues) or bias (eg because a PSMs is wrongly matched).
It seems sensible to identify and remove low-quality features and samples before they propagate through later processing steps. Data quality control should therefore be the very first step. I find it surprising that many feature or sample quality controls are performed at a later stage. From all the surveyed QC metrics, none require upstream processing and can be applied immediately.
The most difficult part of sample preparation is to get rid of the background signal/contamination such as FACS carry-over (cf discussion with Harrison Specht and Andrew Leduc). Negative controls may be used in a way to model systematic background contamination. This modelling should take into account that the background changes with time.
Checking for protein coverage is barely explored in SCP, except in Orsburn et al. 2022.
Identify failed cells and failed runs based on:
- Number of detected features, number of proteins with >3 precursors
- Intensity distributions. For some experiments it is difficult as the intensity distribution in single cells is similar to blank samples (Grégoire et al. 2O23).
- Coefficient of variation. One should also be mindful of not computing CV in log-scale, or at least use the adapted formula as described by Canchola et al. 2017. I compiled some thoughts about CV in this issue:
- Transform as soon as possible
- You cannot use pseudo-counts when log-transforming MS data! MS data are intensities and not counts. Zeros should be NAs anyway.
- Note that according to Harris et al. 2023, the main objective of log-transformation is variance stabilisation, and they observed that log-transformation may not achieve this goal since the data seems overdispersed.
- Ahlmann-Eltze and Huber 2023 offer an interesting discussion and comparison about different variance-stabilising approaches for scRNA-Seq.
A reference in single-cell is about 5-cell equivalents and is generated by dilution of bulk lysates. This sample is then used to subtract or divide the peptide intensities of the sample in the same batch. Therefore, we should call this batch correction rather than normalisation. I dislike this approach since noise is associated to each reference acquisitions. Olga Vitek suggested including at least 2 channels to better estimate the "normalisation" factor. We also refrained from "normalising" in the linear scale (division by reference), but rather perform it in the log scale (subtraction).
Only applicable for multiplexed experiments: since we perform batch correction for MS acquisition runs and for multiplexing labels (TMT, mTRAQ), is normalisation still required?
I answered this question with scplainer, and yes normalisation is very important and captures cell-specific variations that are not captured by combined modelling of label and MS batch. scplainer currently uses the median peptide abundance as a normalisation variable during modelling. In this case, normalisation captures both technical variation, such as small differences in sample prep, and biological variation, for instance differences in cell size or metabolic activity. Whether the later should be included or excluded from downstream analyses is a conceptual consideration. My intuition is that:
- Relative abundances in a cell (concentrations) matter more for proteins such as enzymes, ligand-receptors
- Absolute abundances in a cell (copy numbers) matter ore for proteins such as structural proteins
If indeed these two types of abundance matter, the model should be able to decouple cell-specific technical variation from cell-specific biological variation. To do so, we could model the normalisation factor as a function of biological variables such as the cell size. For instance, FACS experiments measure SSC and FSC that provide proxies for granularity and size, respectively. Alternatively, CellenOne measures cell sizes during dispensing. I however noticed in the schoof2021 dataset a poor correlation between FSC/SSC and the median intensity. One explanation is that the factors that drive the need for normalisation are more technical than biological and happen after cell dispensing. Leduc et al. 2022 observed good correlation between total protein intensities and cell sizes, as measured by the CellenOne. I could confirm this finding with our modelling approach using minimal data processing. However, the relationship is not perfect, indicating that technical variability is influencing the intensity. The modelling the normalisation factor as a function of cell size would enable an effective decoupling of biological effects and technical effects. The estimated trend would represent biological normalisation while the residuals would represent technical normalisation, each being included in the scplainer modelling approach.
Common practice to estimate sample normalisation factors is to compute median, mean or total intensity for each cell. I would choose median over mean as it is more robust against outliers, but is this the best solution?
Another approach is to align quantiles. While it makes sense for bulk samples, I'm not convinced that different cells should share the same feature intensity distributions, especially in the presence of highly missing data.
Another approach would be to get inspiration from the deconvolution
approach implemented in the scran
package by Lun et al.
2016, but this would
require some efforts to explore the suitability and to adapt the
algorithm.
Replacing value with batch corrected values does not include estimation uncertainty.
It was clear from the Jupiter Notebooks from Schoof et al. 2021 that the TMT label has an impact on quantification. We further confirmed these finding for most multiplexed experiments that the label has an impact on quantifications, but this impact can be modelled.
- Ratio compression is often mentioned as a drawback when using TMT labelling. However, LFQ data is not able to model the MS acquisition level bias.
- One way to estimate the bias of MS acquisition in LFQ is to model it with respect to acquisition time (eg using as a function of spline basis), or include LC/MS maintenance or calibration as discrete batches.
SCP data seem not to be MNAR, but I couldn't find a convincing framework to demonstrate this. Instead, I rely on the relationship between average feature intensity and feature detection rate, as suggested by Li and Smyth 2023. However, their approach does not account for the presence of batch effect and may be biased for SCP.
I think SCP data is therefore mostly MCAR. I find the MCAR definition bears only little information. It says that the data is missing unrelated to the underlying value, but there must be other reasons for values to be missing for which we could model the mechanism. For instance, in DDA, we could model the detection rate of a peptide with respect to the MS1 rank of its precursor in each run. Is a peptide not detected because it was not selected for MS2? For DIA, we could adapt the approach and model the detection rate with the expected spectral complexity determined in MS1. Other variables could be peptide properties. Could we predict peptide ionisation based on its physico-chemical properties?
We discussed this in Vanderaa et Gatto 2023.
Some alternative approaches that serves as inspiration:
- The RUV-III-C model is a batch correction model where the quantification data is decomposed in a component with known biological factors and component with unknown factors that are considered as batch correction. This model can handle missing data by fitting the model to each peptide separately, ignoring samples with missing values. This model removes the need for imputation, but I don't think the method will scale to single-cell data and the unknown variation is considered as unwanted whereas unknown biological variation is certainly present in SCP data. Furthermore, this model does not support statistical inference and hence requires to transform the data (cf unaccount uncertainty stated above)
- ZINB-Wave is a model used for scRNA-Seq that extends the RUV model. It models the RNA count data using the NB error family (should be replaced by Guassian error for proteomics) as well as the probability of detection. Both component are modeled as gene-wise known factors, sample-wise known factors and gene- and sample-wise unknown factors. The latter could represent wanted biological variation such as undocumented cell types or cell states. The issue is that this method does not scale well (but there is the newwave upgrade), and it does not perform statistical inference.
- scVI is also taken from the scRNA-Seq field, but it models the data using variational auto-encoders. It uses the learned latent space to optimise a generative model to predict batch corrected counts and expected dropout, but it can also directly perform Bayesian inference on the parameters. The method scales well but requires that the number of cells >> number of features and it needs to be adapted to proteomics data.
- The hurdle model is a two-component model with one focusing on differential abundance (MSqRob model) and the other on differential detection (binomial model). However, further research is needed to speed-up the optimisation (cf Lieven Clement).
- Philippe Hauchamps did his master's thesis about an improvement of proDA to enable statistical inference from peptide data without the need for imputation.
- Multiple imputation with different methods and parameters allow to derive distribution statistics (hence uncertainty metrics) for the imputed values that could be used for downstream analyses.
- The variance of imputated values could be used as weights for downstream statistical modelling.
- Get inspiration from Bramer et al. 2021, Lazar et al. 2016, Kong et al. 2022.
- Multiple PSMs (~ precursor ions) can match a peptide and multiple peptides can match a protein -> cf recent development by Laurent MsCoreUtils::aggregate_by_matrix()
- PSMs should not be seen as independent observations (cf discussion with Thomas Burger)
- With MaxQuant, some PSMs match the same sequence, modification and charge but display different retention times (cf Lieven Clement)
library(scpdata)
specht <- specht2019v3()
featAnn <- rowData(specht[[1]])
featAnn$ID <- paste0(featAnn$Set, featAnn$Sequence, featAnn$Charge,
featAnn$Modifications)
dupl <- featAnn$ID[duplicated(featAnn$ID)]
featAnn %>%
data.frame %>%
filter(Set == Set[[1]],
ID %in% dupl) %>%
ggplot() +
aes(y = dart_qval,
x = Retention.time,
colour = ID,
group = ID) +
geom_point() +
geom_line() +
theme(legend.position = "none")
Perform protein-level infers using peptide-level data, as performed by the MsqRob model.
For most datasets, linear regression is sufficient and we can retrieve unknown sources of variation from the residuals. This two-step approach could be formalized through the optimisation of latent variables, as performed in RUV-III-C (Poulos et al. 2020) or ZINB-WaVE (Risso et al. 2018).