Skip to content

Latest commit

 

History

History
737 lines (529 loc) · 64.2 KB

README.md

File metadata and controls

737 lines (529 loc) · 64.2 KB

EVALFQ

R >3.5 installed with devtools GitHub last commit GitHub commit activity GitHub codesize

The EVALFQ provides an open assess online service enabling (1) the label-free proteome quantification (LFQ) based on three quantification measurements SWATH-MS, Peak Intensity and Spectral Counting, (2) the evaluation of LFQ performances from multiple perspectives and (3) the identification of the optimal LFQs based on comprehensive performance ranking. It not only AUTOMATICALLY detects the diverse formats of data generated by all quantification software, but also provides the most complete set of processing methods among available tools, which including the methods of transformation, pretreatment (centering, scaling & normalization) and missing value imputation.

For function descriptions and analysis of sample datasets you can also use "??EVALFQ" command in R.

Description of methods

Transformation

Box-Cox Transformation (BOX)
Method’s Introduction: This method aims at transforming asymmetrical data to fulfill normality assumption in a regression model (Lo and Gottardo, 2012) by ensuring that the usual assumptions for a linear model hold (Thygesen and Zwinderman, 2004).
Reported Applicable Domain(s): For many real-world datasets, they fail to meet an approximation of a normality assumption, and this method is thus considered as a way to meet the normality assumption (Martinez-Arranz, et al., 2015).
Research Application(s): It has been used to discover new therapeutic target for seven liver diseases (Kohl, et al., 2014), and identify novel diagnostic molecules notably indicating liver fibrosis severity between rapid and slow (Cano, et al., 2017).

Log Transformation (LOG)
Method’s Introduction: This method tends to transform the distribution of protein abundance ratio to a more symmetrical (almost normal) distribution by minimizing the proteins of extreme abundance (Callister, et al., 2006).
Reported Applicable Domain(s): This method was widely adopted in current OMICs, and carried out almost routinely for obtaining a more symmetric distribution prior to statistical analysis (De Livera, et al., 2012).
Research Application(s): It has been used to identify new drug targets for treating early-stage hepatocellular carcinoma (Jiang, et al., 2019), and quantify proteins from formalin-fixed and paraffin-embedded colorectal cancer (Wisniewski, et al., 2012).

Variance Stabilization Normalization (VSN)
Method’s Introduction: This method approaches the logarithm for large values to remove heteroscedasticity using the inverse hyperbolic sine (Kohl, et al., 2012), and keeps the variance constant over the entire data range (Huber, et al., 2002).
Method’s Introduction: This method approaches the logarithm for large values to remove heteroscedasticity using the inverse hyperbolic sine (Kohl, et al., 2012), and keeps the variance constant over the entire data range (Huber, et al., 2002).
Reported Applicable Domain(s): This method makes the individual observations more directly comparable, based on the assumption that most proteins in different samples are not differentially expressed (Lin, et al., 2008; Rausch, et al., 2016).
Research Application(s): It has been used to address the accuracy and precision issues in MS-based isobaric tags for relative and absolute proteomic and metabolomic quantification (Karp, et al., 2010).

Pretreatment: Centering

Mean Centering (MEC)
Method’s Introduction: This method aims at centering data distribution at the origin in the multidimensional space by subtracting the mean value of each peak from the corresponding variable in each sample (Xi, et al., 2014).
Reported Applicable Domain(s): Based on the assumption that all proteins/peptides are equally important, this method transforms all intensities to around zero instead of the mean of protein intensities (Van den Berg, et al., 2006).
Research Application(s): It has been used to facilitate the improvement of the sensitivity of significance test in spectral counting-based comparative discovery proteomics (Gregori, et al., 2012).

Median Centering (MDC)
Method’s Introduction: This method converts all concentrations into fluctuations around zero, rather than the median of the protein intensities (Tang, et al., 2019), by subtracting the median of each sample (Jauhiainen, et al., 2014).
Reported Applicable Domain(s): This method is based on the assumption that all studied proteins are equally important (Van den Berg, et al., 2006), and proves to be relatively non-robust if the total number of proteins is small (Jauhiainen, et al., 2014).
Research Application(s): It has been used to facilitate normalization procedures in LC-MS-based proteomic experiments through dataset dependent ranking of normalization scaling factors (Webb-Robertson, et al., 2011).

Pretreatment: Scaling

Auto Scaling (ATO)
Method’s Introduction: This method is one of the simplest methods to adjust the proteomic variances, which scales protein intensities based on the standard deviation of the proteomic data (Chung and Kang, 2019; Kohl, et al., 2012).
Reported Applicable Domain(s): Based on the assumption that all proteins are equally important (Van den Berg, et al., 2006), it changes the emphasis from proteins of high concentrations to those of moderate/small abundances (Wang, et al., 2013; Xi, et al., 2014).
Research Application(s): It has been used to identify proteomic markers for psoriasis and psoriasis arthritis (Reindl, et al., 2016) and normalize LC-MS proteomics based on scan-level data (Nezami Ranjbar, et al., 2013).

Pareto Scaling (PAR)
Method’s Introduction: This method uses the square root of the standard deviation of the data as the scaling factor, which can reduce the weight of a large fold change in protein intensities (Kohl, et al., 2012).
Reported Applicable Domain(s): This method works based on the assumption that all proteins are equally important (Van den Berg, et al., 2006), and its disadvantage lines in its high sensitivity to large fold changes (van den Berg, et al., 2006).
Research Application(s): It has been implemented into the proteomic experiments that are based on the LC-MS/MS, and it is expected with great potential to be applied to metaproteomic research (Bereman, et al., 2014).

Vast Scaling (VAS)
Method’s Introduction: This method is an extension of auto scaling that focuses on stable variables and uses standard deviation and the so-called coefficient of variation as the scaling factor (Di Guida, et al., 2016; van den Berg, et al., 2006).
Reported Applicable Domain(s): Based on the assumption that all proteins are equally important (Van den Berg, et al., 2006), it is suitable for intensities of small fluctuations, but not suited for large variations without group structure (van den Berg, et al., 2006).
Research Application(s): It has been used to investigate the feasibility of proteomics and metabolomics for immediate analysis of resection margins during breast cancer surgery (Bathen, et al., 2013).

Range Scaling (RAN)
Method’s Introduction: This method scales the protein abundances for a systematic variance according to the abundance range of proteins of all samples as the scaling factor (Smilde, et al., 2005).
Reported Applicable Domain(s): Based on the assumption that all proteins are equally important (Van den Berg, et al., 2006), this method is usually used to change from a high concentration of proteins to medium/small abundance (Parastar and Bazrafshan, 2016).
Research Application(s): It has been used to manipulate the datasets of non-targeted ultra-high performance liquid chromatography tandem mass spectrometry (UHPLC-MS) proteomics/metabolomics (Di Guida, et al., 2016).

Pretreatment: Normalization

Mean Normalization (MEA)
Method’s Introduction: This method reduces variability among replicates by calculating the intensity of each protein in a given sample as the mean of intensities of all variables in samples (De Livera, et al., 2012; Paredi, et al., 2019).
Reported Applicable Domain(s): Based on the assumption that the mean level of abundances is consistent in all samples (Valikangas, et al., 2018), it ensures the protein abundance values in all samples comparable with each ones (Craig, et al., 2006).
Research Application(s): It has been used to analyze the dataset of gel-free quantitative proteomics (Bennike, et al., 2016) and to profile MALDI-TOF urine peptidome with enhanced reproducibility (Padoan, et al., 2015).

Median Normalization (MED)
Method’s Introduction: This method removes unwanted variation among all samples (De Livera, et al., 2012) by calculating the intensity of each protein in a given sample as the median of intensities of all variables in samples (Valikangas, et al., 2018).
Reported Applicable Domain(s): Based on the assumption that the median level of abundances is consistent in all samples (Craig, et al., 2006; Valikangas, et al., 2018), it is suitable for the situation in which samples in a dataset are separated by a constant and the protein intensity of each sample has the same median (Valikangas, et al., 2018).
Research Application(s): This method has been applied to normalize the data of Saccharomyces cerevisiae proteome using the advanced technique of SWATH-mass spectrometry (Bennike, et al., 2016).

Median Absolute Deviation (MAD)
Method’s Introduction: This method measures the median of the absolute deviations based on the median in protein intensity with more robustness and less sensitivity to abnormal values (Chawade, et al., 2014).
Reported Applicable Domain(s): This method is appropriate for the situation in which the values of the spread of expression and the median expression are consistent in measured samples (Fundel, et al., 2008).
Research Application(s): It has been used to process the data of peptide-centric LC-MS proteomics (Selevsek, et al., 2015) and identify subtype specific markers and therapeutic target for metaplastic breast carcinoma (Djomehri, et al., 2020).

Total Ion Current (TIC)
Method’s Introduction: This method sums all the separate ion currents carried by the ions of different m/z contributing to a complete mass spectrum or in a specified m/z range of a mass spectrum (Gaspari, et al., 2016).
Reported Applicable Domain(s): This method assumes that most proteins are unchanged under the studied condition and that there are roughly equal numbers of proteins that are both up and down-regulated (Wulff and Mitchell, 2018).
Research Application(s): It has been used to achieve SELDI-TOF-MS proteomic profiling of serum, urine, and amniotic fluid under the condition of neural tube defects (Liu, et al., 2014).

Cyclic Loess (CYC)
Method’s Introduction: This method combines MA-plot and Bland-Altman plot by assuming the existence of non-linear bias (Kohl, et al., 2012), and it estimates a regression surface using multivariate smoothing procedure (Webb-Robertson, et al., 2014).
Reported Applicable Domain(s): This method is based on the assumption that the majority of the intensities are unchanged in all samples (Ballman, et al., 2004; Cox, et al., 2014) and that the systematic bias nonlinearly depends on abundances (Valikangas, et al., 2018).
Research Application(s): It has been used to normalize the data of the quantitative label-free proteomics (Valikangas, et al., 2018), and to conduct proteomic profiling in the context of common experimental designs (Keeping and Collins, 2011).

Linear Baseline Scaling (LIN)
Method’s Introduction: This method maps linearly from each protein spectrum to a baseline by multiplying the protein intensities in all spectra using a particular scaling factor (Bolstad, et al., 2003; Kohl, et al., 2012).
Reported Applicable Domain(s): Based on the assumption that the most protein abundances are unchanged in all samples (Adriaens, et al., 2012; Ballman, et al., 2004), this method is applicable for the situation in which a constant linear relationship exists between each feature of a given spectrum and the baseline in the studies samples (Kohl, et al., 2012).
Research Application(s): It facilitates the investigation of the preservative effects of morin on banana during the postharvest storage using the metabolites profiles based on NMR spectroscopy (Zhu, et al., 2018).

Robust Linear Regression (RLR)
Method’s Introduction: This method is used for transference when you want to rescale one reference interval to another scale, which is more robust against outliers in the data than linear regression (Valikangas, et al., 2018).
Reported Applicable Domain(s): Based on the assumption that most intensities are unchanged in all samples (Ballman, et al., 2004; Wang, et al., 2011), this method is suitable for OMICs when the number of common features is five or larger (Wehrens, et al., 2014).
Research Application(s): It has been used to enable multidimensional normalization for minimizing the plate effects in the analysis of suspension bead array proteomic data (Lin, et al., 2008).

Locally Weighted Scatterplot Smoothing (LOW)
Method’s Introduction: This method creates a smooth line in regression analysis using a time plot or scatter plot to help to describe the relationship between variables and foresee trends (Yang, et al., 2002).
Reported Applicable Domain(s): It assumes that the most abundances are unchanged in all samples (Adriaens, et al., 2012; Ballman, et al., 2004), and that the systematic bias is non-linearly dependent on the magnitude of peptide abundances (Adriaens, et al., 2012).
Research Application(s): This method has been used to process peptide-centric LC-MS proteomics data (Matzke, et al., 2011) and profile the two-color array expression dataset of retinal ganglion cell layer (Kim, et al., 2006).

EigenMS (EIG)
Method’s Introduction: This method preserves the true differences by estimating the treatment effects using ANOVA model, and is applied in the profiling of MS-based quantitative label-free proteomics (Karpievitch, et al., 2012; Karpievitch, et al., 2014).
Reported Applicable Domain(s): It is well-suited to data with widespread missing measurements, and to be included in a proteomic pipeline as it does not require any special downstream steps/housekeeping (Karpievitch, et al., 2009).
Research Application(s): This method has been applied to normalize the protein intensities in the bottom-up MS-based proteomic profiling and the label-free LC-MS based proteomics analysis (Karpievitch, et al., 2009).

Probabilistic Quotient Normalization (PQN)
Method’s Introduction: This method integrally normalizes each spectrum and calculates a quotient between test and reference spectra, then all variables of the test spectrum are divided by the median quotient (Dieterle, et al., 2006).
Reported Applicable Domain(s): Based on the assumption that the most protein intensities are unchanged in all samples (Tobin, et al., 2017), it ensures the protein abundance values in all samples comparable with each ones (Craig, et al., 2006).
Research Application(s): This method has been used to conduct variance decomposition of protein profiles from antibody arrays using a longitudinal twin model based on affinity-based proteomic technologies (Lo and Gottardo, 2012).

Quantile Normalization (QUA)
Method’s Introduction: This method replaces each point in the samples with the mean of the corresponding quantile and the distribution of the sample is made consistent on the basis of the sample quantile (Bolstad, et al., 2003).
Reported Applicable Domain(s): Based on the assumption that most protein intensities is unchanged in all samples (Adriaens, et al., 2012), it ensures the protein abundances in all samples comparable with each ones (Craig, et al., 2006).
Research Application(s): This method has been applied to perform the rapid mass spectrometric conversion of tissue biopsy samples into the permanent quantitative digital proteome map (Guo, et al., 2015).

Trimmed Mean of M Values (TMM)
Method’s Introduction: This method is conducted to estimate the scale factors between samples that can be incorporated into current statistical analysis in proteomics, and remove the low-expressed proteins (Lin, et al., 2016).
Reported Applicable Domain(s): This method assumes that the protein intensity values are the same among all samples (Branson and Freitas, 2016) and are sensitive to the removal of low-expressed proteins (Lin, et al., 2016), which ensures that the protein abundance values of all samples are comparable with each other (Craig, et al., 2006).
Research Application(s): This method has been used to reveal gender-associated mutagenesis in a cohort of mostly non-smokers from deep proteogenomic landscape of early stage lung adenocarcinoma (Chen, et al., 2020).

Imputation

Background Imputation (BAK)
Method’s Introduction: This method replaces missing values with the lowest detected intensity value of the data set, which is supposed as a representative of the background information (Chai, et al., 2014).
Reported Applicable Domain(s): This method is based on the assumption that the protein value is lost due to the low concentration in the sample and therefore cannot be detected during operation (Tang, et al., 2019).
Research Application(s): This method has been used to analyze a profiling benchmark dataset consists of 12 non-human proteins spiked into a constant human embryonic kidney background (Valikangas, et al., 2018).

Bayesian Principal Component Imputation (BPC)
Method’s Introduction: This method is capable of auto-selecting relevant parameters used in estimation (Chai, et al., 2014), which allows it to provide a better performance than other imputation such as KNN and SVD (Chai, et al., 2014).
Reported Applicable Domain(s): This method does not need parameter optimization (Brock, et al., 2008), leads to improved performance estimation when the total number of the studied samples is huge (Chai, et al., 2014), and imputes based on the variational Bayesian framework that does not force orthogonality between the principal components (Stacklies, et al., 2007).
Research Application(s): This method has been used to conduct integrative OMIC analyses for Shewanella oneidensis (Torres-Garcia, et al., 2011), and to statistically profile the dataset based on gel-based proteomics (Pedreschi, et al., 2008).

Censored Imputation (CEN)
Method’s Introduction: This method is considered as being ‘missing completely at random’, and no value is imputed for it if only a single ‘Not Available’ for a protein in a sample group is found (Valikangas, et al., 2018).
Reported Applicable Domain(s): Based on the assumption that the missing value is due to the low detection capacity, this method is suitable for protein that contains multiple missing value in a sample group (Tang, et al., 2019; Valikangas, et al., 2018).
Research Application(s): This method has been used to quantify the grapevine red blotch virus in grapevine leaf and petiole tissues (Buchs, et al., 2018), and to improve detection of differentially abundant proteins (Koopmans, et al., 2014).

K-nearest Neighbor Imputation (KNN)
Method’s Introduction: This method identifies K proteins that are similar to the proteins with missing value, and these missing values are imputed with the weighted average values of these neighboring proteins (Chai, et al., 2014).
Reported Applicable Domain(s): This method identifies most similar proteins and use a weighted average to estimate missing values (Troyanskaya, et al., 2001; Valikangas, et al., 2018), which outperforms others by processing relatively small size datasets (Chai, et al., 2014).
Research Application(s): This method has been used to identify the diagnostic markers for the HCV-induced progression of fibrosis to cirrhosis in HALT-C patients based on SRM targeted proteomics (Qin, et al., 2012).

Local Least Squares Imputation (LLS)
Method’s Introduction: This method is a nonparametric missing value estimation method, which is designed by introducing an automatic K-value estimator based on a linear combination of similar proteins (Kim, et al., 2005).
Reported Applicable Domain(s): This method takes advantage of local similarity structures and optimization process by the least squares, and is a robust and accurate missing value estimation method (Kim, et al., 2005).
Research Application(s): This method has been utilized in missing value imputation for the OMIC data that can be represented as a matrix form, such as NGS data, proteomics and metabolomics (Wu and Jhou, 2017).

Singular Value Decomposition (SVD)
Method’s Introduction: The method aims at finding the dominant components summarizing the entire matrix and then predicting missing value of the target proteins by regressing against the dominant components (Gan, et al., 2006).
Reported Applicable Domain(s): This method can only be performed on complete matrices (Troyanskaya, et al., 2001), and is very sensitive to the type of analyzed datasets. (Troyanskaya, et al., 2001). It obtains a set of orthogonal expression patterns that can be combined linearly to approximate the expression of all proteins in the dataset (Troyanskaya, et al., 2001).
Research Application(s): This method has been used to facilitate the realization, visualization, manipulation, and quantitation of the proteomic dataset based on the isobaric tagged mass spectrometry (Gatto and Lilley, 2012).

Zero Imputation (ZER)
Method’s Introduction: This method deems to be one of the simplest imputation methods by replacing the missing values with zeros (Gan, et al., 2006), and can be used as pre-processing step in a more advanced algorithm (Tuikkala, et al., 2008).
Reported Applicable Domain(s): This method does not utilize any information about the data (Gan, et al., 2006), and the integrity and usefulness of the studied data can be seriously jeopardized by this imputation since erroneous relationships among proteins can be artificially created due to this “simplest” imputation (Gan, et al., 2006).
Research Application(s): It has been used to impute the missing values for multivariate statistical analysis of gel-based proteomics (Pedreschi, et al., 2008), and facilitate the analysis of quantitative proteomics using isobaric tagging (Gatto and Lilley, 2012).

Construction of the LFQ Chain by Sequential Method Integration

LFQ chain is composed of five sequentially integrated steps: transformation, centering, scaling, normalization, and imputation (As described in the above method). In other words, a random, comprehensive, and sequential integration of 27 methods (excluding VSN) can result in 3,120 LFQ chains of five steps (2 × 3 × 5 × 13 × 8 = 3,120, taking the non-centering, non-scaling, non-normalization, and non-imputation into account, which have been widely adopted in previous publication (Guo, et al., 2015; Liu, et al., 2015; Wu, et al., 2016)). Transformation was reported essential prior to the downstream analyses in any proteomics study (Rausch, et al., 2016); non-transformation was therefore not allowed in both the analyses of this study and the R package EVALFQ. Moreover, since the VSN was unique in having a built-in transformation and subsequent normalization technique, it does not combine with other transformation/pretreatment and only combines with imputation in EVALFQ, which therefore resulted in eight additional LFQ chains. As a result, there were in total 3,128 potential LFQ chains provided in EVALFQ.

Five Criteria Enabling the Assessment from Multiple Perspectives

Criterion Ca. Precision of LFQ Based on Proteomes among Replicates
Quantification precision is profoundly affected by different modes of acquisition, various types of quantification software tools, and diverse LFQ chains, which could be evaluated by the pooled intragroup median absolute deviation (PMAD) of the protein intensities among replicates (Kuharev, et al., 2015; Navarro, et al., 2016). PMAD measures the median of the absolute deviations around sample median, and thus is more robust and less sensitive to outliers (Chawade, et al., 2014). PMAD denoted LFQ’s ability to reduce variation among replicates and thus enhance technical reproducibility (Chawade, et al., 2014). Lower PMAD denoted more thorough removal of experimentally induced noise and indicated better precision of the LFQ chain (Muller, et al., 2018).

Criterion Cb. Classification Ability of LFQ between Distinct Sample Groups
To assess the performance of the classification ability of LFQ, the variations between two distinct sample groups in the proteomic data analyzed using different LFQs are evaluated. An appropriate LFQ is expected to retain/even enlarge the difference in proteomics data between distinct sample groups (Griffin, et al., 2010; Williams, et al., 2016). Heatmap hierarchically clustering samples based on their protein intensities is therefore frequently used as the effective assessment metric to assess LFQ’s classification ability (Griffin, et al., 2010). First, the total number of protein intensities in each sample is reduced by feature elimination. Then, proteins and samples are clustered by their similarities in protein intensity profile. Detailed process on how to apply Criterion Cb can be found in the study by Griffin NM, et al. (Griffin, et al., 2010).

Criterion Cc. Differential Abundance Analysis Based on Reproducibility Optimization
The well-performing LFQs would generate data that have a uniform distribution for the majority of non-differentially expressed proteins, along with a peak in [0.00, 0.05] interval corresponding to proteins with differential intensities (Risso, et al., 2014). To avoid overfitting/confounding, the distribution of p-values of protein intensities between distinct sample groups is examined (Risso, et al., 2014). In the proteomic studies that explore the mechanism underlining complex biological process, a limited number of differentially expressed protein may lead to false discovery (Blaise, 2013). Thus, differential significance of protein intensities between sample groups is first calculated by reproducibility-optimized test statistic (Pursiheimo, et al., 2015), and a skewed distribution of p-values may indicate overfitting/confounding (Karpievitch, et al., 2012).

Criterion Cd. Consistency of the Identified Markers among Different Datasets
Consistency score is popular criterion used to represent the robustness in marker discovery (Li, et al., 2017), which is calculated to quantitatively measure the overlap among protein markers identified from different partitions of a given dataset (Wang, et al., 2015). A higher consistency score value indicated the more robust results in marker discovery (Caron, et al., 2017). Herein, the studied datasets were randomly sampled 50 times to create multiple sub-datasets. Then, each protein was ranked by its statistical significance measured using q-value and fold change. Third, the top-ranked proteins in each sub-dataset were selected as markers. Finally, a consistency score was calculated (Wang, et al., 2015).

Criterion Ce. Accuracy of LFQ based on Spiked and Background Proteins
To evaluate quantification accuracy, additional experimental data containing spiking proteins are frequently utilized as the golden references (Kuharev, et al., 2015; Navarro, et al., 2016). In this case, the expected log fold changes (logFCs) of both spiking and background proteins are used. Particularly, the expected logFC of the background proteins should equal to zero (Valikangas, et al., 2018). Here, the logFCs of protein intensity for both spiked and background proteins between distinct sample groups were first calculated. Then, the mean squared error was applied to assess the level of correspondence between the quantification and expected logFC. The performances could be reflected by how well the quantification logFCs corresponded to the expected values of the references (Valikangas, et al., 2018). The deviations in both quantification and expected logFCs would equal to zero with the minimized deviation (Valikangas, et al., 2018).

Enabling the Comprehensive Assessment from Multiple Perspectives

Based on these independent criteria shown above, EVALFQ enabled the performance assessment of LFQ chain from multiple (five) perspectives. Particularly, the performances of 3,128 potential LFQ chains can first be independently ranked using each criterion. Then, an overall ranking of a studied LFQ chain was defined by the sum of multiple (≤5) rankings under multiple criteria (the smaller the sum is, the higher an LFQ chain ranks).

Installation

# EVALFQ package depends on several packages, which can be installed using the below commands:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("Biobase")
BiocManager::install("BiocGenerics")
BiocManager::install("ROTS")
BiocManager::install("limma")
BiocManager::install("ProteoMM")
BiocManager::install("impute")
BiocManager::install("pcaMethods")
BiocManager::install("vsn")
BiocManager::install("affy")
devtools::install_github("cran/metabolomics")

# To install the package, type this command in R:
install.packages("devtools")
devtools::install_github("idrblab/EVALFQ")

# Or download the source package of EVALFQ_0.1.0.tar.gz and install it
install.packages("EVALFQ.0.1.0.tar.gz", repos = NULL, type = "source", INSTALL_opts = "--no-multiarch")

Usage

library(EVALFQ)

1. Prepare input file for evaluating label-free proteome quantification.

my_data <- PrepareInuputFiles(acquisitionmethods, rawdataset, lable)

acquisitionmethods Input the corresponding "number" of acquisition techniques as follows:
If set 1, the user chooses to process the data based on SWATH-MS.
If set 2, the user chooses to process the data based on Peak Intensity.
If set 3, the user chooses to process the data based on Spectral Counting.

rawdataset Input the name of your raw dataset directly obtained from software.
EVALFQ supports a variety of data generated by 18 kinds of popular quantification software.
The format of each software could be readily found as follows (Right Click to Save).
(1) A list of software for pre-processing the data acquired based on SWATH-MS.
DIA-UMPIRE; OpenSWATH; PeakView; Skyline; Spectronaut
(2) A list of software for pre-processing the data acquired based on Peak Intensity.
MaxQuant; MFPaQ; OpenMS; PEAKS; Progenesis; Proteios SE; Scaffold; Thermo Proteome Discoverer
(3) A list of software for pre-processing the data acquired based on Spectral Counting.
Abacus; Census; DTASelect; IRMa-hEIDI; MaxQuant; MFPaQ; ProteinProphet; Scaffold

lable Input the label of your dataset.

2. Conduct LFQ and assess performance of all possible LFQ workflows.

allranks <- lfqevalueall(data_q,
                         assum_a = "Y",
                         assum_b = "Y",
                         assum_c = "Y",
                         Ca = "1", 
                         Cb = "1", 
                         Cc = "1", 
                         Cd = "1")

data_q This input file should be numeric type except the first and second column containing the names and label (control or case) of the studied samples, respectively. The intensity data should be provided in this input file with the following order: samples in row and proteins/peptides in column. Missing value (NA) of protein intensity are allowed.

assum_a All proteins were assumed to be equally important.
The authors will be asked to input a letter “Y” to indicate the corresponding assumption is held for the studied dataset and a letter “N” to denote the opposite.

assum_b The level of protein abundance was assumed to be constant among all samples.
The authors will be asked to input a letter “Y” to indicate the corresponding assumption is held for the studied dataset and a letter “N” to denote the opposite.

assum_c The intensities of the vast majority of the proteins were assumed to be unchanged under the studied conditions.
The authors will be asked to input a letter “Y” to indicate the corresponding assumption is held for the studied dataset and a letter “N” to denote the opposite.

Ca Criterion (a): precision of LFQ based on the proteomes among replicates.
If set 1, the user chooses to assess LFQ workflows using Criterion (a).
If set 0, the user excludes Criterion (a) from performance assessment.
The default setting of this value is “1”.

Cb Criterion (b): classification ability of LFQ between distinct sample groups.
If set 1, the user chooses to assess LFQ workflows using Criterion (b).
If set 0, the user excludes Criterion (b) from performance assessment.
The default setting of this value is “1”.

Cc Criterion (c): differential expression analysis by reproducibility-optimization.
If set 1, the user chooses to assess LFQ workflows using Criterion (c).
If set 0, the user excludes Criterion (c) from performance assessment.
The default setting of this value is “1”.

Cd Criterion (d): reproducibility of the identified protein markers among different datasets.
If set 1, the user chooses to assess LFQ workflows using Criterion (d).
If set 0, the user excludes Criterion (d) from performance assessment.
The default setting of this value is “1”.

3. Conduct LFQ and assess performance by collectively considering the spiked proteins.

allranks <- lfqspikedall(data_s,
                         spiked,
                         assum_a = "Y",
                         assum_b = "Y",
                         assum_c = "Y",
                         Ca = "1", 
                         Cb = "1", 
                         Cc = "1", 
                         Cd = "1",
                         Ce = "1")

data_s This input file should be numeric type except the first and second column containing the names and label (control or case) of the studied samples, respectively. The intensity data should be provided in this input file with the following order: samples in row and proteins/peptides in column. Missing value (NA) of protein intensity are allowed.

spiked The file should provide the concentrations of known proteins (such as spiked proteins). This file is required, if the user want to conduct assessment using criteria (e) This file should contain the class of samples and the Sample ID. The Sample ID should be unique and defined by the preference of EVALFQ users, and the class of samples refers to the group of Sample ID. The ID of the spiked proteins should be consistent in both “data_s" and "spiked”. Detail information are described in the online “Example”.

assum_a All proteins were assumed to be equally important.
The authors will be asked to input a letter “Y” to indicate the corresponding assumption is held for the studied dataset and a letter “N” to denote the opposite.

assum_b The level of protein abundance was assumed to be constant among all samples.
The authors will be asked to input a letter “Y” to indicate the corresponding assumption is held for the studied dataset and a letter “N” to denote the opposite.

assum_c The intensities of the vast majority of the proteins were assumed to be unchanged under the studied conditions.
The authors will be asked to input a letter “Y” to indicate the corresponding assumption is held for the studied dataset and a letter “N” to denote the opposite.

Ca Criterion (a): precision of LFQ based on the proteomes among replicates.
If set 1, the user chooses to assess LFQ workflows using Criterion (a).
If set 0, the user excludes Criterion (a) from performance assessment.
The default setting of this value is “1”.

Cb Criterion (b): classification ability of LFQ between distinct sample groups.
If set 1, the user chooses to assess LFQ workflows using Criterion (b).
If set 0, the user excludes Criterion (b) from performance assessment.
The default setting of this value is “1”.

Cc Criterion (c): differential expression analysis by reproducibility-optimization.
If set 1, the user chooses to assess LFQ workflows using Criterion (c).
If set 0, the user excludes Criterion (c) from performance assessment.
The default setting of this value is “1”.

Cd Criterion (d): reproducibility of the identified protein markers among different datasets.
If set 1, the user chooses to assess LFQ workflows using Criterion (d).
If set 0, the user excludes Criterion (d) from performance assessment.
The default setting of this value is “1”.

Ce Criterion (e): accuracy of LFQ based on spiked and background proteins.
If set 1, the user chooses to assess LFQ workflows using Criterion (e).
If set 0, the user excludes Criterion (e) from performance assessment.
The default setting of this value is “1”.

4. Draw heatmap plot and save as EVALFQ-OUTPUT.Figure-Top.XXX.workflows.pdf.

lfqvisualize(object, top = 100)

object The input is the output file of the lfqevalueall or lfqspikedall.

top The default 'top' value is 100.
You can view the top ranking heatmap you want.

5. Conduct LFQ and assess performance for part of LFQ workflows.

res <- lfqevalupart(data_q,
                    selectFile,
                    Ca = "1", 
                    Cb = "1", 
                    Cc = "1", 
                    Cd = "1")

data_q Same as the description of the 'lfqevalueall' above.

selectFile Input the name of your prefered strategies. Sample data of this data type is in the working directory (in github) “idrblab/EVALFQ/data/selectworkflows.rda”.

Ca Criterion (a): precision of LFQ based on the proteomes among replicates.
If set 1, the user chooses to assess LFQ workflows using Criterion (a).
If set 0, the user excludes Criterion (a) from performance assessment.
The default setting of this value is “1”.

Cb Criterion (b): classification ability of LFQ between distinct sample groups.
If set 1, the user chooses to assess LFQ workflows using Criterion (b).
If set 0, the user excludes Criterion (b) from performance assessment.
The default setting of this value is “1”.

Cc Criterion (c): differential expression analysis by reproducibility-optimization.
If set 1, the user chooses to assess LFQ workflows using Criterion (c).
If set 0, the user excludes Criterion (c) from performance assessment.
The default setting of this value is “1”.

Cd Criterion (d): reproducibility of the identified protein markers among different datasets.
If set 1, the user chooses to assess LFQ workflows using Criterion (d).
If set 0, the user excludes Criterion (d) from performance assessment.
The default setting of this value is “1”.

6. Conduct LFQ and assess performance for part of LFQ workflows by collectively considering the spiked proteins.

res <- lfqspikepart(data_s,
                    spiked,
                    selectFile,
                    Ca = "1", 
                    Cb = "1", 
                    Cc = "1", 
                    Cd = "1")

data_s Same as the description of the 'lfqspikedall' above.

spiked Same as the description of the 'lfqspikedall' above.

selectFile Input the name of your prefered strategies. Sample data of this data type is in the working directory (in github) “idrblab/EVALFQ/data/selectworkflows.rda”. The abbreviations of all LFQ chains can be downloaded here.

Ca Criterion (a): precision of LFQ based on the proteomes among replicates.
If set 1, the user chooses to assess LFQ workflows using Criterion (a).
If set 0, the user excludes Criterion (a) from performance assessment.
The default setting of this value is “1”.

Cb Criterion (b): classification ability of LFQ between distinct sample groups.
If set 1, the user chooses to assess LFQ workflows using Criterion (b).
If set 0, the user excludes Criterion (b) from performance assessment.
The default setting of this value is “1”.

Cc Criterion (c): differential expression analysis by reproducibility-optimization.
If set 1, the user chooses to assess LFQ workflows using Criterion (c).
If set 0, the user excludes Criterion (c) from performance assessment.
The default setting of this value is “1”.

Cd Criterion (d): reproducibility of the identified protein markers among different datasets.
If set 1, the user chooses to assess LFQ workflows using Criterion (d).
If set 0, the user excludes Criterion (d) from performance assessment.
The default setting of this value is “1”.

Ce Criterion (e): accuracy of LFQ based on spiked and background proteins.
If set 1, the user chooses to assess LFQ workflows using Criterion (e).
If set 0, the user excludes Criterion (e) from performance assessment.
The default setting of this value is “1”.

Examples

# Step 1: Prepare input file for evaluating label-free proteome quantification.

my_df <- PrepareInuputFiles(acquisitionmethods = "2", 
                            rawdataset = "MaxQuant_proteinGroups_LFQ.txt", 
                            lable = "MaxQuant_LFQ_Label.txt")
OR

my_df <- read.csv(file = "EVALFQ_Unified_Data.csv", header = TRUE, stringsAsFactors = FALSE)

# Note: the file should be in the format of Comma-Separated Values (CSV), which provides the intensity data of proteins/peptides. This input file should be numeric type except the first and second column containing the names and label (control or case) of the studied samples, respectively. The intensity data should be provided in this input file with the following order: samples in row and proteins/peptides in column. Missing value (NA) of protein intensity are allowed.

The format of input files could be readily found HERE:
MaxQuant_proteinGroups_LFQ.txt
MaxQuant_LFQ_Label.txt
EVALFQ_Unified_Data.csv

# Step 2: conduct LFQ and assess performance of all possible LFQ workflows or assess performance by collectively considering the spiked proteins.

allranks <- lfqevalueall(data_q = my_df,
                         assum_a = "Y",
                         assum_b = "Y",
                         assum_c = "Y",
                         Ca = "1",
                         Cb = "1",
                         Cc = "1",
                         Cd = "1")

# Note: the file should be in the format of Comma-Separated Values (CSV), which provides the concentrations of known proteins (such as spiked proteins). This file is required, if the user want to conduct assessment using criteria (e) This file should contain the class of samples and the Sample ID. The Sample ID should be unique and defined by the preference of EVALFQ users, and the class of samples refers to the group of Sample ID. The ID of the spiked proteins should be consistent in both "my_spiked" and "spiked_data".

allranks <- lfqspikedall(data_s = my_spiked,
                         spiked = spiked_data,
                         assum_a = "Y",
                         assum_b = "Y",
                         assum_c = "Y",
                         Ca = "1", 
                         Cb = "1", 
                         Cc = "1", 
                         Cd = "1",
                         Ce = "1")

# Note: 'allranks' containing all information of performance assessment, criteria selected and ranking.
# Step 3: a heatmap illustrating the performance ranking of all LFQ workflows based on the criteria selected by user.

lfqvisualize(object = "EVALFQ-OUTPUT.Data-Overall.Ranking.csv", top = 100)

# Note: the 'EVALFQ-OUTPUT.Figure-Top.XXX.workflows.pdf' would be successfully saved in the current path. Please use 'getwd()' to find the current path!

Exclusion or inclusion list for the methods based on different assumptions:
InlusionExclusionList.xlsx
The format of input files could be readily found HERE:
selectedmethods.csv

# Users can also use EVALFQ by selecting your prefered LFQ workflows as follows:

my_df <- read.csv(file = "EVALFQ_Unified_Data.csv", header = TRUE, stringsAsFactors = FALSE)

lfqevalupart(data_q = my_df,
                    selectFile = "selectedmethods.csv",
                    Ca = "1", 
                    Cb = "1", 
                    Cc = "1", 
                    Cd = "1")
                   
OR

res <- lfqspikepart(data_s = my_spiked,
                    spiked = spiked_data,
                    selectFile = "selectedmethods.csv",
                    Ca = "1", 
                    Cb = "1", 
                    Cc = "1", 
                    Cd = "1")            
                               
Note: please select the appropriate number code represents transformation, centering, scaling, normalization, imputation methods (See above details).

Contributors

Jianbo Fu Feng Zhu

Should you have any questions, please contact Jianbo Fu at [email protected]

Reference

Adriaens, M.E., et al. An evaluation of two-channel ChIP-on-chip and DNA methylation microarray normalization strategies. BMC genomics 2012;13(1):42.

Ballman, K.V., et al. Faster cyclic loess: normalizing RNA arrays via linear models. Bioinformatics 2004;20(16):2778-2786.

Bathen, T.F., et al. Feasibility of MR metabolomics for immediate analysis of resection margins during breast cancer surgery. PloS one 2013;8(4):e61578.

Bennike, T.B., et al. Proteome stability analysis of snap frozen, RNAlater preserved, and formalin-fixed paraffin-embedded human colon mucosal biopsies. Data in brief 2016;6(1):942-947.

Bereman, M.S., et al. Implementation of statistical process control for proteomic experiments via LC MS/MS. Journal of the American Society for Mass Spectrometry 2014;25(4):581-587.

Blaise, B.J. Data-driven sample size determination for metabolic phenotyping studies. Analytical chemistry 2013;85(19):8943-8950.

Bolstad, B.M., et al. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003;19(2):185-193.

Branson, O.E. and Freitas, M.A. A multi-model statistical approach for proteomic spectral count quantitation. J. Proteomics 2016;144:23-32.

Brock, G.N., et al. Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes. BMC bioinformatics 2008;9(1):12.

Buchs, N., et al. Absolute quantification of grapevine red blotch virus in grapevine leaf and petiole tissues by proteomics. Frontiers in plant science 2018;9(1):1735.

Callister, S.J., et al. Normalization approaches for removing systematic biases associated with mass spectrometry and label-free proteomics. Journal of proteome research 2006;5(2):277-286.

Cano, A., et al. A Metabolomics Signature Linked To Liver Fibrosis In The Serum Of Transplanted Hepatitis C Patients. Scientific reports 2017;7(1):10497.

Caron, E., et al. Precise Temporal Profiling of Signaling Complexes in Primary Cells Using SWATH Mass Spectrometry. Cell reports 2017;18(13):3219-3226.

Chai, L.E., et al. Investigating the effects of imputation methods for modelling gene networks using a dynamic bayesian network from gene expression data. The Malaysian journal of medical sciences : MJMS 2014;21(2):20-27.

Chawade, A., Alexandersson, E. and Levander, F. Normalyzer: a tool for rapid evaluation of normalization methods for omics data sets. Journal of proteome research 2014;13(6):3114-3120.

Chen, Y.J., et al. Proteogenomics of non-smoking lung cancer in east asia delineates molecular signatures of pathogenesis and progression. Cell 2020;182(1):226-244.

Chung, R.H. and Kang, C.Y. A multi-omics data simulator for complex disease studies and its application to evaluate multi-omics data analysis methods for disease classification. GigaScience 2019;8(5):giz045.

Cox, J., et al. Accurate proteome-wide label-free quantification by delayed normalization and maximal peptide ratio extraction, termed MaxLFQ. Molecular & cellular proteomics : MCP 2014;13(9):2513-2526.

Craig, A., et al. Scaling and normalization effects in NMR spectroscopic metabonomic data sets. Analytical chemistry 2006;78(7):2262-2267.

De Livera, A.M., et al. Normalizing and integrating metabolomics data. Analytical chemistry 2012;84(24):10768-10776.

Di Guida, R., et al. Non-targeted UHPLC-MS metabolomic data processing methods: a comparative investigation of normalisation, missing value imputation, transformation and scaling. Metabolomics : Official journal of the Metabolomic Society 2016;12(1):93.

Di Guida, R., et al. Non-targeted UHPLC-MS metabolomic data processing methods: a comparative investigation of normalisation, missing value imputation, transformation and scaling. Metabolomics : Official journal of the Metabolomic Society 2016;12:93.

Dieterle, F., et al. Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics. Analytical chemistry 2006;78(13):4281-4290.

Djomehri, S.I., et al. Quantitative proteomic landscape of metaplastic breast carcinoma pathological subtypes and their relationship to triple-negative tumors. Nature communications 2020;11(1):1723.

Fundel, K., et al. Normalization strategies for mRNA expression data in cartilage research. Osteoarthritis and cartilage 2008;16(8):947-955.

Gan, X., Liew, A.W. and Yan, H. Microarray missing data imputation based on a set theoretic framework and biological knowledge. Nucleic acids research 2006;34(5):1608-1619.

Gaspari, M., et al. Proteome speciation by mass spectrometry: characterization of composite protein mixtures in milk replacers. Analytical chemistry 2016;88(23):11568-11574.

Gatto, L. and Lilley, K.S. MSnbase-an R/Bioconductor package for isobaric tagged mass spectrometry data visualization, processing and quantitation. Bioinformatics 2012;28(2):288-289.

Gregori, J., et al. Batch effects correction improves the sensitivity of significance tests in spectral counting-based comparative discovery proteomics. Journal of proteomics 2012;75(13):3938-3951.

Griffin, N.M., et al. Label-free, normalized quantification of complex mass spectrometry data for proteomic analysis. Nature biotechnology 2010;28(1):83-89.

Guo, T., et al. Rapid mass spectrometric conversion of tissue biopsy samples into permanent quantitative digital proteome maps. Nature medicine 2015;21(4):407-413.

Huber, W., et al. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 2002;18(1):96-104.

Jauhiainen, A., et al. Normalization of metabolomics data with applications to correlation maps. Bioinformatics 2014;30(15):2155-2161.

Jiang, Y., et al. Proteomics identifies new therapeutic targets of early-stage hepatocellular carcinoma. Nature 2019;567(7747):257-261.

Karp, N.A., et al. Addressing accuracy and precision issues in iTRAQ quantitation. Molecular & cellular proteomics : MCP 2010;9(9):1885-1897.

Karpievitch, Y.V., Dabney, A.R. and Smith, R.D. Normalization and missing value imputation for label-free LC-MS analysis. BMC bioinformatics 2012;13(16):S5.

Karpievitch, Y.V., Dabney, A.R. and Smith, R.D. Normalization and missing value imputation for label-free LC-MS analysis. BMC bioinformatics 2012;13 Suppl 16:S5.

Karpievitch, Y.V., et al. Metabolomics data normalization with EigenMS. PloS one 2014;9(12):e116221.

Karpievitch, Y.V., et al. Normalization of peak intensities in bottom-up MS-based proteomics using singular value decomposition. Bioinformatics 2009;25(19):2573-2580.

Keeping, A.J. and Collins, R.A. Data variance and statistical significance in 2D-gel electrophoresis and DIGE experiments: comparison of the effects of normalization methods. Journal of proteome research 2011;10(3):1353-1360.

Kim, C.Y., et al. Gene expression profile of the adult human retinal ganglion cell layer. Molecular vision 2006;12(1):1640-1648.

Kim, H., Golub, G.H. and Park, H. Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 2005;21(2):187-198.

Kohl, M., et al. A practical data processing workflow for multi-OMICS projects. Biochimica et biophysica acta 2014;1844(1 Pt A):52-62.

Kohl, S.M., et al. State-of-the art data normalization methods improve NMR-based metabolomic analysis. Metabolomics : Official journal of the Metabolomic Society 2012;8(1):146-160.

Koopmans, F., et al. Empirical bayesian random censoring threshold model improves detection of differentially abundant proteins. Journal of proteome research 2014;13(9):3871-3880.

Kuharev, J., et al. In-depth evaluation of software tools for data-independent acquisition based label-free quantification. Proteomics 2015;15(18):3140-3151.

Li, B., et al. NOREVA: normalization and evaluation of MS-based metabolomics data. Nucleic acids research 2017;45(W1):W162-W170.

Lin, S.M., et al. Model-based variance-stabilizing transformation for Illumina microarray data. Nucleic acids research 2008;36(2):e11.

Lin, Y., et al. Comparison of normalization and differential expression analyses using RNA-Seq data from 726 individual Drosophila melanogaster. BMC genomics 2016;17(1):28.

Liu, Y., et al. Quantitative variability of 342 plasma proteins in a human twin population. Molecular systems biology 2015;11(1):786.

Liu, Z., Yuan, Z. and Zhao, Q. SELDI-TOF-MS proteomic profiling of serum, urine, and amniotic fluid in neural tube defects. PloS one 2014;9(7):e103276.

Lo, K. and Gottardo, R. Flexible mixture modeling via the multivariate t distribution with the Box-Cox transformation: an alternative to the skew-t distribution. Statistics and computing 2012;22(1):33-52.

Martinez-Arranz, I., et al. Data in support of enhancing metabolomics research through data mining. Data in brief 2015;3(1):155-164.

Matzke, M.M., et al. Improved quality control processing of peptide-centric LC-MS proteomics data. Bioinformatics 2011;27(20):2866-2872.

Muller, F., et al. On the reproducibility of label-free quantitative cross-linking/mass spectrometry. Journal of the American Society for Mass Spectrometry 2018;29(2):405-412.

Navarro, P., et al. A multicenter study benchmarks software tools for label-free proteome quantification. Nature biotechnology 2016;34(11):1130-1136.

Nezami Ranjbar, M.R., et al. Gaussian process regression model for normalization of LC-MS data using scan-level information. Proteome science 2013;11(Suppl 1):S13.

Padoan, A., et al. Reproducibility in urine peptidome profiling using MALDI-TOF. Proteomics 2015;15(9):1476-1485.

Parastar, H. and Bazrafshan, A. Fuzzy C-means clustering for chromatographic fingerprints analysis: a gas chromatography-mass spectrometry case study. Journal of chromatography. A 2016;1438(1):236-243.

Paredi, G., et al. Is the protein profile of pig Longissimus dorsi affected by gender and diet? Journal of proteomics 2019;206(1):103437.

Pedreschi, R., et al. Treatment of missing values for multivariate statistical analysis of gel-based proteomics data. Proteomics 2008;8(7):1371-1383.

Pursiheimo, A., et al. Optimization of statistical methods impact on quantitative proteomics data. Journal of proteome research 2015;14(10):4118-4126.

Qin, S., et al. SRM targeted proteomics in search for biomarkers of HCV-induced progression of fibrosis to cirrhosis in HALT-C patients. Proteomics 2012;12(8):1244-1252.

Rausch, T.K., et al. Comparison of pre-processing methods for multiplex bead-based immunoassays. BMC genomics 2016;17(1):601.

Reindl, J., et al. Proteomic biomarkers for psoriasis and psoriasis arthritis. Journal of proteomics 2016;140(1):55-61.

Risso, D., et al. Normalization of RNA-seq data using factor analysis of control genes or samples. Nature biotechnology 2014;32(9):896-902.

Selevsek, N., et al. Reproducible and consistent quantification of the Saccharomyces cerevisiae proteome by SWATH-mass spectrometry. Molecular & cellular proteomics : MCP 2015;14(3):739-749.

Smilde, A.K., et al. Fusion of mass spectrometry-based metabolomics data. Analytical chemistry 2005;77(20):6729-6736.

Stacklies, W., et al. PcaMethods--a bioconductor package providing PCA methods for incomplete data. Bioinformatics 2007;23(9):1164-1167.

Tang, J., et al. Simultaneous improvement in the precision, accuracy, and robustness of label-free proteome quantification by optimizing data manipulation chains. Molecular & cellular proteomics : MCP 2019;18(8):1683-1699.

Thygesen, H.H. and Zwinderman, A.H. Comparing transformation methods for DNA microarray data. BMC bioinformatics 2004;5(1):77.

Tobin, J., et al. Untargeted analysis of chromatographic data for green and fermented rooibos: Problem with size effect removal. Journal of chromatography. A 2017;1525(1):109-115.

Torres-Garcia, W., et al. Integrative analysis of transcriptomic and proteomic data of Shewanella oneidensis: missing value imputation using temporal datasets. Molecular bioSystems 2011;7(4):1093-1104.

Troyanskaya, O., et al. Missing value estimation methods for DNA microarrays. Bioinformatics 2001;17(6):520-525.

Tuikkala, J., et al. Missing value imputation improves clustering and interpretation of gene expression microarray data. BMC bioinformatics 2008;9(1):202.

Valikangas, T., Suomi, T. and Elo, L.L. A comprehensive evaluation of popular proteomics software workflows for label-free proteome quantification and imputation. Briefings in bioinformatics 2018;19(6):1344-1355.

Valikangas, T., Suomi, T. and Elo, L.L. A systematic evaluation of normalization methods in quantitative label-free proteomics. Briefings in bioinformatics 2018;19(1):1-11.

van den Berg, R.A., et al. Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics 2006;7:142.

Van den Berg, R.A., et al. Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC genomics 2006;7(1):142.

Wang, B., Wang, X.F. and Xi, Y. Normalizing bead-based microRNA expression data: a measurement error model-based approach. Bioinformatics 2011;27(11):1506-1512.

Wang, S.Y., Kuo, C.H. and Tseng, Y.J. Batch Normalizer: a fast total abundance regression calibration method to simultaneously adjust batch and injection order effects in liquid chromatography/time-of-flight mass spectrometry-based metabolomics data and comparison with current calibration methods. Analytical chemistry 2013;85(2):1037-1046.

Wang, X., Gardiner, E.J. and Cairns, M.J. Optimal consistency in microRNA expression analysis using reference-gene-based normalization. Molecular bioSystems 2015;11(5):1235-1240.

Webb-Robertson, B.J., et al. A statistical analysis of the effects of urease pre-treatment on the measurement of the urinary metabolome by gas chromatography-mass spectrometry. Metabolomics : Official journal of the Metabolomic Society 2014;10(5):897-908.

Webb-Robertson, B.J., et al. A statistical selection strategy for normalization procedures in LC-MS proteomics experiments through dataset-dependent ranking of normalization scaling factors. Proteomics 2011;11(24):4736-4741.

Wehrens, R., Weingart, G. and Mattivi, F. metaMS: an open-source pipeline for GC-MS-based untargeted metabolomics. Journal of chromatography. B, Analytical technologies in the biomedical and life sciences 2014;966(1):109-116.

Williams, K.E., et al. Quantitative proteomic analyses of mammary organoids reveals distinct signatures after exposure to environmental chemicals. Proceedings of the National Academy of Sciences of the United States of America 2016;113(10):E1343-1351.

Wisniewski, J.R., et al. Extensive quantitative remodeling of the proteome between normal colon tissue and adenocarcinoma. Molecular systems biology 2012;8(1):611.

Wu, J.X., et al. SWATH mass spectrometry performance using extended peptide MS/MS assay libraries. Molecular & cellular proteomics : MCP 2016;15(7):2501-2514.

Wu, W.S. and Jhou, M.J. MVIAeval: a web tool for comprehensively evaluating the performance of a new missing value imputation algorithm. BMC bioinformatics 2017;18(1):31.

Wulff, J.E. and Mitchell, M.W. A comparison of various normalization methods for LC/MS metabolomics data. Adv Biosci Biotechnol 2018;9(1):339-351.

Xi, B., et al. Statistical analysis and modeling of mass spectrometry-based metabolomics data. Methods in molecular biology 2014;1198(1):333-353.

Yang, Y.H., et al. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic acids research 2002;30(4):e15.

Zhu, H., et al. Morin as a preservative for delaying senescence of banana. Biomolecules 2018;8(3):52.