This repository contains the software for [Automated Outlier Detection and Estimation of Missing Data] which can be used for data imputation while minimizing the impact of outliers. This software is associated with the paper 'Automated Outlier Detection and Estimation of Missing Data' by Jinwook Rhyu et al.
The software is performed in Python where main_demonstration
and main_validation
are the main functions for [Section 3. Demonstration] and [Section 4. Validation], respectively. The user may edit the parameters based on their dataset until line "### Step A-0: Preprocessing before Step A (Only use variables_mask and observations_mask)".
Please cite this Software as: Jinwook Rhyu, Dragana Bozinovski, Alexis B. Dubs, Naresh Mohan, Elizabeth M. Cummings Bende, Andrew J. Maloney, Miriam Nieves, Jose Sangerman, Amos E. Lu, Moo Sun Hong, Anastasia Artamonova, Rui Wen Ou, Paul W. Barone, James C. Leung, Jacqueline M. Wolfrum, Anthony J. Sinskey, Stacy L. Springs, Richard D. Braatz, Automated outlier detection and estimation of missing data, Computers & Chemical Engineering, Volume 180, 2024, 108448, https://doi.org/10.1016/j.compchemeng.2023.108448.
The major files under Codes
folder are:
Addmissingness
: Add missing patterns to the full dataset. Please refer to [Severson, K. A., Molaro, M. C., & Braatz, R. D. (2017). Principal component analysis of process datasets with missing values. Processes, 5(3), 38.] for more information.Algorithms
: Stores 9 imputation algorithms (MI, Alternating, SVDImpute, PCADA, PPCA, PPCA-M, BPCA, SVT, and ALM) described in [Section 2.3. Imputation algorithms for missing values (Step B)].Determine_A
: Determines the number of principal components based on cross-validation and calculates statistical metrics (e.g. T^2 and Q contributions, thresholds for each contribution, etc.).Fill_missing
: Iterates (a) data imputation and (b) determination of principal components until the number of principal components converges.Plot_dataset
: Generates plots where blue circles indicate normal data, cyan triangles indicate temporarily imputed missing values, red stars indicate detected outliers, green triangles indicate estimated missing values, and olive stars indicated replaced outliers.Preprocessing
: Preprocessing by (A0) use only the masked variables and observations, and (A1) temporarily impute missing values using either mean imputation, interpolation, or last observed values.main_demonstration
: The main code used in the [Section 3. Demonstration].main_validation
: The main code used in the [Section 4. Validation].
The MATLAB version of this software, which is around 5-10 times faster than Python version, is located in Codes_MATLAB
folder.
Reference for AddMissingness software: Severson, K.A., Molaro, M.C., and Braatz, R.D. Methods for applying principal component analysis to process datasets wiht missing values. Processes 2017, 5(3), 38. [http://web.mit.edu/braatzgroup/links.html]
Reference for BPCA algorithm: Oba, S., Sato, M., Takemasa, I., Monden, M., Matsubara, K., and Ishii, S. A Bayesian Missing value estimation method, Bioinformatics 19, pp.2088-2096 (2003). [http://ishiilab.jp/member/oba/tools/BPCAFill.html]
The Dataset
folder contains the following two datasets:
mAb_dataset_demonstration.xlsx
: The original dataset used in the the [Section 3. Demonstration].mAb_dataset_validation.xlsx
: The preprocessed dataset used in the the [Section 4. Validation].
Please contact Richard Braatz at [email protected] for any inquiry.
This study was supported by the U.S. Food and Drug Administration, Contract No. 75F40121C00090. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the financial sponsor. MIT thanks Sartorius Stedim Cellca GMBH for the generous support of the adalibumab-producing CHO cell line.