Automated-Outlier-Detection-and-Estimation-of-Missing-Data

This repository contains the software for [Automated Outlier Detection and Estimation of Missing Data] which can be used for data imputation while minimizing the impact of outliers. This software is associated with the paper 'Automated Outlier Detection and Estimation of Missing Data' by Jinwook Rhyu et al.

The software is performed in Python where main_demonstration and main_validation are the main functions for [Section 3. Demonstration] and [Section 4. Validation], respectively. The user may edit the parameters based on their dataset until line "### Step A-0: Preprocessing before Step A (Only use variables_mask and observations_mask)".

Please cite this Software as: Jinwook Rhyu, Dragana Bozinovski, Alexis B. Dubs, Naresh Mohan, Elizabeth M. Cummings Bende, Andrew J. Maloney, Miriam Nieves, Jose Sangerman, Amos E. Lu, Moo Sun Hong, Anastasia Artamonova, Rui Wen Ou, Paul W. Barone, James C. Leung, Jacqueline M. Wolfrum, Anthony J. Sinskey, Stacy L. Springs, Richard D. Braatz, Automated outlier detection and estimation of missing data, Computers & Chemical Engineering, Volume 180, 2024, 108448, https://doi.org/10.1016/j.compchemeng.2023.108448.

`Codes` folder

The major files under Codes folder are:

Addmissingness: Add missing patterns to the full dataset. Please refer to [Severson, K. A., Molaro, M. C., & Braatz, R. D. (2017). Principal component analysis of process datasets with missing values. Processes, 5(3), 38.] for more information.
Algorithms: Stores 9 imputation algorithms (MI, Alternating, SVDImpute, PCADA, PPCA, PPCA-M, BPCA, SVT, and ALM) described in [Section 2.3. Imputation algorithms for missing values (Step B)].
Determine_A: Determines the number of principal components based on cross-validation and calculates statistical metrics (e.g. T^2 and Q contributions, thresholds for each contribution, etc.).
Fill_missing: Iterates (a) data imputation and (b) determination of principal components until the number of principal components converges.
Plot_dataset: Generates plots where blue circles indicate normal data, cyan triangles indicate temporarily imputed missing values, red stars indicate detected outliers, green triangles indicate estimated missing values, and olive stars indicated replaced outliers.
Preprocessing: Preprocessing by (A0) use only the masked variables and observations, and (A1) temporarily impute missing values using either mean imputation, interpolation, or last observed values.
main_demonstration: The main code used in the [Section 3. Demonstration].
main_validation: The main code used in the [Section 4. Validation].

`Codes_MATLAB` folder

The MATLAB version of this software, which is around 5-10 times faster than Python version, is located in Codes_MATLAB folder.

Reference for AddMissingness software: Severson, K.A., Molaro, M.C., and Braatz, R.D. Methods for applying principal component analysis to process datasets wiht missing values. Processes 2017, 5(3), 38. [http://web.mit.edu/braatzgroup/links.html]

Reference for BPCA algorithm: Oba, S., Sato, M., Takemasa, I., Monden, M., Matsubara, K., and Ishii, S. A Bayesian Missing value estimation method, Bioinformatics 19, pp.2088-2096 (2003). [http://ishiilab.jp/member/oba/tools/BPCAFill.html]

`Dataset` folder

The Dataset folder contains the following two datasets:

mAb_dataset_demonstration.xlsx: The original dataset used in the the [Section 3. Demonstration].
mAb_dataset_validation.xlsx: The preprocessed dataset used in the the [Section 4. Validation].

Please contact Richard Braatz at [email protected] for any inquiry.

Acknowledgement

This study was supported by the U.S. Food and Drug Administration, Contract No. 75F40121C00090. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the financial sponsor. MIT thanks Sartorius Stedim Cellca GMBH for the generous support of the adalibumab-producing CHO cell line.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
Codes		Codes
Codes_MATLAB		Codes_MATLAB
Dataset		Dataset
LICENSE.txt		LICENSE.txt
Process_diagram.png		Process_diagram.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automated-Outlier-Detection-and-Estimation-of-Missing-Data

`Codes` folder

`Codes_MATLAB` folder

`Dataset` folder

Acknowledgement

About

Releases

Packages

Languages

License

JinwookRhyu/Automated-Outlier-Detection-and-Estimation-of-Missing-Data

Folders and files

Latest commit

History

Repository files navigation

Automated-Outlier-Detection-and-Estimation-of-Missing-Data

Codes folder

Codes_MATLAB folder

Dataset folder

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`Codes` folder

`Codes_MATLAB` folder

`Dataset` folder

Packages