Skip to content

Commit

Permalink
rewrite outlier removal method
Browse files Browse the repository at this point in the history
  • Loading branch information
chhoumann committed Jun 12, 2024
1 parent 654cfa6 commit 2e6a561
Showing 1 changed file with 6 additions and 13 deletions.
19 changes: 6 additions & 13 deletions report_thesis/src/sections/pyhat_contribution.tex
Original file line number Diff line number Diff line change
Expand Up @@ -4,24 +4,17 @@ \section{PyHAT Contribution}\label{sec:pyhat_contribution}
\gls{pyhat} offers a user-friendly interface designed for performing machine learning and data analysis tasks specifically for hyperspectral data.
Our collaboration was initiated through a series of discussions with two members from \gls{usgs} that are responsible for \gls{pyhat}, wherein we identified mutual challenges and opportunities for integrating our solutions into the tool.

% The largest contribution involved the integration of an automatic outlier detection method into \gls{pyhat}.
% This method calculates the Mahalanobis distance for each data point and uses the chi-squared distribution to establish a threshold.
% Any datapoint exceeding this threshold is considered an outlier and removed from the dataset.
% Utilizing two intermediary \gls{pls} models, one as a reference and the other to evaluate the impact of outlier removal, the method iteratively identifies and eliminates outliers while assessing the performance of the second model against the reference model.
% If the second model demonstrates improved performance compared to the reference model, it replaces the reference model, and the process continues until no further significant improvement is detected.
% To conserve computational resources, the method halts if the error of the second model increases relative to the reference model, thus providing an early stopping mechanism.

We implemented an outlier detection method in \gls{pyhat} that uses the Mahalanobis distance and the chi-squared test.
This statistical approach identifies outliers without relying on qualitative assessments.
The process involves computing leverage and spectral residuals for each sample using a \gls{pls} model, combining these metrics into a two-dimensional dataset, and calculating the Mahalanobis distance for each sample.
Samples are classified as outliers if their Mahalanobis distance exceeds a chi-squared critical value at a confidence level based on the threshold.
We implemented an outlier detection method in \gls{pyhat} that uses the Mahalanobis distance and the chi-squared test.
This statistical approach identifies outliers without relying on qualitative assessments.
The process involves computing leverage, which measures a sample's influence, and spectral residuals, which are the differences between observed and predicted values, for each sample using a \gls{pls} model.
These metrics are combined into a two-dimensional dataset, and the Mahalanobis distance for each sample is calculated.
Samples are classified as outliers if their Mahalanobis distance exceeds a chi-squared critical value at a confidence level based on the threshold.
Outliers are then excluded, and the model is retrained iteratively until no further performance improvement is observed.
We developed this method as a part of our work on the \gls{moc} model replica presented in \citet{p9_paper}, where it served as an automated version of the one presented by \citet{andersonImprovedAccuracyQuantitative2017}.

This method was integrated into \gls{pyhat}'s library and GUI, allowing users to configure the chi-squared threshold, number of PLS components, and maximum iterations.
This method was integrated into \gls{pyhat}'s library and GUI, allowing users to configure the chi-squared threshold, number of \gls{pls} components, and maximum iterations.
Users can select their dataset and regression target, configure the method, and run it through the GUI.


This contribution also included the development of a graphical user interface (GUI) component for the existing \gls{pyhat} GUI to configure and visualize the outlier removal process.
This included utilities to select a threshold, select a given oxide for which to perform outlier removal, and a logging mechanism to display the number of outliers removed at each iteration in the GUI.

Expand Down

0 comments on commit 2e6a561

Please sign in to comment.