Merge pull request #45 from chhoumann/experimental-evaluation-section

chhoumann · Jan 26, 2024 · 21805f3 · 21805f3
2 parents 492e879 + 4e174a0
commit 21805f3
Show file tree

Hide file tree

Showing 6 changed files with 293 additions and 265 deletions.
diff --git a/report_pre_thesis/diagrams.drawio b/report_pre_thesis/diagrams.drawio
diff --git a/report_pre_thesis/src/index.tex b/report_pre_thesis/src/index.tex
@@ -16,6 +16,7 @@
 \input{sections/related_work}
 \input{sections/data_overview}
 \input{sections/methodology}
+\input{sections/experiments}
 \input{sections/results}
 \input{sections/discussion}
 \input{sections/conclusion}

diff --git a/report_pre_thesis/src/sections/background.tex b/report_pre_thesis/src/sections/background.tex
@@ -1,31 +1,120 @@
 \section{Background}\label{sec:background}
-%The use of LIBS technology in planetary exploration has proven to be effective in analyzing soil and rock samples \citep{knight2000}.
-
-%A laser pulses to ablate and remove any surface contaminants, such as dust and weathering layers, to expose the underlying material.
-%The laser generates a plasma plume from the now-exposed sample material.
-%This plasma plume emits light, which, when collected and analyzed, reveals the elemental composition of the sample by correlating the intensity of emitted light with specific wavelengths in a LIBS spectrum.
-%The LIBS technique enables remote analysis of materials without the need for sample preparation.
-%It allows for rapid analysis because of the immediate spectrum collection from the subsequent plasma, while maintaining a high spatial resolution due to its small observation footprints.
-%This high resolution is essential for pinpointing and investigating small features. \cite{wiensChemcam2012}
-
-
-
-% In 2013, \citet{wiensPreFlight3} published a paper describing the pre-flight calibration and initial data processing for the ChemCam LIBS instrument.
-% This paper introduces methods for preprocessing spectra samples and a regression model based on Partial Least Squares (PLS2) used to predict the composition of geological samples on Mars.
-% The model was trained on a dataset of 69 rock samples from Earth, which were created in a laboratory to simulate the conditions of the Martian surface.
-% This dataset is referred to as the calibration dataset.
-% Two key conclusions were drawn from this paper:
-% \begin{enumerate}
-%     \item A larger dataset is needed to improve the accuracy of the model.
-%     \item The PLS2 model is not ideal for this type of data, and an argument is made for using PLS1 instead because of its ability to optimize each element separately, which in can improve the accuracy although it suffers slower run times.
-% \end{enumerate}
-
-% Based on this work, \citet{cleggRecalibrationMarsScience2017} published a paper in 2017 describing a new approach to the ChemCam LIBS calibration model.
-% This paper introduces a new model based on PLS1 with a sub-model approach (PLS-SM) and Independent Component Analysis (ICA).
-% In addition, a much larger calibration dataset was used, consisting of 408 samples.
-% Using this, the team was able to improve the predictions by employing a \textit{submodel} PLS approach in tandem with ICA.
-% This model is referred to as the Multivariate Oxide Composition (MOC) model.
-% The MOC model is currently used by the ChemCam team to analyze the LIBS data collected by the Curiosity rover.
-
-%\input{sections/known_limitations}
-\input{sections/moc}
+\subsection{The Multivariate Oxide Composition Model}\label{sec:moc}
+\begin{figure}[ht]
+    \centering
+    \includegraphics[width=0.225\textwidth]{images/pipeline.png}
+    \caption{High level overview of the pipeline for deriving the Multivariate Oxide Composition (MOC) from raw LIBS data.}
+    \label{fig:libs_data_processing}
+\end{figure}
+
+In Figure~\ref{fig:libs_data_processing} we illustrate the inference steps for deriving the Multivariate Oxide Composition (MOC) from LIBS data.
+In Section~\ref{sec:methodology}, we detail the training process of the models.
+A comprehensive description of the MOC model is presented in \citet{cleggRecalibrationMarsScience2017} and \citet{andersonImprovedAccuracyQuantitative2017}.
+Here, we offer a concise summary of the MOC model, laying the groundwork for further exploration of our contributions to the subject.
+
+\subsubsection{Data Preprocessing}\label{sec:data_preprocessing}
+Before the MOC model can be applied to the data, the raw LIBS data must be preprocessed to produce a clean calibrated spectrum (CCS).
+We do not go into further details about this process here, however a full description can be found in \citet{wiensPreFlight3}.
+We only concern ourselves with the CCS data in this work.
+
+The CCS data format is a table of intensity values for each wavelength, with each row representing the intensities for a given shot, which is a single laser pulse on the sample.
+We give a more detailed description of the CCS data format in Section \ref{sec:data_overview} where we also describe the ChemCam calibration dataset as a whole.
+
+\subsubsection{Multivariate Oxide Composition Derivation}\label{sec:moc_derivation}
+The multivariate analysis adopts a hybrid methodology, blending Partial Least Squares Regression with Submodels (PLS-SM) and Independent Component Analysis (ICA) to calculate the Multivariate Oxide Composition (MOC).
+The outcomes of these two techniques are merged using a weighted average for each oxide, with the weighting skewed in favor of the technique that demonstrates superior performance for the specific oxide.
+The PLS-SM approach utilizes tailored sub-models that specialize in targeting distinct composition ranges, along with a comprehensive full model used for initial composition estimation.
+Independent Component Analysis assists in distinguishing elemental emission lines, contributing to a refined multivariate model.
+
+Two normalization methods are employed in the analysis: Norm 1 and Norm 3.
+Norm 1 standardizes the full spectrum across all three spectrometers such that the sum total is unity.
+In contrast, Norm 3 conducts normalization on a per-spectrometer basis, culminating in a full normalized spectrum summing to three.
+The optimal normalization technique is selected based on its efficacy in model performance for the specific analysis task at hand.
+
+\subsubsection{Outlier Removal}\label{sec:outlier_removal}
+
+In their analysis, \citet{andersonImprovedAccuracyQuantitative2017} employed a methodical outlier removal process to enhance model accuracy in multivariate regression. To detect outliers, they utilized influence plots, employing statistical measures that reflect each data point's deviation from the model's predictions and their influence on the model due to their position in the predictor space.
+In this context, the deviation, or the error, are known as residuals and the influence a data point has on the model is known as leverage.
+The application of the following equation to latent variables enables the computation of an observation leverages vector (\(h_t\)):
+
+\begin{equation}
+    h_t = \text{diag}\left[ t(t^T t)^{-1} t^T \right]
+\end{equation}
+
+Where $t$ represents the PLS scores.
+Given that leverage quantifies the distance of each observation from the model's center, it can be interpreted as the square Mahalanobis distance — a measure of the distance of a point from the center of mass of points in multivariate space.
+If the original data follows a multivariate normal distribution, the distances form a chi-squared distribution. Using this property, outliers can be detected using the chi-square test \cite{brereton_chi_2015}.
+The Mahalanobis distance is defined as follows:
+
+\begin{equation}
+    D_M(p)^2 = (p - \mu)^T \cdot \Sigma^{-1} (p - \mu)
+\end{equation}
+
+Where $p$ is the point in question, $\mu$ is the mean of the distribution and $\Sigma^{-1}$ is the inverse of the covariance matrix of the distribution, known as the precision matrix.
+Variables exhibiting a high leverage score are indicative of predictors that deviate more significantly from the others, thereby have a greater influence on the model.
+The measure of model fit, denoted as $Q$, is calculated by squaring the differences between the actual spectrum $x$ and the spectrum predicted by the model. This prediction uses the model's scores $t$ and loadings ($P$) to calculate the residuals $e$:
+
+\begin{equation}
+    e = x - t \cdot P^T
+\end{equation}
+
+The fit of model for the $i$th observation is then computed by taking the residual vector $e_i$ and multiplying it by the transpose of itself $e^T$, which gives you the sum of squared differences:
+
+\begin{equation}
+    Q_i = e_{i}e_{i}^T
+\end{equation} \cite{marini_chemometrics_2013} \citet{andersonImprovedAccuracyQuantitative2017}
+
+Outlier removal is performed iteratively; an initial PLS model is conceived with cross-validation to determine the optimum number of latent variables, followed by an inspection of the influence plot to pinpoint outliers. Identified outliers are removed, and the model is re-evaluated. This procedure is repeated as needed, ensuring that any removals do not degrade the model's general performance.
+
+
+\subsubsection{Partial Least Squares Sub-Models}\label{sec:pls_submodels}
+
+\citet{andersonImprovedAccuracyQuantitative2017} proposed an approach referred to as the Partial Least Squares Sub-Models (PLS1-SM).
+The inherent variability of LIBS spectral responses to different element concentrations necessitates a nuanced analysis.
+High element concentrations can lead to spectral signal obscuration, often due to phenomena like self-absorption in strong lines, while at lower concentrations, there is a risk of spectral lines disappearing.
+A single regression model typically falls short in accounting for such variations, leading to compromises in predictive precision for specific samples.
+
+They deployed multiple regression models, each tailored to subsets of the entire composition range, targeting "low," "mid," and "high" concentrations along with a comprehensive "full model." This led to the formation of 32 distinct models, with selected sub-model ranges that prioritize both a robust dataset and precise compositional response.
+
+Each sub-model was subjected to training, cross-validation, and optimization phases, which included the iterative outlier removal strategy mentioned in section~\ref{sec:outlier_removal}.
+The full model's preliminary composition estimation of unknown targets dictates the choice of subsequent sub-model(s) for refined prediction.
+If the full model's prediction falls within the "low", "mid", or "high" ranges, the corresponding sub-model is selected for the final prediction.
+However, if a prediction falls within two ranges, the sub-models corresponding to the two ranges are blended to produce the final prediction.
+For example, a prediction falling within the "low" and "mid" ranges would be calculated by blending the predictions from the "low" and "mid" sub-models as follows:
+
+\begin{align*}
+w_{\text{mid}} &= \frac{y_{\text{full}}-y_{\text{blend range, min}}}{y_{\text{blend range, max}} - y_{\text{blend range, min}}} \\
+w_{\text{low}} &= 1 - w_{\text{mid}} \\
+y_{\text{final}} &= w_{\text{low}}\cdot y_{\text{low}} + w_{\text{mid}}\cdot y_{\text{mid}}
+\end{align*}
+
+where,
+
+\begin{itemize}
+    \item $w_{\text{low}}$ is the weight applied to the lower of the two models,
+    \item $w_{\text{mid}}$ is the weight applied to the higher of the two models,
+    \item $y_{\text{low}}$ is the prediction from the lower of the two models, and
+    \item $y_{\text{mid}}$ is the prediction from the higher of the two models.
+\end{itemize}
+
+This applies analogously for predictions in the "mid-high" range to prevent prediction discontinuities.
+
+The exact delineations of the blending ranges are adjustable, with optimization performed using the Broyden-Fletcher-Goldfarb-Shannon (BFGS) algorithm. This process, aimed at minimizing the RMSE for the full model dataset, established optimal ranges. Initial blend boundaries were predicated on sub-model intersections, with exceptional outliers managed distinctly due to their deviation from expected value ranges.
+
+\subsubsection{Independent Component Analysis}\label{sec:ica}
+\citet{cleggRecalibrationMarsScience2017} and \citet{forniIndependentComponentAnalysis2013} proposed the use of Independent Component Analysis (ICA) to identify the elemental emission lines in LIBS spectra. Independent Component Analysis (ICA) is a computational method used to separate a multivariate signal into additive, statistically independent components, particularly useful in scenarios where the signal sources overlap, such as in LIBS data.
+
+ICA yields independent source components and the affiliated mixing matrix which illustrates how the independent sources are combined to form the observed spectral data.
+
+After extracting the independent components, the key task is to associate each independent component with a specific elemental emission line.
+This process involves examining the emission lines of the elements and evaluating the ICA scores.
+These scores reflect the correlation between each Independent Component and the entire spectrum of wavelengths.
+Subsequently, these correlation values (ICA scores) are employed to establish a calibration curve, which correlates the ICA score to the elemental composition.
+
+To ascertain the accuracy of this calibration, a regression analysis is performed using multiple regression functions. The function that provides the most reliable fit (often assessed through chi-square values) is used to predict the composition for each element.
+
+Model refinement is facilitated by techniques such as normalization, outlier removal (via Median Absolute Deviation), and k-fold cross-validation. These methods ensure the robustness and reliability of the predictive model constructed through ICA.
+
+% TODO: Add a paragraph about the IRF
+% "In simpler terms, the IRF in a spectrometer represents how the instrument's inherent characteristics alter the appearance of the spectral data it measures. By understanding and minimizing the IRF, scientists can obtain more accurate and precise spectroscopic data."
+% Source: https://www.sciencedirect.com/science/article/pii/S0022407304002523?via=ihub
diff --git a/report_pre_thesis/src/sections/experiments.tex b/report_pre_thesis/src/sections/experiments.tex
@@ -0,0 +1,59 @@
+\section{Experiments}\label{sec:experiments}
+To evaluate the performance of each of the components in the pipeline, we focus our experiments on three main aspects:
+
+\begin{itemize}
+	\item \textbf{Outlier removal} to assess the impact of leaving outliers in the dataset or using a different outlier removal method.
+	\item \textbf{Hyperparameter tuning} to assess the impact of different hyperparameter configurations.
+	\item \textbf{Other models} to compare the performance of the PLS1-SM and ICA models to other models.
+\end{itemize}
+
+\noindent
+Given that the original authors did not perform experiments using alternative methods to demonstrate the efficacy of their chosen approach, this omission results in a lack of comprehensive understanding regarding the full potential of the pipeline's performance.
+While they did perform hyperparameter tuning, they did not conduct experiments using different outlier removal methods or alternative models.
+This raises questions about the optimality of the chosen methodology, as a comparative analysis with different methodologies could reveal superior approaches.
+Experimenting with alternative methods means that we can uncover which components contribute the most to the overall error and therefore would benefit the most from further research and development.
+Should a substitution of a component within the pipeline with an alternative method yield improved outcomes, it would indicate that the currently employed method represents a limitation in the overall pipeline, thus highlighting an area that necessitates enhancement.
+
+\subsection{Experiment: Outlier Removal}\label{sec:experiment_outlier_removal}
+The original PLS1-SM identified outliers manually by inspecting the leverage and spectral residuals plots.
+We have instead chosen to automate this based on the reasons described in Section~\ref{sec:methodology_outlier_removal}.
+It would therefore be intriguing to examine the impact on the pipeline's performance when this process is adjusted.
+Firstly, examining the performance implications of completely omitting outlier removal would be worthwhile.
+This experiment is justified given the substantial efforts dedicated to developing the ChemCam calibration dataset as mentioned in Section~\ref{sec:ica_data_preprocessing}, which implies a minimal presence of significant outliers.
+Furthermore, experimenting with various significance levels for the chi-squared test could reveal whether a more or less conservative approach is advantageous.
+
+In the ICA phase, the original authors employed the Median Absolute Deviation (MAD) for outlier removal, yet the detailed methodology of their approach was not fully delineated.
+Consequently, in our version of the pipeline, we chose to exclude the outlier removal step during the ICA phase to avoid introducing unsubstantiated assumptions, as described in Section~\ref{sec:ica_data_preprocessing}.
+This decision allows us to evaluate the intrinsic effectiveness of the ICA phase without the influence of outlier removal.
+Introducing outlier removal using MAD in our replication of the pipeline presents an opportunity to assess its impact on the pipeline's efficacy.
+By comparing the results with and without MAD, we can quantitatively measure the utility of this step.
+Such an experiment is crucial for understanding whether MAD significantly contributes to reducing noise and improving data quality, thereby enhancing the overall performance of the machine learning pipeline.
+This experiment would also offer insights into the robustness of the ICA phase against outliers, providing a more comprehensive understanding of the pipeline's capabilities and limitations.
+
+\subsection{Experiment: Hyperparameter Tuning}\label{sec:experiment_hyperparameter_tuning}
+\citet{cleggRecalibrationMarsScience2017} use qualitative judgement to identify hyperparameters for their PLS1-SM model.
+This approach carries a risk of inaccuracies without sufficient domain expertise, given the challenges in guaranteeing the optimality of chosen hyperparameters.
+Lacking such expertise, we opted for a more systematic and automated methodology to determine hyperparameters for our PLS1-SM model.
+
+Similarly, the authors use eight independent components for their ICA algorithm, but do not provide any experimental results justifying that this is the optimal number of components.
+As such, it is possible that the performance of the ICA phase could be improved by experimenting with a variety of components.
+
+For the PLS1-SM model we decided to use the common grid search algorithm for testing different hyperparameters for the PLS models.
+% Explain set up...
+
+Since each independent component does not necessarily correlate one-to-one with the number of elements that one wishes to identify in a spectra, we decided to experiment with a number of components ranging between 4 and 25.
+This range is within the vicinity of the original selection of components whilst providing us with a set of reasonable extremes.
+
+% Probably show the setup in some way
+
+\subsection{Experiment: Other Models}\label{sec:experiment_other_models}
+\citet{cleggRecalibrationMarsScience2017} have only compared their new approach with the original method presented by \citet{wiensPreFlight3}, and have not conducted experiments using alternative methods to establish the superiority of their chosen approach.
+Therefore, we decided to compare the performance of the PLS1-SM and ICA models to other models.
+The objective is to evaluate two distinct scenarios. In the first scenario, we aim to conduct a direct comparison between the MOC model and an alternative model. The second scenario revolves around substituting either PLS or ICA with a different model and then calculating a weighted average.
+We have decided to conduct the experiments using the following models:
+
+\begin{itemize}
+	\item \textbf{XGBoost}, a gradient boosting algorithm, \cite{chen_xgboost_2016}.
+	\item \textbf{ANN}, a neural network model, \cite{scikit-learn}.
+	% More? Random Forest, SVM, etc.
+\end{itemize}