Merge pull request #85 from chhoumann/KB-107

chhoumann · Feb 22, 2024 · a8ac83a · a8ac83a
2 parents 255cada + 43a916b
commit a8ac83a
Show file tree

Hide file tree

Showing 3 changed files with 54 additions and 5 deletions.
diff --git a/report_thesis/src/glossary.tex b/report_thesis/src/glossary.tex
@@ -8,4 +8,6 @@
 \newacronym{ann}{ANN}{Artificial Neural Network}
 \newacronym{gbr}{GBR}{Gradient Boosting Regression}
 \newacronym{rf}{RF}{Random Forest}
-\newacronym{lasso}{LASSO}{Least Absolute Selection and Shrinkage Operator}
+\newacronym{lasso}{LASSO}{Least Absolute Selection and Shrinkage Operator}
+\newacronym{pca}{PCA}{Principal Component Analysis}
+\newacronym{rmse}{RMSE}{Root Mean Squared Error}
diff --git a/report_thesis/src/index.tex b/report_thesis/src/index.tex
@@ -11,6 +11,7 @@
 \subsubsection*{Acknowledgements:}
 
 \input{sections/introduction.tex}
+\input{sections/problem_definition.tex}
 
 \section{Background}
 Background / Preliminaries (what you need to know in order to understand the story)
@@ -24,10 +25,6 @@ \subsection{Related Work}
 
 Related Work (What others have done and why our method is different / novel)
 
-\subsection{LIBS Setup}
-Detailed explanation of the LIBS setup, including equipment, configurations, and settings.
-Explain any variables, controls, and calibrations involved in the setup.
-
 \subsection{Data Analysis}
 Description of the samples used and their relevance.
 Explain how and why these samples were chosen.

diff --git a/report_thesis/src/sections/problem_definition.tex b/report_thesis/src/sections/problem_definition.tex
@@ -0,0 +1,50 @@
+\section{Problem Definition}
+\label{sec:problem_definition}
+
+Our work aims to improve the accuracy and robustness of major oxide predictions derived from \gls{libs} data, building upon the baseline established in \citet{p9_paper}.
+There are many challenges in predicting major oxides from \gls{libs} data, including the high dimensionality and non-linearity of the data, as well as the presence of multicollinearity.
+Some of these are caused by \textit{matrix effects}\cite{andersonImprovedAccuracyQuantitative2017}, which is a catch-all term for any effect that can cause the intensity of emission lines from an element to vary, independent of that element's concentration. So it's unknown variables that affect the results.
+Furthermore, due to the high cost of data collection, datasets are often small, which further complicates the task of building accurate and robust models.
+
+Based on the limitations with the current \gls{moc} pipeline, as reported in \citet{p9_paper}, we identified three key areas for further investigation: dimensionality reduction, model selection, and outlier removal.
+
+In this work, we focus on dimensionality reduction and model selection over outlier removal.
+This is justified by the low incidence of outliers in the \gls{chemcam} \gls{libs} calibration dataset, as reported in \citet{p9_paper}.
+Dimensionality reduction is crucial for managing the high-dimensional nature of \gls{libs} data. % TODO: There are lots of related works which explore DR in LIBS data. We can back this up with citations.
+Furthermore, model selection shows promise in addressing the limitations of the current \gls{moc} pipeline, as it allows for the exploration of a wider range of algorithms, potentially leading to improved performance.
+We showed that advanced ensemble methods like \gls{gbr} and deep learning models like \gls{ann}s have the potential to outperform the current \gls{moc} pipeline.
+Methods are selected based on their promise in handling high-dimensional, non-linear data. Ideally, the selected methods should also be feasible for small datasets, a common scenario in \gls{libs} analyses. 
+In order to address the aforementioned challenges, we propose to explore advanced ensemble methods and deep learning models, which have shown promise in handling high-dimensional, non-linear data.
+
+It is necessary to establish metrics to evaluate the performance of the models.
+In \cite{p9_paper}, we proposed to use the \gls{rmse} as a proxy for accuracy.
+
+\gls{rmse} is given by:
+
+\begin{equation}
+    RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2},
+\end{equation}
+
+where $y_i$ represents the actual values, $\hat{y}_i$ the predicted values, and $n$ the number of observations.
+
+To address robustness, we propose considering the standard deviation of prediction errors across each oxide and test instance, defined as:
+
+\begin{equation}
+    \sigma_{error} = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (e_i - \bar{e})^2},
+\end{equation}
+
+where $e_i = y_i - \hat{y}_i$ and $\bar{e}$ is the mean error.
+
+In order to narrow down the scope of our research, we set the following constraints:
+\begin{itemize}
+    \item Prioritize normalization across individual spectrometers' wavelength ranges (Norm 3) over full-spectrum normalization (Norm 1).
+    \item Focus on techniques proven effective for non-linear, high-dimensional data, even outside the \gls{libs} context.
+    \item Ensure methods are feasible for small datasets.
+\end{itemize}
+
+In \cite{p9_paper}, we used both full-spectrum normalization (Norm 1) and normalization across individual spectrometers' wavelength ranges (Norm 3).
+However, in this work, we opt to always normalize across individual spectrometers' wavelength ranges (Norm 3).
+This decision is guided by the approach taken by the SuperCam team, where they do not normalize across the entire spectrum, but rather across individual spectrometers' wavelength ranges\cite{andersonPostlandingMajorElement2022}.
+In order to ensure the future applicability of our methods, we follow the same normalization approach.
+
+% TODO: Write tail when we have more structure in the report