-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #85 from chhoumann/KB-107
- Loading branch information
Showing
3 changed files
with
54 additions
and
5 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
\section{Problem Definition} | ||
\label{sec:problem_definition} | ||
|
||
Our work aims to improve the accuracy and robustness of major oxide predictions derived from \gls{libs} data, building upon the baseline established in \citet{p9_paper}. | ||
There are many challenges in predicting major oxides from \gls{libs} data, including the high dimensionality and non-linearity of the data, as well as the presence of multicollinearity. | ||
Some of these are caused by \textit{matrix effects}\cite{andersonImprovedAccuracyQuantitative2017}, which is a catch-all term for any effect that can cause the intensity of emission lines from an element to vary, independent of that element's concentration. So it's unknown variables that affect the results. | ||
Furthermore, due to the high cost of data collection, datasets are often small, which further complicates the task of building accurate and robust models. | ||
|
||
Based on the limitations with the current \gls{moc} pipeline, as reported in \citet{p9_paper}, we identified three key areas for further investigation: dimensionality reduction, model selection, and outlier removal. | ||
|
||
In this work, we focus on dimensionality reduction and model selection over outlier removal. | ||
This is justified by the low incidence of outliers in the \gls{chemcam} \gls{libs} calibration dataset, as reported in \citet{p9_paper}. | ||
Dimensionality reduction is crucial for managing the high-dimensional nature of \gls{libs} data. % TODO: There are lots of related works which explore DR in LIBS data. We can back this up with citations. | ||
Furthermore, model selection shows promise in addressing the limitations of the current \gls{moc} pipeline, as it allows for the exploration of a wider range of algorithms, potentially leading to improved performance. | ||
We showed that advanced ensemble methods like \gls{gbr} and deep learning models like \gls{ann}s have the potential to outperform the current \gls{moc} pipeline. | ||
Methods are selected based on their promise in handling high-dimensional, non-linear data. Ideally, the selected methods should also be feasible for small datasets, a common scenario in \gls{libs} analyses. | ||
In order to address the aforementioned challenges, we propose to explore advanced ensemble methods and deep learning models, which have shown promise in handling high-dimensional, non-linear data. | ||
|
||
It is necessary to establish metrics to evaluate the performance of the models. | ||
In \cite{p9_paper}, we proposed to use the \gls{rmse} as a proxy for accuracy. | ||
|
||
\gls{rmse} is given by: | ||
|
||
\begin{equation} | ||
RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}, | ||
\end{equation} | ||
|
||
where $y_i$ represents the actual values, $\hat{y}_i$ the predicted values, and $n$ the number of observations. | ||
|
||
To address robustness, we propose considering the standard deviation of prediction errors across each oxide and test instance, defined as: | ||
|
||
\begin{equation} | ||
\sigma_{error} = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (e_i - \bar{e})^2}, | ||
\end{equation} | ||
|
||
where $e_i = y_i - \hat{y}_i$ and $\bar{e}$ is the mean error. | ||
|
||
In order to narrow down the scope of our research, we set the following constraints: | ||
\begin{itemize} | ||
\item Prioritize normalization across individual spectrometers' wavelength ranges (Norm 3) over full-spectrum normalization (Norm 1). | ||
\item Focus on techniques proven effective for non-linear, high-dimensional data, even outside the \gls{libs} context. | ||
\item Ensure methods are feasible for small datasets. | ||
\end{itemize} | ||
|
||
In \cite{p9_paper}, we used both full-spectrum normalization (Norm 1) and normalization across individual spectrometers' wavelength ranges (Norm 3). | ||
However, in this work, we opt to always normalize across individual spectrometers' wavelength ranges (Norm 3). | ||
This decision is guided by the approach taken by the SuperCam team, where they do not normalize across the entire spectrum, but rather across individual spectrometers' wavelength ranges\cite{andersonPostlandingMajorElement2022}. | ||
In order to ensure the future applicability of our methods, we follow the same normalization approach. | ||
|
||
% TODO: Write tail when we have more structure in the report |