Skip to content

Commit

Permalink
Merge pull request #90 from chhoumann/KB-131
Browse files Browse the repository at this point in the history
[KB-131] Narrow in on the CS problem we’re solving
  • Loading branch information
chhoumann authored Feb 26, 2024
2 parents f0e9855 + 82e384c commit c94d003
Show file tree
Hide file tree
Showing 2 changed files with 25 additions and 22 deletions.
5 changes: 4 additions & 1 deletion report_thesis/src/sections/introduction.tex
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,10 @@ \section{Introduction}\label{sec:introduction}
However, there remains considerable uncertainty about which machine learning techniques best predict the composition of major oxides in Martian geological samples using \gls{libs} data.
This underscores the importance of a detailed study into advanced machine learning models for improving predictions in these applications.

\textit{In this work, we aim to investigate the application of advanced machine learning models to predict the composition of major oxides in Martian geological samples using \gls{libs} data.}
In addition, the high dimensionality of \gls{libs} data poses a significant challenge for computational models.
Therefore, techniques that reduce the dimensionality of the data are crucial for mitigating the effects of multicollinearity and enhancing the model's ability to discern the underlying patterns within the spectral data.

\textit{In this work, we aim to investigate the application of dimensionality reduction techniques and advanced machine learning models to predict the composition of major oxides in Martian geological samples using \gls{libs} data.}

The remainder of this paper is organized as follows:
\textit{Structure of the paper will be added here after the paper is written.}
Expand Down
42 changes: 21 additions & 21 deletions report_thesis/src/sections/problem_definition.tex
Original file line number Diff line number Diff line change
@@ -1,50 +1,50 @@
\section{Problem Definition}
\section{Problem}
\label{sec:problem_definition}

Our work aims to improve the accuracy and robustness of major oxide predictions derived from \gls{libs} data, building upon the baseline established in \citet{p9_paper}.
There are many challenges in predicting major oxides from \gls{libs} data, including the high dimensionality and non-linearity of the data, as well as the presence of multicollinearity.
Some of these are caused by \textit{matrix effects}\cite{andersonImprovedAccuracyQuantitative2017}, which is a catch-all term for any effect that can cause the intensity of emission lines from an element to vary, independent of that element's concentration. So it's unknown variables that affect the results.
Predicting major oxide compositions from \gls{libs} data presents significant computational challenges, including the high dimensionality and non-linearity of the data, compounded by multicollinearity and the phenomenon known as \textit{matrix effects}.
These effects can cause the intensity of emission lines from an element to vary independently of that element's concentration, introducing unknown variables that complicate the analysis.
Furthermore, due to the high cost of data collection, datasets are often small, which further complicates the task of building accurate and robust models.

Based on the limitations with the current \gls{moc} pipeline, as reported in \citet{p9_paper}, we identified three key areas for further investigation: dimensionality reduction, model selection, and outlier removal.
Building upon the baseline established in \citet{p9_paper}, our work aims to address the significant challenges inherent in predicting major oxide compositions from \gls{libs} data by improving the accuracy and robustness of these predictions.
The presence of multicollinearity within the spectral data, for example, makes it difficult to discern distinct patterns due to the strong correlations among variables that can obscure the impact of individual predictors.
Additionally, the high dimensionality of \gls{libs} data necessitates dimensionality reduction to manage the vast number of variables efficiently.

In this work, we focus on dimensionality reduction and model selection over outlier removal.
This is justified by the low incidence of outliers in the \gls{chemcam} \gls{libs} calibration dataset, as reported in \citet{p9_paper}.
Dimensionality reduction is crucial for managing the high-dimensional nature of \gls{libs} data. % TODO: There are lots of related works which explore DR in LIBS data. We can back this up with citations.
Furthermore, model selection shows promise in addressing the limitations of the current \gls{moc} pipeline, as it allows for the exploration of a wider range of algorithms, potentially leading to improved performance.
We showed that advanced ensemble methods like \gls{gbr} and deep learning models like \gls{ann}s have the potential to outperform the current \gls{moc} pipeline.
Methods are selected based on their promise in handling high-dimensional, non-linear data. Ideally, the selected methods should also be feasible for small datasets, a common scenario in \gls{libs} analyses.
In order to address the aforementioned challenges, we propose to explore advanced ensemble methods and deep learning models, which have shown promise in handling high-dimensional, non-linear data.
Our research objectives include exploring sophisticated modeling techniques that can navigate the non-linear relationships between spectral features and the concentrations of major oxides.
Given the limited size of available datasets, our methodologies must also be robust against overfitting and capable of generalizing well from small sample sizes.
To this end, we propose the exploration of advanced ensemble methods and deep learning models, selected for their potential to handle high-dimensional, non-linear data effectively.

It is necessary to establish metrics to evaluate the performance of the models.
In \cite{p9_paper}, we proposed to use the \gls{rmse} as a proxy for accuracy.
In addressing the limitations of the current \gls{moc} pipeline identified in \citet{p9_paper}, we have prioritized dimensionality reduction and model selection.
This decision is supported by the low incidence of outliers in the \gls{chemcam} \gls{libs} calibration dataset.
Dimensionality reduction is essential for managing the high-dimensional nature of \gls{libs} data, and model selection offers the opportunity to explore a wider range of algorithms, potentially leading to improved performance.
Our focus on advanced ensemble methods like \gls{gbr} and deep learning models like \gls{ann}s is motivated by their demonstrated ability to outperform the existing \gls{moc} pipeline in handling complex data scenarios, as shown in \citet{p9_paper} and \citet{andersonPostlandingMajorElement2022}.

\gls{rmse} is given by:
To evaluate the performance of these models, we will use \gls{rmse} as a proxy for accuracy, defined by the equation:

\begin{equation}
RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2},
\end{equation}

where $y_i$ represents the actual values, $\hat{y}_i$ the predicted values, and $n$ the number of observations.

To address robustness, we propose considering the standard deviation of prediction errors across each oxide and test instance, defined as:
To address robustness, we will consider the standard deviation of prediction errors across each oxide and test instance, defined as:

\begin{equation}
\sigma_{error} = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (e_i - \bar{e})^2},
\end{equation}

where $e_i = y_i - \hat{y}_i$ and $\bar{e}$ is the mean error.

The goal of improving both robustness and accuracy is to ensure that our models can generalize well to new data and provide reliable predictions in the presence of noise and uncertainty.
Essentially, the model should be as accurate as possible, as often as possible.
It is undesirable for a model to be accurate only in specific scenarios, as this would limit its practical utility.

In order to narrow down the scope of our research, we set the following constraints:
\begin{itemize}
\item Prioritize normalization across individual spectrometers' wavelength ranges (Norm 3) over full-spectrum normalization (Norm 1).
\item Focus on techniques proven effective for non-linear, high-dimensional data, even outside the \gls{libs} context.
\item Ensure methods are feasible for small datasets.
\end{itemize}

In \cite{p9_paper}, we used both full-spectrum normalization (Norm 1) and normalization across individual spectrometers' wavelength ranges (Norm 3).
However, in this work, we opt to always normalize across individual spectrometers' wavelength ranges (Norm 3).
Following the approach taken by the SuperCam team, we opt to always normalize across individual spectrometers' wavelength ranges (Norm 3), rather than normalizing across the entire spectrum (Norm 1).
This decision is guided by the approach taken by the SuperCam team, where they do not normalize across the entire spectrum, but rather across individual spectrometers' wavelength ranges\cite{andersonPostlandingMajorElement2022}.
In order to ensure the future applicability of our methods, we follow the same normalization approach.

% TODO: Write tail when we have more structure in the report
Through these focused objectives and methodologies, our work seeks to improve the prediction of major oxide compositions in Martian geological samples.

0 comments on commit c94d003

Please sign in to comment.