Merge pull request #132 from chhoumann/KB-183-PLS-definition-in-backg…

…round [KB-183 ] Introduction for model overview and definition for PLS
chhoumann · May 9, 2024 · 03f54f5 · 03f54f5
2 parents fdbde79 + d8aa231
commit 03f54f5
Show file tree

Hide file tree

Showing 4 changed files with 106 additions and 23 deletions.
diff --git a/report_thesis/src/glossary.tex b/report_thesis/src/glossary.tex
@@ -45,4 +45,4 @@
 \newacronym{uv}{UV}{Ultraviolet}
 \newacronym{vio}{VIO}{Violet}
 \newacronym{vnir}{VNIR}{Visible and Near-Infrared}
-
+\newacronym{kernel-pca}{Kernel-PCA}{Kernel Principal Component Analysis}
diff --git a/report_thesis/src/references.bib b/report_thesis/src/references.bib
@@ -391,3 +391,10 @@ @article{wiensPreflightCalibrationInitial2013
   url = {https://www.sciencedirect.com/science/article/pii/S0584854713000505},
   urldate = {2023-10-03},
 }
+
+@book{James2023AnIS,
+  title={An Introduction to Statistical Learning with Applications in Python},
+  author={Gareth James and Daniela Witten and Trevor Hastie and Robert Tibshirani},
+  year={2023},
+  publisher={Springer}
+}
diff --git a/report_thesis/src/sections/background.tex b/report_thesis/src/sections/background.tex
@@ -126,3 +126,100 @@ \subsection{Data Normalization}\label{sec:data_normalization}
 \end{itemize}
 
 This normalization method results in a total of $3N = 6144$ normalized features for each sample, as each of the three spectrometers contributes 2048 channels.
+
+\subsection{Overview of Core Models}
+In this section, we provide an overview and definitions of \gls{pls}, primarily based on the methodologies described by \citet{James2023AnIS}.
+These models form the basis of the final architecture of our proposed pipeline, detailed further in Section~\ref{sec:methodology}.
+
+\subsubsection{PLS}
+To understand \gls{pls}, it is essential to first understand \gls{pca} and \gls{pcr}.
+
+\gls{pca} is a dimensionality reduction technique that transforms a set of possibly correlated variables into a smaller set of uncorrelated variables called \textit{principal components}.
+First, the data matrix $\mathbf{X}$ is centered by subtracting the mean of each variable to ensure that the data is centered at the origin:
+
+$$
+\mathbf{\bar{X}} = \mathbf{X} - \mathbf{\mu},
+$$
+
+where $\mathbf{\bar{X}}$ is the centered data matrix and $\mathbf{\mu}$ is the mean of each variable.
+
+The covariance matrix of the centered data is then computed:
+
+$$
+\mathbf{C} = \frac{1}{n-1} \mathbf{\bar{X}}^T \mathbf{\bar{X}},
+$$
+
+where $n$ is the number of samples.
+
+Then, the covariance matrix $C$ is decomposed into its eigenvectors $\mathbf{V}$ and eigenvalues $\mathbf{D}$:
+
+$$
+\mathbf{C} = \mathbf{V} \mathbf{D} \mathbf{V}^T,
+$$
+
+where matrix $\mathbf{V}$ contains the eigenvectors of $\mathbf{C}$ and represents the principal component loadings.
+These loadings indicate the directions of maximum variance in $\mathbf{X}$.
+The matrix $\mathbf{D}$ is diagonal and holds the eigenvalues, each of which quantifies the variance captured by its corresponding loading.
+
+These components are the scores $\mathbf{T}$, calculated as follows:
+
+$$
+\mathbf{T} = \mathbf{\bar{X}} \mathbf{V}_n,
+$$
+
+where $\mathbf{V}_n$ includes only the top $n$ eigenvectors.
+The scores $\mathbf{T}$ are the new, uncorrelated features that reduce the dimensionality of the original data, capturing the most significant patterns and trends.
+
+Finally, the original data points are projected onto the space defined by the top $n$ principal components, which transforms $X$ into a lower-dimensional representation:
+
+$$
+\mathbf{X}_{\text{reduced}} = \mathbf{\bar{X}} \mathbf{V}_n,
+$$
+
+where $\mathbf{V}_n$ is the matrix that only contains the top $n$ eigenvectors.
+
+\gls{pcr} extends \gls{pca} in the context of regression analysis.
+First, \gls{pca} is applied to the dataset $\mathbf{X}$, transforming it into a set of uncorrelated variables, the principal components.
+These components, represented by scores $\mathbf{T}$, are derived from the eigenvectors $\mathbf{V}_n$ with the highest variances.
+
+In \gls{pcr}, the dataset $\mathbf{X}$ is decomposed using PCA as:
+
+$$
+\mathbf{X} = \mathbf{TV}^T + \mathbf{E},
+$$
+
+where $\mathbf{T}$ represents the scores, and $\mathbf{V}$ represents the loadings.
+\gls{pcr} utilizes these scores $\mathbf{T}$ in a linear regression model to predict the target variable $\mathbf{y}$:
+
+$$
+\mathbf{y} = \mathbf{Tb} + \mathbf{e},
+$$
+
+where $\mathbf{b}$ are the regression coefficients correlating $\mathbf{T}$ to $\mathbf{y}$, and $\mathbf{e}$ is the vector of residuals, capturing the prediction errors.
+
+However, one drawback of \gls{pcr} is that it does not consider the target in the decomposition of the features and therefore assumes that smaller components have a weaker correlation with the target than the larger ones.
+This assumption does not always hold, which is what \gls{pls} aims to address.
+
+\gls{pls} uses an iterative method to identify components that maximize the covariance between the features and the target.m
+These components, $Z$, are linear combinations of the original features, $X_j$, weighted by coefficients, $\phi_j$, which are specifically calculated to reflect this covariance.
+The formula for each component is expressed as:
+
+$$
+    Z = \sum_{j=1}^{p} \phi_j X_j,
+$$
+
+where $Z$ represents the component, $X_j$ is the $j$-th feature, and $\phi_j$ is the weight for the $j$-th feature.
+The weights, $\phi_j$, are determined by the formula:
+
+$$
+    \phi_j = \frac{\text{cov}(X_j, Y)}{\text{var}(X_j)}.
+$$
+
+To refine the model iteratively, PLS uses the residuals from the previous components to calculate the next component.
+The $m$-th component, for example, is derived from the residuals of the previous $m-1$ components:
+
+$$
+    Z_m = \sum_{j=1}^{p} \phi_{jm} \hat{X}_{j, m-1}.
+$$
+
+The components are then used to predict the target variable by fitting a linear model via least squares regression.
diff --git a/report_thesis/src/sections/methodology.tex b/report_thesis/src/sections/methodology.tex
@@ -1,23 +1,2 @@
 \section{Methodology}\label{sec:methodology}
-\textit{We will write an introduction to the methodology section here, as well as add more subsections in the future. Below is the first subsection describing the data normalization process and the reasons for choosing to only do Norm 3. Please let us know if the explanation and mathematical notation is clear.}
-
-\subsection{Evaluation Metrics}
-To evaluate the performance of our models in predicting major oxide compositions from \gls{libs} data, we will use two key metrics: \gls{rmse} and standard deviation of prediction errors.
-
-\gls{rmse} will be used as a measure of accuracy, quantifying the difference between the predicted and actual values of the major oxides in the samples. It is defined by the equation:
-
-\begin{equation}
-    RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2},
-\end{equation}
-
-where $y_i$ represents the actual values, $\hat{y}_i$ the predicted values, and $n$ the number of observations. A lower RMSE indicates better accuracy.
-
-To assess the robustness of our models, we will consider the standard deviation of prediction errors across each oxide and test instance. This metric measures the variability of the prediction errors and provides insight into the consistency of the model's performance. It is defined as:
-
-\begin{equation}
-    \sigma_{error} = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (e_i - \bar{e})^2},
-\end{equation}
-
-where $e_i = y_i - \hat{y}_i$ and $\bar{e}$ is the mean error. A lower standard deviation indicates better robustness.
-
-By using these two metrics, we aim to evaluate model performance in terms of both accuracy and robustness, which are crucial for the reliable prediction of major oxide compositions from \gls{libs} data.
+\textit{We will write an introduction to the methodology section here, as well as add more subsections in the future.}