diff --git a/report_thesis/src/sections/background/preprocessing/index.tex b/report_thesis/src/sections/background/preprocessing/index.tex index 82f08828..a0612dfb 100644 --- a/report_thesis/src/sections/background/preprocessing/index.tex +++ b/report_thesis/src/sections/background/preprocessing/index.tex @@ -1,7 +1,8 @@ \subsection{Preprocessing} In this subsection, we discuss the preprocessing methods used in our machine learning pipeline. -We cover various normalization techniques such as Z-score normalization, max absolute scaling, min-max normalization, robust scaling, Norm 3, power transformation, and quantile transformation. +We cover the following normalization techniques: Z-score normalization, max absolute scaling, min-max normalization, robust scaling, Norm 3, power transformation, and quantile transformation. These techniques are essential for standardizing data, handling different scales, and improving the performance of machine learning models. +For the purposes of this discussion, let $\mathbf{x}$ be a feature vector with values $x_1, x_2, \ldots, x_n$. \input{sections/background/preprocessing/z-score.tex} \input{sections/background/preprocessing/max_abs.tex} diff --git a/report_thesis/src/sections/background/preprocessing/max_abs.tex b/report_thesis/src/sections/background/preprocessing/max_abs.tex index 8a5fd063..b615d143 100644 --- a/report_thesis/src/sections/background/preprocessing/max_abs.tex +++ b/report_thesis/src/sections/background/preprocessing/max_abs.tex @@ -2,9 +2,11 @@ \subsubsection{Max Absolute Scaler} Max absolute scaling is a normalization technique that scales each feature individually so that the maximum absolute value of each feature is 1. This results in the data being normalized to a range between -1 and 1. The formula for max absolute scaling is given by: + $$ - X_{\text{scaled}} = \frac{x}{\max(|x|)}, +x'_i = \frac{x_i}{\max(|\mathbf{x}|)}, $$ -where $x$ is the original feature value and $X_{\text{scaled}}$ is the normalized feature value. -This scaling method is useful for data that has been centered at zero or data that is sparse, as max absolute scaling does not center the data. -This maintains the sparsity of the data by not introducing non-zero values in the zero entries of the data~\cite{Vasques2024}. \ No newline at end of file + +where $x_i$ is the original feature value, $\max(|\mathbf{x}|)$ is the maximum absolute value of the feature vector $\mathbf{x}$, and $x'_i$ is the normalized feature value. +This scaling method is particularly useful for data that has been centered at zero or is sparse, as max absolute scaling does not alter the mean of the data. +Additionally, it preserves the sparsity of the data by ensuring that zero entries remain zero, thereby not introducing any non-zero values~\cite{Vasques2024}. \ No newline at end of file diff --git a/report_thesis/src/sections/background/preprocessing/min-max.tex b/report_thesis/src/sections/background/preprocessing/min-max.tex index b18485de..176e981c 100644 --- a/report_thesis/src/sections/background/preprocessing/min-max.tex +++ b/report_thesis/src/sections/background/preprocessing/min-max.tex @@ -1,10 +1,12 @@ \subsubsection{Min-Max Normalization}\label{subsec:min-max} -Min-max normalization rescales the range of features to $[0, 1]$ or $[a, b]$, where $a$ and $b$ represent the new minimum and maximum values, respectively. +Min-max normalization rescales the range of features to a specific range $[a, b]$, where $a$ and $b$ represent the new minimum and maximum values, respectively. The goal is to normalize the range of the data to a specific scale, typically 0 to 1. -Mathematically, min-max normalization is defined as: +The min-max normalization of a feature vector $\mathbf{x}$ is given by: + $$ - v' = \frac{v - \min(F)}{\max(F) - \min(F)} \times (b - a) + a, +x'_i = \frac{x_i - \min(\mathbf{x})}{\max(\mathbf{x}) - \min(\mathbf{x})}(b - a) + a, $$ -where $v$ is the original value, $\min(F)$ and $\max(F)$ are the minimum and maximum values of the feature $F$, respectively. -This type of normalization is beneficial because it ensures that each feature contributes equally to the analysis, regardless of its original scale. \ No newline at end of file +where $x_i$ is the original value, $\min(\mathbf{x})$ and $\max(\mathbf{x})$ are the minimum and maximum values of the feature vector $\mathbf{x}$, respectively, and $x'_i$ is the normalized feature value. + +This type of normalization is beneficial because it ensures that each feature contributes equally to the analysis, regardless of its original scale~\cite{dataminingConcepts}. \ No newline at end of file diff --git a/report_thesis/src/sections/background/preprocessing/norm3.tex b/report_thesis/src/sections/background/preprocessing/norm3.tex index 906c376e..ea5406c7 100644 --- a/report_thesis/src/sections/background/preprocessing/norm3.tex +++ b/report_thesis/src/sections/background/preprocessing/norm3.tex @@ -13,7 +13,8 @@ \subsubsection{Norm 3} \label{fig:spectral_plot} \end{figure} -Formally, Norm 3 is defined as +Let $\gamma$ represent the spectrometer index, where $\gamma \in \{1, 2, 3\}$, corresponding to the \gls{uv}, \gls{vio}, and \gls{vnir} spectrometers, respectively. +Then, Norm 3 is formally defined as: \begin{equation} \tilde{X}_{i,j}^{(\gamma)} = \frac{X_{i,j}^{(\gamma)}}{\sum_{j=1}^{N} X_{i,j}^{(\gamma)}}, @@ -22,7 +23,7 @@ \subsubsection{Norm 3} where \begin{itemize} - \item $\tilde{X}_{i,j}^{(\gamma)}$ is the normalized wavelength intensity for the $i$-th sample in the $j$-th channel on the $\gamma$-th spectrometer, with $\gamma \in \{1, 2, 3\}$ representing the \gls{uv}, \gls{vio}, and \gls{vnir} spectrometers, respectively, + \item $\tilde{X}_{i,j}^{(\gamma)}$ is the normalized wavelength intensity for the $i$-th sample in the $j$-th channel on the $\gamma$-th spectrometer, \item $X_{i,j}^{(\gamma)}$ is the original wavelength intensity for the $i$-th sample in the $j$-th channel on the $\gamma$-th spectrometer, and \item $N = 2048$ is the number of channels in each spectrometer. \end{itemize} diff --git a/report_thesis/src/sections/background/preprocessing/robust_scaler.tex b/report_thesis/src/sections/background/preprocessing/robust_scaler.tex index 7a5afe7b..028ef232 100644 --- a/report_thesis/src/sections/background/preprocessing/robust_scaler.tex +++ b/report_thesis/src/sections/background/preprocessing/robust_scaler.tex @@ -1,8 +1,10 @@ \subsubsection{Robust Scaler} The robust scaler is a normalization technique that removes the median and scales the data according to the quantile range. -The formula for the robust scaler is given by: +The robust scaler of a feature vector $\mathbf{x}$ is given by: + $$ -X_{\text{scaled}} = \frac{X - \text{Q1}(X)}{\text{Q3}(X) - \text{Q1}(X)} \: , +x'_i = \frac{x_i - \text{Q1}(\mathbf{x})}{\text{Q3}(\mathbf{x}) - \text{Q1}(\mathbf{x})} \: , $$ -where $X$ is the original data, $\text{Q1}(X)$ is the first quartile of $X$, and $\text{Q3}(X)$ is the third quartile of $X$. + +where $x_i$ is the original feature value, $\text{Q1}(\mathbf{x})$ is the first quartile of the feature vector $\mathbf{x}$, and $\text{Q3}(\mathbf{x})$ is the third quartile of the feature vector $\mathbf{x}$. This technique can be advantageous in cases where the data contains outliers, as it relies on the median and quantile range instead of the mean and variance, both of which are sensitive to outliers~\cite{Vasques2024}. \ No newline at end of file diff --git a/report_thesis/src/sections/background/preprocessing/z-score.tex b/report_thesis/src/sections/background/preprocessing/z-score.tex index 08b5a502..0cd42120 100644 --- a/report_thesis/src/sections/background/preprocessing/z-score.tex +++ b/report_thesis/src/sections/background/preprocessing/z-score.tex @@ -1,12 +1,12 @@ \subsubsection{Z-score Normalization} -Z-score normalization, also standardization, transforms data to have a mean of zero and a standard deviation of one. +Z-score normalization, also known as zero-mean normalization, transforms data to have a mean of zero and a standard deviation of one. This technique is useful when the actual minimum and maximum of a feature are unknown or when outliers may significantly skew the distribution. -The formula for Z-score normalization is given by: +The z-score normalization of a feature vector \(\mathbf{x}\) is given by: $$ -v' = \frac{v - \overline{F}}{\sigma_F}, +x'_i = \frac{x_i - \overline{\mathbf{x}}}{\sigma_\mathbf{x}}, $$ -where $v$ is the original value, $\overline{F}$ is the mean of the feature $F$, and $\sigma_F$ is the standard deviation of $F$. +where \(x_i\) is the original value, \(\overline{\mathbf{x}}\) is the mean of the feature vector \(\mathbf{x}\), \(\sigma_\mathbf{x}\) is the standard deviation of the feature vector \(\mathbf{x}\), and \(x'_i\) is the normalized feature value. By transforming the data using the Z-score, each value reflects its distance from the mean in terms of standard deviations. -Z-score normalization is particularly advantageous in scenarios where data features have different units or scales, or when preparing data for algorithms that assume normally distributed inputs~\cite{dataminingConcepts}. \ No newline at end of file +Z-score normalization is particularly advantageous in scenarios where data features have different units or scales, or when preparing data for algorithms that assume normally distributed inputs~\cite{dataminingConcepts}.