Merge pull request #152 from chhoumann/kb-184-define-GBR

[KB-184] Defining Gradient Boost regression
chhoumann · May 28, 2024 · ba494c3 · ba494c3
2 parents f025980 + e9257ec
commit ba494c3
Show file tree

Hide file tree

Showing 2 changed files with 79 additions and 28 deletions.
diff --git a/report_thesis/src/references.bib b/report_thesis/src/references.bib
@@ -469,6 +469,17 @@ @article{druckerSVR
 	author = {Drucker, Harris and Burges, Christopher J C and Kaufman, Linda and Smola, Alex J and Vapnik, Vladimir},
 }
 
+@article{gradientLossFunction,
+author = {Friedman, Jerome},
+year = {2000},
+month = {11},
+pages = {},
+title = {Greedy Function Approximation: A Gradient Boosting Machine},
+volume = {29},
+journal = {The Annals of Statistics},
+doi = {10.1214/aos/1013203451}
+}
+
 @article{YeoJohnson,
 abstract = {We introduce a new power transformation family which is well defined on the whole real line and which is appropriate for reducing skewness and to approximate normality. It has properties similar to those of the Box-Cox transformation for positive variables. The large-sample properties of the transformation are investigated in the contect of a single random sample.},
 author = {Yeo, InKwon and Johnson, Richard A.},

diff --git a/report_thesis/src/sections/background.tex b/report_thesis/src/sections/background.tex
@@ -6,34 +6,34 @@ \subsubsection{Z-score Normalization}
 This technique is useful when the actual minimum and maximum of a feature are unknown or when outliers may significantly skew the distribution.
 The formula for Z-score normalization is given by:
 
-$$ 
+$$
 v' = \frac{v - \overline{F}}{\sigma_F},
 $$
 
-where $v$ is the original value, $\overline{F}$ is the mean of the feature $F$, and $\sigma_F$ is the standard deviation of $F$. 
+where $v$ is the original value, $\overline{F}$ is the mean of the feature $F$, and $\sigma_F$ is the standard deviation of $F$.
 By transforming the data using the Z-score, each value reflects its distance from the mean in terms of standard deviations.
 Z-score normalization is particularly advantageous in scenarios where data features have different units or scales, or when preparing data for algorithms that assume normally distributed inputs~\cite{dataminingConcepts}.
 
 \subsubsection{Max Absolute Scaler}
-Max absolute scaling is a normalization technique that scales each feature individually so that the maximum absolute value of each feature is 1. 
+Max absolute scaling is a normalization technique that scales each feature individually so that the maximum absolute value of each feature is 1.
 This results in the data being normalized to a range between -1 and 1.
 The formula for max absolute scaling is given by:
 $$
 	X_{\text{scaled}} = \frac{x}{\max(|x|)},
 $$
 where $x$ is the original feature value and $X_{\text{scaled}}$ is the normalized feature value.
-This scaling method is useful for data that has been centered at zero or data that is sparse, as max absolute scaling does not center the data. 
+This scaling method is useful for data that has been centered at zero or data that is sparse, as max absolute scaling does not center the data.
 This maintains the sparsity of the data by not introducing non-zero values in the zero entries of the data~\cite{Vasques2024}.
 
 
 \subsubsection{Min-Max Normalization}\label{subsec:min-max}
-Min-max normalization rescales the range of features to $[0, 1]$ or $[a, b]$, where $a$ and $b$ represent the new minimum and maximum values, respectively. 
-The goal is to normalize the range of the data to a specific scale, typically 0 to 1. 
+Min-max normalization rescales the range of features to $[0, 1]$ or $[a, b]$, where $a$ and $b$ represent the new minimum and maximum values, respectively.
+The goal is to normalize the range of the data to a specific scale, typically 0 to 1.
 Mathematically, min-max normalization is defined as:
-$$ 
+$$
 	v' = \frac{v - \min(F)}{\max(F) - \min(F)} \times (b - a) + a,
 $$
-where $v$ is the original value, $\min(F)$ and $\max(F)$ are the minimum and maximum values of the feature $F$, respectively. 
+where $v$ is the original value, $\min(F)$ and $\max(F)$ are the minimum and maximum values of the feature $F$, respectively.
 
 This type of normalization is beneficial because it ensures that each feature contributes equally to the analysis, regardless of its original scale.
 
@@ -79,43 +79,43 @@ \subsubsection{Norm 3}
 This normalization method results in a total of $3N = 6144$ normalized features for each sample, as each of the three spectrometers contributes 2048 channels.
 
 \subsubsection{Power Transformation}
-Power transformations are a class of mathematical functions used to stabilize variance and make data more closely approximate a normal distribution. 
+Power transformations are a class of mathematical functions used to stabilize variance and make data more closely approximate a normal distribution.
 They are particularly useful in statistical modeling and data analysis to meet the assumptions of linear models.
 
 One of the first influential power transformation techniques is the Box-Cox power transform, introduced by \citet{BoxAndCox} in 1964.
 This is defined for positive data and is aimed at normalizing data or making it more symmetric. The transformation is given by:
 
-$$ 
-\text{BC}(\lambda, x) = 
-\begin{cases} 
+$$
+\text{BC}(\lambda, x) =
+\begin{cases}
 \frac{x^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0 \\
-\log(x) & \text{if } \lambda = 0 
-\end{cases} 
+\log(x) & \text{if } \lambda = 0
+\end{cases}
 $$
-where $ \lambda $ is the transformation parameter and $x$ is the input data. 
+where $ \lambda $ is the transformation parameter and $x$ is the input data.
 $\lambda$ determines the extend and nature of the transformation, where positive values of $\lambda$ apply a power transformation and $\lambda = 0$ applies a logarithmic transformation.
 
-To overcome the limitations of the Box-Cox transformation, \citet{YeoJohnson} introduced a new family of power transformations that can handle both positive and negative values. 
+To overcome the limitations of the Box-Cox transformation, \citet{YeoJohnson} introduced a new family of power transformations that can handle both positive and negative values.
 The Yeo-Johnson power transformation is defined as:
 
-$$ 
-y = 
-\begin{cases} 
+$$
+y =
+\begin{cases}
 \frac{((x + 1)^\lambda - 1)}{\lambda} & \text{for } x \geq 0, \lambda \neq 0 \\
 \log(x + 1) & \text{for } x \geq 0, \lambda = 0 \\
 -\frac{((-x + 1)^{2 - \lambda} - 1)}{2 - \lambda} & \text{for } x < 0, \lambda \neq 2 \\
 -\log(-x + 1) & \text{for } x < 0, \lambda = 2
-\end{cases} 
+\end{cases}
 $$
 where $x$ is the input data, $y$ is the transformed data, and $\lambda$ is the transformation parameter.
-For non-negative values, the Yeo-Johnson transformation simplifies to the Box-Cox transformation, making them equivalent in this context. 
-The key benefit of the Yeo-Johnson transformation is its ability to handle any real number, making it a robust choice for transforming data to achieve approximate normality or symmetry. 
+For non-negative values, the Yeo-Johnson transformation simplifies to the Box-Cox transformation, making them equivalent in this context.
+The key benefit of the Yeo-Johnson transformation is its ability to handle any real number, making it a robust choice for transforming data to achieve approximate normality or symmetry.
 This property is particularly beneficial for preparing data for statistical analyses and machine learning models that require normally distributed input data.
 
 \subsubsection{Quantile Transformer}
-Quantile transformation is a method that applies a non-linear transformation to map data to a uniform or normal distribution. 
-This process involves mapping the data $X$ to a set of probabilities $p$ using the \gls{cdf}, which indicates the probability that a random variable will be less than or equal to a specific value in $X$'s original distribution. 
-Subsequently, the quantile function, which is the inverse of the \gls{cdf} of the desired distribution, is applied to these probabilities $p$ to generate the transformed data. 
+Quantile transformation is a method that applies a non-linear transformation to map data to a uniform or normal distribution.
+This process involves mapping the data $X$ to a set of probabilities $p$ using the \gls{cdf}, which indicates the probability that a random variable will be less than or equal to a specific value in $X$'s original distribution.
+Subsequently, the quantile function, which is the inverse of the \gls{cdf} of the desired distribution, is applied to these probabilities $p$ to generate the transformed data.
 This method forces the data to conform to the specified distribution regardless of the original distribution's form~\cite{Vasques2024}.
 
 \subsubsection{Principal Component Analysis (PCA)}\label{subsec:pca}
@@ -173,8 +173,8 @@ \subsubsection{Kernel PCA}
 
 Similar to \gls{pca}, as described in Section~\ref{subsec:pca}, the goal of \gls{kernel-pca} is to extract the principal components of the data.
 However, unlike \gls{pca}, \gls{kernel-pca} does not compute the covariance matrix of the data directly, as it often is infeasible to compute for high-dimensional datasets.
-\gls{kernel-pca} instead leverages the kernel trick to computate the similarities between data points directly in the original space using a kernel function $k(\mathbf{x}_i, \mathbf{x}_j)$. 
-This kernel function implicitly computes the dot product $\Phi(\mathbf{x}_i)^\top \Phi(\mathbf{x}_j)$ in the higher-dimensional feature space without explicitly performing the mapping. 
+\gls{kernel-pca} instead leverages the kernel trick to computate the similarities between data points directly in the original space using a kernel function $k(\mathbf{x}_i, \mathbf{x}_j)$.
+This kernel function implicitly computes the dot product $\Phi(\mathbf{x}_i)^\top \Phi(\mathbf{x}_j)$ in the higher-dimensional feature space without explicitly performing the mapping.
 By constructing a kernel matrix $\mathbf{K}$ using these pairwise similarities, \gls{kernel-pca} can perform eigenvalue decomposition to obtain the principal components in the feature space, similar to regular \gls{pca} as described in Section~\ref{subsec:pca}.
 However, in \gls{kernel-pca}, the eigenvalue decomposition is performed on the kernel matrix $\mathbf{K}$ rather than the covariance matrix $\mathbf{C}$.
 
@@ -316,7 +316,47 @@ \subsubsection{Extra Trees Regressor (ETR)}
 As a tradeoff, \gls{etr} is less interpretable than a single decision tree, as the added randomness can introduce more bias than \gls{rf}.
 However, it often achieves better generalization performance, especially in high-dimensional or noisy datasets.
 
-\subsubsection{Gradient Boosting Regressor (GBR)}
+\subsubsection{Gradient Boosting Regression (GBR)}\label{sec:gradientboost}
+In this section we introduce \gls{gbr} primarily based on \citet{James2023AnIS}.
+\gls{gbr} is an ensemble learning method that builds models sequentially, each one trying to correct the errors of the previous one, using gradient descent and boosting techniques.
+
+To understand \gls{gbr}, it is helpful to build on the concepts of ensemble learning and decision trees.
+Ensemble learning is a technique in machine learning where multiple models, known as \textit{weak learners}, are combined to produce more accurate predictions.
+Mathematically, ensemble learning can be defined as combining the predictions of $M$ weak learners to form a final prediction $\hat{y}$, such that:
+\begin{equation}
+    \hat{y} = \sum_{m=1}^{M} \alpha_m \hat{y}_m,
+\end{equation}
+where $\hat{y}_m$ is the prediction of the $m$-th weak learner and $\alpha_m$ is the weight assigned to the $m$-th weak learner.
+While there are various choices for weak learners, decision trees are a common choice.
+Decision trees are a core component of gradient boosting methods.
+They partition the data into subsets based on feature values, aiming to create groups where data points have similar predicted outcomes.
+
+Once optimal splits are identified, the tree is constructed by repeatedly partitioning the data until a stopping criterion is met.
+The final model consists of splits that create distinct regions, each with a predicted response value based on the mean of the observations in that region.
+
+This leads us to \textit{gradient boosting}.
+Instead of building a single decision tree, gradient boosting constructs multiple trees sequentially, with each new tree correcting the errors of the previous ones.
+Each tree is small, with few terminal nodes, preventing large adjustments based on a single tree's predictions.
+It also ensures that each tree makes small and simple error corrections, such that each step refines the model's performance more reliably.
+
+Initially, the prediction is set as $\hat{f}_0(\mathbf{x}) = 0$ and residuals as $r_i = y_i$ for all $i$ in the training set, where $\mathbf{x}$ represents the vector of input feature(s) and $y$ is the true value or target variable.
+The model is then iteratively improved over $B$ iterations, where $B$ is a hyperparameter controlling the total number of trees.
+With each iteration $b$ from $1$ to $B$, predictions are updated as:
+$$
+    \hat{f}^{(b)}(\mathbf{x}) = \hat{f}^{(b-1)}(\mathbf{x}) + \lambda \hat{f}_b(\mathbf{x}),
+$$
+where $\hat{f}_b(\mathbf{x})$ is the prediction of the $b$-th tree and $\lambda$ is the learning rate. Residuals are updated as:
+$$
+    r_i^{(b)} = y_i - \hat{f}^{(b-1)}(\mathbf{x}_i),
+$$
+Each tree is then trained on these updated residuals.
+The repetitive process of fitting a weak learner to predict the residuals and using its predictions to update the model gives us the final model:
+$$
+    \hat{f}(\mathbf{x}) = \sum_{b=1}^{B} \lambda \hat{f}_b(\mathbf{x})
+$$
+In the context of regression, gradient boosting aims to minimize the difference between the predicted values and the actual target values by fitting successive trees to the residuals.
+To minimize errors, gradient descent is used to iteratively update model parameters in the direction of the negative gradient of the loss function, thereby following the path of steepest descent~\cite{gradientLossFunction}.
+
 
 \subsubsection{XGBoost}