From 3c37d686f36751893f2e6f1675e8b516bbe1de2f Mon Sep 17 00:00:00 2001 From: Ivikhostrup Date: Wed, 29 May 2024 13:22:57 +0200 Subject: [PATCH 01/10] wrote introductions --- report_thesis/src/sections/background.tex | 220 +++++++++++----------- 1 file changed, 115 insertions(+), 105 deletions(-) diff --git a/report_thesis/src/sections/background.tex b/report_thesis/src/sections/background.tex index 447708e2..3be99bbb 100644 --- a/report_thesis/src/sections/background.tex +++ b/report_thesis/src/sections/background.tex @@ -1,5 +1,12 @@ \section{Background}\label{sec:background} +In this section, we provide an overview of the preprocessing techniques and machine learning models integral to our proposed pipeline. +We discuss various normalization techniques, dimensionality reduction methods, ensemble learning strategies, linear and regularization models, and stacked generalization approaches. +Each technique is examined to highlight its underlying concepts, algorithms, and relevance to our pipeline. +The section begins with an introduction to preprocessing techniques, followed by detailed descriptions of the machine learning models utilized. + \subsection{Preprocessing} +In this section, we introduce the preprocessing techniques used in our proposed pipeline. +The techniques covered include Z-score normalization, max absolute scaling, min-max normalization, robust scaling, Norm 3, power transformation, and quantile transformation. \subsubsection{Z-score Normalization} Z-score normalization, also standardization, transforms data to have a mean of zero and a standard deviation of one. @@ -25,7 +32,6 @@ \subsubsection{Max Absolute Scaler} This scaling method is useful for data that has been centered at zero or data that is sparse, as max absolute scaling does not center the data. This maintains the sparsity of the data by not introducing non-zero values in the zero entries of the data~\cite{Vasques2024}. - \subsubsection{Min-Max Normalization}\label{subsec:min-max} Min-max normalization rescales the range of features to $[0, 1]$ or $[a, b]$, where $a$ and $b$ represent the new minimum and maximum values, respectively. The goal is to normalize the range of the data to a specific scale, typically 0 to 1. @@ -178,111 +184,11 @@ \subsubsection{Kernel PCA} By constructing a kernel matrix $\mathbf{K}$ using these pairwise similarities, \gls{kernel-pca} can perform eigenvalue decomposition to obtain the principal components in the feature space, similar to regular \gls{pca} as described in Section~\ref{subsec:pca}. However, in \gls{kernel-pca}, the eigenvalue decomposition is performed on the kernel matrix $\mathbf{K}$ rather than the covariance matrix $\mathbf{C}$. -\subsection{Overview of Core Models} -In this section, we provide an overview and definitions of \gls{pls}, \gls{svr}, \gls{etr}, \gls{gbr}, and \gls{xgboost}. -These models form the basis of the final architecture of our proposed pipeline, detailed further in Section~\ref{sec:methodology}. - -\subsubsection{Partial Least Squares (PLS)} -Having previously introduced \gls{pca}, we now describe \gls{pls} based on \citet{James2023AnIS}. -In order to understand \gls{pls}, it is helpful to first consider \gls{pcr}, as \gls{pls} is an extension of \gls{pcr} that aims to address some of its limitations. - -\gls{pcr} extends \gls{pca} in the context of regression analysis. -In \gls{pcr}, the dataset $\mathbf{X}$ is decomposed using PCA as: - -$$ -\mathbf{X} = \mathbf{TV}^T + \mathbf{E}, -$$ - -where $\mathbf{T}$ represents the scores, $\mathbf{E}$ represents the residual matrix, and $\mathbf{V}$ represents the loadings. -\gls{pcr} utilizes these scores $\mathbf{T}$ in a linear regression model to predict the target variable $\mathbf{y}$: - -$$ -\mathbf{y} = \mathbf{Tb} + \mathbf{e}, -$$ - -where $\mathbf{b}$ are the regression coefficients correlating $\mathbf{T}$ to $\mathbf{y}$, and $\mathbf{e}$ is the vector of residuals, capturing the prediction errors. - -One drawback of \gls{pcr} is that it does not consider the target in the decomposition of the features and therefore assumes that smaller components have a weaker correlation with the target than the larger ones. -This assumption does not always hold, which is what \gls{pls} aims to address. - -\gls{pls} uses an iterative method to identify components that maximize the covariance between the features and the target. -These components, $Z$, are linear combinations of the original features, $\mathbf{X}_j$, weighted by coefficients, $\phi_j$, which are specifically calculated to reflect this covariance. -The formula for each component is expressed as: - -$$ - Z = \sum_{j=1}^{p} \phi_j \mathbf{X}_j, -$$ - -where $Z$ represents the component, $\mathbf{X}_j$ is the $j$-th feature, and $\phi_j$ is the weight for the $j$-th feature. -The weights, $\phi_j$, are determined by the formula: - -$$ - \phi_j = \frac{\text{cov}(\mathbf{X}_j, Y)}{\text{var}(\mathbf{X}_j)}. -$$ +\subsection{Ensemble Learning Techniques} -To refine the model iteratively, PLS uses the residuals from the previous components to calculate the next component. -The $m$-th component, for example, is derived from the residuals of the previous $m-1$ components: +\subsubsection{What Is Ensemble Learning?} -$$ - Z_m = \sum_{j=1}^{p} \phi_{jm} \hat{\mathbf{X}}_{j, m-1}. -$$ - -The components are then used to predict the target variable by fitting a linear model via least squares regression. - -\subsubsection{Support Vector Regression (SVR)} -\gls{svr} is a regression technique that extends the principles of \gls{svm} to regression problems. -We therefore provide an overview of \gls{svm}s based on \citet{James2023AnIS} before discussing \gls{svr}s. - -\gls{svm} is a supervised learning algorithm used primarily for classification tasks. -A core concept in \gls{svm} is the \textit{hyperplane}. -Generally, a hyperplane is a subspace of one dimension less than its ambient space. -This means that in a two-dimensional space, a hyperplane is a line, while in a three-dimensional space, it is a plane, and so on. - -\gls{svm} is built on the idea of finding the hyperplane that best separates the data points into different classes. -This hyperplane is chosen to maximize the margin, which is the distance between the hyperplane and the nearest data point from either class. -The instances right on or inside the margin are called \textit{support vectors}, which are used to 'support' the margin and decision boundary. - -\gls{svr} extends the principles of \gls{svm} to regression problems. -We use our previous discussion of \gls{svm} to introduce \gls{svr} based on \citet{druckerSVR}. - -\gls{svr} aims to fit a function that predicts continuous values rather than finding the hyperplane that best separates data points. -Instead of using a hyperplane to separate the data, \gls{svr} uses two parallel hyperplanes to define a margin within which the function should lie, often referred to as the $\epsilon$-\textit{tube}, where $\epsilon$ is a hyperparameter that defines the width of the tube. -The goal is to find a function $f(x)$ that lies within this tube and has the maximum number of data points within the tube. -$f(x)$ is typically defined as a linear function of the form: - -$$ -f(x) = \mathbf{w} \cdot \mathbf{x} + b, -$$ - -where: - -\begin{itemize} - \item $\mathbf{w}$ is the weight vector, - \item $\mathbf{x}$ is the input vector, and - \item $b$ is the bias term. -\end{itemize} - -The two parallel hyperplanes at a distance $\epsilon$ from the hyperplane are defined as: - -$$ -\begin{aligned} - \mathbf{w} \cdot \mathbf{x} + b &= f(\mathbf{x}) + \epsilon, \\ - \mathbf{w} \cdot \mathbf{x} + b &= f(\mathbf{x}) - \epsilon. -\end{aligned} -$$ - -Or, more succinctly: - -$$ -\begin{aligned} - f^+(\mathbf{x}) &= f(\mathbf{x}) + \epsilon, \\ - f^-(\mathbf{x}) &= f(\mathbf{x}) - \epsilon, -\end{aligned} -$$ - -where $f^+(\mathbf{x})$ and $f^-(\mathbf{x})$ are the upper and lower bounds of the $\epsilon$-insensitive tube, respectively. - -The optimization problem in \gls{svr} is to find the coefficients $\mathbf{w}$ and $b$ that minimize the norm of $\mathbf{w}$ (i.e., keep the regression function as flat as possible) while ensuring that most data points lie within the $\epsilon$ margin. +\subsubsection{Decision Trees} \subsubsection{Extra Trees Regressor (ETR)} Before discussing the \gls{etr} model, it is important to first understand the concepts of decision trees and \gls{rf}. @@ -392,7 +298,111 @@ \subsubsection{Natural Gradient Boosting (NGBoost)} \subsubsection{XGBoost} -\subsubsection{Stacked Generalization} +\subsection{Linear and Regularization Models} + +\subsubsection{Partial Least Squares (PLS)} +Having previously introduced \gls{pca}, we now describe \gls{pls} based on \citet{James2023AnIS}. +In order to understand \gls{pls}, it is helpful to first consider \gls{pcr}, as \gls{pls} is an extension of \gls{pcr} that aims to address some of its limitations. + +\gls{pcr} extends \gls{pca} in the context of regression analysis. +In \gls{pcr}, the dataset $\mathbf{X}$ is decomposed using PCA as: + +$$ +\mathbf{X} = \mathbf{TV}^T + \mathbf{E}, +$$ + +where $\mathbf{T}$ represents the scores, $\mathbf{E}$ represents the residual matrix, and $\mathbf{V}$ represents the loadings. +\gls{pcr} utilizes these scores $\mathbf{T}$ in a linear regression model to predict the target variable $\mathbf{y}$: + +$$ +\mathbf{y} = \mathbf{Tb} + \mathbf{e}, +$$ + +where $\mathbf{b}$ are the regression coefficients correlating $\mathbf{T}$ to $\mathbf{y}$, and $\mathbf{e}$ is the vector of residuals, capturing the prediction errors. + +One drawback of \gls{pcr} is that it does not consider the target in the decomposition of the features and therefore assumes that smaller components have a weaker correlation with the target than the larger ones. +This assumption does not always hold, which is what \gls{pls} aims to address. + +\gls{pls} uses an iterative method to identify components that maximize the covariance between the features and the target. +These components, $Z$, are linear combinations of the original features, $\mathbf{X}_j$, weighted by coefficients, $\phi_j$, which are specifically calculated to reflect this covariance. +The formula for each component is expressed as: + +$$ + Z = \sum_{j=1}^{p} \phi_j \mathbf{X}_j, +$$ + +where $Z$ represents the component, $\mathbf{X}_j$ is the $j$-th feature, and $\phi_j$ is the weight for the $j$-th feature. +The weights, $\phi_j$, are determined by the formula: + +$$ + \phi_j = \frac{\text{cov}(\mathbf{X}_j, Y)}{\text{var}(\mathbf{X}_j)}. +$$ + +To refine the model iteratively, PLS uses the residuals from the previous components to calculate the next component. +The $m$-th component, for example, is derived from the residuals of the previous $m-1$ components: + +$$ + Z_m = \sum_{j=1}^{p} \phi_{jm} \hat{\mathbf{X}}_{j, m-1}. +$$ + +The components are then used to predict the target variable by fitting a linear model via least squares regression. + +\subsubsection{Support Vector Regression (SVR)} +\gls{svr} is a regression technique that extends the principles of \gls{svm} to regression problems. +We therefore provide an overview of \gls{svm}s based on \citet{James2023AnIS} before discussing \gls{svr}s. + +\gls{svm} is a supervised learning algorithm used primarily for classification tasks. +A core concept in \gls{svm} is the \textit{hyperplane}. +Generally, a hyperplane is a subspace of one dimension less than its ambient space. +This means that in a two-dimensional space, a hyperplane is a line, while in a three-dimensional space, it is a plane, and so on. + +\gls{svm} is built on the idea of finding the hyperplane that best separates the data points into different classes. +This hyperplane is chosen to maximize the margin, which is the distance between the hyperplane and the nearest data point from either class. +The instances right on or inside the margin are called \textit{support vectors}, which are used to 'support' the margin and decision boundary. + +\gls{svr} extends the principles of \gls{svm} to regression problems. +We use our previous discussion of \gls{svm} to introduce \gls{svr} based on \citet{druckerSVR}. + +\gls{svr} aims to fit a function that predicts continuous values rather than finding the hyperplane that best separates data points. +Instead of using a hyperplane to separate the data, \gls{svr} uses two parallel hyperplanes to define a margin within which the function should lie, often referred to as the $\epsilon$-\textit{tube}, where $\epsilon$ is a hyperparameter that defines the width of the tube. +The goal is to find a function $f(x)$ that lies within this tube and has the maximum number of data points within the tube. +$f(x)$ is typically defined as a linear function of the form: + +$$ +f(x) = \mathbf{w} \cdot \mathbf{x} + b, +$$ + +where: + +\begin{itemize} + \item $\mathbf{w}$ is the weight vector, + \item $\mathbf{x}$ is the input vector, and + \item $b$ is the bias term. +\end{itemize} + +The two parallel hyperplanes at a distance $\epsilon$ from the hyperplane are defined as: + +$$ +\begin{aligned} + \mathbf{w} \cdot \mathbf{x} + b &= f(\mathbf{x}) + \epsilon, \\ + \mathbf{w} \cdot \mathbf{x} + b &= f(\mathbf{x}) - \epsilon. +\end{aligned} +$$ + +Or, more succinctly: + +$$ +\begin{aligned} + f^+(\mathbf{x}) &= f(\mathbf{x}) + \epsilon, \\ + f^-(\mathbf{x}) &= f(\mathbf{x}) - \epsilon, +\end{aligned} +$$ + +where $f^+(\mathbf{x})$ and $f^-(\mathbf{x})$ are the upper and lower bounds of the $\epsilon$-insensitive tube, respectively. + +The optimization problem in \gls{svr} is to find the coefficients $\mathbf{w}$ and $b$ that minimize the norm of $\mathbf{w}$ (i.e., keep the regression function as flat as possible) while ensuring that most data points lie within the $\epsilon$ margin. + +\subsection{Stacked Generalization} Stacked generalization, introduced by \citet{wolpertstacked_1992}, is a method designed to improve the predictive performance of machine learning models by leveraging the strengths of multiple models. In this technique, multiple base models are trained on the original dataset. From 42c2e7a05d67f39c6693c42c43131a42c1035741 Mon Sep 17 00:00:00 2001 From: Ivikhostrup Date: Wed, 29 May 2024 14:07:50 +0200 Subject: [PATCH 02/10] Ready for review --- report_thesis/src/sections/background.tex | 49 +++++++++-------------- 1 file changed, 18 insertions(+), 31 deletions(-) diff --git a/report_thesis/src/sections/background.tex b/report_thesis/src/sections/background.tex index 3be99bbb..4e9b7c09 100644 --- a/report_thesis/src/sections/background.tex +++ b/report_thesis/src/sections/background.tex @@ -5,8 +5,7 @@ \section{Background}\label{sec:background} The section begins with an introduction to preprocessing techniques, followed by detailed descriptions of the machine learning models utilized. \subsection{Preprocessing} -In this section, we introduce the preprocessing techniques used in our proposed pipeline. -The techniques covered include Z-score normalization, max absolute scaling, min-max normalization, robust scaling, Norm 3, power transformation, and quantile transformation. +The preprocessing techniques used in our proposed pipeline include Z-score normalization, max absolute scaling, min-max normalization, robust scaling, Norm 3, power transformation, and quantile transformation. \subsubsection{Z-score Normalization} Z-score normalization, also standardization, transforms data to have a mean of zero and a standard deviation of one. @@ -186,19 +185,24 @@ \subsubsection{Kernel PCA} \subsection{Ensemble Learning Techniques} -\subsubsection{What Is Ensemble Learning?} +\subsubsection{Ensemble Learning} +Ensemble learning is a technique in machine learning where multiple models, known as \textit{weak learners}, are combined to produce more accurate predictions. +Mathematically, ensemble learning can be defined as combining the predictions of $M$ weak learners to form a final prediction $\hat{y}$, such that: +\begin{equation} + \hat{y} = \sum_{m=1}^{M} \alpha_m \hat{y}_m, +\end{equation} +where $\hat{y}_m$ is the prediction of the $m$-th weak learner and $\alpha_m$ is the weight assigned to the $m$-th weak learner. +While there are various choices for weak learners, decision trees are a common choice\cite{James2023AnIS}. \subsubsection{Decision Trees} - -\subsubsection{Extra Trees Regressor (ETR)} -Before discussing the \gls{etr} model, it is important to first understand the concepts of decision trees and \gls{rf}. -We give an overview of decision trees and \gls{rf} based on \citet{James2023AnIS}. -Then, we introduce the \gls{etr} model based on \citet{geurtsERF}. - A decision tree is a supervised learning model that partitions data into subsets based on feature values, creating a tree structure. The goal is to create a tree that predicts the target variable by dividing the data into increasingly homogeneous subsets. Each internal node in the tree represents a decision based on a specific feature, while each leaf node represents a prediction for the target variable. -The tree can make predictions for new data points by following a path from the root to a leaf node. +The tree can make predictions for new data points by following a path from the root to a leaf node\cite{James2023AnIS}. + +\subsubsection{Extra Trees Regressor (ETR)} +We give an overview of \gls{rf} based on \citet{James2023AnIS}. +Then, we introduce the \gls{etr} model based on \citet{geurtsERF}. \gls{rf} is an ensemble learning method that improves the accuracy and robustness of decision trees by building multiple trees and combining their predictions. Each tree is trained on a random subset of the data using bootstrap sampling, where samples are drawn with replacement, meaning the same sample can be selected multiple times. @@ -221,26 +225,9 @@ \subsubsection{Extra Trees Regressor (ETR)} \subsubsection{Gradient Boosting Regression (GBR)}\label{sec:gradientboost} In this section we introduce \gls{gbr} primarily based on \citet{James2023AnIS}. -\gls{gbr} is an ensemble learning method that builds models sequentially, each one trying to correct the errors of the previous one, using gradient descent and boosting techniques. - -To understand \gls{gbr}, it is helpful to build on the concepts of ensemble learning and decision trees. -Ensemble learning is a technique in machine learning where multiple models, known as \textit{weak learners}, are combined to produce more accurate predictions. -Mathematically, ensemble learning can be defined as combining the predictions of $M$ weak learners to form a final prediction $\hat{y}$, such that: -\begin{equation} - \hat{y} = \sum_{m=1}^{M} \alpha_m \hat{y}_m, -\end{equation} -where $\hat{y}_m$ is the prediction of the $m$-th weak learner and $\alpha_m$ is the weight assigned to the $m$-th weak learner. -While there are various choices for weak learners, decision trees are a common choice. -Decision trees are a core component of gradient boosting methods. -They partition the data into subsets based on feature values, aiming to create groups where data points have similar predicted outcomes. - -Once optimal splits are identified, the tree is constructed by repeatedly partitioning the data until a stopping criterion is met. -The final model consists of splits that create distinct regions, each with a predicted response value based on the mean of the observations in that region. - -This leads us to \textit{gradient boosting}. -Instead of building a single decision tree, gradient boosting constructs multiple trees sequentially, with each new tree correcting the errors of the previous ones. +\gls{gbr} is an ensemble learning method that sequentially construct multiple decision trees, each one correcting the errors of the previous one, using gradient descent and boosting techniques. Each tree is small, with few terminal nodes, preventing large adjustments based on a single tree's predictions. -It also ensures that each tree makes small and simple error corrections, such that each step refines the model's performance more reliably. +This ensures that each tree makes small and simple error corrections, such that each step refines the model's performance more reliably. Initially, the prediction is set as $\hat{f}_0(\mathbf{x}) = 0$ and residuals as $r_i = y_i$ for all $i$ in the training set, where $\mathbf{x}$ represents the vector of input feature(s) and $y$ is the true value or target variable. The model is then iteratively improved over $B$ iterations, where $B$ is a hyperparameter controlling the total number of trees. @@ -301,11 +288,11 @@ \subsubsection{XGBoost} \subsection{Linear and Regularization Models} \subsubsection{Partial Least Squares (PLS)} -Having previously introduced \gls{pca}, we now describe \gls{pls} based on \citet{James2023AnIS}. +We now describe \gls{pls} based on \citet{James2023AnIS}. In order to understand \gls{pls}, it is helpful to first consider \gls{pcr}, as \gls{pls} is an extension of \gls{pcr} that aims to address some of its limitations. \gls{pcr} extends \gls{pca} in the context of regression analysis. -In \gls{pcr}, the dataset $\mathbf{X}$ is decomposed using PCA as: +In \gls{pcr}, the dataset $\mathbf{X}$ is decomposed using \gls{pca} as: $$ \mathbf{X} = \mathbf{TV}^T + \mathbf{E}, From 2a27530929b0940c284623b0ed29faa4eeb498cf Mon Sep 17 00:00:00 2001 From: Ivikhostrup Date: Wed, 29 May 2024 14:27:59 +0200 Subject: [PATCH 03/10] more ready now --- report_thesis/src/sections/background.tex | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/report_thesis/src/sections/background.tex b/report_thesis/src/sections/background.tex index 4e9b7c09..6c15587c 100644 --- a/report_thesis/src/sections/background.tex +++ b/report_thesis/src/sections/background.tex @@ -1,8 +1,8 @@ \section{Background}\label{sec:background} -In this section, we provide an overview of the preprocessing techniques and machine learning models integral to our proposed pipeline. -We discuss various normalization techniques, dimensionality reduction methods, ensemble learning strategies, linear and regularization models, and stacked generalization approaches. -Each technique is examined to highlight its underlying concepts, algorithms, and relevance to our pipeline. -The section begins with an introduction to preprocessing techniques, followed by detailed descriptions of the machine learning models utilized. +In this section, we provide an overview of the preprocessing techniques and machine learning models used in our proposed pipeline. +We outline the various normalization techniques and dimensionality reduction methods, followed the ensemble learning and linear and regularization models used. +Finally, we outline stacked generalization. +The section begins with an introduction to the preprocessing techniques, followed by detailed descriptions of the machine learning models utilized. \subsection{Preprocessing} The preprocessing techniques used in our proposed pipeline include Z-score normalization, max absolute scaling, min-max normalization, robust scaling, Norm 3, power transformation, and quantile transformation. @@ -183,7 +183,9 @@ \subsubsection{Kernel PCA} By constructing a kernel matrix $\mathbf{K}$ using these pairwise similarities, \gls{kernel-pca} can perform eigenvalue decomposition to obtain the principal components in the feature space, similar to regular \gls{pca} as described in Section~\ref{subsec:pca}. However, in \gls{kernel-pca}, the eigenvalue decomposition is performed on the kernel matrix $\mathbf{K}$ rather than the covariance matrix $\mathbf{C}$. -\subsection{Ensemble Learning Techniques} +\subsection{Ensemble Learning Models} +In this section we introduce the concept of ensemble learning and decision trees, as they are fundamental aspects of the ensemble learning models we discuss. +Following this we outline \gls{etr}, \gls{gbr}, \gls{ngboost}, and \gls{xgboost}. \subsubsection{Ensemble Learning} Ensemble learning is a technique in machine learning where multiple models, known as \textit{weak learners}, are combined to produce more accurate predictions. @@ -286,7 +288,7 @@ \subsubsection{Natural Gradient Boosting (NGBoost)} \subsubsection{XGBoost} \subsection{Linear and Regularization Models} - +In this section we outline \gls{pls} and \gls{svr}. \subsubsection{Partial Least Squares (PLS)} We now describe \gls{pls} based on \citet{James2023AnIS}. In order to understand \gls{pls}, it is helpful to first consider \gls{pcr}, as \gls{pls} is an extension of \gls{pcr} that aims to address some of its limitations. From 1fd01ab9be56d07d808a0183ed3f07840739d3a6 Mon Sep 17 00:00:00 2001 From: Ivikhostrup <56341364+Ivikhostrup@users.noreply.github.com> Date: Fri, 31 May 2024 21:54:36 +0200 Subject: [PATCH 04/10] Update report_thesis/src/sections/background.tex Co-authored-by: Christian Bager Bach Houmann --- report_thesis/src/sections/background.tex | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/report_thesis/src/sections/background.tex b/report_thesis/src/sections/background.tex index 6c15587c..5835de8e 100644 --- a/report_thesis/src/sections/background.tex +++ b/report_thesis/src/sections/background.tex @@ -1,6 +1,6 @@ \section{Background}\label{sec:background} In this section, we provide an overview of the preprocessing techniques and machine learning models used in our proposed pipeline. -We outline the various normalization techniques and dimensionality reduction methods, followed the ensemble learning and linear and regularization models used. +We outline the various normalization techniques and dimensionality reduction methods, followed by the ensemble learning, linear models, and regularization models used. Finally, we outline stacked generalization. The section begins with an introduction to the preprocessing techniques, followed by detailed descriptions of the machine learning models utilized. From 6115e69a50715ba5d4a8645d098a9c2b6d6f5e55 Mon Sep 17 00:00:00 2001 From: Ivikhostrup <56341364+Ivikhostrup@users.noreply.github.com> Date: Sat, 1 Jun 2024 11:06:05 +0200 Subject: [PATCH 05/10] Update report_thesis/src/sections/background.tex Co-authored-by: Christian Bager Bach Houmann --- report_thesis/src/sections/background.tex | 1 - 1 file changed, 1 deletion(-) diff --git a/report_thesis/src/sections/background.tex b/report_thesis/src/sections/background.tex index 5835de8e..aa625ce1 100644 --- a/report_thesis/src/sections/background.tex +++ b/report_thesis/src/sections/background.tex @@ -2,7 +2,6 @@ \section{Background}\label{sec:background} In this section, we provide an overview of the preprocessing techniques and machine learning models used in our proposed pipeline. We outline the various normalization techniques and dimensionality reduction methods, followed by the ensemble learning, linear models, and regularization models used. Finally, we outline stacked generalization. -The section begins with an introduction to the preprocessing techniques, followed by detailed descriptions of the machine learning models utilized. \subsection{Preprocessing} The preprocessing techniques used in our proposed pipeline include Z-score normalization, max absolute scaling, min-max normalization, robust scaling, Norm 3, power transformation, and quantile transformation. From 101af6a8ca9a5636685cff975d1e40c47bdbfea9 Mon Sep 17 00:00:00 2001 From: Ivikhostrup <56341364+Ivikhostrup@users.noreply.github.com> Date: Sat, 1 Jun 2024 11:06:22 +0200 Subject: [PATCH 06/10] Update report_thesis/src/sections/background.tex Co-authored-by: Christian Bager Bach Houmann --- report_thesis/src/sections/background.tex | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/report_thesis/src/sections/background.tex b/report_thesis/src/sections/background.tex index aa625ce1..43c4c243 100644 --- a/report_thesis/src/sections/background.tex +++ b/report_thesis/src/sections/background.tex @@ -4,7 +4,9 @@ \section{Background}\label{sec:background} Finally, we outline stacked generalization. \subsection{Preprocessing} -The preprocessing techniques used in our proposed pipeline include Z-score normalization, max absolute scaling, min-max normalization, robust scaling, Norm 3, power transformation, and quantile transformation. +In this subsection, we discuss the preprocessing methods used in our machine learning pipeline. +We cover various normalization techniques such as Z-score normalization, max absolute scaling, min-max normalization, robust scaling, Norm 3, power transformation, and quantile transformation. +These techniques are essential for standardizing data, handling different scales, and improving the performance of machine learning models. \subsubsection{Z-score Normalization} Z-score normalization, also standardization, transforms data to have a mean of zero and a standard deviation of one. From 2da1d737956d1cb9aeaf0cd253183b5d941d42a0 Mon Sep 17 00:00:00 2001 From: Ivikhostrup <56341364+Ivikhostrup@users.noreply.github.com> Date: Sat, 1 Jun 2024 11:06:43 +0200 Subject: [PATCH 07/10] Update report_thesis/src/sections/background.tex Co-authored-by: Christian Bager Bach Houmann --- report_thesis/src/sections/background.tex | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/report_thesis/src/sections/background.tex b/report_thesis/src/sections/background.tex index 43c4c243..578063b4 100644 --- a/report_thesis/src/sections/background.tex +++ b/report_thesis/src/sections/background.tex @@ -185,8 +185,8 @@ \subsubsection{Kernel PCA} However, in \gls{kernel-pca}, the eigenvalue decomposition is performed on the kernel matrix $\mathbf{K}$ rather than the covariance matrix $\mathbf{C}$. \subsection{Ensemble Learning Models} -In this section we introduce the concept of ensemble learning and decision trees, as they are fundamental aspects of the ensemble learning models we discuss. -Following this we outline \gls{etr}, \gls{gbr}, \gls{ngboost}, and \gls{xgboost}. +In this section we introduce the concept of ensemble learning and decision trees based on \citet{James2023AnIS}, as they are fundamental aspects of the ensemble learning models we discuss. +Following this, we outline \gls{etr}, \gls{gbr}, \gls{ngboost}, and \gls{xgboost}. \subsubsection{Ensemble Learning} Ensemble learning is a technique in machine learning where multiple models, known as \textit{weak learners}, are combined to produce more accurate predictions. From 67119843d488db9825f5b146cec84305e19b6b67 Mon Sep 17 00:00:00 2001 From: Ivikhostrup <56341364+Ivikhostrup@users.noreply.github.com> Date: Sat, 1 Jun 2024 11:07:47 +0200 Subject: [PATCH 08/10] Update report_thesis/src/sections/background.tex Co-authored-by: Christian Bager Bach Houmann --- report_thesis/src/sections/background.tex | 2 -- 1 file changed, 2 deletions(-) diff --git a/report_thesis/src/sections/background.tex b/report_thesis/src/sections/background.tex index 578063b4..044abc41 100644 --- a/report_thesis/src/sections/background.tex +++ b/report_thesis/src/sections/background.tex @@ -251,8 +251,6 @@ \subsubsection{Gradient Boosting Regression (GBR)}\label{sec:gradientboost} To minimize errors, gradient descent is used to iteratively update model parameters in the direction of the negative gradient of the loss function, thereby following the path of steepest descent~\cite{gradientLossFunction}. \subsubsection{Natural Gradient Boosting (NGBoost)} -Having introduced \gls{gbr}, we now give an overview of \gls{ngboost} based on \citet{duan_ngboost_2020}. - \gls{ngboost} is a variant of the gradient boosting algorithm that leverages the concept of natural gradients with the goal of improving convergence speed and model performance. In more complex models, the parameter space can be curved and thus non-Euclidean, making the standard gradient descent less effective. Consequently, using the standard gradient descent can lead to slow convergence and suboptimal performance. From 7063bdc5922c1df234919bd989d95bdbc3f89b31 Mon Sep 17 00:00:00 2001 From: Christian Bager Bach Houmann Date: Sun, 2 Jun 2024 02:33:15 +0200 Subject: [PATCH 09/10] rewrite GBR section --- report_thesis/src/references.bib | 23 ++++++++++++ report_thesis/src/sections/background.tex | 43 +++++++++++++---------- 2 files changed, 47 insertions(+), 19 deletions(-) diff --git a/report_thesis/src/references.bib b/report_thesis/src/references.bib index 9907f3ab..832d47a4 100644 --- a/report_thesis/src/references.bib +++ b/report_thesis/src/references.bib @@ -595,3 +595,26 @@ @misc{duan_ngboost_2020 keywords = {Computer Science - Machine Learning, Statistics - Machine Learning}, annote = {Comment: Accepted for ICML 2020}, } + +@book{hastie_elements, + title = {The Elements of Statistical Learning: Data Mining, Inference, and Prediction}, + author = {Trevor Hastie and Robert Tibshirani and Jerome Friedman}, + series = {Springer Series in Statistics}, + edition = {Second}, + year = {2009}, + publisher = {Springer}, + isbn = {978-0-387-84857-0} +} + +@book{burkovHundredpageMachineLearning2023, + title = {The Hundred-Page Machine Learning Book}, + author = {Burkov, Andriy}, + date = {2023}, + publisher = {Andriy Burkov}, + location = {Orlando, FL}, + isbn = {978-1-77700-547-4}, + langid = {english}, + annotation = {OCLC: 1417057084} +} + + diff --git a/report_thesis/src/sections/background.tex b/report_thesis/src/sections/background.tex index 044abc41..02d4645f 100644 --- a/report_thesis/src/sections/background.tex +++ b/report_thesis/src/sections/background.tex @@ -227,28 +227,33 @@ \subsubsection{Extra Trees Regressor (ETR)} However, it often achieves better generalization performance, especially in high-dimensional or noisy datasets. \subsubsection{Gradient Boosting Regression (GBR)}\label{sec:gradientboost} -In this section we introduce \gls{gbr} primarily based on \citet{James2023AnIS}. -\gls{gbr} is an ensemble learning method that sequentially construct multiple decision trees, each one correcting the errors of the previous one, using gradient descent and boosting techniques. -Each tree is small, with few terminal nodes, preventing large adjustments based on a single tree's predictions. -This ensures that each tree makes small and simple error corrections, such that each step refines the model's performance more reliably. +In this section, we introduce \gls{gbr} primarily based on \citet{hastie_elements} and \citet{burkovHundredpageMachineLearning2023}. -Initially, the prediction is set as $\hat{f}_0(\mathbf{x}) = 0$ and residuals as $r_i = y_i$ for all $i$ in the training set, where $\mathbf{x}$ represents the vector of input feature(s) and $y$ is the true value or target variable. -The model is then iteratively improved over $B$ iterations, where $B$ is a hyperparameter controlling the total number of trees. -With each iteration $b$ from $1$ to $B$, predictions are updated as: -$$ - \hat{f}^{(b)}(\mathbf{x}) = \hat{f}^{(b-1)}(\mathbf{x}) + \lambda \hat{f}_b(\mathbf{x}), -$$ -where $\hat{f}_b(\mathbf{x})$ is the prediction of the $b$-th tree and $\lambda$ is the learning rate. Residuals are updated as: -$$ - r_i^{(b)} = y_i - \hat{f}^{(b-1)}(\mathbf{x}_i), -$$ -Each tree is then trained on these updated residuals. -The repetitive process of fitting a weak learner to predict the residuals and using its predictions to update the model gives us the final model: +Gradient Boosting is a machine learning technique used for various tasks, including regression and classification. +The fundamental concept involves sequentially adding models to minimize a loss function, where each successive model addresses the errors of the ensemble of preceding models. + +This technique utilizes gradient descent to optimize the loss function, allowing for the selection of different loss functions depending on the specific task. +\gls{gbr} is a specialized application of gradient boosting for regression tasks, where it minimizes a regression-appropriate loss function, such as mean squared error or mean absolute error. + +The process starts with an initial model $f_{0}(x)$ that minimizes the loss function over the entire dataset: $$ - \hat{f}(\mathbf{x}) = \sum_{b=1}^{B} \lambda \hat{f}_b(\mathbf{x}) +f_{0}(x)=\arg\min_{\gamma}\sum^{N}_{i=1}L(y_{i},\gamma) $$ -In the context of regression, gradient boosting aims to minimize the difference between the predicted values and the actual target values by fitting successive trees to the residuals. -To minimize errors, gradient descent is used to iteratively update model parameters in the direction of the negative gradient of the loss function, thereby following the path of steepest descent~\cite{gradientLossFunction}. +where $L(y,\hat{y})$ is the chosen loss function, $N$ is the number of samples, $y_{i}$ are the true values, and $\gamma$ is a constant. + +Then we start the iterative process of adding models to the ensemble. +For each iteration $m$, from $1$ to $M$: + +\begin{enumerate} + \item Compute the residuals of the current model. For regression, this could be the squared error loss, $L(y, \hat{y}) = (y - \hat{y})^2$. The residuals $r_{i}^{(m)}$ for each data point $i$ are calculated as $r_{i}^{(m)} = y_{i} - f_{m-1}(x_{i})$, where $f_{m-1}(x_{i})$ is the prediction of the previous model. + \item Fit a new model $h_{m}(x)$ to the residuals. This model aims to correct the errors of the current ensemble by using the residuals instead of ground truth values. Essentially, $h_{m}(x)$ tries to predict the residuals $r_{i}^{(m)}$. + \item Update the ensemble model by adding the predictions of the new model $h_{m}(x)$, multiplied by a learning rate $\eta$. The learning rate $\eta$ controls the contribution of each new model to the ensemble, preventing overfitting by scaling the updates: + $$ + f_{m}(x)=f_{m-1}(x)+\eta h_{m}(x) + $$ +\end{enumerate} + +This iterative process continues until the maximum number $M$ of trees are combined, resulting in the final model $\hat{f}(x) = f_{M}(x)$. \subsubsection{Natural Gradient Boosting (NGBoost)} \gls{ngboost} is a variant of the gradient boosting algorithm that leverages the concept of natural gradients with the goal of improving convergence speed and model performance. From 2922297923cd36cbb41d7ab39164ebf377d3795c Mon Sep 17 00:00:00 2001 From: Christian Bager Bach Houmann Date: Sun, 2 Jun 2024 12:12:29 +0200 Subject: [PATCH 10/10] specifically mention decision trees --- report_thesis/src/sections/background.tex | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/report_thesis/src/sections/background.tex b/report_thesis/src/sections/background.tex index 02d4645f..f484c56e 100644 --- a/report_thesis/src/sections/background.tex +++ b/report_thesis/src/sections/background.tex @@ -234,6 +234,7 @@ \subsubsection{Gradient Boosting Regression (GBR)}\label{sec:gradientboost} This technique utilizes gradient descent to optimize the loss function, allowing for the selection of different loss functions depending on the specific task. \gls{gbr} is a specialized application of gradient boosting for regression tasks, where it minimizes a regression-appropriate loss function, such as mean squared error or mean absolute error. +Typically, decision trees are used as the base models in each iteration. The process starts with an initial model $f_{0}(x)$ that minimizes the loss function over the entire dataset: $$ @@ -246,14 +247,14 @@ \subsubsection{Gradient Boosting Regression (GBR)}\label{sec:gradientboost} \begin{enumerate} \item Compute the residuals of the current model. For regression, this could be the squared error loss, $L(y, \hat{y}) = (y - \hat{y})^2$. The residuals $r_{i}^{(m)}$ for each data point $i$ are calculated as $r_{i}^{(m)} = y_{i} - f_{m-1}(x_{i})$, where $f_{m-1}(x_{i})$ is the prediction of the previous model. - \item Fit a new model $h_{m}(x)$ to the residuals. This model aims to correct the errors of the current ensemble by using the residuals instead of ground truth values. Essentially, $h_{m}(x)$ tries to predict the residuals $r_{i}^{(m)}$. - \item Update the ensemble model by adding the predictions of the new model $h_{m}(x)$, multiplied by a learning rate $\eta$. The learning rate $\eta$ controls the contribution of each new model to the ensemble, preventing overfitting by scaling the updates: + \item Fit a new decision tree $h_{m}(x)$ to the residuals. This tree aims to correct the errors of the current ensemble by using the residuals instead of ground truth values. Essentially, $h_{m}(x)$ tries to predict the residuals $r_{i}^{(m)}$. + \item Update the ensemble model by adding the predictions of the new tree $h_{m}(x)$, multiplied by a learning rate $\eta$. The learning rate $\eta$ controls the contribution of each new tree to the ensemble, preventing overfitting by scaling the updates: $$ f_{m}(x)=f_{m-1}(x)+\eta h_{m}(x) $$ \end{enumerate} -This iterative process continues until the maximum number $M$ of trees are combined, resulting in the final model $\hat{f}(x) = f_{M}(x)$. +This iterative process continues until the maximum number $M$ of trees is combined, resulting in the final model $\hat{f}(x) = f_{M}(x)$. \subsubsection{Natural Gradient Boosting (NGBoost)} \gls{ngboost} is a variant of the gradient boosting algorithm that leverages the concept of natural gradients with the goal of improving convergence speed and model performance.