-
Notifications
You must be signed in to change notification settings - Fork 0
/
MLE_solver.tex
77 lines (53 loc) · 4.2 KB
/
MLE_solver.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
\chapter{Solving MLE}
\section{Taylor Series Approximation}
It all starts from Taylor Series approximation which tells the geometry of any function at a given point (say a). If the Taylor series is centered at zero (a=0), then that series is also called a Maclaurin series.
\begin{equation}
\begin{split}
f(x) = f(a) + \frac{f'(a)}{1!}(x-a) + \frac{f''(a)}{2!}(x-a)^2 + \frac{f^{(3)}(x-a)}{3!}x^3 + \dotsb = \sum_{k=0}^\infty \frac{f^{\left(k\right)}(a)}{k!} (x-a)^k
\end{split}
\end{equation}
\subsection{Why Least Square Error?}
Any dataset that you are trying to fit is assumed to have some noise. The noise can be independent of our data, we call this noise additive. It is generally introduced by human errors when labelling and/or sensor inaccuracy \cite{why-lse}.
\begin{equation}
Y = X\beta + Z
\end{equation}
Z = Additive Gaussian Noise.
\begin{equation}
Normal(0 \mid \mu,\sigma) = \frac{1}{{\sigma \sqrt {2\pi } }}e^{{{ - \left( {z} \right)^2 } \mathord{\left/ {\vphantom {{ - \left( {z } \right)^2 } {2\sigma ^2 }}} \right. \kern-\nulldelimiterspace} {2\sigma ^2 }}}
\end{equation}
Hence our objective is to minimize this noise using MLE.
\begin{equation}
\begin{split}
\arg\max_{\beta} \sum_{i=1} \log p( (y_i,x_i) \mid \beta) & = \arg\min_{\bf \beta} \Big[ \log \prod_{i=1}^{n} \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(y_i- (\beta_0 + \beta_1 x_{i,1} + ... + \beta_p x_{i,p}))^2}{2\sigma^2}} \Big]
\end{split}
\end{equation}
\subsection{Method}
Method could be simply explained with an example of Linear regression.
We know if we assume the distribution of parameters is Normal then we MLE can be written as Mean Squared Error (MSE). The loss function can be shown by the following expression.
\begin{equation}
\sum_{i=1}^{N}\left(Y_{i}-\sum_{j=1}^{p}X_{ij}\beta_{j}\right)^{2}=0
\end{equation}
\noindent Step 1: We then want to take partial derrivatives with respect to each component or parameter.
\begin{equation}
\frac{\partial}{\partial \beta_{k}}\sum_{i=1}^{N}\left(Y_{i}-\sum_{j=1}^{p}X_{ij}\beta_{j}\right)^{2}=0
\end{equation}
\noindent Step 2: Now you have $p$ (number of parameters = $p$) of these equations, one for each beta and equate it to zero. This is a simple application of the chain rule \cite{mse-eq}:
\begin{equation}
-2\sum_{i=1}^{N}X_{ik}\left(Y_{i}-\sum_{j=1}^{p}X_{ij}\beta_{j}\right)=0
\end{equation}
\noindent Now we can re-write the sum inside the bracket as $\sum_{j=1}^{p}X_{ij}\beta_{j}=\bf{x}_{i}^{T}\boldsymbol{\beta}$ So you get:
\begin{equation}
\sum_{i=1}^{N}X_{ik}Y_{i}-\sum_{i=1}^{N}X_{ik}\bf{x}_{i}^{T}\boldsymbol{\beta}=0
\end{equation}
\noindent Now we have $p$ of these equations, and we will "stack them" in a column vector. Notice how $X_{ik}$ is the only term which depends on k, so we can stack this into the vector $\bf{x}_{i}$ and we get:
\begin{equation}
\sum_{i=1}^{N}\bf{x}_{i}\rm{Y}_{i}=\sum_{i=1}^{N}\bf{x}_{i}\bf{x}_{i}^{T}\boldsymbol{\beta}
\end{equation}
\noindent Now we can take the beta outside the sum (but must stay on RHS of sum), and then take the inverse:
\begin{equation}
\left(\sum_{i=1}^{N}\bf{x}_{i}\bf{x}_{i}^{T}\right)^{-1}\sum_{i=1}^{N}\bf{x}_{i}\rm{Y}_{i}=\boldsymbol{\beta}
\end{equation}
\noindent\textbf{Why do we need iterative methods?, when analytical solution exists}\cite{iter} \\
\noindent Even in the case of, say, linear models, where you have an analytical solution, it may still be best to use such an iterative solver. As an example, if we consider linear regression, the explicit solution requires inverting a matrix which has complexity $O(N^3)$. This becomes prohibitive in the context of big data. Also, a lot of problems in machine learning are convex, so using gradients ensure that we will get to the extrema. However, there are still relevant non-convex problems, like neural networks, where gradient methods (backpropagation) provide an efficient solver. Again this is specially relevant for the case of deep learning \cite{iter}.
\section{Maxima, Minima and Saddle point}
Now we know how to formulate MLE and now we need to compute its maximum or minimum. Depending on the complexity of the parameter there could be one or more critical points. Critical points bascially could be either a maximum, minimum or saddle point.