From 95fd720e1ed20d4267931169a1f019346e6d9f57 Mon Sep 17 00:00:00 2001 From: wassimmazouz Date: Tue, 2 Jul 2024 10:40:21 +0200 Subject: [PATCH 1/7] init .rst file for L1-reg tutorial --- doc/tutorials/fermat_rule_reg.rst | 50 +++++++++++++++++++++++++++++++ doc/tutorials/tutorials.rst | 5 ++++ 2 files changed, 55 insertions(+) create mode 100644 doc/tutorials/fermat_rule_reg.rst diff --git a/doc/tutorials/fermat_rule_reg.rst b/doc/tutorials/fermat_rule_reg.rst new file mode 100644 index 000000000..4697f7ee5 --- /dev/null +++ b/doc/tutorials/fermat_rule_reg.rst @@ -0,0 +1,50 @@ +.. _fermat_rule_reg: + +====================================================== +Mathematics Behind L1 Regularization and Fermat's Rule +====================================================== + +This tutorial presents the mathematics behind solving the optimization problem +:math:`\min f(x) + \lambda \|x\|_1` and demonstrates why the solution is zero when +:math:`\lambda` is greater than the infinity norm of the gradient of :math:`f` at zero, therefore justifying the choice in skglm of + +.. code-block:: +alpha_max = (popu_X.T @ (1 - popu_Y) / len(popu_Y)).max() + +Problem setup +============= + +Consider the optimization problem: + +.. math:: + \min_x f(x) + \lambda \|x\|_1 + +where: + +- :math:`f: \mathbb{R}^d \to \mathbb{R}` is a differentiable function, +- :math:`\|x\|_1` is the L1 norm of :math:`x`, +- :math:`\lambda \in \mathbb{R}` is a regularization parameter. + +We aim to determine the conditions under which the solution to this problem is :math:`x = 0`. + +Theoretical Background +====================== + +According to Fermat's rule, the minimum of the function occurs where the subdifferential of the objective function includes zero. For our problem, the objective function is: + +.. math:: + g(x) = f(x) + \lambda \|x\|_1 + +The subdifferential of :math:`\|x\|_1` at 0 is the L-infinity ball: + +.. math:: + \partial \|x\|_1 |_{x=0} = \{ u \in \mathbb{R}^n : \|u\|_{\infty} \leq 1 \} + + + +References +========== + +.. _1: +[1] Eugene Ndiaye, Olivier Fercoq, Alexandre Gramfort, and Joseph Salmon. 2017. Gap safe screening rules for sparsity enforcing penalties. J. Mach. Learn. Res. 18, 1 (January 2017), 4671–4703. + diff --git a/doc/tutorials/tutorials.rst b/doc/tutorials/tutorials.rst index 24652a871..fdc442ff4 100644 --- a/doc/tutorials/tutorials.rst +++ b/doc/tutorials/tutorials.rst @@ -33,3 +33,8 @@ Get details about Cox datafit equations. ----------------------------------------------------------------- Mathematical details about the group Lasso, in particular with nonnegativity constraints. + +:ref:`Mathematics Behind L1 Regularization and Fermat's Rule ` +----------------------------------------------------------------- + +Mathematical context about the choice of the regularization parameter in L1-regularization. From 9d080af8d1c7bcc8292ee5b4d61cc819684cb9f1 Mon Sep 17 00:00:00 2001 From: wassimmazouz Date: Tue, 2 Jul 2024 11:45:42 +0200 Subject: [PATCH 2/7] Theoretical background and example --- doc/tutorials/fermat_rule_reg.rst | 36 ++++++++++++++++++++++++++++++- 1 file changed, 35 insertions(+), 1 deletion(-) diff --git a/doc/tutorials/fermat_rule_reg.rst b/doc/tutorials/fermat_rule_reg.rst index 4697f7ee5..685916154 100644 --- a/doc/tutorials/fermat_rule_reg.rst +++ b/doc/tutorials/fermat_rule_reg.rst @@ -9,7 +9,7 @@ This tutorial presents the mathematics behind solving the optimization problem :math:`\lambda` is greater than the infinity norm of the gradient of :math:`f` at zero, therefore justifying the choice in skglm of .. code-block:: -alpha_max = (popu_X.T @ (1 - popu_Y) / len(popu_Y)).max() +alpha_max = np.max(np.abs(gradient0)) Problem setup ============= @@ -40,11 +40,45 @@ The subdifferential of :math:`\|x\|_1` at 0 is the L-infinity ball: .. math:: \partial \|x\|_1 |_{x=0} = \{ u \in \mathbb{R}^n : \|u\|_{\infty} \leq 1 \} +Thus, for :math:`0 \in \partial g(x)` at :math:`x=0`: + +.. math:: + 0 \in \nabla f(0) + \lambda \partial \|x\|_1 |_{x=0} + +which implies, given that the dual of L1-norm is L-infinity: + +.. math:: + \|\nabla f(0)\|_{\infty} \leq \lambda + +If :math:`\lambda > \|\nabla f(0)\|_{\infty}`, then the only solution is :math:`x=0`. + +Example +======= + +Consider the loss function for Ordinary Least Squares :math:`f(x) = \frac{1}{2} \|Ax - b\|_2^2`. We have: + +.. math:: + \nabla f(x) = A^T (Ax - b) + +At :math:`x=0`: + +.. math:: + \nabla f(0) = -A^T b + +The infinity norm of the gradient at 0 is: + +.. math:: + \|\nabla f(0)\|_{\infty} = \|A^T b\|_{\infty} + +For :math:`\lambda > \|A^T b\|_{\infty}`, the solution to :math:`\min_x \frac{1}{2} \|Ax - b\|_2^2 + \lambda \|x\|_1` is :math:`x=0`. + References ========== +The first 5 pages of the following article provide sufficient context for the problem at hand. + .. _1: [1] Eugene Ndiaye, Olivier Fercoq, Alexandre Gramfort, and Joseph Salmon. 2017. Gap safe screening rules for sparsity enforcing penalties. J. Mach. Learn. Res. 18, 1 (January 2017), 4671–4703. From 70b99ec66ea639d60db5d3c36d3f35cf6989e076 Mon Sep 17 00:00:00 2001 From: wassimmazouz Date: Tue, 2 Jul 2024 15:03:07 +0200 Subject: [PATCH 3/7] requested changes --- doc/tutorials/fermat_rule_reg.rst | 30 ++++++++++++++++-------------- doc/tutorials/tutorials.rst | 2 +- 2 files changed, 17 insertions(+), 15 deletions(-) diff --git a/doc/tutorials/fermat_rule_reg.rst b/doc/tutorials/fermat_rule_reg.rst index 685916154..d68180f9e 100644 --- a/doc/tutorials/fermat_rule_reg.rst +++ b/doc/tutorials/fermat_rule_reg.rst @@ -1,8 +1,8 @@ -.. _fermat_rule_reg: +.. _reg_sol_zero: -====================================================== -Mathematics Behind L1 Regularization and Fermat's Rule -====================================================== +========================================================== +Critical regularization strength above which solution is 0 +========================================================== This tutorial presents the mathematics behind solving the optimization problem :math:`\min f(x) + \lambda \|x\|_1` and demonstrates why the solution is zero when @@ -11,6 +11,8 @@ This tutorial presents the mathematics behind solving the optimization problem .. code-block:: alpha_max = np.max(np.abs(gradient0)) +However, the regularization parameter used at the end should preferably be a fraction of this (e.g. `alpha = 0.01 * alpha_max`). + Problem setup ============= @@ -21,9 +23,9 @@ Consider the optimization problem: where: -- :math:`f: \mathbb{R}^d \to \mathbb{R}` is a differentiable function, +- :math:`f: \mathbb{R}^d \to \mathbb{R}` is a convex differentiable function, - :math:`\|x\|_1` is the L1 norm of :math:`x`, -- :math:`\lambda \in \mathbb{R}` is a regularization parameter. +- :math:`\lambda > 0` is a regularization parameter. We aim to determine the conditions under which the solution to this problem is :math:`x = 0`. @@ -38,14 +40,14 @@ According to Fermat's rule, the minimum of the function occurs where the subdiff The subdifferential of :math:`\|x\|_1` at 0 is the L-infinity ball: .. math:: - \partial \|x\|_1 |_{x=0} = \{ u \in \mathbb{R}^n : \|u\|_{\infty} \leq 1 \} + \partial \|x\|_1 |_{x=0} = \{ u \in \mathbb{R}^d : \|u\|_{\infty} \leq 1 \} Thus, for :math:`0 \in \partial g(x)` at :math:`x=0`: .. math:: 0 \in \nabla f(0) + \lambda \partial \|x\|_1 |_{x=0} -which implies, given that the dual of L1-norm is L-infinity: +which implies, given that the dual norm of L1-norm is L-infinity: .. math:: \|\nabla f(0)\|_{\infty} \leq \lambda @@ -55,29 +57,29 @@ If :math:`\lambda > \|\nabla f(0)\|_{\infty}`, then the only solution is :math:` Example ======= -Consider the loss function for Ordinary Least Squares :math:`f(x) = \frac{1}{2} \|Ax - b\|_2^2`. We have: +Consider the loss function for Ordinary Least Squares :math:`f(x) = \frac{1}{2n} \|Ax - b\|_2^2`, where :math:`n` is the number of samples. We have: .. math:: - \nabla f(x) = A^T (Ax - b) + \nabla f(x) = \frac{1}{n}A^T (Ax - b) At :math:`x=0`: .. math:: - \nabla f(0) = -A^T b + \nabla f(0) = -\frac{1}{n}A^T b The infinity norm of the gradient at 0 is: .. math:: - \|\nabla f(0)\|_{\infty} = \|A^T b\|_{\infty} + \|\nabla f(0)\|_{\infty} = \frac{1}{n}\|A^T b\|_{\infty} -For :math:`\lambda > \|A^T b\|_{\infty}`, the solution to :math:`\min_x \frac{1}{2} \|Ax - b\|_2^2 + \lambda \|x\|_1` is :math:`x=0`. +For :math:`\lambda \geq \frac{1}{n}\|A^T b\|_{\infty}`, the solution to :math:`\min_x \frac{1}{2n} \|Ax - b\|_2^2 + \lambda \|x\|_1` is :math:`x=0`. References ========== -The first 5 pages of the following article provide sufficient context for the problem at hand. +Refer to the section 3.1 and proposition 4 in particular of the following article for more details. .. _1: [1] Eugene Ndiaye, Olivier Fercoq, Alexandre Gramfort, and Joseph Salmon. 2017. Gap safe screening rules for sparsity enforcing penalties. J. Mach. Learn. Res. 18, 1 (January 2017), 4671–4703. diff --git a/doc/tutorials/tutorials.rst b/doc/tutorials/tutorials.rst index fdc442ff4..00685c31a 100644 --- a/doc/tutorials/tutorials.rst +++ b/doc/tutorials/tutorials.rst @@ -34,7 +34,7 @@ Get details about Cox datafit equations. Mathematical details about the group Lasso, in particular with nonnegativity constraints. -:ref:`Mathematics Behind L1 Regularization and Fermat's Rule ` +:ref:`Critical regularization strength above which solution is 0 ` ----------------------------------------------------------------- Mathematical context about the choice of the regularization parameter in L1-regularization. From b79162f3c2afae5b412fecdf221157136d1f5776 Mon Sep 17 00:00:00 2001 From: mathurinm Date: Tue, 2 Jul 2024 15:51:07 +0200 Subject: [PATCH 4/7] changes --- doc/tutorials/alpha_max.rst | 93 +++++++++++++++++++++++++++++++ doc/tutorials/fermat_rule_reg.rst | 86 ---------------------------- doc/tutorials/tutorials.rst | 10 ++-- 3 files changed, 98 insertions(+), 91 deletions(-) create mode 100644 doc/tutorials/alpha_max.rst delete mode 100644 doc/tutorials/fermat_rule_reg.rst diff --git a/doc/tutorials/alpha_max.rst b/doc/tutorials/alpha_max.rst new file mode 100644 index 000000000..e1e23a256 --- /dev/null +++ b/doc/tutorials/alpha_max.rst @@ -0,0 +1,93 @@ +.. _alpha_max: + +========================================================== +Critical regularization strength above which solution is 0 +========================================================== + +This tutorial shows that for :math:`\lambda \geq \lambda_{\text{max}} = || \nabla f(0) ||_{\infty}`, the solution to +:math:`\min f(x) + \lambda || x ||_1` is 0. + +In skglm, we thus frequently use + +.. code-block:: + + alpha_max = np.max(np.abs(gradient0)) + +and choose for the regularization strength :\math:`\alpha` a fraction of this critical value, e.g. ``alpha = 0.01 * alpha_max``. + +Problem setup +============= + +Consider the optimization problem: + +.. math:: + \min_x f(x) + \lambda || x||_1 + +where: + +- :math:`f: \mathbb{R}^d \to \mathbb{R}` is a convex differentiable function, +- :math:`|| x ||_1` is the L1 norm of :math:`x`, +- :math:`\lambda > 0` is the regularization parameter. + +We aim to determine the conditions under which the solution to this problem is :math:`x = 0`. + +Theoretical background +====================== + + +Let + +.. math:: + + g(x) = f(x) + \lambda || x||_1 + +According to Fermat's rule, 0 is the minimizer of :math:`g` if and only if 0 is in the subdifferential of :math:`g` at 0. +The subdifferential of :math:`|| x ||_1` at 0 is the L-infinity unit ball: + +.. math:: + \partial || \cdot ||_1 (0) = \{ u \in \mathbb{R}^d : ||u||_{\infty} \leq 1 \} + +Thus, + +.. math:: + + 0 \in \text{argmin} ~ g(x) + &\Leftrightarrow 0 \in \partial g(0) \\ + &\Leftrightarrow + 0 \in \nabla f(0) + \lambda \partial || \cdot ||_1 (0) \\ + &\Leftrightarrow - \nabla f(0) \in \lambda \{ u \in \mathbb{R}^d : ||u||_{\infty} \leq 1 \} \\ + &\Leftrightarrow || \nabla f(0) ||_\infty \leq \lambda + + +We have just shown that the minimizer of :math:`g = f + \lambda || \cdot ||_1` is 0 if and only if :math:`\lambda \geq ||\nabla f(0)||_{\infty}`. + +Example +======= + +Consider the loss function for Ordinary Least Squares :math:`f(x) = \frac{1}{2n} ||Ax - b||_2^2`, where :math:`n` is the number of samples. We have: + +.. math:: + \nabla f(x) = \frac{1}{n}A^T (Ax - b) + +At :math:`x=0`: + +.. math:: + \nabla f(0) = -\frac{1}{n}A^T b + +The infinity norm of the gradient at 0 is: + +.. math:: + ||\nabla f(0)||_{\infty} = \frac{1}{n}||A^T b||_{\infty} + +For :math:`\lambda \geq \frac{1}{n}||A^T b||_{\infty}`, the solution to :math:`\min_x \frac{1}{2n} ||Ax - b||_2^2 + \lambda || x||_1` is :math:`x=0`. + + + +References +========== + +Refer to Section 3.1 and Proposition 4 in particular of [1] for more details. + +.. _1: +[1] Eugene Ndiaye, Olivier Fercoq, Alexandre Gramfort, and Joseph Salmon. 2017. Gap safe screening rules for sparsity enforcing penalties. J. Mach. Learn. Res. 18, 1 (January 2017), 4671–4703. + diff --git a/doc/tutorials/fermat_rule_reg.rst b/doc/tutorials/fermat_rule_reg.rst deleted file mode 100644 index d68180f9e..000000000 --- a/doc/tutorials/fermat_rule_reg.rst +++ /dev/null @@ -1,86 +0,0 @@ -.. _reg_sol_zero: - -========================================================== -Critical regularization strength above which solution is 0 -========================================================== - -This tutorial presents the mathematics behind solving the optimization problem -:math:`\min f(x) + \lambda \|x\|_1` and demonstrates why the solution is zero when -:math:`\lambda` is greater than the infinity norm of the gradient of :math:`f` at zero, therefore justifying the choice in skglm of - -.. code-block:: -alpha_max = np.max(np.abs(gradient0)) - -However, the regularization parameter used at the end should preferably be a fraction of this (e.g. `alpha = 0.01 * alpha_max`). - -Problem setup -============= - -Consider the optimization problem: - -.. math:: - \min_x f(x) + \lambda \|x\|_1 - -where: - -- :math:`f: \mathbb{R}^d \to \mathbb{R}` is a convex differentiable function, -- :math:`\|x\|_1` is the L1 norm of :math:`x`, -- :math:`\lambda > 0` is a regularization parameter. - -We aim to determine the conditions under which the solution to this problem is :math:`x = 0`. - -Theoretical Background -====================== - -According to Fermat's rule, the minimum of the function occurs where the subdifferential of the objective function includes zero. For our problem, the objective function is: - -.. math:: - g(x) = f(x) + \lambda \|x\|_1 - -The subdifferential of :math:`\|x\|_1` at 0 is the L-infinity ball: - -.. math:: - \partial \|x\|_1 |_{x=0} = \{ u \in \mathbb{R}^d : \|u\|_{\infty} \leq 1 \} - -Thus, for :math:`0 \in \partial g(x)` at :math:`x=0`: - -.. math:: - 0 \in \nabla f(0) + \lambda \partial \|x\|_1 |_{x=0} - -which implies, given that the dual norm of L1-norm is L-infinity: - -.. math:: - \|\nabla f(0)\|_{\infty} \leq \lambda - -If :math:`\lambda > \|\nabla f(0)\|_{\infty}`, then the only solution is :math:`x=0`. - -Example -======= - -Consider the loss function for Ordinary Least Squares :math:`f(x) = \frac{1}{2n} \|Ax - b\|_2^2`, where :math:`n` is the number of samples. We have: - -.. math:: - \nabla f(x) = \frac{1}{n}A^T (Ax - b) - -At :math:`x=0`: - -.. math:: - \nabla f(0) = -\frac{1}{n}A^T b - -The infinity norm of the gradient at 0 is: - -.. math:: - \|\nabla f(0)\|_{\infty} = \frac{1}{n}\|A^T b\|_{\infty} - -For :math:`\lambda \geq \frac{1}{n}\|A^T b\|_{\infty}`, the solution to :math:`\min_x \frac{1}{2n} \|Ax - b\|_2^2 + \lambda \|x\|_1` is :math:`x=0`. - - - -References -========== - -Refer to the section 3.1 and proposition 4 in particular of the following article for more details. - -.. _1: -[1] Eugene Ndiaye, Olivier Fercoq, Alexandre Gramfort, and Joseph Salmon. 2017. Gap safe screening rules for sparsity enforcing penalties. J. Mach. Learn. Res. 18, 1 (January 2017), 4671–4703. - diff --git a/doc/tutorials/tutorials.rst b/doc/tutorials/tutorials.rst index 00685c31a..d42d8eeaa 100644 --- a/doc/tutorials/tutorials.rst +++ b/doc/tutorials/tutorials.rst @@ -25,16 +25,16 @@ Explore how ``skglm`` fits an unpenalized intercept. :ref:`Mathematics behind Cox datafit ` ------------------------------------------------------------------ +--------------------------------------------------------- Get details about Cox datafit equations. :ref:`Details on the group Lasso ` ------------------------------------------------------------------ +------------------------------------------------------- Mathematical details about the group Lasso, in particular with nonnegativity constraints. -:ref:`Critical regularization strength above which solution is 0 ` ------------------------------------------------------------------ +:ref:`Critical regularization strength above which solution is 0 ` +----------------------------------------------------------------------------- -Mathematical context about the choice of the regularization parameter in L1-regularization. +How to chose the regularization strength in L1-regularization? From c68294e1824b3603ae6f3701d77cb21cb42c8ab2 Mon Sep 17 00:00:00 2001 From: wassimmazouz Date: Tue, 2 Jul 2024 16:09:29 +0200 Subject: [PATCH 5/7] FIX unindent error message --- doc/tutorials/alpha_max.rst | 1 + doc/tutorials/tutorials.rst | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/doc/tutorials/alpha_max.rst b/doc/tutorials/alpha_max.rst index e1e23a256..f229d41a3 100644 --- a/doc/tutorials/alpha_max.rst +++ b/doc/tutorials/alpha_max.rst @@ -89,5 +89,6 @@ References Refer to Section 3.1 and Proposition 4 in particular of [1] for more details. .. _1: + [1] Eugene Ndiaye, Olivier Fercoq, Alexandre Gramfort, and Joseph Salmon. 2017. Gap safe screening rules for sparsity enforcing penalties. J. Mach. Learn. Res. 18, 1 (January 2017), 4671–4703. diff --git a/doc/tutorials/tutorials.rst b/doc/tutorials/tutorials.rst index d42d8eeaa..d86b58840 100644 --- a/doc/tutorials/tutorials.rst +++ b/doc/tutorials/tutorials.rst @@ -37,4 +37,4 @@ Mathematical details about the group Lasso, in particular with nonnegativity con :ref:`Critical regularization strength above which solution is 0 ` ----------------------------------------------------------------------------- -How to chose the regularization strength in L1-regularization? +How to choose the regularization strength in L1-regularization? From ec636d4ac9067e18c77fbe33be9b1d5b314e6310 Mon Sep 17 00:00:00 2001 From: wassimmazouz Date: Tue, 2 Jul 2024 16:59:03 +0200 Subject: [PATCH 6/7] try with eqnarray --- doc/tutorials/alpha_max.rst | 2 ++ 1 file changed, 2 insertions(+) diff --git a/doc/tutorials/alpha_max.rst b/doc/tutorials/alpha_max.rst index f229d41a3..886215cf9 100644 --- a/doc/tutorials/alpha_max.rst +++ b/doc/tutorials/alpha_max.rst @@ -51,12 +51,14 @@ Thus, .. math:: + \begin{eqnarray} 0 \in \text{argmin} ~ g(x) &\Leftrightarrow 0 \in \partial g(0) \\ &\Leftrightarrow 0 \in \nabla f(0) + \lambda \partial || \cdot ||_1 (0) \\ &\Leftrightarrow - \nabla f(0) \in \lambda \{ u \in \mathbb{R}^d : ||u||_{\infty} \leq 1 \} \\ &\Leftrightarrow || \nabla f(0) ||_\infty \leq \lambda + \end{eqnarray} We have just shown that the minimizer of :math:`g = f + \lambda || \cdot ||_1` is 0 if and only if :math:`\lambda \geq ||\nabla f(0)||_{\infty}`. From 181da37b5defe1660028b527125e4a028a72d26e Mon Sep 17 00:00:00 2001 From: Badr-MOUFAD Date: Tue, 2 Jul 2024 17:06:41 +0200 Subject: [PATCH 7/7] fix math render --- doc/tutorials/alpha_max.rst | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/doc/tutorials/alpha_max.rst b/doc/tutorials/alpha_max.rst index 886215cf9..8c105f87d 100644 --- a/doc/tutorials/alpha_max.rst +++ b/doc/tutorials/alpha_max.rst @@ -50,15 +50,18 @@ The subdifferential of :math:`|| x ||_1` at 0 is the L-infinity unit ball: Thus, .. math:: + :nowrap: - \begin{eqnarray} - 0 \in \text{argmin} ~ g(x) - &\Leftrightarrow 0 \in \partial g(0) \\ - &\Leftrightarrow - 0 \in \nabla f(0) + \lambda \partial || \cdot ||_1 (0) \\ - &\Leftrightarrow - \nabla f(0) \in \lambda \{ u \in \mathbb{R}^d : ||u||_{\infty} \leq 1 \} \\ - &\Leftrightarrow || \nabla f(0) ||_\infty \leq \lambda - \end{eqnarray} + \begin{equation} + \begin{aligned} + 0 \in \text{argmin} ~ g(x) + &\Leftrightarrow 0 \in \partial g(0) \\ + &\Leftrightarrow + 0 \in \nabla f(0) + \lambda \partial || \cdot ||_1 (0) \\ + &\Leftrightarrow - \nabla f(0) \in \lambda \{ u \in \mathbb{R}^d : ||u||_{\infty} \leq 1 \} \\ + &\Leftrightarrow || \nabla f(0) ||_\infty \leq \lambda + \end{aligned} + \end{equation} We have just shown that the minimizer of :math:`g = f + \lambda || \cdot ||_1` is 0 if and only if :math:`\lambda \geq ||\nabla f(0)||_{\infty}`. @@ -93,4 +96,3 @@ Refer to Section 3.1 and Proposition 4 in particular of [1] for more details. .. _1: [1] Eugene Ndiaye, Olivier Fercoq, Alexandre Gramfort, and Joseph Salmon. 2017. Gap safe screening rules for sparsity enforcing penalties. J. Mach. Learn. Res. 18, 1 (January 2017), 4671–4703. -