ito_solutions.html

<html><head><title>niplav</title>
<link href="./favicon.png" rel="shortcut icon" type="image/png"/>
<link href="main.css" rel="stylesheet" type="text/css"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<!DOCTYPE HTML>

<style type="text/css">
code.has-jax {font: inherit; font-size: 100%; background: inherit; border: inherit;}
</style>
<script async="" src="./mathjax/latest.js?config=TeX-MML-AM_CHTML" type="text/javascript">
</script>
<script type="text/x-mathjax-config">
	MathJax.Hub.Config({
	extensions: ["tex2jax.js"],
	jax: ["input/TeX", "output/HTML-CSS"],
	tex2jax: {
		inlineMath: [ ['$','$'], ["\\(","\\)"] ],
		displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
		processEscapes: true,
		skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
	},
	"HTML-CSS": { availableFonts: ["TeX"] }
	});
</script>
<script>
document.addEventListener('DOMContentLoaded', function () {
	// Change the title to the h1 header
	var title = document.querySelector('h1')
	if(title) {
		var title_elem = document.querySelector('title')
		title_elem.textContent=title.textContent + " – niplav"
	}
});
</script>
</head><body><h2 id="home"><a href="./index.html">home</a></h2>
<p><em>author: niplav, created: 2022-10-19, modified: 2022-11-25, language: english, status: in progress, importance: 2, confidence: likely</em></p>
<blockquote>
<p><strong>Solutions to the textbook “Introduction to Optimization”.</strong></p>
</blockquote><div class="toc"><div class="toc-title">Contents</div><ul><li><a href="#401">4.0.1</a><ul></ul></li><li><a href="#411">4.11</a><ul><li><a href="#i">(i)</a><ul></ul></li><li><a href="#ii">(ii)</a><ul></ul></li><li><a href="#iii">(iii)</a><ul></ul></li><li><a href="#iv">iv)</a><ul></ul></li></ul></li><li><a href="#431">4.3.1</a><ul></ul></li><li><a href="#433">4.3.3</a><ul><li><a href="#i_1">(i)</a><ul></ul></li><li><a href="#ii_1">(ii)</a><ul></ul></li></ul></li><li><a href="#441">4.4.1</a><ul><li><a href="#i_2">(i)</a><ul></ul></li><li><a href="#ii_2">(ii)</a><ul></ul></li><li><a href="#iii_1">(iii)</a><ul></ul></li><li><a href="#iv_1">(iv)</a><ul></ul></li></ul></li></ul></div>
<h1 id="Solutions_to_Introduction_to_Optimization"><a class="hanchor" href="#Solutions_to_Introduction_to_Optimization">Solutions to “Introduction to Optimization”</a></h1>
<h3 id="401"><a class="hanchor" href="#401">4.0.1</a></h3>
<p>See <a href="./mfis_solutions.html#24">here</a>.</p>
<h3 id="411"><a class="hanchor" href="#411">4.11</a></h3>
<h4 id="i"><a class="hanchor" href="#i">(i)</a></h4>
<p>Some standard numpy code:</p>
<pre><code>import numpy as np

def givec(x, c=10):
    n=len(x)
    iexp=np.array(range(0,n))
    cdiag=c**(iexp/((n-1)*np.ones(n)))
    C=np.diag(cdiag)
    return C

def fsq(x, c=10):
    C=givec(x, c=c)
    return x.T@C@x

def fhole(x, c=10, a=0.1):
    retval=fsq(x, c=c)
    return retval/(a**2+retval)
</code></pre>
<p>The derivative of <code>$f_{\text{sq}}=x^{\top}Cx$</code> is <code>$(C+C^{\top})x$</code>
(as per identity 81 from the <a href="https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf">Matrix
Cookbook</a>).</p>
<p>Which is excellent, since the identity 2.3 of Maths for Intelligent
Systems (Toussaint 2022) gives
<code>$\frac{\partial}{\partial x} x^{\top}Cx=x^{\top}C \frac{\partial}{\partial x}x+x^{\top} C^{\top} \frac{\partial}{\partial x} x$</code>
which would be <code>$x^{\top}C+x^{\top}C^{\top}=C^{\top}x+Cx=(C^{\top}+C)x$</code>
and therefore the same as above (and in identity 97 from the Matrix
Cookbook).</p>
<p>So then</p>
<pre><code>def fsqgrad(x, c=10):
    C=givec(x, c=c)
    return (C.T+C)@x
</code></pre>
<p>Since <code>$f_{\text{sq}}$</code> and <code>$f_{\text{hole}}$</code> return a scalar, one can
use a simple <a href="https://en.wikipedia.org/wiki/Quotient_rule">quotient rule</a>:</p>
<div>
        $$f_{\text{hole}}(x)=\\
        \frac{(C^{\top}+C)x\cdot(a^2+x^{\top}Cx)-x^{\top}Cx\cdot (C^{\top}+C)x)}{(a^2+x^{\top}Cx)^2}= \\
         \frac{(C^{\top}+C)x\cdot(a^2+x^{\top}Cx-x^{\top}Cx)}{(a^2+x^{\top}Cx)^2}= \\
         \frac{a^2 \cdot (C^{\top}+C)x}{(a^2+x^{\top}Cx)^2}$$
</div>
<p>The implementation is straightforwardly</p>
<pre><code>def fholegrad(x, c=10, a=0.1):
    retval_fsq=fsq(x, c=c)
    retval_fsqgrad=fsqgrad(x, c=c, a=a)
    return (retval_fsqgrad*(a**2+retval_fsq)-retval_fsq*retval_fsqgrad)/((a**2+retval_fsq))**2
</code></pre>
<h4 id="ii"><a class="hanchor" href="#ii">(ii)</a></h4>
<p>The function can be pretty directly translated from the pseudocode in
the book to Python (no surprise here):</p>
<pre><code>def graddesc(f, fgrad, alpha=0.001, theta=10e-20, x0=[1,1], c=10, a=0.1):
    x=np.array(x0)
    i=0
    prevcost=-math.inf
    curcost=f(x)
    while abs(prevcost-curcost)&gt;theta:
        print("#iteration: ", i)
        print("current cost: ", curcost)
        prevcost=curcost
        x=x-alpha*fgrad(x)
        curcost=f(x)
        i=i+1
    print("solution: ", x)
    print("iterations: ", i)
    print("cost: ", f(x))
    return x
</code></pre>
<p>Executing <code>graddesc</code> with <code>fsq</code>/<code>fsqgrad</code> and <code>fhole</code>/<code>fholegrad</code> gives
~1000 iterations for finding a local minimum with <code>fsq</code>, and takes ~185k
iterations for <code>fhole</code>. But it finds the minimum:</p>
<pre><code>print("fsq:")
graddesc(fsq, fsqgrad)
print("fhole:")
graddesc(fhole, fholegrad)

fsq:
solution:  [4.98354923e-09 1.65118900e-84]
iterations:  9549
cost:  2.4835762963261225e-17
fhole:
solution:  [ 4.13240000e-11 -7.98149419e-19]
iterations:  184068
cost:  1.7076729729684466e-19
</code></pre>
<h4 id="iii"><a class="hanchor" href="#iii">(iii)</a></h4>
<pre><code>def backtrack_graddesc(f, fgrad, alpha=0.05, theta=10e-15, x0=[1,1], c=10, a=0.1, rho_ls=0.01, rho_plus=1.2, rho_minus=0.5, delta_max=math.inf):
    x=np.array(x0)
    i=0
    prevcost=-math.inf
    curcost=f(x)
    delta=-fgrad(x)/np.linalg.norm(fgrad(x))
    while np.linalg.norm(alpha*delta)&gt;=theta:
        while f(x+alpha*delta)&gt;f(x)+rho_ls*fgrad(x)@(alpha*delta).T:
            alpha=rho_minus*alpha
        prevcost=curcost
        x=x+alpha*delta
        alpha=min(rho_plus*alpha, delta_max)
        curcost=f(x)
        i=i+1
    return x
</code></pre>
<h4 id="iv"><a class="hanchor" href="#iv">iv)</a></h4>
<p>First we compute the derivative of <code>$\underset{β}{\text{argmin}}||y-Xβ||^2+λ||β||^2$</code>:</p>
<div>
    $$\frac{\partial ||y-Xβ||^2+λ||β||^2}{\partial β}= \\
    \frac{\partial ||y-Xβ||^2}{\partial β} + \frac{\partial λ||β||^2}{\partial β}= \\
    \frac{\partial ||-Xβ-(-y)||^2}{\partial β}+λ2β=\\
    2 \cdot \frac{-Xβ-(-y)}{||-Xβ-(-y)||}+λ2β$$
</div>
<h3 id="431"><a class="hanchor" href="#431">4.3.1</a></h3>
<div>
</div>
<h3 id="433"><a class="hanchor" href="#433">4.3.3</a></h3>
<div>
    $$f(x)=\frac{1}{2}x^{\top} A x+b^{\top}+c=\\
    \frac{1}{2} x^{\top} A x=\\
    \frac{1}{2} x^{\top} \left [\matrix{2 &amp; 0 \cr 0 &amp; 1}\right ] x$$
</div>
<p>(since <code>$b=\mathbf{0}$</code> and <code>$c=0$</code>). So the negative gradient is</p>
<div>
    $$-\nabla f(x)= \\
    \frac{1}{2} x^{\top} (A+A^{\top})=\\
    \frac{1}{2} x^{\top} \left [\matrix{4 &amp; 0 \cr 0 &amp; 2}\right ]$$
</div>
<h4 id="i_1"><a class="hanchor" href="#i_1">(i)</a></h4>
<p><code>$x_0=\left [ \matrix{1 \cr 1} \right ]$</code>, so</p>
<div>
    $$δ_0=\\
    -\nabla f(x_0)=\\
    -\frac{1}{2} \left [ \matrix{1 &amp; 1} \right ] \left [\matrix{4 &amp; 0 \cr 0 &amp; 2}\right ]=\\
    -\left [ \matrix{2 &amp; 1} \right ]$$
</div>
<p>Then</p>
<div>
    $$f(x+αδ)=\\
    \frac{1}{2} (\left [ \matrix{1 \cr 1} \right]-α\left [ \matrix{2 \cr 1} \right ])^{\top}\left [\matrix{2 &amp; 0 \cr 0 &amp; 1}\right ](\left [ \matrix{1 \cr 1} \right]-α\left [ \matrix{2 \cr 1} \right ])=\\
    \frac{1}{2} (\left [ \matrix{1 \cr 1} \right]-α\left [ \matrix{2 \cr 1} \right ])^{\top}(\left [ \matrix{2 \cr 1} \right]-α\left [ \matrix{4 \cr 1} \right ])=\\
    \frac{1}{2} (\left [ \matrix{1 &amp; 1} \right]-α\left [ \matrix{2 &amp; 1} \right ])(\left [ \matrix{2 \cr 1} \right]-α\left [ \matrix{4 \cr 1} \right ])=\\
    \frac{1}{2}(3-10\cdot α+9 \cdot α^2)=\\
    1.5-5\cdot α+9 \cdot α^2$$
</div>
<p>The derivative of that is <code>$5+18 \cdot α$</code>. Setting it to zero gives
<code>$α=\frac{5}{18}$</code>.</p>
<p>The next <code>$x$</code> then is (going by the values we computed)
<code>$\left [ \matrix{0.\overline{4} \cr 0.7\overline{2}} \right ]$</code>.</p>
<p>The new gradient then is</p>
<div>
    $$g=\\
    -\nabla f(x)=\\
    -\frac{1}{2} \left [ \matrix{0.\overline{4} &amp; 0.7\overline{2}} \right ] \left [\matrix{2 &amp; 0 \cr 0 &amp; 1}\right ]=\\
    \left [ \matrix{-0.\overline{4} &amp; -0.36\overline{1}} \right ]$$
</div>
<p>Now computing <code>$β$</code>:</p>
<div>
    $$\frac{g^{\top}(g-g')}{g'^{\top}g'}=\\
    \frac{\left [ \matrix{-0.\overline{4} &amp; -0.36\overline{1}} \right ]^{\top}(\left [ \matrix{-0.\overline{4} &amp; -0.36\overline{1}} \right ]-\left [ \matrix{-2 &amp; -1} \right ])}{\left [ \matrix{-2 &amp; -1} \right ]^{\top}\left [ \matrix{-2 &amp; -1} \right ]}\approx \\
    -0.1844$$
</div>
<p>which is smaller than zero, so our new <code>$β$</code> is <code>$0$</code>.</p>
<p>Then <code>$δ$</code> is <code>$\left [ \matrix{-0.\overline{4} &amp; -0.36\overline{1}} \right ]$</code>.</p>
<p>We now go through the whole exercise a second time:</p>
<div>
    $$f(x+αδ)=\\
    \frac{1}{2} (\left [ \matrix{0.\overline{4} \cr 0.7\overline{2}} \right ]-α\left [ \matrix{-1.8765 &amp; -1.4938} \right ])^{\top}\left [\matrix{2 &amp; 0 \cr 0 &amp; 1}\right ](\left [ \matrix{0.\overline{4} \cr 0.7\overline{2}} \right ]-α\left [ \matrix{-1.8765 &amp; -1.4938} \right ])=\\
    (\frac{1}{2} (\left [ \matrix{0.\overline{4} \cr 0.7\overline{2}} \right ]-α\left [ \matrix{-1.8765 &amp; -1.4938} \right ])^{\top}\left [ \matrix{0.\overline{8} \cr 0.7\overline{2}} \right]-α\left [ \matrix{-0.\overline{8} \cr -0.36\overline{1}} \right ]) \approx \\
    0.458\overline{3}-0.6559\cdot α+0.2627 \cdot α^2$$
</div>
<p>Then the derivative is approximately <code>$-0.6559+0.5255 \cdot α$</code>, and
the new <code>$α$</code> is the solution to that, namely <code>$\approx 1.248$</code>.</p>
<p>Then the new <code>$x$</code> is <code>$\approx \left [ \matrix{-0.1103 &amp; 0.2715} \right ]$</code>.
The new <code>$g$</code> is <code>$\approx \left [ \matrix{0.11029 &amp; -0.1358} \right ]$</code>, and
<code>$β\approx 0.0933$</code>, which is this time greater than zero so we can keep it.</p>
<p>The new <code>$δ$</code> therefore is <code>$\approx \left [ \matrix{0.0688 &amp; -0.1694} \right ]$</code>.</p>
<p>…And we're done with the second round. I'm not going to do this for
<code>$x_0=(-1,2)$</code>, in case you were wondering, I hope I've shown that I <em>can</em>
do this, and it's tedious to type out.</p>
<!--TODO: maybe do other ones too?-->
<h4 id="ii_1"><a class="hanchor" href="#ii_1">(ii)</a></h4>
<p>Intuitively, this makes sense: We walked all the way to the minimum in
the direction of the current gradient, so the next direction should not
go "back" in the direction where we came from, or further than we needed
to go.</p>
<!--TODO: actually prove this-->
<h3 id="441"><a class="hanchor" href="#441">4.4.1</a></h3>
<p>Not gonna show you the sketch, of course. (I also don't have time right
now to learn <a href="https://manim.community">manim</a> for that, though it would
be cool).</p>
<h4 id="i_2"><a class="hanchor" href="#i_2">(i)</a></h4>
<p>The solution is <code>$\frac{-π}{2}$</code>.</p>
<h4 id="ii_2"><a class="hanchor" href="#ii_2">(ii)</a></h4>
<p>The solution is
<code>$\left [ \matrix{-\frac{\sqrt{2}}{2} \cr -\frac{\sqrt{2}}{2}} \right ]$</code>.</p>
<h4 id="iii_1"><a class="hanchor" href="#iii_1">(iii)</a></h4>
<p>Since the gradient is "pulling" to the bottom left, the best one can do
is to set <code>$x_1$</code> to zero. So
<code>$x=\left [ \matrix{-\frac{\sqrt{2}}{2} \cr -\frac{\sqrt{2}}{2}} \right ]$</code>.</p>
<h4 id="iv_1"><a class="hanchor" href="#iv_1">(iv)</a></h4>
<p>The second constraint can be pictured the following way: On the <code>$x_2$</code>
axis there is a parabola forming a paraboloid cylinder which is zero where
<code>$x_2$</code> is zero, and grows symmetrically in both other directions. At
<code>$x_2≥1$</code> and <code>$x_2≤-1$</code> this cylinder is greater than <code>$x_1$</code>,
which we want to avoid. So we know that <code>$x_2=-1$</code>.</p>
<p>But what is <code>$x_1$</code>? It should be as small as possible. Plugging in
<code>$x_1=0$</code> tells us that this is the best we can do.</p>
</body></html>