class: middle, center, title-slide
Lecture 3: Automatic differentiation
Prof. Gilles Louppe
[email protected]
- Calculus
- Automatic differentiation
- Implementation
- Beyond neural networks
class: middle
.italic[Implementing backpropagation by hand is like programming in assembly language. You will probably never do it, but it is important for having a mental model of how everything works.]
.pull-right[Roger Grosse]
???
Promise for today!
class: middle
- Gradient-based training algorithms are the workhorse of deep learning.
- Deriving gradients by hand is tedious and error prone. This becomes quickly impractical for complex models.
- Changes to the model require rederiving the gradient.
.footnote[Image credits: Visualizing optimization trajectory of neural nets, Logan Yang, 2020.]
class: middle
A program is defined as a composition of primitive operations that we know how to differentiate individually.
import jax.numpy as jnp
from jax import grad
def predict(params, inputs):
for W, b in params:
outputs = jnp.dot(inputs, W) + b
inputs = jnp.tanh(outputs)
return outputs
def loss_fun(params, inputs, targets):
preds = predict(params, inputs)
return jnp.mean((preds - targets)**2)
grad_fun = grad(loss_fun)
class: middle
Modern frameworks support higher-order derivatives.
def tanh(x):
y = jnp.exp(-2.0 * x)
return (1.0 - y) / (1.0 + y)
fp = grad(tanh)
fpp = grad(grad(tanh)) # what sorcery is this?!
...
???
Will show a demo later on.
class: middle
Automatic differentiation (AD) provides a family of techniques for evaluating the derivatives of a function specified by a computer program.
-
$\neq$ symbolic differentiation, which aims at identifying some human-readable expression of the derivative. -
$\neq$ numerical differentation (finite differences), which may introduce round-off errors.
class: middle
Let
The derivative of
-
$f'(x)$ is the Lagrange notation, -
$\frac{\partial f}{\partial x}(x)$ is the Leibniz notation.
class: middle, center
The derivative
The gradient of
Applying the definition of the derivative coordinate-wise, we have
???
Note how each coordinate-wise derivative is a directional derivative in the direction
The Jacobian of
The gradient's transpose is thus a wide Jacobian (
class: middle
Let us assume a function
class: middle
By the chain rule, $$ \begin{aligned} \frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_0} &= \frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_{t-1}} \underbrace{\frac{\partial \mathbf{x}_{t-1}}{\partial \mathbf{x}_{0}}}_{\text{recursive case}} \\ \\ &= \frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_{t-1}} \frac{\partial \mathbf{x}_{t-1}}{\partial \mathbf{x}_{t-2}} \ldots \frac{\partial \mathbf{x}_2}{\partial \mathbf{x}_1} \frac{\partial \mathbf{x}_1}{\partial \mathbf{x}_0} \end{aligned} $$
class: middle
class: middle
The time complexity of the forward and reverse accumulations are
(Prove it!)
.success[If $n\_t \ll n\_0$ (which is typical in deep learning), then .bold[backward accumulation is cheaper]. And vice-versa.]
???
Prove it.
Chain compositions can be generalized to feedforward neural networks of the form
class: middle, center
(whiteboard example)
Let
-
$\mathbf{x}_1, \ldots, \mathbf{x}_s$ are the input variables, -
$f(\mathbf{x}_1, \ldots, \mathbf{x}_s)$ is implemented by a computer program producing intermediate variables$(\mathbf{x}_{s+1}, \ldots, \mathbf{x}_t)$ , -
$t$ is the total number of variables, with$\mathbf{x}_t$ denoting the output variable, -
$\mathbf{x}_k \in \mathbb{R}^{n_k}$ , for$k=1, \ldots, t$ .
The goal is to compute the Jacobians
class: middle
A numerical algorithm is a succession of instructions of the form
class: middle
This computation can be represented by a directed acyclic graph where
- the nodes are the variables
$\mathbf{x}_k$ , - an edge connects
$x_i$ to$x_k$ if$x_i$ is an argument of$\mathbf{f}_k$ .
The evaluation of
The forward mode of automatic differentiation consists in computing
Set the Jacobians of the input nodes with $$ \begin{aligned} \frac{\partial \mathbf{x}_1}{\partial \mathbf{x}_1} &= 1_{n_1 \times n_1} \\ \frac{\partial \mathbf{x}_2}{\partial \mathbf{x}_1} &= 0_{n_2 \times n_1} \\ \ldots \\ \frac{\partial \mathbf{x}_s}{\partial \mathbf{x}_1} &= 0_{n_s \times n_1} \end{aligned} $$
class: middle
For all
-
$\left[ \frac{\partial \mathbf{x}_k}{\partial \mathbf{x}_l} \right]$ denotes the on-the-fly computation of the Jacobian locally associated to the primitive$\mathbf{f}_k$ , -
$\frac{\partial \mathbf{x}_l}{\partial \mathbf{x}_1}$ is obtained from the previous iterations (in topological order). ] .kol-1-2[
.width-100[]] ]
class: middle, center
(whiteboard example)
class: middle
.alert[Forward mode automatic differentiation needs to be repeated for
.success[However, the cost in terms of memory is limited since temporary variables can be freed as soon as their child nodes have all been computed.]
Instead of evaluating the Jacobians
Set the Jacobian of the output node to
class: middle
For all
-
$\frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_m}$ is obtained from previous iterations (in reverse topological order) and is known as the adjoint, -
$\left[ \frac{\partial \mathbf{x}_m}{\partial \mathbf{x}_k} \right]$ denotes the on-the-fly computation of the Jacobian locally associated to the primitive$\mathbf{f}_m$ . ] .kol-1-2[
.center.width-100[]] ]
class: middle, center
(whiteboard example)
class: middle
.success[The advantage of backward mode automatic differentiation is that a single traversal of the graph allows to compute all
.alert[However, the cost in terms of memory is significant since all the temporary variables computed during the forward pass must be kept in memory.]
class: middle
class: middle
class: middle
Most automatically-differentiable frameworks are defined by a collection of composable primitive operations.
class: middle
Primitive functions are composed together into a graph that describes the computation. The computational graph is either built
- ahead of time, from the abstract syntax tree of the program or using a dedicated API (e.g., Tensorflow 1), or
- just in time, by tracing the program execution (e.g., Tensorflow Eager, JAX, PyTorch).
class: middle
In the backward recursive update, in the situation above, we have when
- Therefore, each primitive only needs to define its vector-Jacobian product (VJP).
The Jacobian
$\left[ \frac{\partial \mathbf{x}_m}{\partial \mathbf{x}_k} \right]$ is never explicitly built. It is usually simpler, faster, and more memory efficient to compute the VJP directly. - Most reverse mode AD systems compose VJPs backward to compute
$\frac{\partial \mathbf{x}_t}{\partial \mathbf{x}_1}$ .
class: middle
Similarly, when
class: middle
def tanh(x):
y = jnp.exp(-2.0 * x)
return (1.0 - y) / (1.0 + y)
fp = grad(tanh)
fpp = grad(grad(tanh)) # what sorcery is this?!
...
.alert[The backward pass is itself a composition of primitives. Its execution can be traced, and reverse mode AD can run on its computational graph!]
class: middle, center
(demo)
class: middle
class: middle, center, black-slide
<iframe width="600" height="450" src="https://www.youtube.com/embed/sq2gPzlrM0g?start=1240" frameborder="0" allowfullscreen></iframe>You should be using automatic differentiation (Ryan Adams, 2016)
class: middle, center, black-slide
<iframe width="600" height="450" src="https://www.youtube.com/embed/YuVdk1b0TVw" frameborder="0" allowfullscreen></iframe>Differentiable simulation for system identification and visuomotor control
(Murthy Jatavallabhula et al, 2021)
class: middle
.center[
Optimizing a wing (Sam Greydanus, 2020)
[Run in browser] ]
class: middle, center
... and plenty of other applications! (See this thread)
- Automatic differentiation is one of the keys that enabled the deep learning revolution.
- Backward mode automatic differentiation is more efficient when the function has more inputs than outputs.
- Applications of AD go beyond deep learning.
class: end-slide, center count: false
The end.
count: false
Slides from this lecture have been largely adapted from:
- Mathieu Blondel, Automatic differentiation, 2020.
- Gabriel Peyré, Course notes on Optimization for Machine Learning, 2020.