class: middle, center, title-slide
Lecture 7: Auto-encoders and generative models
Prof. Gilles Louppe
[email protected]
???
R: VAE: R: reverse KL https://ermongroup.github.io/cs228-notes/inference/variational/ R: http://paulrubenstein.co.uk/variational-autoencoders-are-not-autoencoders/
Learn a model of the data.
- Auto-encoders
- Generative models
- Variational inference
- Variational auto-encoders
class: middle
class: middle
Many applications such as image synthesis, denoising, super-resolution, speech synthesis or compression, require to go beyond classification and regression and model explicitly a high-dimensional signal.
This modeling consists of finding .italic["meaningful degrees of freedom"], or .italic["factors of variations"], that describe the signal and are of lesser dimension.
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle count: false
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
An auto-encoder is a composite function made of
- an encoder
$f$ from the original space$\mathcal{X}$ to a latent space$\mathcal{Z}$ , - a decoder
$g$ to map back to$\mathcal{X}$ ,
such that
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
Let
Given two parameterized mappings
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
For example, when the auto-encoder is linear,
$$
\begin{aligned}
f: \mathbf{z} &= \mathbf{U}^T \mathbf{x} \\
g: \hat{\mathbf{x}} &= \mathbf{U} \mathbf{z},
\end{aligned}
$$
with
In this case, an optimal solution is given by PCA.
class: middle
Better results can be achieved with more sophisticated classes of mappings than linear projections, in particular by designing
For instance,
- by combining a multi-layer perceptron encoder
$f : \mathbb{R}^p \to \mathbb{R}^d$ with a multi-layer perceptron decoder$g: \mathbb{R}^d \to \mathbb{R}^p$ . - by combining a convolutional network encoder
$f : \mathbb{R}^{w\times h \times c} \to \mathbb{R}^d$ with a decoder$g : \mathbb{R}^d \to \mathbb{R}^{w\times h \times c}$ composed of the reciprocal transposed convolutional layers.
class: middle
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
To get an intuition of the learned latent representation, we can pick two samples
.center.width-80[![](figures/lec7/interpolation.png)]
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
Besides dimension reduction, auto-encoders can capture dependencies between signal components to restore degraded or noisy signals.
In this case, the composition
The goal is to optimize
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
A fundamental weakness of denoising auto-encoders is that the posterior
If we train an auto-encoder with the quadratic loss, then the best reconstruction is
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
.footnote[Credits: slides adapted from .italic["Tutorial on Deep Generative Models"], Shakir Mohamed and Danilo Rezende, UAI 2017.]
class: middle
A generative model is a probabilistic model
.center[ .width-100[] ] .caption[Generative models have a role in many important problems]
???
Go beyond estimating
- Understand and imagine how the world evolves.
- Recognize objects in the world and their factors of variation.
- Establish concepts for reasoning and decision making.
class: middle
Generating images and video content.
(Gregor et al, 2015; Oord et al, 2016; Dumoulin et al, 2016) ]
class: middle
Generating audio conditioned on text.
(Oord et al, 2016) ]
class: middle
Hierarchical compression of images and other data.
(Gregor et al, 2016) ]
class: middle
Photo-realistic single image super-resolution.
(Ledig et al, 2016) ]
class: middle
Understanding the factors of variation and invariances.
(Higgins et al, 2017) ]
class: middle
Simulate future trajectories of environments based on actions for planning.
.center[ .width-40[] .width-40[]
(Finn et al, 2016) ]
class: middle
Rapid generalization of novel concepts.
(Gregor et al, 2016) ]
class: middle
Generative models for proposing candidate molecules and for improving prediction through semi-supervised learning.
(Gomez-Bombarelli et al, 2016) ]
class: middle
Generative models for applications in astronomy and high-energy physics.
(Regier et al, 2015) ]
The generative capability of the decoder
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
For instance, a factored Gaussian model with diagonal covariance matrix,
class: middle
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
These results are not satisfactory because the density model on the latent space is too simple and inadequate.
Building a good model in latent space amounts to our original problem of modeling an empirical distribution, although it may now be in a lower dimension space.
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
class: middle
Consider for now a prescribed latent variable model that relates a set of observable variables
class: middle
The probabilistic model is given and motivated by domain knowledge assumptions.
Examples include:
- Linear discriminant analysis
- Bayesian networks
- Hidden Markov models
- Probabilistic programs
class: middle
The probabilistic model defines a joint probability distribution
For a given model
For most interesting cases, this is usually intractable since it requires evaluating the evidence
Variational inference turns posterior inference into an optimization problem.
- Consider a family of distributions
$q(\mathbf{z}|\mathbf{x}; \nu)$ that approximate the posterior$p(\mathbf{z}|\mathbf{x})$ , where the variational parameters$\nu$ index the family of distributions. - The parameters
$\nu$ are fit to minimize the KL divergence between$p(\mathbf{z}|\mathbf{x})$ and the approximation$q(\mathbf{z}|\mathbf{x};\nu)$ .
class: middle
Formally, we want to minimize
$$\begin{aligned}
KL(q(\mathbf{z}|\mathbf{x};\nu) || p(\mathbf{z}|\mathbf{x})) &= \mathbb{E}_{q(\mathbf{z}|\mathbf{x};\nu)}\left[\log \frac{q(\mathbf{z}|\mathbf{x} ; \nu)}{p(\mathbf{z}|\mathbf{x})}\right] \\
&= \mathbb{E}_{q(\mathbf{z}|\mathbf{x};\nu)}\left[ \log q(\mathbf{z}|\mathbf{x};\nu) - \log p(\mathbf{x},\mathbf{z}) \right] + \log p(\mathbf{x}).
\end{aligned}$$
For the same reason as before, the KL divergence cannot be directly minimized because
of the
class: middle
However, we can write
$$
KL(q(\mathbf{z}|\mathbf{x};\nu) || p(\mathbf{z}|\mathbf{x})) = \log p(\mathbf{x}) - \underbrace{\mathbb{E}_{q(\mathbf{z}|\mathbf{x};\nu)}\left[ \log p(\mathbf{x},\mathbf{z}) - \log q(\mathbf{z}|\mathbf{x};\nu) \right]}_{\text{ELBO}(\mathbf{x};\nu)}
$$
where
- Since
$\log p(\mathbf{x})$ does not depend on$\nu$ , it can be considered as a constant, and minimizing the KL divergence is equivalent to maximizing the evidence lower bound, while being computationally tractable. - Given a dataset
$\mathbf{d} = \{\mathbf{x}_i|i=1, ..., N\}$ , the final objective is the sum$\sum_{\{\mathbf{x}_i \in \mathbf{d}\}} \text{ELBO}(\mathbf{x}_i;\nu)$ .
class: middle
Remark that $$\begin{aligned} \text{ELBO}(\mathbf{x};\nu) &= \mathbb{E}_{q(\mathbf{z};|\mathbf{x}\nu)}\left[ \log p(\mathbf{x},\mathbf{z}) - \log q(\mathbf{z}|\mathbf{x};\nu) \right] \\ &= \mathbb{E}_{q(\mathbf{z}|\mathbf{x};\nu)}\left[ \log p(\mathbf{x}|\mathbf{z}) p(\mathbf{z}) - \log q(\mathbf{z}|\mathbf{x};\nu) \right] \\ &= \mathbb{E}_{q(\mathbf{z}|\mathbf{x};\nu)}\left[ \log p(\mathbf{x}|\mathbf{z})\right] - KL(q(\mathbf{z}|\mathbf{x};\nu) || p(\mathbf{z})) \end{aligned}$$ Therefore, maximizing the ELBO:
- encourages distributions to place their mass on configurations of latent variables that explain the observed data (first term);
- encourages distributions close to the prior (second term).
class: middle
We want $$\begin{aligned} \nu^{*} &= \arg \max_\nu \text{ELBO}(\mathbf{x};\nu) \\ &= \arg \max_\nu \mathbb{E}_{q(\mathbf{z}|\mathbf{x};\nu)}\left[ \log p(\mathbf{x},\mathbf{z}) - \log q(\mathbf{z}|\mathbf{x};\nu) \right]. \end{aligned}$$
We can proceed by gradient ascent, provided we can evaluate
In general,
this gradient is difficult to compute because the expectation is unknown and the parameters
class: middle
class: middle
So far we assumed a prescribed probabilistic model motivated by domain knowledge. We will now directly learn a stochastic generating process with a neural network.
A variational auto-encoder is a deep latent variable model where:
- The likelihood
$p(\mathbf{x}|\mathbf{z};\theta)$ is parameterized with a generative network$\text{NN}_\theta$ (or decoder) that takes as input$\mathbf{z}$ and outputs parameters$\phi = \text{NN}_\theta(\mathbf{z})$ to the data distribution. E.g., $$\begin{aligned} \mu, \sigma &= \text{NN}_\theta(\mathbf{z}) \\ p(\mathbf{x}|\mathbf{z};\theta) &= \mathcal{N}(\mathbf{x}; \mu, \sigma^2\mathbf{I}) \end{aligned}$$ - The approximate posterior
$q(\mathbf{z}|\mathbf{x};\varphi)$ is parameterized with an inference network$\text{NN}_\varphi$ (or encoder) that takes as input$\mathbf{x}$ and outputs parameters$\nu = \text{NN}_\varphi(\mathbf{x})$ to the approximate posterior. E.g., $$\begin{aligned} \mu, \sigma &= \text{NN}_\varphi(\mathbf{x}) \\ q(\mathbf{z}|\mathbf{x};\varphi) &= \mathcal{N}(\mathbf{z}; \mu, \sigma^2\mathbf{I}) \end{aligned}$$
class: middle
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
As before, we can use variational inference, but to jointly optimize the generative and the inference networks parameters
We want $$\begin{aligned} \theta^{*}, \varphi^{*} &= \arg \max_{\theta,\varphi} \text{ELBO}(\mathbf{x};\theta,\varphi) \\ &= \arg \max_{\theta,\varphi} \mathbb{E}_{q(\mathbf{z}|\mathbf{x};\varphi)}\left[ \log p(\mathbf{x},\mathbf{z};\theta) - \log q(\mathbf{z}|\mathbf{x};\varphi)\right] \\ &= \arg \max_{\theta,\varphi} \mathbb{E}_{q(\mathbf{z}|\mathbf{x};\varphi)}\left[ \log p(\mathbf{x}|\mathbf{z};\theta)\right] - KL(q(\mathbf{z}|\mathbf{x};\varphi) || p(\mathbf{z})). \end{aligned}$$
- Given some generative network
$\theta$ , we want to put the mass of the latent variables, by adjusting$\varphi$ , such that they explain the observed data, while remaining close to the prior. - Given some inference network
$\varphi$ , we want to put the mass of the observed variables, by adjusting$\theta$ , such that they are well explained by the latent variables.
class: middle
Unbiased gradients of the ELBO with respect to the generative model parameters
However, gradients with respect to the inference model parameters
class: middle
Let us abbreviate $$\begin{aligned} \text{ELBO}(\mathbf{x};\theta,\varphi) &= \mathbb{E}_{q(\mathbf{z}|\mathbf{x};\varphi)}\left[ \log p(\mathbf{x},\mathbf{z};\theta) - \log q(\mathbf{z}|\mathbf{x};\varphi)\right] \\ &= \mathbb{E}_{q(\mathbf{z}|\mathbf{x};\varphi)}\left[ f(\mathbf{x}, \mathbf{z}; \varphi) \right]. \end{aligned}$$
We have
.grid[ .kol-1-5[] .kol-4-5[.center.width-90[]] ]
We cannot backpropagate through the stochastic node
The reparameterization trick consists in re-expressing the variable
class: middle
.grid[ .kol-1-5[] .kol-4-5[.center.width-90[]] ]
For example, if
class: middle
Given such a change of variable, the ELBO can be rewritten as: $$\begin{aligned} \text{ELBO}(\mathbf{x};\theta,\varphi) &= \mathbb{E}_{q(\mathbf{z}|\mathbf{x};\varphi)}\left[ f(\mathbf{x}, \mathbf{z}; \varphi) \right]\\ &= \mathbb{E}_{p(\epsilon)} \left[ f(\mathbf{x}, g(\varphi,\mathbf{x},\epsilon); \varphi) \right] \end{aligned}$$ Therefore, $$\begin{aligned} \nabla_\varphi \text{ELBO}(\mathbf{x};\theta,\varphi) &= \nabla_\varphi \mathbb{E}_{p(\epsilon)} \left[ f(\mathbf{x}, g(\varphi,\mathbf{x},\epsilon); \varphi) \right] \\ &= \mathbb{E}_{p(\epsilon)} \left[ \nabla_\varphi f(\mathbf{x}, g(\varphi,\mathbf{x},\epsilon); \varphi) \right], \end{aligned}$$ which we can now estimate with Monte Carlo integration.
The last required ingredient is the evaluation of the likelihood
Consider the following setup:
- Generative model: $$\begin{aligned} \mathbf{z} &\in \mathbb{R}^d \\ p(\mathbf{z}) &= \mathcal{N}(\mathbf{z}; \mathbf{0},\mathbf{I})\\ p(\mathbf{x}|\mathbf{z};\theta) &= \mathcal{N}(\mathbf{x};\mu(\mathbf{z};\theta), \sigma^2(\mathbf{z};\theta)\mathbf{I}) \\ \mu(\mathbf{z};\theta) &= \mathbf{W}_2^T\mathbf{h} + \mathbf{b}_2 \\ \log \sigma^2(\mathbf{z};\theta) &= \mathbf{W}_3^T\mathbf{h} + \mathbf{b}_3 \\ \mathbf{h} &= \text{ReLU}(\mathbf{W}_1^T \mathbf{z} + \mathbf{b}_1)\\ \theta &= \{ \mathbf{W}_1, \mathbf{b}_1, \mathbf{W}_2, \mathbf{b}_2, \mathbf{W}_3, \mathbf{b}_3 \} \end{aligned}$$
class: middle
- Inference model: $$\begin{aligned} q(\mathbf{z}|\mathbf{x};\varphi) &= \mathcal{N}(\mathbf{z};\mu(\mathbf{x};\varphi), \sigma^2(\mathbf{x};\varphi)\mathbf{I}) \\ p(\epsilon) &= \mathcal{N}(\epsilon; \mathbf{0}, \mathbf{I}) \\ \mathbf{z} &= \mu(\mathbf{x};\varphi) + \sigma(\mathbf{x};\varphi) \odot \epsilon \\ \mu(\mathbf{x};\varphi) &= \mathbf{W}_5^T\mathbf{h} + \mathbf{b}_5 \\ \log \sigma^2(\mathbf{x};\varphi) &= \mathbf{W}_6^T\mathbf{h} + \mathbf{b}_6 \\ \mathbf{h} &= \text{ReLU}(\mathbf{W}_4^T \mathbf{x} + \mathbf{b}_4)\\ \varphi &= \{ \mathbf{W}_4, \mathbf{b}_4, \mathbf{W}_5, \mathbf{b}_5, \mathbf{W}_6, \mathbf{b}_6 \} \end{aligned}$$
Note that there is no restriction on the generative and inference network architectures. They could as well be arbitrarily complex convolutional networks.
class: middle
Plugging everything together, the objective can be expressed as:
$$\begin{aligned}
\text{ELBO}(\mathbf{x};\theta,\varphi) &= \mathbb{E}_{q(\mathbf{z}|\mathbf{x};\varphi)}\left[ \log p(\mathbf{x},\mathbf{z};\theta) - \log q(\mathbf{z}|\mathbf{x};\varphi)\right] \\
&= \mathbb{E}_{q(\mathbf{z}|\mathbf{x};\varphi)} \left[ \log p(\mathbf{x}|\mathbf{z};\theta) \right] - KL(q(\mathbf{z}|\mathbf{x};\varphi) || p(\mathbf{z})) \\
&= \mathbb{E}_{p(\epsilon)} \left[ \log p(\mathbf{x}|\mathbf{z}=g(\varphi,\mathbf{x},\epsilon);\theta) \right] - KL(q(\mathbf{z}|\mathbf{x};\varphi) || p(\mathbf{z}))
\end{aligned}
$$
where the KL divergence can be expressed analytically as
class: middle
Consider as data
class: middle, center
(Kingma and Welling, 2013)
class: middle, center
(Kingma and Welling, 2013)
class: black-slide
.center[ <iframe width="640" height="400" src="https://www.youtube.com/embed/XNZIN7Jh3Sg?&loop=1&start=0" frameborder="0" volume="0" allowfullscreen></iframe>
Random walks in latent space. (Alex Radford, 2015)
]
class: middle, black-slide
.center[
<iframe width="640" height="400" src="https://int8.io/wp-content/uploads/2016/12/output.mp4" frameborder="0" volume="0" allowfullscreen></iframe>Impersonation by encoding-decoding an unknown face.
(Kamil Czarnogórski, 2016) ]
class: middle
.center[
Voice style transfer [demo]
(van den Oord et al, 2017) ]
class: middle, black-slide
.center[
<iframe width="640" height="400" src="https://www.youtube.com/embed/Wd-1WU8emkw?&loop=1&start=0" frameborder="0" volume="0" allowfullscreen></iframe>(Inoue et al, 2017)
]
class: middle
.center[Design of new molecules with desired chemical properties.
(Gomez-Bombarelli et al, 2016)]
class: end-slide, center count: false
The end.
count: false
- Mohamed and Rezende, "Tutorial on Deep Generative Models", UAI 2017.
- Blei et al, "Variational inference: Foundations and modern methods", 2016.
- Kingma and Welling, "Auto-Encoding Variational Bayes", 2013.