An exploration of diffusion (aka denoising diffusion probabalistic) models.
Suppose one were to wonder how to de-noise an image that was corrupted some small amount. One way to approach this problem is to specify what exactly noise looks like such that it may be filtered out from an input. But if the type of noise were not known beforehand, it would not be possible to identify exactly what the noisy versus original parts of the corrupted image were. We therefore need some information on what the noise compared to the original image is, given samples of the original input
As is often the case for poorly defined problems (in this case finding a general de-noiser), another way to address the problem of noise removal is to have a machine learning program learn how to accomplish noise removal in a probabalistic manner. A trained program would ideally be able to remove each element of noise with high likelihood.
One of the most effective ways of removing noise probabalistically is to use a deep learning model as the denoiser. Doing so allows us to refrain from specifying what the noise should look like, even in very general terms, which allows for a greater number of possible noise functions to be approximated by the model. Various loss functions may be used for the denoising autoencoder, one of which is Mean Squared Error (MSE),
where
In order to be able to successfully remove noise from an input, it is necessary for a model to learn something about the distribution of all uncorrupted inputs
Now suppose that instead of recovering an input that has been corrupted slightly, we want instead to recover an input that was severely corrupted and is nearly indistinguisheable from noise. We could attempt to train our denoising autoencoder on samples with larger and larger amounts of noise, but doing so in most cases does not yield even a rough approximation of the desired input. This is because we are asking the autoencoder to perform a very difficult task because most images that one would want to reconstruct are really nothing like a typical noise sample at all.
The property that natural images are very different from noise is the motivation behind denoising diffusion models, also called denoising diffusion probabalistic modeling or simply diffusion models for short, originally introduced by Sohl-Dickinson and colleagues.
To make the very difficult task of removing a large amount of noise from a distribution approachable, we can break it into smaller sub-tasks that are more manageable. This is the same approach taken to optimiziation of any deep learning algorithm: finding a minimal value via a direct method is intractable, so instead we use gradient descent and the model will learn how many small steps towards a minimal point in the loss function space. Here we will teach a denoising autoencoder to remove both small and larger amount of noise from a corrupted input so that samples may be generated over many steps in order the reconstruct an input from pure noise.
Diffusion inversion is defined on a forward diffusion process,
where
The variance schedule is typically linear, with later work finding that a cosine schedule is sometimes more effective. Rather than compute
where
Inverting the diffusion process is equivalent to starting with pure noise
With the (pure noise) starting point
Training the model
where
Here input images are expected to be scaled to have a minimum of -1 and maximum of 1
To summarize, training a diffusion inversion model as presented by Ho and colleagues consists of teaching the model to predict the noise in a corrupted image, with the reasoning that after this occurs the model will be able to remove noise from a Gaussian distribution to generate new samples. Later work finds that switching between predicting
To generate samples, we want to learn the reverse diffusion process,
Let's try to generate images of handwritten digits using diffusion inversion. First we need an autoencoder, and for that we can turn to a miniaturized and fully connected version of the well-known U-net architecture introduced by Ronnenberger and colleagues. The U-net architecture will be considered in more detail later, as for now we will make do with the fully connected version shown below:
This is a variation of a typical undercomplete autoencoder which compresses the input somewhat (2k) using fully connected layers followed by de-compression where residual connections are added between layers of equal size, forming a distinctive U-shape. It should be noted that layer normalization may be ommitted without any noticeable effects. Layer sizes were determined upon some experimentation, where it was found that smaller (ie starting with 1k) widths trained very slowly.
Next we chose to take 1000 steps in the forward diffusion process, meaning that our model will learn to de-noise an input over 1k small steps in the diffusion inversion generative process as well. It is now necessary (for learning in a reasonable amount of time) to provide the time-step information to the model in some way: this allows the model to 'know' what time-step we are at and therefore approximately how much noise exists in the input. Ho and colleagues use positional encoding (inspired by the original transformer positional encodings for sequences) and therefore encode the time-step information as a trigonometric function added to the input (after a one-layer feedforward trainable layer). Although this approach would certainly be expected to be effective for a small model as well as large ones, we take the simpler approach of encoding time-step information in a single input element before appending that element to the others after flattening. Specifically, this time-step element
An aside: it is interesting that time information is required for diffusion inversion at all, being that one might expect for an autoencoder trained to estimate the variance of the input noise relative to some slightly less-noisy input to be capable of estimating the accumulated noise as well, and therefore estimate the time-step value for itself. Removing the time-step information from the diffusion process of either the original U-net or the small autoencoder above yields poor sample generation but curiously the models are capable of optimizing the objective function without much trouble. It is currently unclear why time-step information is required for sample generation, ie
Concatenation of a single element time-step value to the input tensor may be done as follows,
class FCEncoder(nn.Module):
def __init__(self, starting_size, channels):
super().__init__()
starting = starting_size
self.input_transform = nn.Linear(32*32*3 + 1, starting)
...
def forward(self, input_tensor, time):
time_tensor = torch.tensor(time/timesteps).reshape(batch_size, 1)
input_tensor = torch.flatten(input_tensor, start_dim=1)
input_tensor = torch.cat((input_tensor, time_tensor), dim=-1)
...
and the fully connected autoencoder combined with this one-element time information approach yields decent results after training on non-normalized MNIST data as shown below.
These results are somewhat more impressive when we consider that generative models typically struggle somewhat with un-normalized MNIST data, as most elements in the input are identically zero. Experimenting with different learning rates and input modifications convinces us that diffusion inversion is happily rather insensitive to exact parameter configurations, which is a marked difference from GAN training.
For higher-resolution images, the use of unmodified fully connected architectures is typically infeasible due to the very large number of parameters. Certain restrictions can be placed on the transformations between one layer and the next, and here we use both convolutions and attention in our denoising model to produce realistic images of churches.
We employ a modification on the original Unet architecture such that attention has been added at each layer, modified from Phil Wang's implementation of the model introduced by Ho and colleagues. Here we use a four-deep Unet model with linear attention applied to residual connections and dot-product attention applied to the middle block. The general architecture is as follows:
MLP activations for this model are Sigmoid Linear Units (SiLU)
which is a very similar function to GeLU, and group norms are applied to each block and attention block (before the self-attention transformation). Time information is encoded into each blockvia a trainable MLP applied to fixed cosine-sine positional embeddings in the input. It is somewhat curious that this encoding of time into each block is beneficial, as positional encodings are usually applied only on the input.
After a couple hundred epochs of training on LSUN churches at