ComputerScience #AI¶

Original diffusion paper: @sohl-dicksteinDeepUnsupervisedLearning.

Diffusion models learn the structure of a distribution in an unsupervised manner. They get presented many samples from the input space with added noise, and learn to denoise them. Given enough training data, they are able to denoise even random noise, yielding novel samples. ¹

Intuitively, diffusion is just an approach of training a neural model model to denoise samples with different levels of added noise. As the model learns the underlying structures of the sample space, it gains the ability to "denoise" random noise and thus generate novel images.

Diffusion¶

The word diffusion derives from the Latin word, diffundere, which means "to spread out". It is the process of any physical quantity (particles, energy) spreading out from a region of high concentration to a region of lower concentration. It is not caused by any force but just the random movement (random walk) of the particles.

For physical particles suspended in gas or liquid, its random motion is called Brownian Motion, which can only be modeled stochastically for any larger amount of particles. Mathematically, it is described by the Wiener Process \(W_t\).

Algorithm¶

The diffusion model has two steps, a forward diffusion that transforms the input step-wise into Gaussian noise basically by applying a Wiender Process to the input space. Then a reverse diffusion process, a generative process that is capable of denoising an image or even generating completely new ones from random noise.¹

Forward Trajectory¶

The forward diffusion takes an input sample and iteratively applies small amounts of Gaussian noise, governed by variance schedule. At the final step, the image will be transformed practically into noise.¹

Reverse Trajectory¶

The reverse diffusion tries to remove the Gaussian noise from an image, in a step-by-step manner. For each step, the model must predict what part of the image is noise, which is leaned during training. By repeatedly removing noise, the original image can be reconstructed from the noise added in the forward pass.¹

Noise Schedule¶

The noise schedule defines exactly what noise is added at each timestep of the forward process. Initially, linear noise schedules were used, where noise intensity increases linearly. ¹ ^2

Later models by OpenAI researchers found that in linear schedules information is destroyed too rapidly and have almost no information left for the last several steps. Using a cosine noise schedule solves both problems and leads to a more useful transition from information to noise. ³

Model for Reverse Process¶

The model used for the reverse trajectory is not predetermined by the algorithm. Often, Variational Autoencoder or U-Nets are used. They can be used to predict

the mean of noise
the original image
the noise of the image (mean & variance). ²

Classically, the U-Net autoencoder model is used. For the reverse trajectory prediction, the noisy image and an embedding of the current timestep is passed into the model:

Drawing 2024-04-29 11.03.49.excalidraw#^group=E73oGGy1Glf94Wd3zIcAf

The embedding is passed to each layer, so the model knows how much noise to expect.

Improving on the original implementation, OpenAI researcher make several adjustments to beat out GAN on image synthesis task:

Increasing the depth and decreasing the width of the U-Net.
Adding more attention layers.
Increasing number of attention heads.
BigGAN residual blocks.
Adaptive Group Normalization
Classifier Guidance

Adaptive Group Normalization¶

For the Adaptive Group Normalization, the author add an extra group normalization layer to after the convolution layers of the U-Net and project onto it the timestep embedding and add onto it the class label for conditioning.

Drawing 2024-04-29 11.29.14.excalidraw#^group=M8QZRJOy2exQTWoZe9Ilc

Training¶

Training is essentially only done for the reverse trajectory. For the forward process, a variance schedule is chosen and the noise is applied iteratively. Afterward, the training of the reversal model begins:¹

Model setup: A model like VAE is chosen. It is then given a noise data sample and a timestep as input and tries to predict the noise that was added to the image.
Calculate loss: Thanks to the noise schedule, we always know the actual noise that was added to the image. The difference between the model prediction and the actual noise added is used to calculate the loss (often MSE).
Learning reverse trajectory: By repeatedly applying the step 1 and 2, the model learns to predict the noise in an image, without actual knowledge of the noise schedule. It can only use whatever structure remains in the image.
Sampling & Refinement: After many epochs of training, the model can be given completely random noise as input to generate novel samples that are in accordance with the input distribution. The model can be fine-tuned by adjusting the noise schedule or the rate at which noise is added in the forward pass (timestep parameter called \(\beta\)).

Conditioning and Control¶

The training process learns the inherent structure of the input space in an unsupervised manner. Consequently, it can generate novel samples, but we have no control over what sample will be generated. However, diffusion models can easily be conditioned on input labels (classes, text description, numbers) during training. They thus learn multiple condition distributions given each label and when applying the reverse process, one can provide a label to guide the image generation. For textual descriptions, labels are usually embeddings generated by models like [[BERT]] and [[GPT]].

Conditioning can be done by concatenating the input label with the noisy training image, so that the model learns to predict the noise differently depending on the training label. Another way is to feed in the label as a separate input to each layer of the internal model used.

The conditioned model is not only controllable, but also capable to perform well across different tasks, just by provision of the task label.

Mathematical Background¶

Notation¶

\(\{x_0,…,x_T\}\) is one sample, with no added noise at timestep \(0\) to maximal noise at timestep \(T\).

Noise Adding Function¶

\(q(x_t|x_{t-1})\) is the noise adding function of the #Forward Trajectory. Its defined as

\[ q(x_t|x_{t-1})=\mathcal{N}(x_t,\sqrt{1-\beta_t}x_{t-1}, \beta_tI) \]

where

\(\mathcal{N}\) is a normal distribution
\(x_t\) is the output
\(\sqrt{1-\beta_t}x_{t-1}\) is the mean
\(\beta_tI\) is the variance.

The \(\beta\) symbol denotes the #Noise Schedule. In the linear schedule, it starts with \(\beta_0=0.0001\) and grows linearly to \(\beta_T=0.02\). The term \(\sqrt{1-\beta_t}\) thus shrinks from 1 to 0.99 over 1000 timesteps. This term scales down the mean of the image, counteracting the increasing variance and keeping values bounded.

With \(\alpha_t=1-\beta_t\) and \(\bar{\alpha_t}=\prod_{s=1}^{t}a_s\) we can use the reparameterization trick to reformulate the noise formula as

\[ \begin{align} q(x_t|x_{t-1}) &= \mathcal{N}(x_t,\sqrt{1-\beta_t}x_{t-1}, \beta_tI)\\ &= \sqrt{1-\beta_t}x_{t-1}+\sqrt{\beta_t}\epsilon && \text{Reparameterization Trick}\\ &= \sqrt{\alpha_t}x_{t-1}+\sqrt{1-\alpha_t}\epsilon\\ &= \sqrt{\alpha_t\alpha_{t-1}}x_{t-2}+\sqrt{1-\alpha_t\alpha_{t-1}}\epsilon\\ &= \sqrt{\alpha_t\alpha_{t-1}\ldots\alpha_{0}}x_{0}+\sqrt{1-\alpha_t\alpha_{t-1}\ldots\alpha_{0}}\epsilon\\ &= \sqrt{\bar{\alpha_t}}x_0+\sqrt{1-\bar{\alpha_t}}\epsilon \end{align} \]

to add the noise for timestep \(t\) in a single step.

Noise Removal Function¶

\(p(x_{t-1}|x_t)\) is the noise removal function of the #Reverse Trajectory. It is defined as

\[ \begin{align} p(x_{t-1}|x_{t}) &= \mathcal{N}(x_{t-1;}\mu_\theta(x_t,t),\Sigma_\theta(x_t,t)) \end{align} \]

where

\(\mathcal{N}\) is a normal distribution
\(x_{t-1}\) is the output
\(\mu_\theta(x_t,t)\) is the mean, that needs to be predicted by the #Model for Reverse Process
\(\Sigma_\theta(x_t,t)\) is the variance, fixed by the #Noise Schedule. The function is learned by neural model with the #Loss Function below used for gradient descent.

Loss Function¶

Reference: https://www.youtube.com/watch?v=HoKDTa5jHvg

The loss of the prediction model is the log-likelihood \(-log(p_\theta(x_0))\). But the probability of \(x_{0}\) is dependent on the probability of all other \(x_1,\ldots,x_T\). We can compute the Evidence Lower Bound as

\[ -log(p_\theta(x_0))\leq-log(p_\theta(x_{0}))+D_{KL}(q(x_{1:T}|x_0)||p_\theta(x_{1:T}|x_0)) \]

where \(D_{KL}\) is the Kullback-Leibler divergence. The divergence can be reformulated using basic stochastic tools:

Drawing 2024-04-29 11.29.14.excalidraw#^group=glkDTbKZh4NvzfjY7HveM

Drawing 2024-04-29 11.29.14.excalidraw#^group=v3EagJMYtL8B8So99NeV6

Now, the \(-log(p_\theta(x_0))\) cancels out, giving us the Evidence Lower Bound to minimize:

\[ -log(p_\theta(x_0))\leq-log(\frac{q(x_{1:T}|x_0)}{p_\theta(x_{0:T})}) \]

The quantity \(q(x_{1:T}|x_0)\) is deterministically given by the #Noise adding function. The lower quantity is equivalent to

\[ p_\theta(x_{0:T})=p(x_{T})\prod\limits_{t=1}^Tp_\theta(x_{t-1}|x_t) \]

where the right-hand term in the product is exactly what is modeled by our #Model for Reverse Process. Applying Bayes rule to \(q(x_{1:T}|x_0)\) and further simplifications shows that the loss function can be reduced to a function of the noise:

\[ L=\mathbb{E}_{t,x_0,\epsilon}[||\epsilon-\epsilon_\theta(x_t,t)||^2] \]

Conditioning¶

The basic diffusion model only generates random samples from random noise, without any means of control. If we want to generate specific samples (e.g. generate images of cats only), we need to additionally condition the model.

Score-Based Diffusion Model#Conditioning