From paper: @songScoreBasedGenerativeModeling2021
MIT Tutorial: https://www.youtube.com/watch?v=wMmqCMwuM2Q
Blog from author: https://yang-song.net/blog/2021/score/
Score-based diffusion models apply Langevin dynamics from Score-Based Generative Modeling to Diffusion Models. The normal approach is to sequentially destruct the input data by adding noise step-wise and then to learn a sequence of conditional statistical models that learn to denoise the samples again using reverse-time diffusion. In contrast, the score-based diffusion models use Differential Equation#Stochastic Differential Equation (SDE) to model a continuous diffusion of the samples and the use the score (gradient of log probability density, see #Score Function) to define a reverse SDE. The score-based approach allows for faster sampling (less reverse-diffusion steps), requires only a single conditional model and simple class-conditioning (#Conditioning).
Concept¶
The idea of Diffusion Models is to learn a trajectory between a simple distribution (e.g. Gaussian) and our target distribution (e.g. images) within the sample space. Each point \(t\) along the trajectory represents a distribution closer to the target distribution.
In score-based generative diffusion models (SDM), we train a neural network to predict the score-function \(\nabla_{\mathbf{x}} \log p_t(\mathbf{x}_t)\), the gradient of the current distribution \(p_{t}\) towards the target distribution \(p_{x_0}\). This gradient can then be used to modify the current \(x_t\) along the trajectory to the target distribution, using a Differential Equation#Stochastic Differential Equation (SDE) (SDE) described later. Compared to normal diffusion, the score function gives a continuous trajectory over distributions in the sample space.
It is easy to find the trajectory from our target distribution toward the simple distribution: We iteratively destruct the image by adding noise through the SDE described in the #Forward Process. This is similar to the normal diffusion process.
To find our way back along the trajectory, we can now train the model to predict the gradient at each time step \(t\) along the trajectory, the score function. This gives us the means to generate samples of the target distribution \(p_{x_0}\) from samples of the simple distribution \(p_T\). The score function is used in the SDE of the #Reverse Process.
The source distribution can also be a latent space representation of the actual empirical distribution. Those models are then known as Latent Score-Based Generative Model.
Score Function¶

The (Stein) Score Function is the gradient of the log-probability of the modeled distribution.
It describes a vector-field where each vector points in the direction of increasing probability density. Using iterative methods like [[Stein's method]], we can perturb a simple distribution into any target distribution using only the score function. This makes the problem of finding a distribution equivalent to finding its score function, which typically is much easier to do.
Tractability¶
One great advantage of the score function is that it doesn't require normalizing the output and that its probability is fully based on principled statistical methods.

To compare other models:
- Generative Modeling#Energy-Based Models need to find the normalizing constant \(Z_{\theta}\) so the models output is an actual distribution.
- Variational Autoencoder require the latent distribution to be tractable, limiting their expressiveness.
- Master Wiki/Models/General Models/Generative Adversarial Network (GAN) don't actually try to capture a distribution. The score-based approach however circumvents


The score function however trivializes the problem. Because the gradient of a constant is zero, we don't actually need to find \(Z_{\theta}\) to model any distribution. Even better, the gradient of the score function is equal to the gradient of the neural network:
So we can simply use [[Backpropagation]] to compute the score function from our neural network. This model is typically denoted as \(s_{\theta}(x)\). While this shows that a neural network can actually model the score function, it still needs to learn it in the first place. This can be done by #Score Matching. The score model is typically chosen to be a U-Net. In state-of-the-art models, they are of similar size to other image-generation models, but still much smaller than [[Large Language Model]]s.
Score Matching¶

In order to compute the loss of an estimate of the score function, we need to be able to compare two score functions, i.e. compare two vector fields. This way, we can compare the real-world loss function with the one we model.

A straight-forward way is to compute the difference between each point in the vector field and then to average those differences. This is the Fisher divergence:
However, we don't know the score function of the empirical distribution \(\nabla_{x}p_{data}(x)\). But, using the Divergence Theorem, we can reformulate this term into the Score Matching loss:
which is equal to the Fisher divergence up to a constant (not relevant when maximizing). Approximation of this term can be then be done by taking the empirical mean.
Sliced Score Matching¶

While the score matching approach does work, it has a significant efficiency issue. The first term of the expectation can be computed by a single forward pass, the second term however is described by the [[Jacobian Matrix]] of the score function, which requires backpropagating back to each input neuron separately. This makes naive score matching unusable for large input dimensions.



One way to circumvent the Jacobian is to use Random Projection. In random projection, we sample a random Gaussian base-vector and project each vector in the field onto this random basis. This reduces the dimensionality of our vector field to \(1\), while preserving distances pretty well. This gives us the Sliced Fished Divergence:
which can be transformed again using the Gauss theorem to the Sliced Score Matching loss:
This term can be computed very efficiently:

The second term in the expectation is just a forward pass through the network, followed by a single projection neuron.

The first term is just the gradient of the second term followed by another projection neuron. So we can just backpropagate the once to find this term. This is a massive reduction in computational complexity.
In practice, we would compute the Sliced Score Matching by…
- sampling minibatch of data points from the empirical distribution \((x_{1},\cdots,x_{n})\sim p_{data}\)
- sampling minibatch of projection directions \((\mathbf{v}_{1}, \cdots, \mathbf{v}_{n})\sim p_{\mathbf{v}}\)
- estimating sliced score matching loss from empirical mean
- applying stochastic gradient descent
Denoising Score Matching¶
Instead of using a \(1\)-dimensional projection as in #Sliced Score Matching, we can also use perturbation kernel \(q_{\sigma}(\tilde{x}|x)\) to add noise to our input samples and transform \(p_{data}(x)\) to \(q_{\sigma}(x)\) in a controlled way. The denoising score matching loss is a derivation from the score matching and can be formulated as the mean difference between the modeled score function and the score function of the perturbation distribution:
Using a perturbation kernel that transforms our input into Gaussian noise, we can simplify even further to
In practice, we would compute the Denoising Score Matching by…
- sampling minibatch of data points from the empirical distribution \((x_{1},\cdots,x_{n})\sim p_{data}\)
- sampling minibatch of perturbation points \((\tilde{x}_{1}, \cdots, \tilde{x}_{n})\sim q_{\sigma}\) by applying noise kernel to input
- estimating denoising score matching loss
- applying stochastic gradient descent
However, there is a significant issue: If we want to estimate the actual data distribution, \(\sigma\) needs to be very small. The denoising score matching has a variance inverse to that of the noise distribution. So decreasing the noise perturbation actually increases the variance of the objective, making denoising score matching not very useful on its own.
Generating Samples¶
The diffusion process corresponds to approximate maximum likelihood training based on Kullback-Leibler divergence between the target distribution and the distribution defined by the #Reverse Process SDE.
Reverse Process¶

To generate samples from the prior distribution, we start with points randomly distributed and then follow the vector field described by the #Score Function to arrive at our target distribution. As seen in the figure above, this requires adding some noise in each step, as otherwise we would just collapse on the modes of the empirical distribution. This iterative process is described by a Differential Equation#Stochastic Differential Equation (SDE) and is called Langevin Dynamics:
where \(z_{t}\sim\mathcal{N}(0,I)\).
Noise Conditional Score Model¶

There is an issue with the score function: It only is accurate where there actual data points, otherwise we have no data to compare the vector fields to. This is the issue of #Scarcity in sample space.

We can increase the support of the empirical data by perturbing the samples using Gaussian noise. The samples are then scattered and the PDF is "fuzzied" out. As seen above, this enables the accurate estimation of the score function across the whole input space. We now have another issue however, the score function doesn't generate real images anymore but noisy data. This is where score-based modelling merges with Diffusion Models.

Instead of only training on real or perturbed data, we create multiple datasets with different levels of noise (see #Forward Process). Now we take the random input samples and first follow the score function of the highest noise level for a few iterations. This will nudge the random samples slightly towards the empirical distribution. Next, we perturb the samples with the score function of a slightly lower noise level. We can repeat this process with score functions of decreasing noise levels until the samples estimate the real data distribution.

This seemingly requires multiple models, one for each noise level. We can, however, use a single network, conditioned on the noise level, to estimate the score functions for all levels. This works well as all the score functions share a lot of information as they predict the same underlying data distribution. This model is called the Noise Conditional Score Model \(s_{\theta}(x,\sigma)\). For optimization, we want to balance the loss across all noise levels using weights \(\lambda(\sigma_{i})\). Without it, high noise levels would dominate the #Score Matching loss function. Based on principled analysis, we can set \(\lambda(\sigma_{i})=\sigma_{i}^{2}\), which decouples the loss from the extend of noise.
Time Conditional Score Function¶

As described in #Forward Process, it is desirable to have an infinite amount of noise levels using Differential Equation#Stochastic Differential Equation (SDE)s (SDEs). The SDE described for the forward process can be reversed analytically, giving us the reverse-time SDE:
where \(\bar{\mathbf{w}}\) is the reverse Brownian motion. This now requires us to find a Time Conditional Score Model \(s_{\theta}(x,t)\). This changes our training objective from the one above:

We can again use #Score Matching methods to compute this loss and train the time conditional network. We can then plug the model into the reverse-time SDE and solve it using the Euler-Maruyama method:
where \(\mathbf{z}\sim\mathcal{N(0,|\Delta t|\mathbf{I})}\).

Forward Process¶
For training the score-based diffusion model, we need to perturb the input samples with varying degrees of noise. As we increase the number of noise levels, we get
- higher quality samples
- exact log-likelihood computations
- controllable generation for inverse problem solving So ideally, we want to move from discrete noise levels towards a continuous transition of perturbed distribution, that starts at the data distribution \(p_{data}(x)\) and ends with maximum noise, so a Gaussian distribution \(\pi(x)\).


We can do so by not perturbing the set of data points discretely, but by describing the perturbation as a Differential Equation#Stochastic Differential Equation (SDE) (SDE). A SDE is composed of two terms:
The first part, \(f(x_{t},t)dt\), is called the deterministic drift and describes the stochastic process. The second term, \(g(t)d\mathbf{w}_{t}\) introduces randomness into the process. \(\mathbf{w}_{t}\) is the Brownian motion, basically an infinitesimal small amount of noise added at each step. For a Gaussian prior distribution, we can reduce this term to:

Conditioning¶
This approach was originally presented in @dhariwalDiffusionModelsBeat2021 and is called Classifier Guidance. It incorporates the gradient of a separately trained classifier into the trajectory of the reverse diffusion. The #Reverse Process
uses the term \(- \nabla_{\mathbf{x}} \log p_t(\mathbf{x}_t)\) to pull the trajectory of \(x_t\) toward the target distribution \(p_t\). This term can be extended using the gradient of the classifier. The classifier is trained for samples at different time-steps of the diffusion process, giving us \(p(c|x_{t}, t)\). The general idea is, that according to Bayes formula, we can just add the gradient of the classifier to gradient of the target distribution and thus move the trajectory toward the conditional distribution \(p(x|c)\):
An extension of classifier guidance is soft-label guidance as presented in EmoDiff#Soft-Label Guidance. It allows for controllable intensity of a class label, moving the trajectory only partially toward a given class label. This also allows for mixed classes. A big advantage of this approach is that the guidance is done by a model separate to the #Score Function. So we can train our Score model once and then train different classifier models to perform different tasks, like generating images using text prompts, image in-painting or colorization.
Probability Evaluation¶
The Differential Equation#Stochastic Differential Equation (SDE)s described in #Forward Process and #Reverse Process#Time Conditional Score Function don't allow for the exact log-likelihood estimation, due to its inherent randomness. We can, however, replace the SDEs with Ordinary Differential Equations (ODEs). The ordinary differential process actually has the same marginal distributions, \(p_{data}(x)\) and \(\pi(x)\). The ODE is formulated as

We can now observe the change in probability density and use the Instantaneous Change of Variables formula (Theorem 1 in @chenNeuralOrdinaryDifferential2018) to compute the log-likelihood of any data sample. This is basically the approach of Normalizing Flows. Interestingly, this method performs as well and even outperforms other models, that are specifically trained for maximum likelihood.
Score-based models using ODE flow are deterministic in nature, providing another big advantage: If two models are trained separately on the same data set, they will find the same latent encoding. This is an unique property of these models, as compared to Normalizing Flows, Variational Autoencoders and other Encoder-Decoder models.
Optimizations¶
Based on @karrasElucidatingDesignSpace (Presentation on YouTube)
Many aspects around DDPMs can be optimized, especially when it comes to designing the sampling process and training the score model.
Sampling¶
Score Network¶
Issues¶
Scarcity in Sample Space¶
According to the Manifold Hypothesis, real world data tends to concentrate on lower dimensional manifolds embedded in the high dimensional data space. This leads to a problem: If \(x\) is embedded on a lower-dimensional manifold, then we will never observe samples for most of the input space and the score \(\nabla_{x}\log{p_{data}(x)}\) will be undefined everywhere except the manifold. We can circumvent the problem using two tricks:
- Data Perturbation: Increase the support of \(\nabla_{x}\log{p_{data}(x)}\) by perturbing the samples slightly, fuzzing out the modes across the whole input space.
- Simulated Annealing: By starting with large steps and decreasing it over time, we can move towards the manifold region quickly and then converge more slowly in this more detailed region. A similar idea is employed in GANs (Generative Modeling#Noise Injection).
Entanglement of Latent Space¶
Mentioned here: https://youtu.be/wMmqCMwuM2Q?si=AJtgIyZ3QuJ2LOX3&t=4864
Score-based diffusion models do not tend to model a disentangled latent space (Entanglement of Latent Features). The reason is pretty straight-forward: In other models, like Master Wiki/Models/General Models/Generative Adversarial Network (GAN) and Variational Autoencoders, the latent space is part of the optimization process. So learning a disentangled representation is actually improving the loss and the quality of the models output. In Diffusion models however, the latent code is manually and deterministically created by the #Forward Process.
MasterThesis One consideration: Song mentions that diffusion model have less disentangled space than GANs and VAEs. Seems like VAE disentanglement might pair well with diffusion?¶
Overfitting and Mode Collapse¶
Diffusion models can overfit and collapse on modes. Compared to the Master Wiki/Models/General Models/Generative Adversarial Network (GAN), those issues are rather dependent on the dataset than on the model architecture. Having a varied and large enough dataset handles both those issues.