Generative Modeling
Based on @lambBriefIntroductionGenerative2021a

The goal of generative modeling is to generate samples from some empirical distribution \(p(x)\). This target could be images from peoples faces, voice recordings (Text-to-Speech Synthesis (TTS)), text or some latent distribution. This is done by training a model \(q_{\theta}(x)\), which should be as close to the real distribution as possible. Typically this is done by minimizing the divergence (Kullback-Leibler divergence) between the modelled and the real distribution:
This approach is called probabilistic generative modeling. There are exceptions to this formulation, where the goal is to produce samples with novel traits, styles or generate samples from a yet unknown latent distribution (Variational Autoencoder).
Maximum Likelihood Approaches¶
Classically, divergence is measured using Kullback-Leibler divergence. The KL divergence measures the expected excess information, when drawing samples from \(p\) but assuming the distribution \(q\). Its formulated as:
The KL divergence can be reformulated into a sum of [[Shannon Entropy]] and [[Cross-Entropy]]:
where \(H\) is the entropy of the empirical distribution \(p\) and \(E\) is the [[Cross-Entropy]] the empirical and the modelled distribution. We see that the first term only depends on the empirical distribution \(p\), over which we have no control. So minimizing the KL divergence means minimizing the cross-entropy between both distributions.
We can now formulate the generative problem as maximizing the likelihood of drawing each of our empirical samples from our modelled distribution:
and see how maximizing the likelihood also minimizes the KL divergence.
Issues with KL Divergence¶

A key issue with the approach of minimizing the KL divergence is the asymmetry (Kullback-Leibler divergence#Asymmetry). The forward KL divergence enforces the generative model to cover as much of the empirical space as possible, even at the cost of generating implausible samples. On the other hand, if we used reverse KL divergence, we would generate plausible samples, but drastically reduce the coverage of the empirical distribution.
Algorithms¶
Gaussian Mixture Models¶
One way of maximizing the likelihood is to use a simple, parameterized distribution and tune the parameters till they fit best to the empirical observations. Such a tunable distribution is called Tractable Distribution, as we can compute the probability for each point analytically. Intractable distributions are not solvable analytically and often require us to sample from them to find the probability density.
The [[Gaussian Mixture Model]] uses the weighted sum of multiple, parameterized normal distributions to model the empirical distribution:
As long as the sum of weights \(\pi_{k}\) is equal to \(1\), the resulting distribution is guaranteed to be normalized. This method works well for empirical distributions with few modes. But since each mode requires a separate mixture component, this method breaks down for high-dimensional data with large amounts of modes, where computation just becomes too complex.
Energy-Based Models¶
A distribution is a function that is non-negative and integrates to \(1\) over its support. [[Energy-Based Models]] use the non-negative energy function \(E_{\theta}(x)=e^{-f_{\theta}(x)}\) and its integral \(Z_{\theta}=\int_{x\in R}E_{\theta}(x)\) to model a distribution:
As we divide a functions by its integral, \(q_{\theta}\) is guaranteed to have an area of \(1\). However, for models like neural networks, we can only evaluate them at discreet point and can't integrate over the input space. If we now look at the log-likelihood, we can do the following reformulation:
The gradient of the integral can be formulated in terms of an expectation over the networks outputs.
The model then tries to maximize the function \(f_{\theta}\) for samples of the empirical distribution and minimize it for samples from the model distribution. #unclear
Autoregressive Models¶
Both [[Gaussian Mixture Model]]s and #Energy-Based Models use likelihood maximization, but have difficulties with modelling multimodal, high-dimensional distributions. One way to tackle this problem is using Autoregressive Models which factorize a joint distribution into a chain of conditional distributions.
Autoregressive models are a straightforward formulation of a tractable density, which directly maximize likelihood and thus minimized KL divergence. However, they do struggle with compounding prediction errors when predicting from past predictions. There are many techniques like scheduled sampling, shortening self-generated sub-sequences or using annealing. The compounding error effect is especially notable if the model starts from scratch: The loss looks only one step ahead and if we lack any prior context, it is not a good measure for long-term error.
Variational Autoencoder¶
The VAE introduces a latent distribution \(p(z)\) to represent the joint distribution \(p(x,z)=p_{\theta}(x|z)p(z)\). The latent space is chosen to have a simpler structure than the visible space, thus learning it should be less complex than learning the empirical distribution it-self. For more information, see Variational Autoencoder.
Adversarial Approaches¶
As an alternative to #Maximum Likelihood Approaches, adversarial approaches use a candidate generative model and utilize the difference between samples from this model and the empirical distribution. This difference is often quantized by the density ratio between the generative model and the empirical distribution:
The key idea is that training a classifier and training a generative model is very similar, both rely on determining the underlying structure of the real data. Also, the density of the generative model does not have to represented explicitly. We just train the model to be sensitive to differences between the generated samples and the real samples.
Algorithms¶
Noise Contrastive Estimation¶
In this approach, a fixed \(q_{\theta}\) is chosen and classifier \(D_{\theta}\) is learned. Then, importance sampling is used to generate samples:
Basically, we redistribute the probability mass according to the classifier to approximate the real distribution. This approach is simple, but has two major issues:
- \(q_{\theta}\) must cover the support of \(p\), other wise \(\hat{p}\) isn't well defined.
- If \(q_{\theta}\) has small values where \(p\) is large, then \(D_{\theta}\approx1\). In turn, the importance weighting goes to infinity, which leads to very high variance sampling. This means that \(q_{\theta}\) is very limited to already being very similar to \(p\).
Generative Adversarial Networks¶
The Master Wiki/Models/General Models/Generative Adversarial Network (GAN) approach utilizes the classifier approach as well. But instead of using the classifier to for importance sampling, it uses it to train the generative model. The generative model is trained to generate real-looking samples by maximizing the output of the classifier. The discriminator model (classifier) is trained to identify generated samples by minimizing its output for those samples. So the two models have adversarial roles and loss.