Skip to content

[!tip] TODO

unclear Verify and extend this from @kingmaIntroductionVariationalAutoencoders2019

Based on @kingmaAutoEncodingVariationalBayes2022

Variational Autoencoders (VAEs) are a special form of the Autoencoder (AE) architecture. The VAE consists of a probabilistic encoder \(q_{\phi}(z|x)\) and a probabilistic decoder \(p_{\theta}(x|z)\). Different to the normal autoencoder, the VAE encodes the input not directly into a latent vector, but into an input-specific latent distribution (by outputting the mean and variance vectors). The decoder is trained to decode samples from this latent distribution back into the input space.

AEs can be seen as a special case of the VAE, where the latent distribution has infinitely small variance. The VAE gives several advantages over the normal AE:

  • Control We have control over how the latent distribution should look like. Usually it is pushed towards a normal distribution.
  • Disentanglement VAEs can deal with Entanglement of Latent Features.
  • Efficiency: The latent distribution can reflect the variance of features in the input space. For example, the variance in appearance of cats might be larger than the variance in appearance of apples.

Architecture

Drawing 2024-07-25 13.15.15.excalidraw#^group=UUA739VelYo7bctW4iSex

The encoder is trained to learn the conditional mean \(\mu\) and variance \(\sigma\) of the latent distribution given the input. The decoder is trained to decode a sample \(z\) from that distribution as to reconstruct the input.

The estimator is given as:

\[ \mathcal{L}_{{\theta}, {\phi}}(x^{(i)})=\mathbb{E}_{q_{\phi}(z|x^{(i)})}[\log{p_{\theta}(x^{(i)}|z)}]-D_{KL}(q_{\phi}(z|x^{(i)})||p_{\theta}(z)) \]

The left term is very similar to the Autoencoder (AE)#Reconstruction Loss: We try to maximize the log-likelihood of observing \(x^{(i)}\) from our decoder given the latent vector drawn from the encoder distribution. The right term minimizes the Kullback-Leibler divergence between the encoder distribution and the prior distribution of the latent variable \(z\). Usually, the prior is chosen to be a Gaussian. This regularizes the learnt latent distribution, leading to better generalization of the model.

Loss

The VAE maximizes the Evidence Lower Bound (ELBO), thus finding a latent space where the decoder finds can reconstruct \(x\) with the highest likelihood, while simplifying the latent space as much as possible. The model assumes a very simple prior latent distribution (usually Gaussian) and minimizes the ELBO loss (Evidence Lower Bound#^88f689):

\[ \mathcal{L}(\theta,\phi;x^{(i)})=\mathbb{E}_{q_{\phi}(z|x^{(i)})}\left[\log{p_{\theta}(x^{(i)}|z)}\right]-D_{KL}(q_{\phi}(z|x)||p_{\theta}(z)) \]

Minimizing this loss optimizes two parameter at once:

  1. Generative parameter \(\theta\): Maximize the likelihood of reconstructing the input \(x\) from the encoding \(z\).
  2. Variational parameter \(\phi\): Regularize the encoder \(q_{\phi}(z|x)\) to model the simple latent distribution \(p_{\theta}(z)\).

Gradient of Reconstruction Term

We got one big issue: The gradient of the first term (naively) requires the gradient of \(q_{\phi}(z)\), which is intractable (#^0ef687) and can only be approximated using Monte Carlo estimation:

\[ \nabla_{\phi}\mathbb{E}_{q_{\phi}(z)}[f(z)] =\mathbb{E}_{q_{\phi}(z)}[f(z)\nabla_{\phi}\log{q_{\phi}(z)}] \simeq\frac{1}{L}\sum\limits_{l=1}^{L}f(z)\nabla_{q_{\phi}(z^{(l)})}\log{q_{\phi}(z^{(l)})} \]

Monte Carlo gradient estimation exhibits a very high variance, rendering it impractical for backpropagation. However, we can use the #Reparameterization Trick to externalize the randomness of the sampling and reformulate the first term of the loss as:

\[ \mathbb{E}_{q_{\phi}(z|x^{(i)})}\left[\log{p_{\theta}(x^{(i)}|z)}\right] \simeq\frac{1}{L}\sum\limits_{l=1}^{L}\log{p_{\theta}(x^{(i)}|g_{\phi}(\epsilon^{(i,l)},x^{(i)})}) \]

where \(\epsilon^{(l)}\sim p(\epsilon)\), the assumed latent distribution and to which the reparameterization function \(g_{\phi}\) is applied to. The gradient of this term depends on the gradient of \(g_{\theta}\), which is tractable (#^8cba6a).

When computing the loss as a sum across minibatches, we can even use a single latent sample per input sample (\(L=1\)).

[!tip] Note As we can see, the variational parameters \(\phi\) are thus weights and biases, that are trained to approximate the distribution parameters of the true posterior \(p_{\theta}(z|x)\). Because of reparameterization, these distribution parameters are not inherent to the randomness of the distribution, but explicitly represented in \(g_{\Phi}\). Hence, we can efficiently perform backpropagation (without high variance Monte Carlo gradient estimations).

Gradient of Regularization Term

The second term of the model can be solved analytically for many latent distributions, even though we don't know the gradients for \(q_{\Phi}(z)\). This is because, for a simple latent distribution, the latent distribution modeled by the encoder \(q(z)\) and the true latent distribution \(p(z)\) are using the same parameters \(\theta\). For example, assuming a \(J\)-dimensional Gaussian latent distribution:

\[ \begin{align} -D_{KL}(q_{\Phi}(z)||p_{\theta}(z)) &= \int_{z}q_{\theta}(z)(\log{p_{\theta}(z)}-\log{q_{\theta}(z)})dz \\ &= \frac{1}{2}\sum\limits_{j=1}^{J}(1+\log((\sigma_{j})^2)-(\mu_{j})^{2}-(\sigma_{j})^{2}) \end{align} \]

Reparameterization Trick

The Problem

There is one big issue when we use random sampling within the model and try to use [[Backpropagation]] to find the local gradients in each model parameter. Backpropagation works by finding the gradient of the expected loss for the input distribution and propagating it back through the nodes of the network.

Without random sampling, we can formulate the gradient of the expectation as the expectation of the gradient:

\[ \begin{align} \nabla_{\theta}\mathbb{E}_{p(z)}[f_{\theta}(z)] &= \nabla_{\theta}\left[\int_{z}p(z)f_{\theta}(z) \right]dz \\ &= \int_{z}p(z)\left[\nabla_{\theta}f_{\theta}(z) \right]dz \\ &= \mathbb{E}_{p(z)}[\nabla_{\theta}f_{\theta}(z)] \end{align} \]

where

  • \(p(z)\) is a known probability density (input space).
  • \(f_{\theta}(z)\) is the output of our network (e.g. loss function)

Since the gradient is a linear operator, the gradient of the expectation is equal to the expectation of the gradient. So we can take the expected gradient of the loss and backpropagate it through the network.

Once we use random sampling in the model, we have a problem. The probability density is now also part of the network and thus parametrized by the same parameters \(\theta\) as well. It is intractable (can't be know), because it is to be learned. So with random sampling, when we try to use the same reformulation as before, we get:

\[ \begin{align} \nabla_{\theta}\mathbb{E}_{p_{\theta}(z)}[f_{\theta}(z)] &= \nabla_{\theta}\left[\int_{z}p_{\theta}(z)f_{\theta}(z) \right]dz \\ &= \int_{z}\nabla_{\theta}[p_{\theta}(z)f_{\theta}(z)]dz \\ &= \int_{z}p_{\theta}(z)\nabla_{\theta}f_{\theta}(z)dz + \int_{z}f_{\theta}(z)\nabla_{\theta}p_{\theta}(z)dz \\ &= \mathbb{E}_{p(z)}[\nabla_{\theta}f_{\theta}(z)] + \int_{z}f_{\theta}(z)\nabla_{\theta}p_{\theta}(z)dz \end{align} \]

where

  • \(p_{\theta}(z)\) is the prior unknown, parametrized distribution.

In this case, we can't just pull the gradient inside the expectation because we are left with the extra term \(\int_{z}f_{\theta}(z)\nabla_{\theta}p_{\theta}(z)dz\). And since we don't know the parameters of the latent distribution, we can't deterministically backpropagate the loss. (We could make an estimation of the parameters that produce the latent distribution, based on the sample we observed (REINFORCE algorithm). But this typically has high variance and makes the model unstable). ^0ef687

The Trick

Drawing 2024-07-25 13.15.15.excalidraw#^group=tT07F78BLDghTQOiZhhul

Our goal is to make the backpropagation deterministic again. We do that by reformulating the unknown distribution \(z \sim p_{\theta}(z)\) as function \(z=g_{\theta}(\epsilon)\) where \(\epsilon\) is drawn from a known distribution. Randomness is now also an input to our model.

Doing so allows us to reformulate the gradient without the extra term:

\[ \begin{align} \nabla_{\theta}\mathbb{E}_{p(\epsilon)}[f_{\theta}(g_{\theta}(\epsilon))] &= \nabla_{\theta}\left[\int_{\epsilon}p(\epsilon)f_{\theta}(g_{\theta}(\epsilon)) \right]d\epsilon \\ &= \int_{\epsilon}p(\epsilon)\left[\nabla_{\theta}f_{\theta}(g_{\theta}(\epsilon)) \right]d\epsilon \\ &= \mathbb{E}_{p(\epsilon)}[\nabla_{\theta}f_{\theta}(g_{\theta}(\epsilon))] \\ &= \mathbb{E}_{p(\epsilon)}\left[ \frac{\partial f_{\theta}(g_{\theta}(\epsilon))}{\partial g_{\theta}(\epsilon)} \cdot \nabla_{\theta} g_{\theta}(\epsilon) \right] \end{align} \]

So we can move the gradient inside the expectation again. When we now calculate the gradient, we apply the chain rule, so we take the derivative of \(f_{\theta}\) with respect to \(g_{\theta}\) and multiply with the derivative of \(g_{\theta}\) with respect to parameters. Compared to \(p_{\theta}\), we can actually compute the gradient of \(g_{\theta}\), as it is a deterministic function. The input to this function might be stochastic, but importantly it is stochastically independent to our parameters, so there is no dependency between our parameters and the random distribution. ^8cba6a

Disentanglement Methods

VAEs are very useful when it comes to Entanglement of Latent Features.

To reconstruct the input from a lower dimensional representation, a VAE needs to represent the latent structure of the input space in its latent space. If the input space is clearly separable by a few latent features (e.g. images of cats and dogs), the VAE will likely dedicate neurons to separate those latent features. But we can not be sure that a set of latent neurons in the VAE corresponds to exactly on feature. For example, the same neurons that separates between cats and dogs might also separate between sunny and rainy backgrounds. This is a problem if we want to use the latent representation in some kind of generative context. If we ask the model generate an image of a cat, it will always have a sunny background, and for dogs a rainy background. To circumvent this issue, we can employ disentanglement methods that force the VAE to learn a latent representation in which the latent features of the input space are clearly separated.

Disentanglement methods can be separated into supervised, where the latent features are known (e.g. as labels) and unsupervised methods, where the latent features are a-priori unknown.

Unsupervised Methods

Unsupervised methods don't require labels, learning the structure of the input space implicitly.

Pasted image 20240826164345.png

Supervised Methods

Supervised models use labels, so the latent variables can explicitly be conditioned to correspond to latent factors.

Pasted image 20240829161204.png

Entanglement Metrics

Entanglement of Latent Features#Entanglement metrics