Evidence Lower Bound

Based on: https://mbernste.github.io/posts/elbo/

The Evidence Lower Bound (ELBO), also called Variational Lower Bound, is a lower bound on the log-likelihood of observed data. It is important for tasks such as [[Expectation Maximization]] and [[Variational Inference]].

The VLB is used for problems that involve a hidden latent variable. We observe samples $x\sim p(x)$ but we assume there is another random variable $p(z)$, where together they form the joint distribution $p(x,z)$. Since $z$ remains unobserved, we use a parameterized encoding model $q_{\Phi}(z|x)$ and decoding model $p_{\theta}(x|z)$ and try to find the most likely parameters for those models given our observations.

ELBO is also used in Variational Autoencoders. There, we assume a very simple latent prior distribution, like a Gaussian. Then we train a the model to learn the variational posterior $q_{\Phi}(z|x)$ (encoder part) to approximate the theoretical true posterior $p_{\theta}(z|x)$. The encoder network parameterized by $\Phi$ maps the input data xxx to a distribution over the latent variables $z$. In parallel, we train model to maximize the log-likelihood of $p_{\theta}(x|z)$ for a $z$ sampled from the encoding $q_{\Phi}(z|x)$.

Derivation¶

The evidence is defined as the marginal likelihood of $x$ for all some fixed parameters of our decoding model:

\[ \text{evidence(x)}:=\log{p_{\theta}(x)} \]

We want modeled distribution $p_{\theta}(x)$ to be as similar as possible to our true distribution $p(x)$, so the evidence in our model of for each observed sample should be as high as possible.

1.) Evidence from a Decoding Model¶

The conditional probability $p_{\theta}(x|z)$ can be seen as a stochastic function, mapping samples $z\sim p(z)$ probabilistically onto $p(x)$ (e.g. using a neural network). The evidence then is the probability density of $x$ when generated from our model using parameters $\theta$ over all possible inputs $z$.

$$ \begin{align} \log{p_{\theta}(x)} &= \log{\int_{z}p_{\theta}(x, z)dz} \ &= \log\int_{z} p_{\theta}(x|z)p(z)dz \

\end{align} $$

We have one problem: We don't know the latent space yet, so the integration is intractable.

2.) Latent Space from an Encoding Model¶

Since we need to learn the latent space $p(z)$ from observations in the sample space $p(x)$, we also introduce a model $q_{\Phi}(z|x)$ to probabilistically map samples into the latent space. We can now reformulate the evidence in terms of this encoding model, using [[Jensen's Inequality]]:

\[ \begin{align} \log{p_{\theta}(x)} &= \log{\int_{z}p_{\theta}(x, z)dz} \\ &= \log{\int_{z}p_{\theta}(x,z)\frac{q_{\Phi}(z|x)}{q_{\Phi}(z|x)}dz} \\ &= \log{\mathbb{E}_{z\sim q_{\Phi}(z|x)}\left[\frac{p_{\theta}(x,z)}{q_\Phi(z|x)}\right]} \\ &\ge \mathbb{E}_{z\sim q_{\Phi}(z|x)}\left[\log{\frac{p_{\theta}(x,z)}{q_{\Phi}(z|x)}}\right] \\ &= \text{lower bound} \end{align} \]

So although we can't calculate the evidence directly, with help of the encoding model, we can find a lower bound to our evidence. Reformulating a bit further gives us:

\[ \begin{align} \text{lower bound} &= \mathbb{E}_{z\sim q_{\Phi}(z|x)}\left[\log{\frac{p_{\theta}(x,z)}{q_{\Phi}(z|x)}}\right] \\ &= \mathbb{E}_{z\sim q_{\Phi}(z|x)}[\log{p_{\theta}(x,z)}]\space-\space\int_{z}q_{\Phi}(z|x)\log{\frac{1}{q_{\Phi}(z|x)}}dz\\ &= \mathbb{E}_{z\sim q_{\Phi}(z|x)}[\log{p_{\theta}(x,z)}]\space-\space\mathcal{H(q_{\Phi}(z|x))} \end{align} \]

^9dea44

The first term would be maximized if $q_{\Phi}(z|x)$ would just be a Dirac at the maximum of $p_{\theta}(x,z)$. But the entropy of a Dirac is infinite, so the second term would be minimized. The goal of maximizing the lower bound is finding an optimized trade-off of a latent space where the distribution is…

narrow enough so that a highly likely $z$ is mapped onto highly likely $x$.
wide enough so that the latent space supports the informational content (entropy) of the sample space (and not always maps to one point).

A common reformulation of the lower-bound looks like this: ^88f689

\[ \begin{align} \text{ELBO(x)} &= \mathbb{E}_{z\sim q_{\Phi}(z|x)}\left[\log{\frac{p_{\theta}(x,z)}{q_{\Phi}(z|x)}}\right] \\ &= \mathbb{E}_{z\sim q_{\Phi}(z|x)}[\log{p_{\theta}(x,z)}-\log{q_{\Phi}(z|x)}] \\ &= \mathbb{E}_{z\sim q_{\Phi}(z|x)}[\log{p_{\theta}(x|z)p(z)}-\log{q_{\Phi}(z|x)}] \\ &= \mathbb{E}_{z\sim q_{\Phi}(z|x)}[\log{p_{\theta}(x|z)} + \log{p(z)}-\log{q_{\Phi}}] \\ &= \mathbb{E}_{z\sim q_{\Phi}(z|x)}[\log{p_{\theta}(x|z)}] + \mathbb{E}_{z\sim q_{\Phi}(z|x)}[\log{p(z)}-\log{q_{\Phi}(z|x)}] \\ &= \mathbb{E}_{z\sim q_{\Phi}(z|x)}[\log{p_{\theta}(x|z)}] + \int_{z}q_{\Phi}(z|x)\log{\frac{p(z)}{q_{\Phi}(z|x)}}dz \\ &= \mathbb{E}_{z\sim q_{\Phi}(z|x)}[\log{p_{\theta}(x|z)}] - D_{KL}(q_{\Phi}(z|x)||p(z)) \end{align} \]

^1e31b4

We see how maximizing the lower bound increases the likelihood of reconstructing $x$ from our encoded latent variable $z$ and forces the approximate latent distribution toward the true latent distribution.

3.) Alternative Formulation: Quantifying the Difference¶

ELBO now gives us the lower bound of the marginal log-likelihood of an observations when modeled by our encoding parameters $\Phi$ and decoding parameter $\theta$. We can also quantify the difference between this lower bound and the true evidence:

\[ \begin{align} \text{evidence(x)}-\text{ELBO}(x) &= \log{p_{\theta}(x)}-\mathbb{E}_{z\sim q_{\Phi}(z|x)}\left[\log{\frac{p_{\theta}(x,z)}{q_{\Phi}(z|x)}}\right] \\ &= -\mathbb{E}_{z\sim q_{\Phi}(z|x)}[\log{p_{\theta}(x,z)}-\log{q_{\Phi}(z|x)}-\log{p_{\theta}(x)}] \\ &= -\mathbb{E}_{z\sim q_{\Phi}(z|x)}\left[\log{\frac{p_{\theta}(x,z)}{p_{\theta}(x)}-\log{q_{\Phi}(z|x)}}\right] \\ &= -\mathbb{E}_{z\sim q_{\Phi}(z|x)}[\log{p_{\theta}(z|x)}-\log{q_{\Phi}(z|x)}] \\ &= -\int_{z}q_{\Phi}(z|x)\log{\frac{p_{\theta}(z|x)}{q_{\Phi}(z|x)}}dz \\ &= D_{KL}(q_{\Phi}(z|x)||p_{\theta}(z|x)) \end{align} \]

The difference between our lower bound and the actual evidence is defined by the Kullback-Leibler divergence between the latent space modeled by the encoding and decoding model. This gives us the alternative formulation:

\[ \text{ELBO(x)}=\log{p_{\theta}(x)}-D_{KL}(q_{\Phi}(z|x)||p_\theta(z|x)) \]

Maximizing ELBO in this case means we maximize the likelihood of reconstructing $x$ from our model and maximize the discriminative power of our latent encoding (The decoder decodes $x$ from the same $z$ as the encoder encodes $x$ to).