Evidence Lower Bound
Based on: https://mbernste.github.io/posts/elbo/
The Evidence Lower Bound (ELBO), also called Variational Lower Bound, is a lower bound on the log-likelihood of observed data. It is important for tasks such as [[Expectation Maximization]] and [[Variational Inference]].
The VLB is used for problems that involve a hidden latent variable. We observe samples \(x\sim p(x)\) but we assume there is another random variable \(p(z)\), where together they form the joint distribution \(p(x,z)\). Since \(z\) remains unobserved, we use a parameterized encoding model \(q_{\Phi}(z|x)\) and decoding model \(p_{\theta}(x|z)\) and try to find the most likely parameters for those models given our observations.
ELBO is also used in Variational Autoencoders. There, we assume a very simple latent prior distribution, like a Gaussian. Then we train a the model to learn the variational posterior \(q_{\Phi}(z|x)\) (encoder part) to approximate the theoretical true posterior \(p_{\theta}(z|x)\). The encoder network parameterized by \(\Phi\) maps the input data xxx to a distribution over the latent variables \(z\). In parallel, we train model to maximize the log-likelihood of \(p_{\theta}(x|z)\) for a \(z\) sampled from the encoding \(q_{\Phi}(z|x)\).
Derivation¶
The evidence is defined as the marginal likelihood of \(x\) for all some fixed parameters of our decoding model:
We want modeled distribution \(p_{\theta}(x)\) to be as similar as possible to our true distribution \(p(x)\), so the evidence in our model of for each observed sample should be as high as possible.
1.) Evidence from a Decoding Model¶
The conditional probability \(p_{\theta}(x|z)\) can be seen as a stochastic function, mapping samples \(z\sim p(z)\) probabilistically onto \(p(x)\) (e.g. using a neural network). The evidence then is the probability density of \(x\) when generated from our model using parameters \(\theta\) over all possible inputs \(z\).
$$ \begin{align} \log{p_{\theta}(x)} &= \log{\int_{z}p_{\theta}(x, z)dz} \ &= \log\int_{z} p_{\theta}(x|z)p(z)dz \
\end{align} $$
We have one problem: We don't know the latent space yet, so the integration is intractable.
2.) Latent Space from an Encoding Model¶
Since we need to learn the latent space \(p(z)\) from observations in the sample space \(p(x)\), we also introduce a model \(q_{\Phi}(z|x)\) to probabilistically map samples into the latent space. We can now reformulate the evidence in terms of this encoding model, using [[Jensen's Inequality]]:
So although we can't calculate the evidence directly, with help of the encoding model, we can find a lower bound to our evidence. Reformulating a bit further gives us:
^9dea44
The first term would be maximized if \(q_{\Phi}(z|x)\) would just be a Dirac at the maximum of \(p_{\theta}(x,z)\). But the entropy of a Dirac is infinite, so the second term would be minimized. The goal of maximizing the lower bound is finding an optimized trade-off of a latent space where the distribution is…
- narrow enough so that a highly likely \(z\) is mapped onto highly likely \(x\).
- wide enough so that the latent space supports the informational content (entropy) of the sample space (and not always maps to one point).
A common reformulation of the lower-bound looks like this: ^88f689
^1e31b4
We see how maximizing the lower bound increases the likelihood of reconstructing \(x\) from our encoded latent variable \(z\) and forces the approximate latent distribution toward the true latent distribution.
3.) Alternative Formulation: Quantifying the Difference¶
ELBO now gives us the lower bound of the marginal log-likelihood of an observations when modeled by our encoding parameters \(\Phi\) and decoding parameter \(\theta\). We can also quantify the difference between this lower bound and the true evidence:
The difference between our lower bound and the actual evidence is defined by the Kullback-Leibler divergence between the latent space modeled by the encoding and decoding model. This gives us the alternative formulation:
Maximizing ELBO in this case means we maximize the likelihood of reconstructing \(x\) from our model and maximize the discriminative power of our latent encoding (The decoder decodes \(x\) from the same \(z\) as the encoder encodes \(x\) to).