source: https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

Kullback Leibler divergence

Statistical measure of how different two probability distributions \(P\) and \(Q\) are from each other.

\[ D_{KL}(P||Q)=\sum\limits_{x\in\mathcal{X}}P(x)log(\frac{P(x)}{Q(x)}). \]

It is the expected excess information content when sampling from distribution \(P\) but assuming distribution \(Q\) (see Entropy#Relative entropy)

Its not a metric, as it doesn't satisfy the triangle equality or symmetry required for a metric, but a divergence and a generalization of the squared distance between two distributions.

The KL divergence (or some form of it) is typically used in models with latent data representations like Variational Autoencoders and Diffusion Models.

Asymmetry¶

Pasted image 20240812114940.png

The asymmetry of the KL divergence can be seen in the figure above. The multi-modal distribution \(p\) is to be modeled using the uni-modal distribution \(q\).

Optimizing the forward KL divergence \(D_{KL}(p|q)\), \(q\) is more heavily penalizes for having low values where \(p\) is high than for having high values where \(p\) is low. Thus, \(q\) is forced to have some energy for both modes of \(p\), and is not penalized for having a lot of energy where \(p\) is zero.

Optimizing the reverse KL divergence \(D_{KL}(q|p)\) is doing the opposite: \(q\) is more heavily penalized for having high values where \(p\) is low than for having low values where \(p\) is high. So \(q\) is forced to have all its energy concentrated at a mode of \(p\), even if it means there is no energy at some other mode of \(p\). Thus, the reverse KL divergence is called mode seeking.