Entropy

The entropy of a distribution is based on the average #Information Content of each event in a random variable that generates that distribution. When using base 2, it gives a measure of the smallest average amount of bits required to encode events a distribution.

\[ H(X):=-\sum\limits_{x\in \mathcal{X}}p(x)\log p(x) \]

For a continuous variable \(X\), the entropy is the expected information content:

\[ H(X):=\mathbb{E}_{x\sim\mathcal{X}}(-\log x) \]

Entropy is high for distributions where each event has a similar probability, so there are no unsurprising events (uniform, high variance Gaussian). It is low for distributions with high peaks, where a few events are very likely and therefore unsurprising to observe (delta function, low variance Gaussian).

Information Content¶

The information content \(I(\omega)\) (also self-information) quantifies how the surprisal of observing an event \(w\). It can also be seen as the amount of bits (base 2) needed to encode an event.

\[ I(\omega_{n})=-\log(P(\omega_{n}))=\log(\frac{1}{P(\omega_{n})}) \]

Derivation¶

The definition of the information content was chosen by Claude Shannon to follow three axioms:

An event with probability 100% is perfectly unsurprising and yields no information.
The less probable an event is, the more surprising it is and the more information it yields.
If two independent events are measured separately, the total amount of information is the sum of the self-informations of the individual events.

The first axiom requires that \(P(A)=1\iff f(P(A))=I(A)=0\).

To satisfy the second axiom, the information content must grow inversely to the probability: \(f(P(A))=I(A)\propto\frac{1}{P(A)}\).

To satisfy the third axiom, we need a function of the probability that satisfies

\[ \begin{align} &I(A\cap B) = f(P(A\cap B)) &(1)\\ &= f(P(A)\cdot P(B)) = f(P(A))+f(P(b)) &(2)\\ &= I(A)+I(B)&(3) \end{align} \]

By Cauchy's logarithmic functional equation, the only function that satisfies the form \(f(x\cdot y)=f(x)+f(y)\) of equation \((2)\) is the logarithm.

Putting together theses properties, the information content is uniquely defined as

\[ I(\omega_{n})=-\log(P(\omega_{n}))=\log(\frac{1}{P(\omega_{n})}) \]

Cross-Entropy¶

The cross-entropy \(H(P,Q)\) is the expected information content when drawing from distribution \(P(x)\) but assuming the model \(Q(x)\). It can also be seen as the expected amounts of bits needed to identify events \(x\) drawn from \(P(x)\) when using an encoding optimized for \(Q(x)\). It is defined as:

\[ H(P,Q)=-\mathbb{E}_{p}[\log q(x)]=\mathbb{E}_{p}[\log\frac{1}{q(x)}] \]

The cross-entropy is always greater or equal to \(H(P)\), as assuming the wrong model can only add to the uncertainty.

Relative Entropy¶

We can also look at how much entropy is added by assuming the wrong model, giving us the relative entropy

\[ \begin{align} &H(P,Q)-H(P)\\ &=\mathbb{E}_{p}\left[\log\frac{1}{q(x)}\right] - \mathbb{E}_{p}\left[\log\frac{1}{p(x)}\right] \\ &= \mathbb{E}_{p}\left[\log\frac{p(x)}{q(x)}\right] \\ &= D_{KL}(P,Q) \end{align} \]

which is also known as the Kullback-Leibler divergence.

In theory, we want to minimize the relative entropy between the empirical distribution \(P\) and our modeled distribution \(Q\). However, for most machine learning tasks, we don't have any influence on the empirical distribution \(P\), so minimizing the relative entropy is the same as minimizing the cross-entropy.

The KL divergence is relevant for models that use latent data representations, like Variational Autoencoders and Diffusion Models (each noise level is a latent representation).