Wasserstein metric

The Earth Mover distance, also called Wasserstein metric, provides a measure of how much work it takes to move the energy of one distribution to the energy of the other distribution. Each distribution can be thought of as a pile of dirt. The Wasserstein distance is the minimum amount of work that it takes to move the one pile over to the other pile.

Formalization¶

Pasted image 20240812220433.png

Given two distributions \(\mu(x)\) and \(\nu(y)\). We can now define an infinite set of joint distributions \(\gamma(x,y)\in\Gamma(\mu,\nu)\) that have \(\mu\) and \(\nu\) as their marginals (Coupling). Each of those joint distributions \(\gamma(x,y)\) is a different transport plan that describes how much probability mass to move from point \(x\) to point \(y\). We can now define the work needed to move the mass by multiplying the mass moved by the distance \(d\) between the points and integrating over all the points:

\[ \int\int d(x,y)\gamma(x,y)\space\text{d}x\space\text{d}y=\int d(x,y)\space\text{d}\gamma(x,y)=\mathbb{E}_{(x,y)\sim\gamma}[d(x,y)] \]

This gives us the total amount of work for each transport plan \(\gamma\). The earth mover distance is now the minimal work needed across all possible transport plans (smallest lower-bound, infimum):

\[ \text{EMD}(\mu,\nu)=\inf_{\gamma\in\Gamma(\mu,\nu)}\mathbb{E}_{(x,y)\sim\gamma}[d(x,y)] \]

The Wasserstein metric is defined exactly the same, but sometimes with respect to the moments of the distributions. If \(\mu\) and \(\nu\) have \(p\) moments, then the Wasserstein distance is given as:

\[ W_{p}(\mu,\nu)=\inf_{\gamma\in\Gamma(\mu,\nu)}\mathbb{E}_{(x,y)\sim\gamma}[d(x,y)^{p}]^{\frac{1}{p}} \]

Dual Representation¶

By representing the Wasserstein distance as a problem in [[Linear Programming]], we can obtain its dual representation:

\[ W_{1}(\mu,\nu)=\frac{1}{K}\sup_{||f||_{L}\leq K}\mathbb{E}_{x\sim\mu}[f(x)]-\mathbb{E}_{y\sim\nu}[f(y)] \]

where \(f\) is a K-Lipschitz function, meaning its absolute slope is bounded by the finite constant \(K\). The Wasserstein distance now is the lowest upper-bound (suprenum) for the difference of the expected evaluation of \(f\) over all values in each distribution, for all functions K-Lipschitz functions \(f\).