Unsupervised Guided VAE

Based on @dingGuidedVariationalAutoencoder2020.

Pasted image 20240826153307.png

From the same paper as Supervised Guided VAE. An unsupervised guided VAE is a Variational Autoencoder variation that uses [[Principal Component Analysis]] to guide the VAE training to form independent latent features. Its goal is to deal with the Entanglement of Latent Features.

Idea¶

During training, the input is reconstructed twice: Once the way as is normally done in VAEs and once in a more explicit guided reconstruction. To do so, the latent vector \(z\) is split into a deformation vector \(z_{\text{def}}\) and a content vector \(z_{\text{cont}}\).

The deformation vector \(z_{def}\) is used to define a transformation field. For example, for images, an affine transformation grid can be used, which defines the scaling, rotation and translation of the reconstructed image. These latent features are thus conditionally independently encoded in the units of \(z_{def}\).

The content vector \(z_{cont}\) carries the actual content information of the output, its units can be conditionally dependent to other latent features. It is transformed using the PCA and then transformed again via the transformation field to reconstruct the input.

Since we explicitly dedicate latent units to the transformation field during training, the model is pressured to use those neurons to encode features that are effected by the transformation (affine => scaling, rotation, translation).

Loss¶

The model adds a deformable PCA loss to the Evidence Lower Bound of the normal VAE. So the objective becomes:

\[ \theta^{*}=\arg\max_{\theta,\phi,B}\left\{\sum\limits_{i=1}^{n}\text{ELBO}(\theta, \phi; x^{(i)})-\mathcal{L}_{\text{DPCA}}(\phi, B)\right\} \]

where

\[ \begin{align} &L_{\text{DPCA}}(\phi,B)\\ &=\sum\limits_{i=1}^{n}\mathbb{E}_{q_{\theta}(z_{\text{def}}, z_{\text{cont}}|x^{(i)})}[||x^{(i)}-\tau(z_{\text{def}})\circ(z_{\text{cont}}B^{T})||^{2}] & (1) \\ &+\sum\limits_{k,j\ne k}(b_{k}^{T}b_{j})^{2} & (2) \end{align} \]

(1) Reconstruction by Transformation The term is the mean squared error between the input \(x^{(i)}\) and the reconstruction by transformation \(\hat{x}=\tau(z_{\text{def}})\circ(z_{\text{cont}}B^{T})\): The deformation vector is used to define a deformation field \(\tau(z_{\text{def}})\). The authors use an affine grid. The field is then applied to the content vector using an according transformation. In case of the affine grid, the affine transformation is applied to the content vector and sampled accordingly to obtain the reconstructed input. The content is compressed by the PCA model. \(B=(b_{1},\ldots,b_{k})\) are the k basis of the PCA model.

(2) PCA loss The second term is the orthogonal loss of the PCA. It pressures the basis vectors \(b_{k}\) to be orthogonal to each other. The dot product \(b_{k}^{T}b_{j}\) gives the number of units that the vector \(b_{k}\) extends into the direction of \(b_{j}\). If it is \(0\), the vectors are perfectly orthogonal. Therefore, we try to minimize this dot product pairwise between all vectors.

The loss can be combined with beta-TCVAE to achieve even better disentanglement.