Supervised Guided VAE

Based on @dingGuidedVariationalAutoencoder2020.

Pasted image 20240829155146.png

From the same paper as Unsupervised Guided VAE. An supervised guided VAE is a Variational Autoencoder variation that uses inhibition and excitation of latent variables to guide the VAE training to form independent latent features. Its goal is to deal with the Entanglement of Latent Features.

Idea¶

The concept in this model is similar to the one in Unsupervised Guided VAE: Dedicate latent variables to explicitly learn latent factors. In the unsupervised case, subset of the latent variables were trained to be the parameters of a transformation field, that could manipulate the content, encoded in the remaining latent variables, by translation, scaling and rotating the encoded content.

The supervised case is simpler: Split the latent variables into a factor variable for the \(t\)-th latent factor and the rest:

\[ z=(z_{t},z_{t}^{rst}) \]

We can now define the ground-truth label of the \(t\)-th latent factor of an input as \(y_{t}(x_{i})\in\{-1,+1\}\). For example, if an image shows a cat, then \(y_{cat}=1\) and \(y_{dog}=y_{goat}=\ldots=-1\). We can now define a loss to…

maximize the probability of the classifier predicting \(y_{t}(x_{i})\) when seeing \(z_t\)
minimize the probability of the classifier predicting \(y_{t}(x_{i})\) when seeing \(z_{t}^{rst}\)

Loss¶

As mentioned above, we need two loss functions, one to excite the \(t\)-th latent variable and one to inhibit all other variables, when observing a sample with label \(t\).

\[ \begin{align} \mathcal{L}_{excitation}(\phi, t) &=\max_{\omega_{t}}\left\{\sum\limits_{i=1}^{n}\mathbb{E}_{q_{\phi}(z_{t}|x_{i})}[\log p_{\omega_{t}}(y=y_{t}(x_{i})|z_{t})]\right\} \\[10pt] \mathcal{L}_{inhibition}(\phi, t) &=\max_{C_{t}}\left\{\sum\limits_{i=1}^{n}\mathbb{E}_{q_{\phi}(z_{t}^{rst}|x_{i})}[\log p_{C_{t}}(y=y_{t}(x_{i})|z_{t}^{rst})]\right\} \end{align} \]

where…

\(\omega_{t}\) stands for a classifier making a prediction for the label \(y_{t}\) using the latent variable \(z_{t}\).
\(C_t\) stands for a classifier making a prediction for the label \(y_{t}\) using the latent variable \(z_{t}^{rst}\).

We can now add the excitation loss and subtract the inhibition loss to our Evidence Lower Bound, and define our goal as:

\[ \theta^{*},\phi^{*}=\max_{\theta,\phi}\left\{\sum\limits_{i=1}^{n}ELBO(\phi,\theta;x_{i})+\sum\limits_{t=1}^{T}[\mathcal{L}_{excitation}(\phi,t)-\mathcal{L}_{inhibition}(\phi,t)]\right\} \]

This way, we force the latent variable \(z_{t}\) to be as informative about the latent factor \(v_{t}\) as possible, while forcing all other latent variables \(z_{t}^{rst}\) to be as uninformative about \(v_{t}\) as possible.