ComputerScience #AI¶

The GAN model was introduce by Goodfellow et al. in 2020 (@goodfellowGenerativeAdversarialNetworks2020). The architecture consists of two adversarial components:

A generator, that transforms a noise vector \(z\) into a sample \(x\) from the target distribution \(X\).
A discriminator, that receives both real and generated samples and must find out which is real and which is fake. Both components improve each other. For an overview, look here: https://developers.google.com/machine-learning/gan/gan_structure.

Pasted image 20240422132814.png

Loss¶

The cross-entropy classification loss is given as:

\[ V(G,D)=\mathbb{E}_{x\sim p(x)}[\log{D(x)}] + \mathbb{E}_{z\sim q(z)}[\log{1-D(G(z))}] \]

where the left term maximizes the classification ability for real data and the right term maximizes the discriminating ability for generated samples. Because the support \(p\) and \(q\) don't overlap much in practice, especially when training is initiated, the objective is typically reformulated as:

\[ \theta^{*}=\arg\max_{\theta}\mathbb{E}_{z\sim q(z)}[\log{D(G_{\theta}(z))}] \]

Improving Learning Signal¶

The standard GAN uses the [[Jensen-Shannon Divergence]], which introduces a critical flaw into the training signal: #Vanishing & Exploding Gradients. To mitigate this issue, there are several solutions that try to make the gradients more continuous.

Noise Injection¶

The idea is similar to [[Score-Based Generative Modeling#Scarcity in sample space]]: If we add noise to both the real and the generative distribution, we artificially increase the support to the whole space \(\mathbb{R}^{d}\). Thus, even if \(q\) and \(p\) normally don't overlap, the noise spreads the perpetuated distributions so that the gradients of the critic are well-defined and finitely bounded everywhere.

While this helps to alleviate the problem of missing support, adding noise obviously degrades the quality of the generated samples. One way to avoid this is to anneal the noise over the time of training, as the support \(q\) and \(p\) start to increasingly overlap over time. But still, some amount of will always be required during training.

Wasserstein GAN¶

The Wasserstein GAN (WGAN) uses the Wasserstein Loss for training, which is based on the Wasserstein metric. The main motivation for WGANs is to solve the problem of #Vanishing/Exploding Gradients, #Training Instability and #Mode Collapse all at once by providing a better loss function.

Pasted image 20240815143746.png

A big advantage of using the Wasserstein metric over the [[Jensen-Shannon Divergence]] of the normal model is that it provides a good measure of how different two PDFs are, even if their support doesn't overlap. This is because the Wasserstein metric is not based on the density ratio of the two distributions (which is not useful when the density of one distribution goes to zero), but actually considers the distance between the distributions.

It uses the Wasserstein metric#Dual Representation to define the loss:

\[ W(p,q_{\theta})=\sup_{||f||_{L}\le1}\mathbb{E}_{x\sim p}[f(x)]-\mathbb{E}_{y\sim q_{\theta}}[f(y)] \]

where the discriminator neural network is used as the function \(f\). To enforce the discriminator to be Lipschitz-continuous, the authors originally just clipped the weights to be in a finite range. The weight clipping didn't stop the vanishing/exploding gradient problem, so later improvements used a penalty on gradients bigger than the unit-norm (Gradient Penalty) to circumvent this issue as well.

Spectral Normalization¶

In #Wasserstein GANs, the gradient penalty makes the discriminator less expressive and introduces instability in the training process. But lets consider the goal of the gradient penalty: The gradients of the loss should stay finitely bounded by some constant so they don't explode.

Instead of clipping the weights, we can try to keep the scaling effect they have on their inputs bounded to some constant. The spectral norm \(\sigma\) of a matrix \(A\) gives the maximum scaling factor when \(A\) is applied to a unit vector.

\[ \sigma(A):=\max_{||x||_{2}=1}||Ax||_{2} \]

We can use it to normalize the maximum scaling factor of the weights to \(1\):

\[ \hat{W}_{SN}(W):=\frac{W}{\sigma(W)} \]

The normalized weights have can not increase the magnitude of the input, but only preserve or decrease it. They are not Lipschitz continuous with a Lipschitz constant \(1\). The gradient of each weight with respect to its input is therefore also bounded to \(1\). When backpropagating, the gradients are summed and multiplied according to the chain rule and therefore stay bounded in a finite network.

Spectral normalization mainly avoids the exploding gradient issue, but since the flow of the gradients is more controlled and stable, also the chance of vanishing gradients is reduced.

Jacobian Clamping¶

Disantanglement¶

The standard GAN architecture has a problem with Entanglement of Latent Features. To disentangle the learned features, CycleGAN and StarGAN have emerged.

CycleGAN¶

CycleGAN are two GANs: One GAN learns to generate and discriminate source-to-target, the other GAN learns to do the same target-to-source. The error is calculated as the GAN error of both networks + a cycle error, which is the expected absolute deviation from the source-to-target-to-source generated sample to the actual source sample.

Green pathway: Source-to-target, red pathway: Target-to-source.

Pasted image 20240422133830.png

StarGAN¶

The StarGAN architecture only uses a single generator-discriminator pair. Each pair is not only given a sample but also a context domain code \(c\). This domain code allows the model to be conditioned on multiple distributions at once.

In addition to the generator and discriminator, a classifier is trained to predict the context domain for a sample. This classifier can be used to build the Classification Loss of the generator.

Pasted image 20240422135308.png

Issues¶

Vanishing & Exploding Gradients¶

Pasted image 20240812160853.png

Using the loss as described in #Loss has a big issue: The loss depends on the gradient of the classifier. When training starts, the support of the empirical distribution \(p_{data}(x)\) and the generated distribution \(p_{\theta}(x)\) wont overlap much. Because we are using the [[Jensen-Shannon Divergence]] as loss, there is no differentiation as to how different two distributions are, when their support doesn't (or only minimally) overlaps.ir support doesn't (or only minimally) overl

There are multiple solutions to this problem:

Least-squares GAN (LSGAN): Makes critic output real-values number, so the gradient is more linear.
#Wasserstein GAN, Gradient penalty, Spectral Norm: Enforce continuous gradients (slope is bounded).
#Noise Injection: By injecting noise, the support of both distributions start overlapping, giving non-zero gradients.

Training Instability¶

GAN models rely on a constant loss for both the discriminator and the generator during training.

If the generator loss is reducing, it means it starts outperforming the discriminator. This could be fine at the latter stages of training, where the generated samples are already very realistic. But when this happens early in training, it probably means that the generator found a loop-hole in the discriminator and fools it by producing only samples satisfying that loop-hole, without them actually being realistic.

If the discriminator loss is reducing too strongly, then it might recognize each and every sample from the generator as fake. In turn, the generator has no information anymore on what to improve as it receives the same loss for every generated sample. As a consequence, the gradients start to be random and it will probably not recover.

Mode Collapse¶

For the GAN model to be useful, one would want it to generate a new random sample, given a random input.

A possible issue arising in GANs is that the generator might start to always produce the same output, regardless of the input, if that output always convinces the discriminator. After a while, the discriminator might start flagging the generators sample after repeatedly receiving a high loss. However, the discriminator does not learn what underlying structure constitute it a fake sample, but only learns to reject this specific sample.

Therefore, the generator must only slightly vary the output for the discriminator to be fooled again. The generating ability of the GAN consequently collapse to one or a very limited range of samples, the so called mode collapse.

Solutions¶

#Wasserstein GAN