Generative Models: 14.5   Generative adversarial networks

Dr Chris Paton

14.5 Generative adversarial networks

A generative adversarial network is two neural networks locked in a contest. The first network, the generator, fabricates data, images, sounds, sentences, whatever the modelling task demands. The second network, the discriminator, sits in judgement: presented with a sample, it must decide whether the sample is real (drawn from the training set) or fake (synthesised by the generator). The two networks train together, in opposition. The generator wins a round by producing a fake convincing enough that the discriminator labels it real. The discriminator wins by spotting the fake. Each side improves under pressure from the other. If everything goes to plan, the contest ends in equilibrium: the generator's output distribution is indistinguishable from the data distribution, and the discriminator can do no better than guess. At that point, the generator has implicitly learned the data distribution without ever writing down a likelihood, an energy, or a partition function. It has learned to mimic.

This idea was published by Ian Goodfellow and colleagues in 2014, allegedly conceived in a Montreal pub. It is one of the most cited papers in modern machine learning, and it has spawned an entire sub-field. The mathematical core is a single minimax game, elegant enough to fit on a beermat. The engineering reality is rather less elegant: GANs are famously difficult to train. They oscillate, collapse, diverge and sulk. A decade of follow-up research has been spent stabilising them. Where VAEs (§14.4) optimise an explicit lower bound on the log-likelihood, this section covers the adversarial alternative, in which the model never writes down a likelihood at all. §14.6 surveys the major GAN variants; §§14.7–14.9 turn to normalising flows, energy-based models and diffusion, which have largely displaced GANs from the centre of generative modelling research.

Symbols Used Here

$G$generator network mapping noise to fake data.

$D$discriminator network mapping data to a probability of being real.

$\mathbf{z}$noise input to $G$, drawn from a simple prior such as a standard Gaussian.

$p_{\text{data}}$true data distribution from which the training set is sampled.

$p_g$distribution implicitly defined by $G$ as $\mathbf{z}$ varies through its prior.

The minimax objective

The whole construction rests on a single value function:

$$\min_G \max_D V(D, G) = \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}}[\log D(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p_z}[\log(1 - D(G(\mathbf{z})))].$$

Read it slowly. The discriminator $D$ takes some sample and returns a number in the interval $[0, 1]$; the higher the number, the more confident $D$ is that the sample is real. The first term is the expected log-probability that $D$ assigns to genuine training data, $D$ wants this large, so $D$ wants to push $D(\mathbf{x})$ up towards one for real $\mathbf{x}$. The second term is the expected log-probability that $D$ assigns to fake data being fake, $D$ wants this large too, so $D$ wants to push $D(G(\mathbf{z}))$ down towards zero for synthesised $G(\mathbf{z})$. Putting both terms together, $D$ is performing an ordinary binary classification with cross-entropy loss, where one class is "real" and the other is "fake".

The generator $G$ has the opposite incentive. It cannot influence the first term, it does not produce real samples, so it focuses on the second. It wants $D(G(\mathbf{z}))$ to be high, meaning the discriminator was fooled. Equivalently, it wants $\log(1 - D(G(\mathbf{z})))$ to be small. Hence the outer minimisation over $G$.

This is a two-player zero-sum game. The discriminator's gain is the generator's loss. The notation $\min_G \max_D$ should be parsed as "for whatever $G$ we settle on, $D$ will respond optimally; we choose $G$ to minimise the value attained at $D$'s best response". In game-theoretic language we are searching for a saddle point: a pair $(G^\star, D^\star)$ such that no unilateral deviation by either player improves their position.

A useful intuition is the counterfeiter and the police. The counterfeiter (generator) prints fake banknotes; the police (discriminator) examine notes and try to spot the forgeries. Each side studies the other and adapts. Over time the counterfeits become indistinguishable from real notes and the police cannot do better than flipping a coin. That equilibrium, perfect counterfeits, is the outcome a GAN seeks. The metaphor is Goodfellow's, and it captures both the mathematical structure and the somewhat menacing flavour of the training dynamics.

A few features deserve emphasis. There is no explicit likelihood anywhere in the objective. We never evaluate $\log p_g(\mathbf{x})$, which is fortunate because computing it would require integrating $G$ over latent space, typically intractable. We only need to sample from $p_g$, which is cheap: draw $\mathbf{z}$, push it through $G$. GANs are implicit generative models: they specify a sampling procedure, not a density. This is what makes them flexible. It is also what makes them slippery: most diagnostic tools for probabilistic models require a tractable likelihood, and GANs simply do not provide one.

The optimal discriminator

Suppose for a moment that we fix the generator $G$. The discriminator's task is then a static binary classification problem: distinguish samples drawn from $p_{\text{data}}$ from samples drawn from $p_g$. The objective $V(D, G)$, viewed as a functional of $D$, can be rewritten by combining the two expectations into a single integral:

$$V(D, G) = \int_{\mathbf{x}} \big[\, p_{\text{data}}(\mathbf{x}) \log D(\mathbf{x}) + p_g(\mathbf{x}) \log(1 - D(\mathbf{x})) \,\big]\, d\mathbf{x}.$$

Pointwise, for each $\mathbf{x}$, we want to choose $D(\mathbf{x})$ to maximise $a \log D + b \log(1 - D)$ where $a = p_{\text{data}}(\mathbf{x})$ and $b = p_g(\mathbf{x})$. Differentiating with respect to $D$ and setting to zero yields the familiar result:

$$D^\star(\mathbf{x}) = \frac{p_{\text{data}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_g(\mathbf{x})}.$$

This has a clean Bayesian interpretation. Imagine a coin flip decides whether to draw $\mathbf{x}$ from $p_{\text{data}}$ or from $p_g$, with equal prior probabilities. The optimal classifier reports the posterior probability that the coin landed on "data". When $p_{\text{data}} \gg p_g$ at a point, $D^\star \to 1$; when $p_g \gg p_{\text{data}}$, $D^\star \to 0$; when the two distributions agree, $D^\star = 1/2$, the discriminator is reduced to guessing.

Now substitute $D^\star$ back into $V$ and ask what objective the generator is implicitly minimising. After a short calculation,

$$V(D^\star, G) = \mathbb{E}_{p_{\text{data}}}\!\left[\log \frac{p_{\text{data}}}{p_{\text{data}} + p_g}\right] + \mathbb{E}_{p_g}\!\left[\log \frac{p_g}{p_{\text{data}} + p_g}\right].$$

Add and subtract $\log 2$ in each expectation, gather terms, and recognise the result as a Jensen–Shannon divergence:

$$V(D^\star, G) = -2 \log 2 + 2\, \mathrm{JSD}(p_{\text{data}} \,\|\, p_g).$$

The Jensen–Shannon divergence is a symmetric, smoothed variant of the Kullback–Leibler divergence. It is non-negative, and it equals zero if and only if the two distributions coincide almost everywhere. So at the global optimum of the GAN objective, the generator distribution $p_g$ matches $p_{\text{data}}$ exactly. The minimax game is, in effect, a Jensen–Shannon-divergence minimisation in disguise, but minimised through a clever reformulation that requires only sampling, not likelihood evaluation.

This existence proof is reassuring. It says that the minimax game has the right answer: an equilibrium where the generator captures the data distribution and the discriminator is reduced to a coin flip. What the proof does not say is that the training procedure will reach this equilibrium. As we shall see, that is a much subtler matter.

Training in practice

In practice we cannot compute either expectation in closed form, and we cannot find $D^\star$ analytically because $p_g$ is not available as a density. We approximate the saddle point by alternating gradient updates:

Sample a mini-batch of real data $\{\mathbf{x}_i\}$ and a mini-batch of noise $\{\mathbf{z}_j\}$. Push the noise through $G$ to obtain fake samples $G(\mathbf{z}_j)$. Update the parameters of $D$ by ascending the gradient of $V$, that is, train $D$ as a binary classifier with cross-entropy loss for one or a few steps.
Sample a fresh mini-batch of noise. Update the parameters of $G$ by descending the gradient of $V$, which in practice means making $D(G(\mathbf{z}))$ larger, encouraging the generator to fool the current discriminator.

The original paper's generator term, $\log(1 - D(G(\mathbf{z})))$, has a treacherous gradient profile. Early in training, the discriminator easily spots fakes, so $D(G(\mathbf{z}))$ sits near zero and $\log(1 - D)$ sits near zero too, flat. The gradient that flows back to $G$ is therefore vanishingly small in precisely the regime where the generator is doing badly and most needs a strong learning signal. Goodfellow's standard fix is the non-saturating loss: instead of minimising $\log(1 - D(G(\mathbf{z})))$, the generator maximises $\log D(G(\mathbf{z}))$. This has the same fixed point but a much healthier gradient when $D(G(\mathbf{z}))$ is small. Almost every modern GAN implementation uses the non-saturating form by default.

Even with the non-saturating loss, GAN training is famously twitchy. Three failure modes recur often enough that they have names:

Mode collapse. The generator discovers that one or two modes of $p_{\text{data}}$ are particularly hard for $D$ to handle and parks itself there. It produces highly realistic samples but only a tiny slice of the data variety. A face GAN suffering from mode collapse might output thousands of faces that are all variations on the same person. Mode collapse is a failure of coverage, not of sample quality.
Vanishing gradients. If $D$ gets too good too fast, or if the supports of $p_{\text{data}}$ and $p_g$ do not overlap, which is generic in high dimensions, there is a discriminator that classifies perfectly. JSD plateaus at $\log 2$, and the gradient passed back to $G$ is essentially noise. The generator stops learning.
Non-convergence. The two players chase each other through parameter space without settling. Losses oscillate; samples improve, then deteriorate, then improve again. Equilibrium analysis assumed simultaneous best responses; alternating gradient steps offer no such guarantee.

Empirical remedies include keeping $D$ slightly worse than optimal (for instance, by training $D$ for fewer steps per $G$ step), using Adam with $\beta_1 = 0.5$ to damp the running mean, normalising data and outputs to $[-1, 1]$ paired with a $\tanh$ output in $G$, and adding small amounts of label noise. None of these is a cure; they are guard rails. Stable GAN training has, more than anything, been an exercise in collected folk wisdom.

Architectural progress and conditional variants

The original 2014 GAN used multilayer perceptrons and could only handle small images. A decade of architectural innovations stabilised training and lifted sample quality dramatically, DCGAN's convolutional backbone, the Wasserstein objective and gradient penalty, progressive growing, the StyleGAN family's mapping-network/AdaIN design, and BigGAN's class-conditional ImageNet-scale recipe. Conditional and cycle-consistent extensions (pix2pix, CycleGAN) carried GANs from unconditional sampling into paired and unpaired image-to-image translation. The next section, §14.6, treats each of these variants in detail; for now it is enough to know that the vanilla GAN above is rarely used as-is, and that almost every practical GAN system inherits one or more of the architectural and objective changes catalogued there.

GAN's status in 2026

As of 2026, diffusion models have eaten most of GAN territory. For unconditional and text-conditional image synthesis, diffusion (§14.9–14.14) produces higher fidelity, better diversity, easier-to-condition pipelines, and far more stable training. Stable Diffusion, Imagen, DALL·E and their successors are diffusion-based, not GAN-based. Almost every new generative-image paper at major venues uses diffusion or a flow-matching variant.

GANs have not vanished, however. They retain three durable niches:

Speed-critical generation. A GAN samples in a single forward pass through $G$. Diffusion typically requires tens to hundreds of denoising steps. For real-time applications, video games, interactive editing, on-device generation, GAN inference can be one to two orders of magnitude faster, even after diffusion-distillation tricks. Consistency models and one-step diffusion are closing this gap, but the gap remains.
High-fidelity face manipulation. StyleGAN's latent space has an exceptionally clean structure for editing. Years of tooling have grown around it: GAN inversion, latent direction discovery, fine-grained attribute editing. For purposes such as professional portrait retouching, ageing simulation and identity swap, StyleGAN-family models are still the working tools.
Niche modalities. Time-series generation, point clouds, molecular graphs, audio waveforms (the original WaveGAN-family work) and certain scientific simulations continue to use GAN-style adversarial losses, often as an auxiliary term within a larger likelihood-based model.

The era of GAN-as-default-generator is over. The era of GAN-as-useful-tool continues. From a curriculum point of view, GANs remain essential: the minimax framing, the implicit-density idea, and the long debugging history of mode collapse and vanishing gradients shaped a generation of researchers' intuition about what generative modelling actually requires.

What you should take away

A GAN is a two-player game. A generator $G$ produces fakes; a discriminator $D$ tries to distinguish fakes from real data. They train simultaneously, in opposition.
At equilibrium, the generator matches the data distribution. For a fixed $G$, the optimal discriminator is $D^\star(\mathbf{x}) = p_{\text{data}}(\mathbf{x}) / (p_{\text{data}}(\mathbf{x}) + p_g(\mathbf{x}))$. Substituted back, the value function reduces to a Jensen–Shannon divergence between $p_{\text{data}}$ and $p_g$, which is zero only when the two coincide.
Training is alternating gradient descent on a saddle point. It works in practice with the non-saturating generator loss and a variety of hard-won engineering tricks, but it suffers from mode collapse, vanishing gradients and non-convergence.
Architectures and objectives matter. DCGAN made GANs trainable; WGAN and WGAN-GP made them theoretically cleaner; progressive growing and StyleGAN made them photorealistic; BigGAN took them to ImageNet scale.
GANs have been displaced, not deleted. Diffusion is the default for image synthesis in 2026. GANs persist where speed matters, where StyleGAN's editable latent space is valuable, and in modality-specific niches. The mathematical lessons of the GAN era (implicit density modelling, adversarial losses as divergences, the difficulty of saddle-point optimisation) outlast the algorithm itself.