Generative Adversarial Network, Glossary, Textbook of AI

Also known as: GAN

A generative adversarial network (GAN), introduced by Ian Goodfellow et al. in 2014, is a generative-modelling framework in which two neural networks are trained together in opposition. The generator G takes random noise z and produces samples G(z); the discriminator D takes either real data or generated samples and tries to distinguish them.

The two networks are trained with opposing objectives: D is trained to maximise log D(x) on real data and log(1 − D(G(z))) on generated samples (correctly classifying); G is trained to minimise log(1 − D(G(z))) (or equivalently maximise log D(G(z))), making D classify its generated samples as real. At equilibrium, G produces samples indistinguishable from real data and D outputs ½ everywhere.

In practice GAN training is famously unstable. The original objective suffers from mode collapse (G produces a single point or small set of points that fool D), training oscillations (G and D chase each other without converging), and vanishing gradients when D wins decisively. A long line of research, DCGAN (architectural conventions), Wasserstein GAN (a different loss), spectral normalisation, gradient penalties, two-time-scale updates, improved stability progressively.

GANs reshaped generative modelling for nearly a decade. Variants enabled photorealistic image generation (StyleGAN, BigGAN), conditional generation (cGAN, AC-GAN), domain translation (CycleGAN, pix2pix), super-resolution (SRGAN), data augmentation and many other applications. StyleGAN-3 (2021) was the high-water mark of GAN-based image generation.

In 2022 diffusion models (DALL-E 2, Imagen, Stable Diffusion) substantially displaced GANs as the dominant image-generation paradigm, achieving better sample quality and easier training. GANs remain useful for specific applications (real-time generation, super-resolution, certain forms of editing) and as theoretical objects.

Mathematics

A GAN consists of a generator $G_\theta: \mathcal{Z} \to \mathcal{X}$ mapping latent noise to data, and a discriminator $D_\phi: \mathcal{X} \to [0, 1]$ trying to predict whether its input is real. The original objective is the two-player minimax game

$$\min_\theta \max_\phi \mathbb{E}_{x \sim p_\mathrm{data}}[\log D_\phi(x)] + \mathbb{E}_{z \sim p(z)}[\log(1 - D_\phi(G_\theta(z)))].$$

Goodfellow et al. (2014) showed that the global optimum is $D^*(x) = p_\mathrm{data}(x) / (p_\mathrm{data}(x) + p_g(x))$ where $p_g$ is the generator's induced distribution; substituting this into the objective shows the optimum is reached when $p_g = p_\mathrm{data}$, with optimal value $-2 \log 2$ corresponding to Jensen-Shannon divergence zero.

In practice the non-saturating loss is preferred for $G$ because the original $\log(1 - D(G(z)))$ saturates when $D$ wins:

$$\mathcal{L}_G = -\mathbb{E}_{z}[\log D_\phi(G_\theta(z))]$$

Wasserstein GAN (Arjovsky et al., 2017) replaces the Jensen-Shannon divergence with the Earth Mover's distance and the discriminator with a 1-Lipschitz critic $f_\phi$:

$$\min_\theta \max_{f_\phi: \|f\|_L \leq 1} \mathbb{E}_{x \sim p_\mathrm{data}}[f_\phi(x)] - \mathbb{E}_{z}[f_\phi(G_\theta(z))]$$

enforced via weight clipping or gradient penalty $\lambda \mathbb{E}[(\|\nabla f\| - 1)^2]$. WGAN is dramatically more stable than the original GAN and produces interpretable training curves.

Subsequent variants (Spectral Normalisation, StyleGAN's progressive growing and noise injection) further stabilise training; diffusion models (DDPM 2020) ultimately displaced GANs as the leading image-generation paradigm in 2022.

Video

Related terms: ian-goodfellow, Variational Autoencoder, Diffusion Model

Discussed in:

Chapter 14: Generative Models, Generative Adversarial Networks
Chapter 14: Generative Models, Generative Models

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).