14.6 GAN variants

The plain vanilla GAN of §14.5 is a simple idea, a generator and a discriminator playing a minimax game, but it is notoriously difficult to train. Practitioners who first sat down to implement Goodfellow's original recipe quickly discovered a litany of pathologies. The discriminator races ahead and saturates, leaving the generator with vanishing gradients. Mode collapse strikes, with the generator producing only a handful of plausible samples while ignoring the rest of the data distribution. Training oscillates rather than converging. Hyperparameters that work on one dataset fail on another. The decade following Goodfellow's 2014 paper produced a long lineage of variants, each addressing one or more of these failure modes, and together they transformed GANs from a fragile research curiosity into a workable tool for high-quality image synthesis.

This section traces the major lineage. We begin with DCGAN, which gave the field a reliable convolutional architecture; move to the Wasserstein GAN, which replaced the original objective with a more stable distance; then to progressive growing and StyleGAN, which produced photorealistic faces; to BigGAN, which scaled class-conditional generation to ImageNet; and finally to conditional and cycle-consistent variants, which extended GANs from unconditional sampling to image-to-image translation. We close with a brief assessment of where GAN variants stand in 2026, an era dominated by diffusion.

Symbols Used Here
$G, D$generator, discriminator (or critic)
$\mathbf{z}$latent noise vector
$\mathbf{w}$intermediate style vector (StyleGAN)
$y$conditioning label

DCGAN (2015)

Radford, Metz and Chintala's deep convolutional GAN was the first architecture that reliably trained on natural images. Before DCGAN, most GAN papers used multilayer perceptrons or ad-hoc convolutional designs that diverged on anything beyond MNIST. Radford and colleagues searched a large space of architectural choices and distilled a small set of guidelines that together unlocked stable training. Replace pooling layers with strided convolutions in the discriminator and fractional-strided (transposed) convolutions in the generator, so the network learns its own spatial up- and down-sampling. Use batch normalisation everywhere except the generator's output layer and the discriminator's input layer, which keeps activations well-conditioned. Drop fully-connected hidden layers in deep architectures. Use ReLU in the generator and LeakyReLU with slope 0.2 in the discriminator.

These rules sound mundane today because they have been absorbed into the standard convolutional vocabulary. In 2015 they were a revelation. DCGAN produced the first convincing GAN-generated bedrooms, faces and album covers, and demonstrated meaningful arithmetic in latent space, the famous "smiling woman minus neutral woman plus neutral man equals smiling man" experiment. It remains a sensible default starting point for anyone learning GANs, and most subsequent variants inherit its convolutional backbone.

WGAN and WGAN-GP

Arjovsky, Chintala and Bottou's 2017 Wasserstein GAN diagnosed the original GAN's instability at the level of the loss function. Goodfellow's minimax objective, when the discriminator is optimal, minimises the Jensen–Shannon divergence between the data distribution $p_{\text{data}}$ and the generator distribution $p_g$. Jensen–Shannon has a nasty property: when the two distributions have disjoint support, which is exactly the case early in training, when the generator outputs noise, it is constant, and its gradient with respect to the generator is zero. The generator gets no useful signal.

The Wasserstein-1 distance, also called the earth-mover's distance, has no such pathology. It measures the minimum cost of transporting probability mass from $p_g$ to $p_{\text{data}}$:

$$W(p_{\text{data}}, p_g) = \inf_{\gamma \in \Pi(p_{\text{data}}, p_g)} \mathbb{E}_{(x, y) \sim \gamma}[\|x - y\|]$$

By Kantorovich–Rubinstein duality this primal optimisation problem has a dual form,

$$W(p_{\text{data}}, p_g) = \sup_{\|f\|_L \leq 1} \mathbb{E}_{p_{\text{data}}}[f(x)] - \mathbb{E}_{p_g}[f(x)],$$

where the supremum is taken over 1-Lipschitz functions. The discriminator, now renamed the critic, because it no longer outputs a probability, parametrises $f$ and is trained to maximise the difference of expectations. The generator is trained to minimise the same quantity. The Wasserstein distance is continuous and almost everywhere differentiable in the generator's parameters, even when supports are disjoint, so the generator always receives a useful gradient.

The catch is the Lipschitz constraint. The original WGAN enforced it by clipping the critic's weights to a small box, which Arjovsky himself acknowledged was a crude solution: clipping interacts badly with batch normalisation and biases the critic towards simple functions. Gulrajani and colleagues' WGAN-GP (2017) replaced clipping with a gradient penalty:

$$\mathcal{L}_{\text{GP}} = \lambda\, \mathbb{E}_{\hat x \sim p_{\hat x}}\bigl[(\|\nabla_{\hat x} D(\hat x)\|_2 - 1)^2\bigr],$$

where $\hat x = \alpha x + (1 - \alpha)\tilde x$ is sampled along straight lines between real and generated points. The penalty is zero when the gradient norm equals one, the boundary of the 1-Lipschitz constraint, and quadratically punishes deviations. WGAN-GP became the most reliable GAN variant of the past decade. It rarely mode-collapses, its loss curves are interpretable (lower critic loss really does mean better samples), and it remains the baseline against which new GAN ideas are still measured. Spectral normalisation (Miyato et al., 2018) offered a complementary route, dividing every weight matrix by its largest singular value via cheap power iteration to enforce the Lipschitz bound globally rather than only at sampled points.

Progressive growing and StyleGAN

Karras and colleagues at NVIDIA produced the next leap, beginning with progressive growing in 2018. The idea is pedagogical. Rather than asking a generator to synthesise a 1024×1024 face from scratch, train it first to produce 4×4 thumbnails. Once that is stable, fade in an extra layer that doubles the resolution to 8×8. Continue doubling until the target resolution is reached. At each stage the network only has to learn detail at one scale, and the previous stages provide a warm start for the next. This curriculum dramatically improved both training stability and the resolution achievable, producing the first GAN-generated megapixel faces.

StyleGAN (Karras, Laine and Aila, 2019) reorganised the generator around the idea of style. In a vanilla GAN the latent vector $\mathbf{z}$ is fed at the input and propagated forwards. StyleGAN instead routes $\mathbf{z}$ through an eight-layer mapping network that produces an intermediate latent $\mathbf{w}$ in a learned space $\mathcal{W}$. The synthesis network is a fixed learned constant fed through a stack of convolutional blocks; at each block, $\mathbf{w}$ is broadcast in via adaptive instance normalisation,

$$\mathrm{AdaIN}(x_i, y) = y_{s,i}\, \frac{x_i - \mu(x_i)}{\sigma(x_i)} + y_{b,i},$$

where $y_s$ and $y_b$ are learned affine projections of $\mathbf{w}$. The decoupling has a useful consequence: the mapping network can untangle the factors of variation that a vanilla generator entangles. Coarse-resolution styles control pose, identity and shape; fine-resolution styles control hair texture, micro-expressions and lighting. Mixing experiments, take pose from one image, identity from another, work cleanly. So does latent-space editing: directions in $\mathcal{W}$ correspond to interpretable attributes such as age, gender, beard, glasses and smile, found by simple linear classifiers on labelled samples.

StyleGAN2 (2020) fixed the characteristic blob artefacts of the original by replacing AdaIN with weight demodulation and removing progressive growing in favour of a skip-connection generator. StyleGAN3 (2021) tackled the texture-sticking problem, the tendency of fine details to remain anchored to image coordinates rather than tracking the underlying object, by treating internal feature maps as continuous signals and applying anti-aliased filtering at every up-sampling step. The result was rotationally and translationally equivariant generation, which mattered for video. For face synthesis specifically, StyleGAN3 held the state of the art until diffusion models surpassed it around 2022, and it remains a respected baseline whenever a fast, controllable face generator is needed.

BigGAN

Brock, Donahue and Simonyan's BigGAN (2019) showed what GANs could do at scale. They took a class-conditional GAN architecture and threw an order of magnitude more compute at ImageNet than any previous GAN had used: batch sizes of 2048, channel widths roughly doubled, and TPU pods running for days. Class conditioning was provided through class-conditional batch normalisation in the generator and a projection discriminator that takes an inner product between the discriminator's penultimate features and a class embedding.

Beyond the brute-force scaling, BigGAN introduced two regularisation tricks worth remembering. Orthogonal regularisation softly pushes weight matrices towards orthogonality, stabilising training at large widths. The truncation trick trades diversity for fidelity at sample time: instead of drawing $\mathbf{z}$ from a standard Gaussian, draw it from a truncated Gaussian whose tails are clipped, then re-sample anything outside the threshold. Lower truncation produces sharper, more typical samples at the cost of variety. BigGAN was the first model to produce high-resolution, diverse, class-conditional ImageNet samples that a casual observer could mistake for photographs. It held the GAN crown on ImageNet until cascaded diffusion models eclipsed it in 2022.

Conditional GANs

The original conditional GAN (cGAN) of Mirza and Osindero (2014) is a one-line modification of the vanilla GAN: feed a label $y$ to both the generator and the discriminator, so the objective becomes

$$\min_G \max_D \mathbb{E}[\log D(x \mid y)] + \mathbb{E}[\log(1 - D(G(\mathbf{z} \mid y) \mid y))].$$

This trivial change unlocks a vast range of applications. Pix2pix (Isola et al., 2017) applied the idea to paired image-to-image translation: given paired training examples, a satellite image and its map, an architectural label mask and a photograph, an edge sketch and a finished shoe, train a U-Net generator and a PatchGAN discriminator that classifies overlapping image patches as real or fake. The L1 reconstruction loss is added to the adversarial loss to encourage low-frequency correctness, while the discriminator polices the high-frequency texture. Pix2pix became the workhorse for any task where input–output image pairs were available.

CycleGAN (Zhu et al., 2017) tackled the harder, more interesting setting where paired data are unavailable but two unpaired image collections are at hand: photographs and Monet paintings, horses and zebras, summer and winter scenes. The architecture has two generators, $G : X \to Y$ and $F : Y \to X$, plus two discriminators $D_X$ and $D_Y$. The adversarial losses ensure that $G(x)$ looks like a $Y$-image and $F(y)$ looks like an $X$-image, but adversarial losses alone do not pin down a meaningful translation: a generator could ignore its input entirely and output the same convincing $Y$-image every time. CycleGAN solves this by imposing cycle consistency,

$$\mathcal{L}_{\text{cyc}} = \mathbb{E}_x[\|F(G(x)) - x\|_1] + \mathbb{E}_y[\|G(F(y)) - y\|_1],$$

which forces the round trip to recover the original image. The combined objective produces translations that preserve content while transferring style. CycleGAN spawned an entire sub-genre, UNIT, MUNIT, StarGAN, AttGAN, exploring the same unpaired-translation territory with progressively more flexible style codes and multi-domain support. Its influence reaches well beyond image translation: cycle-consistency losses now appear in unsupervised speech translation, cross-modal retrieval and unpaired domain adaptation.

Where GAN variants live in 2026

The rise of denoising diffusion since 2020 has displaced GANs from the headline position in image generation. Diffusion models train more stably, scale more predictably and produce more diverse outputs. Most large-scale text-to-image and text-to-video systems, Stable Diffusion, DALL-E 3, Imagen, Sora, Veo, are diffusion-based or score-based. Yet GANs have not vanished. StyleGAN-3 remains the go-to choice when fast, single-step, controllable face synthesis is needed, especially for real-time avatar applications where diffusion's iterative sampling is too slow. Pix2pix and CycleGAN persist in niche image-translation pipelines, particularly in domains with limited training data. The GAN objective itself has reappeared as a fine-tuning step on top of diffusion samplers, used to distil multi-step diffusion into single-step generators. Adversarial training, in other words, is still a useful tool, but now usually one tool among several rather than the centrepiece.

What you should take away

  1. Vanilla GANs are unstable; almost every practical system uses one of the variants in this section.
  2. DCGAN gave the field a reliable convolutional architecture; its design rules still inform modern generators.
  3. WGAN-GP replaced the Jensen–Shannon objective with a Wasserstein-based loss, eliminating vanishing gradients and making training diagnosable.
  4. StyleGAN's mapping network and AdaIN injection produced a disentangled latent space, enabling editable, photorealistic face synthesis.
  5. Conditional and cycle-consistent variants extended GANs from unconditional sampling to paired and unpaired image-to-image translation, while BigGAN demonstrated class-conditional generation at ImageNet scale.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).