Dropout (mathematical detail), Glossary, Textbook of AI

Dropout (Srivastava, Hinton, Krizhevsky, Sutskever, Salakhutdinov 2014) randomly zeros each activation of a layer with probability $1 - p$ during training:

$$\tilde h_i = \begin{cases} h_i / p & \text{with probability } p \\ 0 & \text{with probability } 1 - p \end{cases}$$

The division by $p$, inverted dropout, keeps the expected activation $\mathbb{E}[\tilde h_i] = h_i$ unchanged, so no rescaling is needed at inference time. (Alternative formulation: don't scale during training, multiply activations by $p$ at inference.)

Implementation: at training, sample a binary mask $m \in \{0, 1\}^d$ with $m_i \sim \mathrm{Bernoulli}(p)$, compute $\tilde h = h \odot m / p$. At inference, $\tilde h = h$ (no mask).

Typical dropout rates: $p = 0.5$ for fully-connected layers, $p = 0.8$ for input layers, less for convolutional layers (which already have parameter sharing as regularisation).

Theoretical interpretations:

Approximate model averaging: each forward pass with a random mask is a different sub-network. With $d$ units, there are $2^d$ possible sub-networks; training optimises the parameters jointly. At inference, multiplying by $p$ approximates the geometric mean of these exponentially many sub-networks (Hinton's original interpretation).

Approximate Bayesian inference (Gal & Ghahramani 2016): dropout is equivalent to variational inference in a Bayesian neural network with a particular variational distribution. Sampling at test time gives Bayesian uncertainty estimates, Monte Carlo dropout.

Implicit regularisation: prevents co-adaptation of features. Each unit must be useful in many random contexts of other-unit-presence-or-absence, which forces robust, distributed representations.

Variants:

DropConnect: drop connections (weights) instead of activations. More radical regularisation; sometimes works better but harder to implement efficiently.

Spatial dropout: drop entire feature maps in CNNs, not individual activations. The standard variant for convolutional layers.

Stochastic depth / DropPath: drop entire residual blocks in ResNet/Transformer architectures. The standard regulariser for very deep networks (EfficientNet, modern ViT, large LLMs).

Variational dropout (Kingma et al. 2015): learn the dropout rate per parameter from data; can shrink rates to zero (effectively pruning).

Concrete dropout: differentiable concrete-distribution relaxation of binary dropout, enabling end-to-end learning of dropout rates.

Why dropout fell out of favour for some architectures: modern Transformers and large-scale pretraining seem to benefit less from dropout than older fully-connected nets. Possible reasons: massive data + scale + RMSNorm + AdamW provide enough regularisation already, and dropout interferes with BatchNorm/LayerNorm statistics. Modern LLMs use small or zero dropout; modern image classifiers use stochastic depth instead.

Related terms: Dropout, Regularisation, geoffrey-hinton, Bayesian Inference

Discussed in:

Chapter 9: Neural Networks, Training Neural Networks

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).