Glossary

Dropout

Dropout, introduced by Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov in 2014 ("Dropout: A Simple Way to Prevent Neural Networks from Overfitting", JMLR), is one of the most influential regularisation techniques in deep learning. The original paper has been cited tens of thousands of times and dropout (in some form) appears in nearly every modern neural architecture.

The mechanism

During each training forward pass, each neuron in a designated layer is independently dropped, set to zero, with probability $p$. The surviving activations are scaled by $1/(1-p)$ so the expected value remains unchanged. This is inverted dropout, the standard implementation:

$$\tilde{h}_i = \begin{cases} h_i / (1-p) & \text{with probability } 1-p \\ 0 & \text{with probability } p \end{cases}$$

At test time, no neurons are dropped; the full network is used and no scaling is applied. Typical dropout rates: $p = 0.5$ for hidden fully-connected layers, $p = 0.1$-$0.3$ for input layers and modern transformers.

Why it works: the ensemble interpretation

Dropout forces the network to develop redundant, distributed representations that are robust to the loss of individual units. No single neuron can be relied upon, so the network must learn features that are useful in many combinations.

The deeper interpretation is that dropout trains an exponentially large ensemble of $2^n$ sub-networks (for $n$ droppable units) that share weights. Each mini-batch sees a different random sub-network, and the final test-time prediction approximates the geometric mean over the ensemble, combining the bias reduction of ensembling with the memory cost of a single model. This connection to model averaging gives dropout a Bayesian interpretation (Gal & Ghahramani, 2016: dropout as approximate variational inference in a deep Gaussian process), which is exploited in Monte Carlo dropout for uncertainty estimation.

Co-adaptation

Hinton's original motivation framed dropout as a way to break co-adaptation of features. In a fully-connected network without dropout, hidden units can specialise for particular contexts and rely on the presence of other specific units. Dropout breaks these brittle alliances and forces each unit to become useful on its own.

Architectural variants

Dropout is particularly effective in fully connected layers with many parameters. It is less commonly used in modern convolutional architectures, where batch normalisation provides much of the regularisation. In transformers dropout is typically applied at moderate rates ($0.1$) inside attention (on the attention probabilities) and feed-forward sub-layers, plus on residual connections.

Several variants address specific architectures:

  • DropConnect (Wan et al., 2013): drops weights rather than activations.
  • Spatial Dropout (Tompson et al., 2015): drops entire feature maps in CNNs, preserving spatial coherence.
  • Variational Dropout (Gal & Ghahramani, 2016): uses the same mask across time steps in RNNs, providing principled regularisation for sequence models.
  • DropPath / Stochastic Depth (Huang et al., 2016): drops entire residual blocks, used in EfficientNet and many vision transformers.
  • Cutout / DropBlock (DeVries & Taylor 2017; Ghiasi et al. 2018): drops contiguous regions of input or activations, more effective for vision than independent pixel-wise dropout.
  • Attention Dropout and Embedding Dropout: standard in BERT, GPT and successors.

Modern usage

In very large models (LLMs, vision transformers), dropout is often reduced or omitted because (a) the dataset is large enough that overfitting is rare, and (b) implicit regularisation from large batches and high learning rates suffices. Llama-2 sets dropout to zero. Smaller fine-tuning runs, however, still routinely benefit from dropout rates in the $0.05$-$0.1$ range. Dropout remains a free, simple lever and one of the most cited ideas in modern machine learning.

Interactive

Dropout zeros out a random subset of activations each forward pass. Half of the neurons are silenced randomly, forcing the network to spread information across many paths.

Related terms: Regularisation, Batch Normalisation, Overfitting

Discussed in:

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.