Dropout zeros out a random subset of activations each forward pass, Textbook of AI

Half of the neurons are silenced randomly, forcing the network to spread information across many paths.

From the chapter: Chapter 10: Training & Optimisation

Transcript

A dense layer of activations during a forward pass.

Dropout, with rate 0.5, samples a binary mask. Half the entries are one. Half are zero.

Multiply the activations by the mask. Half the neurons are silenced for this step.

Next forward pass, a fresh mask. Different neurons are silenced.

The network cannot lean on any particular neuron, because that neuron might be off next time. It learns redundant, distributed representations.

At test time, no dropout. Every neuron is active. The activations are scaled to match the expected magnitude during training.

You can read dropout as ensembling. Each forward pass corresponds to a different sub-network sampled from the full one. There are exponentially many sub-networks, all sharing weights. Test-time evaluation approximates the average of all of them.

Dropout is a regulariser. It reduces overfitting. It was crucial for the early success of large feed-forward and convolutional nets.

Modern transformers use it lightly, often only at the attention output and inside feed-forward layers, often disabled entirely for the largest models. With enough data, the implicit regularisation of huge batches and weight decay does the same job. But the idea, train with random masks, has reappeared in many guises.

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).