9.10 Initialisation

Before training begins, every weight in the network must be set to some number. That choice, what to put there at step zero, turns out to matter enormously. A deep neural network is a long pipeline of multiplications, and any small bias in scale gets compounded layer after layer. If, on average, each layer makes its inputs a bit smaller, then by the tenth layer the signal has been crushed almost to zero; if each layer makes its inputs a bit larger, the values grow geometrically and overflow. Either failure mode kills training: there is no useful gradient when the activations are uniformly zero, and there is nothing finite to learn from when they are infinite. A good initialisation scheme is one that keeps the variance of activations roughly constant as we move forward through the layers, so that no layer becomes a bottleneck of its own making. This section explains why that goal matters, derives the formula that follows from it, and presents the two recipes, Xavier/Glorot and He/Kaiming, that almost every modern network uses.

Each activation function from §9.4 reshapes the variance of its input differently, so the right initialisation must compensate for that reshaping. §9.11 examines the vanishing and exploding gradient problem in more depth and shows that initialisation is one of several tools, alongside normalisation, residual connections, and gradient clipping, used to keep the signal alive in very deep networks.

Symbols Used Here
$\mathbf{W}^{(\ell)}$weight matrix at layer $\ell$
$d_\ell$number of neurons in layer $\ell$
$d_{\ell-1}$fan-in: number of inputs to a neuron in layer $\ell$
$d_\ell$fan-out: number of outputs (used in symmetric variants)
$\mathrm{Var}(\cdot)$variance
$\mathbb{E}[\cdot]$expectation
$\mathcal{N}(\mu, \sigma^2)$Gaussian distribution with mean $\mu$ and variance $\sigma^2$
$\mathcal{U}(a, b)$uniform distribution on the interval $[a, b]$
$\sigma$$\mathrm{ReLU}$, activation functions (sigmoid; rectified linear unit)

Why bad initialisation breaks training

There are three classic ways to ruin a network before the first gradient step.

Failure mode 1: weights too small. Suppose every weight is drawn from a Gaussian with standard deviation $10^{-3}$, so a typical weight is around 0.001. A neuron in layer 2 computes a weighted sum of, say, 100 inputs, each multiplied by something tiny. The output is the sum of 100 numbers, each roughly $0.001$ in magnitude, with random signs. Their sum has standard deviation around $\sqrt{100} \times 0.001 = 0.01$. Pass this into a sigmoid and you get values clumped right at $\sigma(0) = 0.5$, with derivative $\sigma'(0) = 0.25$. So far that is not yet catastrophic. But now do it again at layer 3, layer 4, layer 5. Each layer scales the signal by roughly $\sqrt{d_{\ell-1}} \cdot 0.001$, which for $d_{\ell-1} = 100$ is $0.01$. After 10 layers the signal is around $10^{-20}$, indistinguishable from numerical zero. The forward pass gives almost identical outputs for every input, the loss has almost no useful gradient with respect to early weights, and the network simply does not learn. We say the activations vanish.

Failure mode 2: weights too large. Now flip the problem. Every weight is drawn from a Gaussian with standard deviation 10, so a typical weight has magnitude around 10. The same layer-2 neuron now sums 100 inputs each scaled by something around 10, with standard deviation $\sqrt{100} \times 10 = 100$. A ReLU activation just passes positive values through unchanged, so layer 3 receives inputs of size around 100, and its outputs have standard deviation around $\sqrt{100} \times 10 \times 100 = 10\,000$. By layer 5 we are at $10^8$, by layer 10 at $10^{18}$. The forward pass overflows to inf, the backward pass produces NaN, and the entire run dies. We say the activations explode. Sigmoid networks fail differently but just as fatally: the activations saturate at 1 (or 0), where the derivative is essentially zero, so gradients vanish because the values are too big, not too small.

Failure mode 3: identical weights. Suppose, perhaps from a coding bug, every weight is set to the same value, say, all zeros, or all ones. Then every neuron in a given layer receives the same weighted sum of the same inputs. They all compute the same function, produce the same output, receive the same gradient on the backward pass, and update by the same amount. From the network's perspective, you do not have a layer of width 100, you have a layer of width 1, copied 100 times. No amount of training will distinguish the units, because nothing in the dynamics ever breaks the symmetry. The network is rank-deficient by construction.

To make the compounding concrete: imagine a 5-layer fully connected network with input dimension 100, hidden width 100, and weights drawn from $\mathcal{N}(0, 1)$ (variance one, not the small variance we will derive in a moment). With unit-variance inputs, each layer multiplies the variance by $d_{\ell-1} \times 1 = 100$. After 5 layers the output variance is $100^5 = 10^{10}$. The standard deviation is around $10^5$. Try fitting a softmax cross-entropy loss to outputs of that scale and you will get instant NaNs. Pick the opposite extreme, variance $10^{-4}$ per weight, and after 5 layers the variance is $(100 \times 10^{-4})^5 = 10^{-5}$. The signal is dead. The whole point of careful initialisation is to find the variance that makes neither of those things happen, layer after layer, regardless of depth.

The variance-preservation principle

The fix is to choose the initial weight distribution so that the variance of each layer's pre-activations matches the variance of the previous layer's activations. If we can pull off that trick at layer 1, layer 2, and so on, then the signal scale is invariant in depth, and the network starts in a regime where no layer is dominating or vanishing.

Here is the derivation. Consider a single neuron in layer $\ell$. Its pre-activation is

$$z^{(\ell)}_j = \sum_{i=1}^{d_{\ell-1}} W_{ij}^{(\ell)} \, a_i^{(\ell-1)},$$

where $a_i^{(\ell-1)}$ is the activation of the $i$-th neuron in the previous layer and $W_{ij}^{(\ell)}$ is the weight connecting it to the current neuron. We make four assumptions, all reasonable at initialisation:

  1. The weights $W_{ij}^{(\ell)}$ are independent and identically distributed, with $\mathbb{E}[W_{ij}] = 0$ and $\mathrm{Var}(W_{ij}) = \sigma_w^2$.
  2. The inputs $a_i^{(\ell-1)}$ are independent and identically distributed, with $\mathbb{E}[a_i^{(\ell-1)}] = 0$ and $\mathrm{Var}(a_i^{(\ell-1)}) = \sigma_a^2$.
  3. Weights and activations are independent of each other.
  4. We are at initialisation, so the assumptions in 1–3 are not yet violated by training dynamics.

The variance of a sum of independent zero-mean terms is the sum of their variances. The variance of a product of independent zero-mean random variables $X$ and $Y$ is $\mathrm{Var}(XY) = \mathrm{Var}(X)\mathrm{Var}(Y)$ (because $\mathbb{E}[XY] = 0$, so $\mathrm{Var}(XY) = \mathbb{E}[X^2 Y^2] = \mathbb{E}[X^2]\mathbb{E}[Y^2] = \mathrm{Var}(X)\mathrm{Var}(Y)$). Putting these together,

$$\mathrm{Var}(z^{(\ell)}_j) = \sum_{i=1}^{d_{\ell-1}} \mathrm{Var}(W_{ij}) \mathrm{Var}(a_i^{(\ell-1)}) = d_{\ell-1} \cdot \sigma_w^2 \cdot \sigma_a^2.$$

This is the key equation. The pre-activation variance equals the fan-in times the weight variance times the input variance. To make the pre-activation variance equal the input variance, $\mathrm{Var}(z^{(\ell)}_j) = \sigma_a^2$, we need

$$\boxed{\sigma_w^2 = \frac{1}{d_{\ell-1}}}.$$

This is the variance-preservation rule. It is the single most important formula in initialisation theory, and every modern recipe is a variation on it. In English: draw weights from a distribution whose variance is one over the number of inputs to the neuron.

A worked example. Suppose layer $\ell$ has fan-in $d_{\ell-1} = 256$. Then $\sigma_w^2 = 1/256 \approx 0.0039$, and the standard deviation of each weight is $\sqrt{1/256} = 1/16 = 0.0625$. So we sample each weight from $\mathcal{N}(0, 0.0039)$, giving typical weights around $\pm 0.06$. Plug those weights into the formula: the variance of the pre-activation is $256 \times 0.0039 \times \sigma_a^2 = \sigma_a^2$. Variance preserved.

The variance-preservation rule is incomplete in two ways, and the two famous initialisation schemes, Xavier and He, patch the two gaps.

First, the rule above keeps the forward variance constant but says nothing about gradients flowing backwards. By a symmetric argument, preserving gradient variance through the backward pass requires $\sigma_w^2 = 1/d_\ell$ (one over the fan-out), not $1/d_{\ell-1}$. Glorot and Bengio's compromise is to use the harmonic average. Second, the rule above assumes activations are linear: it ignores the fact that an activation function reshapes the variance of its input. He et al.'s correction accounts for what ReLU does to the variance.

Xavier/Glorot initialisation (sigmoid, tanh)

Glorot and Bengio (2010), in Understanding the difficulty of training deep feedforward neural networks, argued that we want both the forward-pass activations and the backward-pass gradients to have stable variance. The forward condition gives $\sigma_w^2 = 1/d_{\ell-1}$. Running the same calculation on the backward pass, where each layer's gradient is multiplied by $\mathbf{W}^\top$ rather than $\mathbf{W}$, gives $\sigma_w^2 = 1/d_\ell$. We cannot satisfy both at once unless $d_{\ell-1} = d_\ell$, so they proposed splitting the difference using the harmonic mean of the two:

$$\mathrm{Var}(W_{ij}) = \frac{2}{d_{\ell-1} + d_\ell}.$$

This formula is appropriate for activation functions that are roughly linear near $z = 0$, such as $\tanh$ (whose derivative at the origin is exactly 1) and the sigmoid (whose derivative at the origin is 0.25 but is otherwise smooth and symmetric). For these activations, we want the pre-activation $z$ to stay in a small enough range that the activation does not saturate, so that the gradient passes through unchanged.

In practice you can sample either from a Gaussian or from a uniform distribution with the same variance:

  • Normal: $W_{ij} \sim \mathcal{N}\!\left(0,\, \dfrac{2}{d_{\ell-1} + d_\ell}\right)$
  • Uniform: $W_{ij} \sim \mathcal{U}\!\left(-\sqrt{\dfrac{6}{d_{\ell-1} + d_\ell}},\; +\sqrt{\dfrac{6}{d_{\ell-1} + d_\ell}}\right)$

The factor of 6 in the uniform version is because a uniform distribution on $[-a, a]$ has variance $a^2/3$, so to match the target variance $2/(d_{\ell-1} + d_\ell)$ we need $a^2/3 = 2/(d_{\ell-1} + d_\ell)$, i.e. $a = \sqrt{6/(d_{\ell-1} + d_\ell)}$.

A worked example. A hidden layer with fan-in 100 and fan-out 50 needs

$$\sigma_w^2 = \frac{2}{100 + 50} = \frac{2}{150} \approx 0.01333,$$

so $\sigma_w \approx 0.1155$. The Gaussian recipe is therefore $W \sim \mathcal{N}(0, 0.01333)$. The equivalent uniform recipe is $W \sim \mathcal{U}(-0.2, +0.2)$, since $\sqrt{6/150} = \sqrt{0.04} = 0.2$. Either way, a typical weight is around $\pm 0.12$. Put those numbers into a tanh layer with unit-variance input: the pre-activation has variance $100 \times 0.01333 \times 1 \approx 1.33$, which is close enough to 1 that tanh stays in its linear regime and learning proceeds.

import numpy as np

def xavier_normal(d_in, d_out, rng):
    std = np.sqrt(2.0 / (d_in + d_out))
    return rng.standard_normal((d_out, d_in)) * std

def xavier_uniform(d_in, d_out, rng):
    a = np.sqrt(6.0 / (d_in + d_out))
    return rng.uniform(-a, a, size=(d_out, d_in))

In PyTorch, the same scheme is torch.nn.init.xavier_normal_(W) and torch.nn.init.xavier_uniform_(W). It is the right default for any network whose hidden activations are tanh or sigmoid.

He/Kaiming initialisation (ReLU)

ReLU breaks the assumption behind Xavier. The function $\mathrm{ReLU}(z) = \max(0, z)$ throws away every negative input. If $z$ is symmetric around zero, which it is at initialisation, half the activations are exactly zero, and the second moment of the output is exactly half the input variance. He et al. (2015), in Delving deep into rectifiers, did the calculation properly: for $z \sim \mathcal{N}(0, \sigma_z^2)$,

$$\mathbb{E}[\mathrm{ReLU}(z)^2] = \frac{1}{2} \sigma_z^2.$$

This is what propagates through to the next layer's pre-activation, which is the quantity the He derivation needs to keep stable. (The actual variance, $\mathrm{Var}(\mathrm{ReLU}(z)) = \tfrac{1}{2}\sigma_z^2 (1 - 1/\pi)$, is smaller because $\mathrm{ReLU}(z)$ has a non-zero mean of $\sigma_z/\sqrt{2\pi}$; only the second moment is the right input to the variance-preservation argument.) To preserve forward variance through a ReLU layer, the pre-activation variance must be twice what Xavier would give. Compensate by doubling the weight variance:

$$\mathrm{Var}(W_{ij}) = \frac{2}{d_{\ell-1}}.$$

This is the He (or Kaiming, after the first author Kaiming He) initialisation. The factor of 2 in the numerator is precisely the correction for ReLU's variance-halving. In practice:

  • Normal (most common): $W_{ij} \sim \mathcal{N}\!\left(0,\, \dfrac{2}{d_{\ell-1}}\right)$
  • Uniform: $W_{ij} \sim \mathcal{U}\!\left(-\sqrt{\dfrac{6}{d_{\ell-1}}},\; +\sqrt{\dfrac{6}{d_{\ell-1}}}\right)$

A worked example. A hidden ReLU layer with fan-in 100 needs $\sigma_w^2 = 2/100 = 0.02$, so $\sigma_w \approx 0.1414$. Typical weights are around $\pm 0.14$. The pre-activation variance is $100 \times 0.02 \times 1 = 2$, the ReLU halves it to 1, and the output variance is back to 1. Variance preserved through the non-linearity.

He initialisation is the default for any ReLU-like activation: vanilla ReLU, Leaky ReLU, Parametric ReLU, ELU, and GELU all use the same recipe (sometimes with a small adjustment for the slope of the leaky region). PyTorch implements it as torch.nn.init.kaiming_normal_(W, nonlinearity='relu') and torch.nn.init.kaiming_uniform_(W, nonlinearity='relu'). The empirical effect is dramatic: Glorot and Bengio's original paper found that 30-layer ReLU networks trained with Xavier initialisation either failed to converge or required tiny learning rates, whereas the same networks with He initialisation trained stably with standard learning rates.

def he_normal(d_in, d_out, rng):
    std = np.sqrt(2.0 / d_in)
    return rng.standard_normal((d_out, d_in)) * std

Bias initialisation

Biases are usually initialised to zero. There is no symmetry-breaking issue, because the weights break the symmetry, even with all biases starting at zero, every neuron sees a different weighted sum of inputs and computes a different value. Zero biases also avoid arbitrary offsets that the optimiser would have to undo before learning could begin.

There are two common exceptions. First, some practitioners initialise the biases of ReLU layers to a small positive value such as 0.01, in order to ensure that, on average, the pre-activation $z = \mathbf{w}^\top \mathbf{x} + b$ is slightly positive. This guards against the dying ReLU problem, where a unit ends up in a regime where its pre-activation is always negative, so its output is always zero, so its gradient is always zero, and it never recovers. A small positive bias keeps the unit alive long enough for training to find a useful direction. Second, in LSTM cells the bias of the forget gate is often initialised to 1 (or 2), so the forget gate starts close to fully open. This encourages the recurrent state to persist over many time steps, which empirically makes long-range dependencies easier to learn.

Beyond these two cases, zero is fine. The default in PyTorch is to zero biases for nn.Linear and nn.Conv2d.

Initialisation for embedding layers and special architectures

Embedding layers map discrete tokens (words, IDs) to dense vectors. They are usually initialised with a small Gaussian, $W \sim \mathcal{N}(0, \sigma^2)$ with $\sigma$ around $0.01$ to $0.02$. Roughly, the variance is $1/d$ where $d$ is the embedding dimension, so the squared norm of an embedding vector starts around 1, neither tiny nor exploding. GPT-2 famously uses $\mathcal{N}(0, 0.02^2)$ throughout.

Convolutional layers use He initialisation, but the fan-in is computed slightly differently. For a Conv2d layer with input channels $C_{\mathrm{in}}$ and kernel size $k \times k$, each output value is a sum of $C_{\mathrm{in}} \cdot k \cdot k$ products. So the effective fan-in is $C_{\mathrm{in}} k^2$, and the weight variance is $\sigma_w^2 = 2 / (C_{\mathrm{in}} k^2)$. A 64-channel input with a $3 \times 3$ kernel has fan-in $64 \times 9 = 576$, giving $\sigma_w \approx \sqrt{2/576} \approx 0.0589$.

Residual connections ($\mathbf{a}^{(\ell+1)} = \mathbf{a}^{(\ell)} + F(\mathbf{a}^{(\ell)})$) add a special twist: the residual branch $F(\cdot)$ is added on top of the identity. If $F$ already has unit-variance output, the sum has variance 2, which compounds across blocks. A common fix is to initialise the last weight matrix in each residual block with a smaller variance, often by scaling by $1/\sqrt{2L}$ for a network with $L$ residual blocks. This is sometimes called fixup or skip-init initialisation.

Transformer layers combine all of the above: token embeddings are $\mathcal{N}(0, 0.02^2)$, attention weight matrices use He or Xavier with a small extra scale factor, and the output projection of each block is downscaled by $1/\sqrt{2L}$ to keep the residual stream from drifting in scale.

Symmetry breaking, in plain words

If you initialise every weight in a layer to the same number, all zeros, all ones, all 0.5, every neuron in that layer is interchangeable. They all see the same inputs, compute the same weighted sum, produce the same output, and on the backward pass receive the same gradient. They update in lockstep. From the optimiser's perspective the layer behaves like one neuron copied a hundred times. No matter how long you train, those neurons remain identical, and the layer remains rank-1 forever. The whole point of having 100 neurons in parallel, that they should learn 100 different features, is lost.

The cure is to make the initial weights random. The randomness need not be elaborate; it just needs to break the tie between neurons. Once two neurons start with even slightly different weights, they will see slightly different gradient updates on the first batch, and from there their trajectories diverge.

Notice the division of labour. The variance of the initial distribution is what scientists tune carefully; that is what Xavier and He are about. The randomness itself is what is essential for symmetry breaking; that is non-negotiable. A common bug, especially in custom layers, is to forget the random part and end up with a perfectly-scaled but perfectly-symmetric initialisation that trains as a rank-1 layer. If your loss curve shows a network that learns a tiny amount and then plateaus, with all neurons in some layer producing nearly identical activations, suspect a symmetry-breaking bug.

Practical recipe

  1. Pick the scheme by activation. Use He/Kaiming for any ReLU-family activation (ReLU, Leaky ReLU, GELU, ELU). Use Xavier/Glorot for tanh, sigmoid, or any near-linear activation. For embedding layers, use a small Gaussian like $\mathcal{N}(0, 0.02^2)$.
  2. Initialise biases to zero. If you are training a ReLU network and you observe many dead units after a few iterations, retry with biases initialised to a small positive value such as 0.01. For LSTM forget gates, set the bias to 1.
  3. Trust the framework defaults. PyTorch's nn.Linear already uses Kaiming-uniform initialisation. Unless you have a measured reason to override it (e.g. you are building a deep residual network from scratch or replicating a paper that specifies a different scheme), just use what the library gives you.
  4. Diagnose by inspection. If training stalls in the first few iterations, log the per-layer activation variance and gradient norm. If activations shrink monotonically with depth (vanishing) or grow monotonically (exploding), the initialisation scheme or the activation function is wrong for your architecture. The fix is almost always to switch to He, to add normalisation (§9.13), or to add residual connections.
  5. Modern transformer scaling. For very deep transformers, divide the variance of residual-block output projections by $2L$ (or the standard deviation by $\sqrt{2L}$), where $L$ is the number of layers. This prevents the residual stream from drifting in scale across blocks. Many open-source training codes also use Layer-Sequential Unit-Variance (LSUV) initialisation, which orthogonally initialises and then rescales each layer to unit output variance using a single forward pass.

What you should take away

  1. Variance preservation is the single design goal. A good initialisation keeps the variance of activations roughly constant across layers, so signal neither vanishes nor explodes as depth grows.
  2. Sample weights with variance $1/d_{\ell-1}$, give or take a constant. That is the variance-preservation rule. Xavier uses $2/(d_{\ell-1} + d_\ell)$ to balance forward and backward; He uses $2/d_{\ell-1}$ to compensate for ReLU.
  3. Match the scheme to the activation. Xavier for tanh and sigmoid; He for ReLU and its relatives; small Gaussian for embeddings; downscaled projections for residual blocks.
  4. Biases at zero, with two exceptions. Use small positive biases for ReLU layers prone to dying units, and unit-positive biases for LSTM forget gates.
  5. Random, not constant, is essential. Identical initial weights make every neuron in a layer compute the same function and stay that way. Symmetry breaking requires real randomness.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).