2.11 Numerical stability
A theorem about real numbers does not automatically hold on a real computer. The first ten sections of this chapter took the convenient view of a mathematician: vectors had infinite-precision components, matrix inverses existed when determinants were non-zero, and singular value decompositions cleanly separated signal from noise. Once those formulas leave the page and run on actual silicon, every number is rounded, every long sum drifts a little, and a few perfectly innocent-looking expressions become silent disasters. Every machine-learning practitioner discovers this the painful way at some point, usually at three in the morning, watching a training loss explode to NaN for no reason at all. This section catalogues the most common numerical hazards in linear algebra and the standard responses that tame them.
The same matrix decompositions, the same softmax, the same inner products that we developed in §§2.1–2.10 all reappear here, now wearing a budget of finite precision. Numerical stability governs how mixed-precision training is set up, how large language models are served at inference time, how iterative solvers converge in numerical methods, and how a careless piece of code can quietly throw away half its accuracy without telling you. The defensive techniques are few and well understood; the rest is discipline.
Floating-point basics
Computers cannot store real numbers. They can only store rational numbers with a fixed budget of bits, and the IEEE 754 standard specifies how that budget is spent. A floating-point number is written as
$$x = (-1)^{s} \cdot m \cdot 2^{e}$$
where $s$ is a single sign bit, $m$ is a mantissa (also called the significand), and $e$ is a signed exponent. The mantissa carries the digits of the number; the exponent shifts the decimal point. Larger mantissa, more precision; larger exponent field, larger range. The various floating-point formats are simply different ways to slice up a fixed total of bits between these three parts.
Four formats matter for modern machine learning:
- fp32 (single precision): 1 sign bit, 8 exponent bits, 23 mantissa bits, 32 bits total. Machine epsilon $\epsilon_{\text{mach}} \approx 1.2 \times 10^{-7}$, which is about seven decimal digits of precision. Range from roughly $10^{-38}$ to $10^{38}$. The historical default for scientific computing and the safe fallback throughout deep learning.
- fp16 (half precision): 1 sign, 5 exponent, 10 mantissa, 16 bits total. $\epsilon_{\text{mach}} \approx 10^{-3}$, about three decimal digits. Range only $10^{-5}$ to $6.5 \times 10^{4}$, which is dangerously narrow, gradients in a deep network can underflow to zero or overflow to infinity if you are not careful.
- bf16 (brain floating-point, popularised by Google's TPUs): 1 sign, 8 exponent, 7 mantissa, 16 bits total. The same exponent width as fp32, so the same enormous dynamic range, paid for with even less precision than fp16 (about two to three decimal digits). The dominant choice on modern GPUs because gradients almost never overflow.
- fp64 (double precision): 1 sign, 11 exponent, 52 mantissa, 64 bits total. About 16 decimal digits, range $10^{\pm 308}$. Twice the memory and bandwidth cost of fp32; rare in deep learning but the default in scientific computing and statistical software.
The non-obvious consequence is that most decimal numbers cannot be represented exactly. The fraction $1/10$ is a repeating binary fraction, just as $1/3$ is a repeating decimal in base ten. In fp32 the value $0.1$ is actually stored as $0.10000000149011612\ldots$ The error per number is tiny, about one part in $10^{7}$, but errors accumulate. Sum a billion fp32 ones and the total is not $10^{9}$; it stops growing somewhere around $1.6 \times 10^{7}$, because by then the running sum is so large that adding another $1$ falls below the precision threshold and gets rounded away.
The single most important consequence to internalise is this: $1 + \epsilon_{\text{mach}}/2$ in fp32 equals exactly $1$. There is no representable number between $1$ and $1 + \epsilon_{\text{mach}}$. Anywhere your algebra silently relied on a small but non-zero contribution surviving an addition with something much larger, the small contribution has just vanished without warning.
A second consequence is that floating-point addition is not associative. The expression $(a + b) + c$ can differ from $a + (b + c)$ in the last few bits, because the intermediate rounding depends on the order of operations. Most of the time the difference is invisible, but it explains why two implementations of the same mathematical formula sometimes produce slightly different numbers, and why running the same training script twice on a GPU may not give bit-identical results unless you explicitly request determinism. The point is not that floating-point is broken, it is the best practical compromise we have, but that any reasoning which assumes the algebra of real numbers carries over exactly is reasoning on borrowed time.
Common numerical hazards
Four hazards account for the vast majority of numerical bugs in machine learning code.
Catastrophic cancellation is the loss of significant digits when subtracting two nearly-equal numbers. Consider $1.0000001 - 1.0$ evaluated in fp32. The first operand is stored as approximately $1.00000011920928\ldots$ (the closest representable value), the second as exactly $1.0$. The difference is roughly $10^{-7}$, but only the first significant digit is meaningful, the rest is rounding noise from the original representation. Six digits of precision have evaporated in a single subtraction. The naive variance formula $\operatorname{Var}(X) = E[X^2] - E[X]^2$ is the textbook example: if the data has small variance and large mean, both terms on the right are large and nearly equal, and the subtraction can produce a negative answer (mathematically impossible) or zero where the true variance is small but positive. Welford's online algorithm computes the variance directly from running deviations and avoids the cancellation entirely.
Loss of significance from accumulation is the cumulative version of the same problem. Sum a billion fp32 values one at a time and you build up rounding error proportional to the number of additions. Sum the same values pairwise, add neighbours, then add neighbours of neighbours, and so on, and the error grows only logarithmically in the number of terms. Kahan summation goes further by keeping a running compensation term that captures the rounding error of each addition and folds it back in on the next step. Modern numerical libraries use one of these schemes by default; if you are writing your own reduction, do not use a naive Python for loop.
Overflow occurs when a computation produces a number larger than the format can represent. fp16 maxes out at $65{,}504$. A softmax over logits where any single value exceeds about $11$ in fp16 will overflow when exponentiated, because $e^{11} \approx 60{,}000$. In a real language model the logits routinely reach values of $20$ or $50$ before the temperature is applied. The softmax then produces inf / inf, which IEEE 754 defines to be NaN, and the whole training run dies. The fix is the numerically-stable softmax described in the next subsection.
Underflow is the opposite: numbers shrink below the smallest representable value and are silently flushed to zero. $e^{-100}$ in fp16 is $0$. A product of a thousand probabilities, each around $0.1$, evaluates to $0$ in any standard format because the true value of $10^{-1000}$ is unrepresentable. The defensive habit is to work in log space, sum log-probabilities instead of multiplying probabilities, and only exponentiate at the very end, if at all. This trick, combined with the log-sum-exp identity, eliminates almost all overflow and underflow problems in probabilistic computations.
A practical example: a hidden Markov model assigning a likelihood to a long sequence multiplies hundreds or thousands of small probabilities together. Done directly, the answer underflows to zero within a hundred steps and stays there. Done in log space, the same computation is a sum of moderate negative numbers, which a fp32 register handles without difficulty for sequences of any length. The forward and backward algorithms in HMMs and the message-passing rules in graphical models are routinely written in log space for exactly this reason.
Numerically-stable softmax
The softmax function turns a vector of real-valued scores into a probability distribution:
$$\operatorname{softmax}(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j} e^{z_j}}.$$
Implemented naively, this overflows the moment any single $z_i$ is large. The fix is simple enough that every deep-learning library applies it silently. Subtract the maximum entry $z_{\max} = \max_k z_k$ from every component before exponentiating:
$$\operatorname{softmax}(\mathbf{z})_i = \frac{e^{z_i - z_{\max}}}{\sum_{j} e^{z_j - z_{\max}}}.$$
The two expressions are mathematically identical, the constant factor $e^{-z_{\max}}$ appears in both numerator and denominator and cancels, but the second is numerically safe. The largest exponent in the new computation is exactly $0$, so $e^{0} = 1$ is the largest term, and overflow is impossible. Smaller terms may underflow harmlessly to zero, contributing nothing to the sum (which is exactly what they would do mathematically if the answer is dominated by one entry).
A worked example. Take $\mathbf{z} = (10, 20, 1000)$. The naive numerator $e^{1000}$ is well outside the range of any standard floating-point format and produces inf. Now subtract the maximum: $\mathbf{z}' = (10 - 1000,\, 20 - 1000,\, 1000 - 1000) = (-990, -980, 0)$. The exponentials are $e^{-990} \approx 0$ (underflows safely), $e^{-980} \approx 0$ (underflows safely), and $e^{0} = 1$. The sum is $1$, and the softmax is exactly $(0, 0, 1)$. The numerically-stable form gives the right answer; the naive form gives NaN.
The same idea generalises to the log-sum-exp function, which appears throughout probabilistic machine learning:
$$\operatorname{lse}(\mathbf{z}) = \log \sum_{j} e^{z_j} = z_{\max} + \log \sum_{j} e^{z_j - z_{\max}}.$$
This identity is the backbone of stable cross-entropy losses, log-likelihoods of mixtures, and any computation involving log-probabilities. PyTorch's torch.logsumexp and cross_entropy are fused, single-pass implementations, never compute cross-entropy as a separate softmax followed by a logarithm; you will square the precision loss for no reason.
Condition number
The condition number measures how sensitive the solution of a linear system is to perturbations of its input. For a non-singular matrix $\mathbf{A}$ with singular values $\sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_n > 0$, the condition number is
$$\kappa(\mathbf{A}) = \frac{\sigma_{\max}}{\sigma_{\min}} = \frac{\sigma_1}{\sigma_n}.$$
A well-conditioned matrix has $\kappa$ close to $1$: small input errors produce small output errors. An ill-conditioned matrix has $\kappa$ very large: small input errors are amplified, perhaps massively, into large output errors. The rule of thumb is that solving $\mathbf{A}\mathbf{x} = \mathbf{b}$ on a system with $k$ digits of precision and condition number $10^p$ leaves you with about $k - p$ digits in the answer. In fp32 ($k \approx 7$), a condition number of $10^{6}$ leaves a single trustworthy digit; a condition number of $10^{8}$ leaves nothing.
A worked example shows the danger. Consider
$$\mathbf{A} = \begin{pmatrix} 1 & 1 \\ 1 & 1.0001 \end{pmatrix}.$$
The two rows are nearly parallel. The singular values are approximately $\sigma_1 \approx 2.0$ and $\sigma_2 \approx 5 \times 10^{-5}$, giving $\kappa(\mathbf{A}) \approx 4 \times 10^{4}$. Solve $\mathbf{A}\mathbf{x} = (1, 1)^{\top}$ and you get $\mathbf{x} = (1, 0)^{\top}$. Now perturb the right-hand side by one part in ten thousand and solve $\mathbf{A}\mathbf{x} = (1, 1.0001)^{\top}$: the answer becomes $\mathbf{x} = (0, 1)^{\top}$. A perturbation in $\mathbf{b}$ of $10^{-4}$ has produced a swing in $\mathbf{x}$ of order $1$, a thousand-fold amplification.
This matters for machine learning in two specific places. First, the Hessian of a deep neural network's loss function typically has condition numbers in the thousands or millions: a few directions in parameter space are very sharp, others very flat. Plain gradient descent has to use a learning rate small enough for the sharp directions, which means glacial progress in the flat ones. Adam, RMSProp, and similar adaptive optimisers approximately precondition by scaling each parameter's update by an estimate of its own curvature, which fights ill-conditioning at modest cost. Second, the matrix $\mathbf{X}^{\top}\mathbf{X}$ that arises in the normal equations has condition number $\kappa(\mathbf{X})^{2}$, squared. A design matrix with $\kappa(\mathbf{X}) = 10^{4}$ produces a normal-equations matrix with $\kappa = 10^{8}$, which on fp32 is essentially singular. The remedy is never to form $\mathbf{X}^{\top}\mathbf{X}$ explicitly: solve least squares via the QR decomposition of $\mathbf{X}$ directly, or via the SVD if $\mathbf{X}$ is rank-deficient. NumPy's np.linalg.lstsq uses one of these stable paths under the hood.
Mixed-precision training
The tension between speed and precision has produced a standard compromise called mixed-precision training. The observation that drives it is simple: matrix multiplications dominate the FLOP count of a training step, but most other operations are cheap. So do the matrix multiplications in low precision, where they run two to four times faster and consume half the memory bandwidth, and do the precision-sensitive operations in fp32. Specifically:
- Store a master copy of the weights in fp32.
- Cast the weights down to bf16 (or fp16) before each matrix multiplication.
- Run the matrix multiplications and most activations in low precision.
- Keep softmax, layer-norm statistics, the optimiser's first and second moment estimates, and the loss accumulation in fp32.
- After the optimiser computes a parameter update, apply it to the fp32 master weights.
PyTorch's torch.cuda.amp.autocast and torch.cuda.amp.GradScaler automate the casts. bf16 is now usually preferred to fp16 for two reasons: it has the same exponent range as fp32 so gradients almost never overflow, and modern Nvidia and Google hardware supports it natively at full speed.
Where fp16 is unavoidable (older hardware, certain inference deployments), loss scaling prevents gradient underflow. Multiply the scalar loss by a large factor, typically a power of two between $2^{8}$ and $2^{16}$, before calling backward(). The chain rule scales every gradient by the same factor, lifting tiny gradients out of the underflow zone. Just before the optimiser step, divide the gradients back down by the same factor and clip them. If any gradient is inf or NaN, skip the step and halve the scale; if you have gone many steps without overflow, double it. Modern frameworks do this automatically.
Mixed precision is essentially free in accuracy when set up correctly. The few catastrophes that occur tend to come from a custom layer that someone forgot to wrap in autocast, or from a softmax that someone implemented by hand without the maximum-subtraction trick.
Practical guidance
A short list of defensive habits will catch most numerical problems before they bite.
- Use stable forms wherever they exist:
log_softmaxrather than $\log(\operatorname{softmax}(\cdot))$,logsumexprather than $\log \sum e^{x}$, fused cross-entropy rather than separate softmax and log. - Print the loss and the gradient norm at every step early in training. A
NaNis far easier to diagnose at step 12 than at step 12{,}000, and a quietly diverging gradient norm is a red flag long before the loss explodes. - Default to bf16 over fp16 unless you are running on hardware that does not support bf16. The wider exponent range removes most of the foot-guns.
- Add a small epsilon to any denominator that could be near zero. Write
np.sqrt(x + 1e-12)rather thannp.sqrt(x); writex / (norm + 1e-8)rather thanx / norm. - Never explicitly form a matrix inverse to solve a linear system.
inv(A) @ bis both slower and less stable thansolve(A, b). Use LU, QR, or SVD as appropriate. - Avoid forming $\mathbf{X}^{\top}\mathbf{X}$ for least squares; use QR or SVD on $\mathbf{X}$ directly.
- When iterative algorithms maintain orthogonal vectors (Gram–Schmidt, Lanczos, certain orthogonal RNNs), re-orthonormalise periodically. Modified Gram–Schmidt is more stable than the classical version, and a fresh QR is more stable still.
- If you suspect numerical trouble, check the condition number of any matrix you are inverting or decomposing.
np.linalg.cond(A)is one line.
What you should take away
- Floating-point arithmetic is approximate, and the approximation is non-uniform: $1 + 10^{-8}$ in fp32 is exactly $1$, while $10^{-8} + 10^{-8}$ is exactly $2 \times 10^{-8}$. Algorithms must be designed with this in mind, not in spite of it.
- Catastrophic cancellation, accumulation drift, overflow and underflow are the four horsemen of numerical bugs; each has a standard antidote (reformulation, pairwise or Kahan summation, max-subtraction, log-space arithmetic).
- The numerically-stable softmax, subtract the maximum before exponentiating, is the single most important trick in deep-learning numerics, and the log-sum-exp identity it generalises to powers most stable probabilistic computations.
- The condition number $\kappa(\mathbf{A}) = \sigma_{\max}/\sigma_{\min}$ predicts how many digits of accuracy you will lose when solving a linear system; squaring a matrix (forming $\mathbf{X}^{\top}\mathbf{X}$) squares the condition number, which is why QR and SVD are preferred to the normal equations.
- Mixed-precision training in bf16 with an fp32 master copy of the weights is the modern default; it is essentially free in accuracy and significantly faster, provided you respect the small set of operations that must remain in fp32.