Neural Networks: 9.19   Common pitfalls

Dr Chris Paton

9.19 Common pitfalls

Training a neural network for the first time is a strange experience. The mathematics is clean, the code is short, and the textbook examples all work. Then you point the same code at your own data and nothing happens. The loss refuses to fall, or it falls and then explodes, or the model gets every answer right on the training set and every answer wrong on the test set. None of this is in the equations. It is the gap between the equations and a working system, and that gap is where most beginners spend their first few hundred hours.

This section is a field guide to the most frequent mistakes. It is organised not by cause but by symptom, because a symptom is usually all you have. You stare at a loss curve that does the wrong thing and you need to know what to suspect first. For each pitfall we set out the symptom you observe, the underlying causes that produce it, and the practical fixes that resolve it. None of this is exotic. Every mistake here has been made by every working researcher at least once.

Section 9.14 covered training tips for the case where everything is broadly working and you want to push performance further. This section covers the opposite case: training is not working and you need a systematic way to diagnose why. Treat it as a checklist. Run through it in order before you blame the model.

Symbols Used Here

$\mathcal{L}$loss

$\eta$learning rate

$\nabla$gradient

The model trains to chance accuracy

Symptom. The training loss starts at the value you would expect for a randomly-initialised network, refuses to budge meaningfully across thousands of steps, and accuracy on both training and validation sets sits at the level you would get by guessing. For ten classes, that is roughly ten per cent; for binary classification, fifty per cent. The loss curve looks flat or noisy but never trends downwards.

Causes. The mistake is almost never in the optimiser. It is structural. The most common single cause is a misalignment between predictions and targets: the model produces logits in one order while the labels are in another, or the labels have been one-hot encoded when the loss expects integer class indices, or the batch dimension has been transposed somewhere in a reshape. The loss is then computed against essentially random targets and there is nothing to learn. A second common cause is corrupted labels, a CSV file read with the wrong delimiter so that every label is shifted by one column, or an integer cast that has silently turned class four into class zero. A third is missing input normalisation: pixel values in the range zero to two hundred and fifty-five rather than zero to one, producing pre-activations so large that every ReLU saturates on the first forward pass and no gradient flows back. A fourth is an output dimension that does not match the number of classes, a final linear layer with output size nine for a ten-class problem. A fifth is an architectural slip, the most common being applying a softmax inside the model and then again inside the loss, which collapses the gradient.

Fix. Before you change anything else, train the model on a tiny subset of the data. Take ten examples. Disable shuffling, disable dropout, set the learning rate to something modest, and run the training loop for a few hundred steps. A correctly-wired neural network will memorise ten examples to near-zero loss and a hundred per cent accuracy. If yours cannot, the problem is structural. The data, the loss, the labels, or the architecture is wrong. Once the tiny subset trains, scale up gradually. This single discipline catches more bugs than any other technique.

The loss is NaN within a few steps

Symptom. The loss falls for a few iterations, or perhaps just one, and then prints nan. From that point onwards every gradient is nan, every weight update poisons the network further, and accuracy collapses to chance. Sometimes the loss prints inf first and then nan on the next step.

Causes. Numerical instability has many entrances. The most frequent is a learning rate that is simply too high: a single step pushes the weights into a regime where activations explode, the next forward pass produces astronomical pre-activations, and the loss overflows. A close cousin is exploding gradients in deep or recurrent networks, where the gradient norm grows exponentially with depth and a single backward pass moves the weights past the floating-point range. Division by zero is the third classic: $\log(0)$ inside cross-entropy when a softmax output is exactly zero in float16, $\sqrt{0}$ inside a layer-norm whose variance has collapsed, or $1/(1-x)$ in an attention mask. Activation overflow is the fourth: a pre-activation $z = 100$ produces $e^{z}$ outside the float32 range, and any softmax or sigmoid downstream becomes inf and then nan.

Fix. First, reduce the learning rate by a factor of ten and rerun. If the NaN disappears, the original learning rate was too high; bisect downwards. Second, add gradient clipping (torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)) before the optimiser step. Clipping does not hide bugs, it tames them. Third, prefer the numerically stable forms of common operations. Use log_softmax with nll_loss rather than softmax followed by log and a manual cross-entropy; use binary_cross_entropy_with_logits rather than a sigmoid followed by binary_cross_entropy. The identity $\log(\sigma(z)) = -\log(1 + e^{-z})$ is well-behaved for negative $z$ where the naive form is not. Fourth, if you are training in mixed precision (float16 or bfloat16) and the NaN appears repeatedly, cast the loss computation back to float32, the dynamic range of float16 is too narrow for many losses. Print the maximum absolute pre-activation per layer for the first few steps; if any layer is producing values above one hundred, the problem is upstream of the loss.

The model memorises the training set perfectly but fails on test

Symptom. Training accuracy climbs steadily to one hundred per cent. Validation accuracy plateaus early, then begins to fall. The gap between the two grows with every epoch. By the end of training the model is essentially a lookup table for the training set and a random predictor for everything else.

Causes. The textbook cause is overfitting: the model has more capacity than the dataset can constrain, and instead of learning the underlying pattern it learns the individual training points. A million-parameter network on a thousand-example dataset will overfit on most problems. A second cause, often mistaken for overfitting, is data leakage. Validation examples have somehow ended up in the training set, perhaps because the same image appears under two filenames, or because a temporal split was made randomly rather than chronologically, or because data augmentation was applied before the train/validation split rather than after. A third cause is genuine distribution shift: the training and test sets were drawn from different populations, so the model has correctly learned the training distribution but the test distribution is different.

Fix. Apply regularisation as covered in §9.12: weight decay, dropout, early stopping. These reduce effective capacity and slow the descent into memorisation. If you cannot regularise enough, collect more data, there is no substitute. Data augmentation is the cheapest way to multiply your effective dataset: random crops, flips, and colour jitter for images; back-translation or random masking for text. Audit the train/validation split for leakage. Hash every example and check that no hash appears in both splits. If your data is temporal, split chronologically: use the first eighty per cent of dates for training and the last twenty per cent for validation, never random examples. If you suspect distribution shift, plot histograms of summary statistics (per-channel mean for images, sequence length for text) on both splits and compare them. Cross-validation gives a more honest estimate of generalisation than a single split, especially for small datasets. Finally, do not trust a single number: report mean and standard deviation across three or five seeds before claiming an improvement.

Training is much slower than expected

Symptom. A single epoch takes hours when the dataset is small enough that it should take minutes. The GPU is mostly idle. Loss decreases but at a glacial pace because each iteration takes seconds.

Causes. Almost always the bottleneck is not the GPU. The most common cause is a single-threaded data loader: every batch is being read from disk, decoded, and preprocessed serially while the GPU sits waiting. A second cause is small batches with low arithmetic intensity, leaving the GPU running at a fraction of its peak throughput. A third is CPU-bound preprocessing, where image decoding or tokenisation dominates and the GPU is starved. A fourth is unnecessary CPU-to-GPU transfers inside the training loop, moving a tensor to the device once per batch instead of once at the start, or calling .cpu() for logging on every step. A fifth, more obscure, is gradient checkpointing left on by accident; checkpointing trades compute for memory and roughly doubles the cost of every forward pass.

Fix. Profile before you optimise. nvidia-smi dmon (or nvitop for a friendlier view) will tell you the GPU utilisation; if it sits below fifty per cent, the GPU is not the bottleneck. torch.profiler will show you where time is being spent inside the training step. Increase the batch size until either the GPU is saturated or you run out of memory. Set num_workers in your DataLoader to a value greater than zero (typically four to eight) and pin_memory=True so that host-to-device transfers can overlap with computation. Move expensive preprocessing onto the GPU where possible, image augmentation libraries such as Kornia operate on tensors and run on the device. Audit the loop for stray .cpu() calls; only detach and move when you genuinely need a Python scalar. Use mixed precision (torch.cuda.amp.autocast) where supported; on modern hardware it roughly halves the time per step. Cache decoded data on disk in a fast format (such as a memory-mapped numpy array or webdataset shards) rather than re-decoding JPEGs every epoch.

The validation accuracy bounces wildly between epochs

Symptom. From one evaluation to the next, validation accuracy jumps by ten per cent in either direction. The training loss looks reasonable but the validation curve is a sawtooth.

Causes. The validation set is too small for the noise in the metric to average out. With a hundred validation examples, a single misclassified example moves accuracy by one per cent; with a thousand, by a tenth. The learning rate may also be too high, so that consecutive evaluations sample the model in genuinely different states. Mislabelled examples in the validation set push the average around with every prediction change near the decision boundary. Inadequate batch-norm statistics, running means accumulated over too few batches, or evaluation in train mode by mistake, can also produce unstable validation numbers.

Fix. Enlarge the held-out set if you can. A validation set of at least a thousand examples per class is a reasonable target for image classification; more for noisier tasks. Lower the learning rate, especially towards the end of training, by using a cosine or step decay schedule. Print per-class accuracy as well as the aggregate; a single noisy class often explains most of the variance. Compute validation metrics over multiple seeds or multiple epochs and report a moving average rather than the latest value. Make sure model.eval() is called before evaluation so that batch norm uses its running statistics rather than the current batch.

The model performs well on validation but poorly in production

Symptom. Cross-validated accuracy in the lab is ninety-five per cent. Once deployed, the model is right seventy per cent of the time. Users complain. Engineers blame the model; researchers blame the engineering.

Causes. The lab and production data are not the same distribution. Training data was collected on a fixed date in a controlled setting; production data arrives continuously, in changing conditions. A camera was upgraded, a user interface changed, the population shifted. This is covariate shift. A second cause is preprocessing mismatch: training pipeline normalises with means and standard deviations computed once over the training set, but the inference pipeline uses values computed differently or not at all. The model receives inputs that look subtly different from anything it saw during training. A third is label distribution skew: a model trained on a balanced laboratory set encounters a real-world distribution where one class is twenty times more common than another. A fourth is temporal drift: time-of-day effects, seasonal effects, or simply that the world a year after deployment is not the world the data was collected in.

Fix. Treat deployment as the start of monitoring rather than the end of development. Log a sample of production inputs and store them; compare summary statistics weekly to those of the training set. Alert when the distribution shifts beyond a threshold. Build the inference preprocessing pipeline by exporting the exact same code path as training, ideally as a single artefact (a TorchScript module or ONNX graph) that contains the normalisation. Never reimplement preprocessing in a different language for production. Periodically retrain on recent data, the cadence depends on how fast your distribution drifts, but quarterly is a reasonable default for many domains. Use online metrics where possible: not just accuracy but business metrics, calibration, and latency. Before any full rollout, run an A/B test against the previous model and require a confidence interval that excludes regression. When you ship, ship behind a feature flag so you can roll back in minutes rather than days.

Training works on small data but fails at scale

Symptom. Your model trains cleanly on ten thousand examples. You scale to ten million and the loss stalls, or the GPU sits idle, or the run mysteriously diverges after the first epoch.

Causes. Hyperparameters that worked at small scale rarely work at large scale without adjustment. The learning rate that was optimal for batch size 32 is not optimal for batch size 4096. Gradient noise scales differently: at very large batches each step is closer to the true gradient and the optimiser behaves more like full-batch gradient descent, which can require a different schedule. The data pipeline that was fast enough for ten thousand examples is the bottleneck for ten million. Numerical issues from very large gradient accumulations, or from float16 underflow when summing many small numbers, can stall training in ways that small-scale runs never reveal.

Fix. Apply the linear scaling rule as a starting point: when you multiply the batch size by $k$, multiply the learning rate by $k$ as well. This works up to roughly batch size 8000 for many vision tasks. Above that, use optimisers designed for very large batches such as LARS or LAMB, which scale the update per layer by the ratio of weight norm to gradient norm. Use a warmup schedule that ramps the learning rate up over the first few thousand steps before applying any decay; this prevents the early iterations from diverging. Profile and parallelise the data pipeline aggressively, sharded reads, multiple worker processes, prefetching, and on-GPU augmentation. Verify gradient summation in float32 even when forward passes are in float16.

Some classes are never predicted

Symptom. Aggregate accuracy looks acceptable. The confusion matrix tells a different story: one or more classes are never predicted at all. Every example of the rare class is mapped to a more common one.

Causes. Class imbalance in the training data. If ninety-nine per cent of examples are class A and one per cent are class B, a model that always predicts class A achieves ninety-nine per cent accuracy and zero recall on class B. The loss function does not penalise this enough to overcome the imbalance. A second cause is a fixed decision threshold (typically 0.5 for binary classification) that suits the training distribution but not the operating point you actually need. A third is a loss function that is not class-balanced when the data is not.

Fix. Balance the training distribution. The simplest method is oversampling the minority classes, repeat their examples until the batch composition is roughly even. Alternatively, undersample the majority class, accepting that you discard data. Class weights in the loss function (CrossEntropyLoss(weight=class_weights) in PyTorch) provide the same effect without changing the sampler. Focal loss down-weights examples the model already classifies confidently, focusing learning on the hard cases. Tune the decision threshold on the validation set rather than accepting the default. Where possible, collect more examples of the rare class, synthetic augmentation, targeted data collection, or active learning all help.

The trained model is large and slow at inference

Symptom. The model achieves the accuracy you wanted but the file is several gigabytes, inference takes seconds per example, and serving costs are unsustainable. Mobile or browser deployment is impossible.

Causes. This is not a training failure. Modern architectures are designed for accuracy and are routinely too large for production. The fix is post-training compression rather than retraining.

Fix. Knowledge distillation trains a small student network to mimic the outputs of a large teacher; the student often retains most of the teacher's accuracy at a fraction of the size. Quantisation reduces the precision of stored weights from float32 to float16 or int8, saving two to four times the memory and often running faster on hardware with int8 support. Pruning removes weights with small magnitudes; structured pruning (removing whole channels or attention heads) is preferable to unstructured pruning because it produces real speedups on commodity hardware. Export to a production runtime: ONNX for portability, TensorRT for NVIDIA-specific optimisation, Core ML for Apple devices, TensorFlow Lite for mobile. These runtimes apply graph-level optimisations (operator fusion, constant folding, layout transforms) that are not available in the training framework. Combine techniques rather than picking one: distill, then quantise, then prune the student.

Model behaves very differently in training vs inference

Symptom. Training accuracy is ninety per cent. When you evaluate the same examples in evaluation mode, accuracy drops to seventy per cent, even on the training set itself, which the model has just memorised.

Causes. Several layers behave differently in training and evaluation modes. Dropout zeroes a random subset of activations during training and is a no-op at evaluation time, but only if you remember to switch modes. Batch norm uses the current batch's statistics in training and accumulated running statistics at evaluation; if model.eval() is never called, evaluation will use whatever the last forward pass computed, which depends on the batch composition. The running statistics themselves can be stale or poorly estimated if training was unstable. Gradient checkpointing and mixed precision can each cause subtle numerical differences between train and eval, though these are usually small.

Fix. Call model.eval() explicitly before every evaluation, and model.train() before resuming training. Wrap evaluation in with torch.no_grad(): to disable gradient tracking, which both speeds it up and surfaces any code that was secretly relying on autograd. Print the model's training mode flag at the top of your evaluation function as a sanity check. If batch norm statistics are unstable, increase the batch size during the final epochs of training so the running averages become more reliable. For models that are very sensitive to mode (such as those with dropout near the output), evaluate periodically during training and verify that train and eval accuracies on a held-out chunk of the training set are close.

What you should take away

When training fails, fit ten examples first. A model that cannot memorise ten examples has a structural bug, and no amount of tuning will fix it.
Diagnose by symptom, not by guess. Treat the loss curve and the confusion matrix as evidence and work backwards from what you see.
Numerical stability is engineering, not theory. Lower the learning rate, clip gradients, and use the stable forms of common operations as a reflex.
The gap between validation and production is usually preprocessing, distribution shift, or both. Export the inference pipeline as a single artefact and monitor production inputs from day one.
Most pitfalls are mundane: wrong shapes, wrong modes, wrong splits, wrong scales. The exotic explanations rarely apply. Check the boring causes first.