6.7 No free lunch and inductive bias

In §6.6 we introduced capacity, complexity and regularisation, and saw that a model's success depends on whether its hypothesis class is well matched to the function we are trying to learn. We hinted at, but did not fully confront, a deeper question. If we could choose any algorithm we liked, would there be one that always wins? Could we, with sufficient ingenuity, build a learner that beats every other learner on every problem? The answer, formalised by David Wolpert in the mid-1990s, is no. Every algorithm is a compromise. Every algorithm bets on some structure being present in the data. When the bet pays off, the algorithm performs well; when it does not, the algorithm performs no better than a coin flip. This section is about the bets, what they are called, why they are unavoidable, and how the past decade of deep learning has been a slow refinement of which bets are worth making.

The no-free-lunch theorem

Wolpert's no-free-lunch theorem, published in 1996, states that averaged uniformly over all possible target functions $f^*: \mathcal{X} \to \mathcal{Y}$, every learning algorithm has the same expected generalisation error. There is no universally best algorithm. There is no master regulariser. There is no architectural choice that dominates the rest. If we average over the full space of conceivable mappings from inputs to outputs, every learner, from a one-line linear regression to a frontier transformer, is equally mediocre.

This sounds devastating until we read it carefully. The theorem averages over all possible functions, weighting each equally. Most of those functions are pathological. They are the mappings in which the label of every point is independent of every other, in which the value of $f^*$ at $x = 0.5$ tells us nothing whatever about the value of $f^*$ at $x = 0.50001$. They are mostly noise. They are functions a person could not learn either, because there is nothing to learn.

The world we actually live in is not drawn from this uniform distribution. The functions that matter, recognising a friend's face, parsing a sentence, predicting tomorrow's weather, deciding whether a CT scan shows a stroke, share strong regularities. They are smooth where it counts, sparse in their dependencies, hierarchical in their composition, invariant to nuisances such as translation or rotation, and well approximated by simple primitives composed in modest depth. The natural world favours structure, and structure is what learning algorithms exploit.

The theorem is therefore not nihilism but a discipline. It tells us that any time a particular algorithm beats another on a particular benchmark, the winning algorithm must be exploiting some feature of the data that the loser does not. There is always a reason. The reason has a name: inductive bias. To say "this model is better" without qualification is to make a category error. The accurate statement is "this model's inductive bias matches this problem's structure better." Different problems, different biases, different winners.

A useful corollary: when someone claims to have found a universally superior method, they have either misunderstood Wolpert or are implicitly comparing on a narrow distribution of tasks. The sentence "X beats Y on average" is meaningful only when we specify which average. The benchmarks that matter, ImageNet, GLUE, MMLU, MIMIC, are themselves narrow slices of function space, curated to reflect tasks human beings care about. A method that wins across that slice is not universal; it is well matched to the kind of problems we choose to study. Progress is real, but it is progress in calibrating our priors to the world we live in, not progress towards a transcendent learner.

Inductive bias

The inductive bias of a learning algorithm is the set of assumptions, explicit or implicit, that it brings to the problem of choosing between hypotheses that fit the training data equally well. With a finite training set, there are always infinitely many functions that pass through every point. The bias decides which of those functions the algorithm prefers, and therefore how it generalises beyond the training data.

Every algorithm has one, whether or not it is acknowledged.

  • Linear regression assumes that the target depends linearly on the features (after any chosen feature transformation). When this matches reality, linear regression is unbeatable for sample efficiency.
  • K-nearest neighbours assumes smoothness: that points close together in feature space have similar labels. With the right metric, kNN is excellent on low-dimensional, locally smooth functions; with the wrong metric, it is hopeless.
  • Decision trees assume that the target is approximately constant within axis-aligned regions. When the natural decision boundaries are axis-aligned (as they often are for tabular medical or financial features), trees thrive.
  • Convolutional neural networks assume translation equivariance (a cat is a cat regardless of where it appears in the image) and locality (that nearby pixels are more related than distant ones). Hinton's classic observation that a CNN with weight tying needs orders of magnitude less data than a fully-connected network on the same image task is a direct consequence of this bias.
  • Recurrent neural networks assume sequential, Markov-like structure with stationary dynamics, so the same parameters can be reused at every timestep.
  • Transformers assume that all-to-all interactions among tokens are useful but that the pattern of which interactions matter is content-dependent and should be learned, not hard-coded. Their bias is towards problems where compositional, long-range relationships matter and where data are abundant enough to learn the attention patterns.
  • Graph neural networks assume permutation equivariance over nodes and that the graph topology is itself informative.
  • Diffusion models assume that the data manifold can be reached by gradually denoising Gaussian noise, a strong but, empirically, extremely productive prior on natural images and audio.

A model performs well on problems whose structure matches its inductive bias. It performs badly on problems where the bias is wrong. Choosing an algorithm for a project is, before anything else, choosing which assumptions you are willing to make about your data.

Why deep learning won

For about thirty years the dominant paradigm in machine learning was hand-engineered features fed to a relatively simple classifier, HOG descriptors and SIFT keypoints into a linear SVM, n-gram counts into logistic regression, Mel-frequency cepstral coefficients into a Gaussian mixture. The features encoded human intuition about the domain. The classifier did the easy bit.

Deep learning's victory, beginning with AlexNet in 2012, was the discovery that the right learned features, given enough data and compute, beat anything humans could design, but only when the architecture's inductive bias matched the data's structure. CNNs won on images because translation equivariance and locality are the right priors for natural images. Transformers won on text and, eventually, almost everything else, because content-dependent all-to-all attention is the right prior for compositional, long-range structure. Vision Transformers work on images not in spite of dropping the convolutional bias but because, with sufficient data, the attention mechanism can rediscover convolution-like local patterns and go beyond them.

The pattern is consistent. On perceptual data, pixels, waveforms, language, the deep architectures of the past decade encode priors that match how those signals are generated in the natural world. The data are abundant; the priors are right; the models scale.

The pattern breaks down on tabular data. A spreadsheet of patient blood-test results, demographic codes and outcome flags has no translation equivariance, no locality, no compositional sequence structure. The features are heterogeneous, individually meaningful, and relate to the target through a mix of monotone effects, axis-aligned thresholds and pairwise interactions. The right inductive bias is closer to that of decision trees, and gradient boosting frameworks such as XGBoost and LightGBM continue to win Kaggle leaderboards for tabular problems. Throwing a transformer at a spreadsheet does not help; there is no hidden image in there to find.

Scale changes the trade-off but does not abolish it. With enough data, even a weakly biased model can overcome a mismatched prior; this is part of why frontier language models trained on internet-scale corpora cope reasonably with tabular reasoning. But on the small or medium datasets typical of medical research, a well-chosen classical method usually beats a poorly-matched deep one.

Practical implication

Match the algorithm to the problem. This is the working clinician's version of no free lunch. Before choosing a model, ask what structure the data have, and choose a method whose inductive bias respects that structure.

For images of any kind (radiographs, fundus photographs, histopathology slides, dermatology cameras), start with a convolutional or vision-transformer backbone, ideally pre-trained on a large general-image corpus and then fine-tuned on the clinical task. For free-text (discharge summaries, radiology reports, GP letters), start with a pre-trained language model. For sequential physiological signals such as ECGs or EEGs, a CNN or temporal transformer trained on the raw waveform usually beats hand-crafted features. For tabular electronic-health-record data (comorbidities, medications, vital-sign trends), gradient boosting is the sensible first attempt; deep learning becomes competitive only when the dataset is very large or when there is a sequential structure (such as a longitudinal trajectory) that an architecture can exploit.

Resist the temptation to use whichever method is currently fashionable. The fact that transformers dominate language does not make them the right choice for predicting hospital readmission from a thirty-feature spreadsheet. The fact that random forests are old does not make them the wrong choice for that problem. Model choice is a question about the data, not a question about prestige. A useful diagnostic: if you cannot articulate, in one sentence, which assumption your model is making and why your data plausibly satisfy it, you have not yet chosen a model, you have chosen a brand. Force yourself to write the sentence before you run the experiment, and revisit it whenever the results disappoint.

What you should take away

  1. There is no universally best learning algorithm. Wolpert's no-free-lunch theorem proves that, averaged over all possible target functions, every learner is equally mediocre.
  2. Real-world problems are not drawn from that uniform average. They are smooth, sparse, hierarchical and structured, and learning is possible because of those regularities.
  3. Every algorithm has an inductive bias, a set of assumptions about which hypotheses to prefer when the data underdetermine the choice. The bias is what makes generalisation possible.
  4. Deep architectures won on perceptual data because their inductive biases (translation equivariance, locality, content-dependent attention) match how natural signals are generated. They have not won, and will not generically win, on tabular data.
  5. When you start a new project, ask first what structure your data have, then choose an algorithm whose bias respects that structure. Fashion is a poor guide; fit between assumptions and reality is the only thing that matters.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).