Supervised Learning: 7.6   Decision trees

Dr Chris Paton

7.6 Decision trees

A decision tree is a sequence of yes/no questions about features, ending in a prediction. You start at the root, ask whether some feature lies above or below a threshold, and follow the matching branch. You repeat until you reach a leaf, which carries either a class label or a regression value. Nothing about the model is mysterious: the path you walked is the explanation, and any clinician, auditor or regulator can read it off the page.

This plain readability is one reason trees have endured for forty years while flashier methods have come and gone. They handle continuous and categorical features without fuss. They are immune to the units a feature happens to be measured in: rescaling age from years to decades, or weight from kilograms to grams, leaves the splits and predictions untouched. They notice non-linear relationships and feature interactions without you having to engineer them. They cope, with a little care, with missing values. And, most importantly for modern practice, they are the building block of the two ensemble methods that dominate tabular machine learning in 2026, random forests and gradient boosting.

The catch is that a single tree, grown to its natural depth, overfits aggressively. Each leaf can be made pure by simply continuing to split until it contains one training point, at which stage the tree has memorised noise as eagerly as signal. Pruning helps; ensembles help much more. The remainder of this section unpacks how a tree predicts, how it is grown, how impurity is measured, how pruning controls variance, and how random forests turn a high-variance learner into a low-variance one by averaging.

In §7.5 we met $k$-nearest neighbours, a non-parametric method that defers all reasoning to query time. Decision trees are the other major non-parametric workhorse of classical machine learning, but they invert that philosophy: a tree front-loads the work into training, learning a hierarchy of splits, then predicts in a handful of comparisons. §7.7 returns to a probabilistic model with naive Bayes, but trees lay the ground for §7.10's ensemble methods, where bagged trees become random forests and sequentially boosted trees become gradient boosting machines.

Symbols Used Here

$\mathcal{D}$training data

$x_j$feature

$t$split threshold

$H$entropy or Gini impurity

How a tree predicts

Prediction is a walk from root to leaf. At each internal node sits a question of the form "is feature $x_j$ less than or equal to threshold $t$?" If yes, you move to the left child; if no, the right. You keep going until you arrive at a leaf, which holds the prediction for any input that lands there. For classification, the leaf typically stores the majority class of the training points it contains, together with a probability vector estimated from the leaf's class proportions. For regression, the leaf stores the mean of the training targets that ended up in it.

Consider a small clinical example: predicting whether a patient has flu. The root might ask, "does the patient have a fever above 38 degrees?" A no branch could lead to a leaf labelled "not flu" with probability 0.9, since fever is the most informative single sign. A yes branch leads to a second question: "does the patient have a cough?" If yes, perhaps the leaf is "flu" with probability 0.8; if no, a further split on muscle aches might follow. After three or four such questions the tree commits to a prediction.

Two properties of this walk are worth highlighting. First, the cost of a prediction is logarithmic in the number of leaves and entirely independent of the size of the training set, so trees scale very well at inference time. Second, the prediction surface is piecewise constant: the input space is partitioned into axis-aligned rectangles, each labelled with the leaf's prediction. This is why trees struggle to represent smooth diagonal boundaries, they can only approximate them with staircases, and why the boundary they do produce is so easy to interpret. You can literally print the path that led to any given decision and present it to a clinician or a credit officer.

How a tree is grown

A tree is grown by recursive partitioning. The procedure is greedy and refreshingly simple:

At the current node, search every feature and every candidate threshold; pick the (feature, threshold) split that most reduces impurity in the resulting two child sets.
Recurse on each child with its share of the data.
Stop when the leaf is pure (one class only), too small to split usefully, or when a maximum depth is reached.

The greedy step never reconsiders past splits, which is why the resulting tree is not guaranteed to be globally optimal, finding the optimal binary tree is NP-hard. In practice the greedy choice works well, especially once ensembles smooth over its mistakes.

A worked example clarifies the mechanics. Suppose you have one hundred emails, fifty spam and fifty ham, and you are choosing the first split. The candidate feature "does the message contain the word free?" partitions the data as follows. The "free = yes" branch ends up with thirty-five spam and five ham; the "free = no" branch with fifteen spam and forty-five ham. Both children are noticeably purer than the parent. Using Gini impurity, the parent has $G = 1 - (0.5^2 + 0.5^2) = 0.5$. The left child has $G_L = 1 - ((35/40)^2 + (5/40)^2) \approx 0.219$ and the right has $G_R = 1 - ((15/60)^2 + (45/60)^2) = 0.375$. The weighted child impurity is $(40/100)(0.219) + (60/100)(0.375) = 0.313$, so splitting on "free" yields an impurity reduction of $0.5 - 0.313 = 0.187$. The algorithm now compares this gain with the gain available from every other candidate feature and threshold, perhaps the presence of the word "lottery", the number of exclamation marks, the time of sending, and accepts the split with the largest gain.

For continuous features, the standard trick is to sort the values within the node and consider thresholds halfway between consecutive distinct values; only those midpoints can ever change the partition, so the search remains finite. For categorical features with many levels there are more careful schemes, but the principle is the same. The greedy search costs $\mathcal{O}(nd\log n)$ at a node with $n$ points and $d$ features, and the recursion runs over all nodes. Modern implementations cache pre-sorted feature columns and update impurity counts incrementally, so a tree on a million-row table fits in seconds on a laptop.

Impurity measures

A split is good when its children are purer than its parent. Three impurity functions dominate practice.

For classification, with class proportions $p_1, \ldots, p_K$ at a node:

Shannon entropy: $H = -\sum_k p_k \log_2 p_k$. This is the number of bits needed, on average, to encode the class label of a point drawn uniformly from the node. It is zero when one class dominates and maximal when the classes are equally probable.
Gini impurity: $G = 1 - \sum_k p_k^2 = \sum_k p_k (1 - p_k)$. This is the expected error if you predicted by sampling from the node's class distribution rather than committing to the majority. Like entropy, Gini is zero for a pure node and maximal at uniform proportions.

Both quantities behave very similarly across the relevant range of $p$, and in practice Gini and entropy produce trees that agree on most splits and never disagree dramatically. Gini is slightly cheaper because it avoids the logarithm, which is why CART, and therefore scikit-learn, defaults to it.

For regression trees, the analogue is the variance of the target within a node, which corresponds to mean squared error if predictions are leaf means: $\mathrm{MSE} = (1/n) \sum_i (y_i - \bar{y})^2$.

In every case, the quantity actually optimised at a candidate split is the information gain, or, equivalently, impurity reduction, defined as

$$\mathrm{IG} = H(\mathrm{parent}) - \frac{n_L}{n} H(\mathrm{left}) - \frac{n_R}{n} H(\mathrm{right}),$$

where $n_L$ and $n_R$ are the sizes of the two children. The split with the largest information gain is the one chosen. Notice that the children's impurities are weighted by the proportion of data that flows into them: a clean split that affects only a handful of points cannot beat a slightly less clean split that classifies the bulk of the data.

Pruning

A tree grown until every leaf is pure achieves zero training error and, almost always, terrible test error. The cure is pruning, which trades a little training fit for substantial generalisation.

Pre-pruning, sometimes called early stopping, intervenes during growth. You set a maximum depth, a minimum number of samples per leaf, a minimum number of samples required to attempt a split, or a minimum information gain. When any limit is hit, the recursion stops and a leaf is created. Pre-pruning is fast and simple but has one well-known flaw: it can be myopic. A split that yields a small immediate gain may unlock a much larger gain at the next level, and a strict early-stopping rule will reject the parent and lose the descendant entirely.

Post-pruning avoids that trap by growing the full tree first and then trimming back. The standard algorithm, cost-complexity pruning from CART, optimises the penalised criterion

$$R_\alpha(T) = R(T) + \alpha |T|,$$

where $R(T)$ is the training error of tree $T$ and $|T|$ is the number of leaves. As $\alpha$ rises from zero, whole subtrees collapse into single leaves whenever the saved complexity exceeds the cost of the extra training error. The procedure produces a nested sequence of pruned subtrees, and the right $\alpha$ is chosen by cross-validation. Post-pruning is more expensive than pre-pruning but tends to find better trees and is the recommended default whenever a single tree is the final model.

Strengths and weaknesses

Trees have a useful list of practical strengths. They are interpretable: the path from root to leaf is a human-readable rule. They handle mixed feature types natively: continuous, ordinal and categorical features sit happily side by side. They are scale-invariant: monotone transformations of a feature, including standardisation and log transforms, leave the partition unchanged, so you can skip the preprocessing step that linear models demand. They capture non-linear interactions automatically, an interaction effect is just a deeper split, without you needing to engineer cross terms. They tolerate irrelevant features, since uninformative columns simply never get chosen for splits. And with surrogate splits they cope with missing values at both training and inference time.

Their weaknesses are equally real. The largest is high variance. Because the greedy split at the root depends on every data point, a small perturbation to the training set, replacing five points, say, can flip the chosen feature at the root, which cascades through the rest of the tree and produces a structurally different model. Two trees grown on bootstrap samples of the same dataset can look almost unrelated, even when their predictions agree on most points. This sensitivity is why a single tree is rarely the model of choice in production.

A second weakness is poor representation of smooth or linear relationships. To approximate a diagonal boundary, an axis-aligned tree must lay down a staircase of many short splits, each contributing little information gain on its own; with a fixed depth budget the staircase is coarse. A linear model with two features would solve the problem in one weight vector. A third weakness is sensitivity to small data perturbations more generally: minority classes, rare feature values, and outliers can all produce unstable splits, particularly near the leaves where sample sizes shrink. Once again, the cure is to average, to grow many trees and let their disagreements cancel.

Random forests

A random forest takes the high-variance, low-bias single tree and tames it by averaging. The recipe has two ingredients. First, bootstrap aggregation (bagging): train each tree on a bootstrap resample of the original dataset, a sample of size $n$ drawn with replacement, so roughly 63 per cent of the original points appear in each bootstrap and the rest are out-of-bag. Second, random feature subsampling: at each split, instead of considering all $d$ features, restrict the search to a random subset, typically of size $\sqrt{d}$ for classification and $d/3$ for regression. The combination de-correlates the trees: bagging alone would still produce trees that agreed too much, because a single dominant feature would be picked at every root. Forcing each split to choose from a different random subset breaks that correlation and lets the average bring the variance down.

Predictions are aggregated by majority vote for classification or the mean for regression. The variance of the average of $B$ identically distributed but correlated estimators with pairwise correlation $\rho$ and individual variance $\sigma^2$ is $\rho\sigma^2 + (1-\rho)\sigma^2/B$; sending $\rho$ towards zero is exactly what feature subsampling does, and the second term shrinks as more trees are added.

The hyperparameters that matter are the number of trees, typically a hundred to five hundred, where more is monotonically better up to a plateau; the maximum number of features per split, with $\sqrt{d}$ a reliable default for classification and $d/3$ for regression; the maximum depth, often left unlimited because the ensemble controls overfitting; and the minimum samples per leaf, usually one for classification and around five for regression. Random forests are the practical default for tabular data when you do not want to tune carefully. Out of the box, they often beat well-tuned linear models, give a free out-of-bag estimate of generalisation error, expose feature-importance rankings, and parallelise trivially across cores.

Where trees are used in 2026

Tabular data, which still accounts for the majority of business machine learning. From churn prediction and demand forecasting to clinical risk stratification, gradient-boosted trees and random forests remain the default first model.
Initial feature exploration and importance ranking, where the speed of training and the readable splits make trees an excellent diagnostic tool before committing to a more complex model.
Production fraud detection and credit scoring, where the combination of accuracy, calibrated probabilities and a defensible audit trail is hard to beat with deep models.
Baselines against which new deep tabular models are measured. A well-tuned XGBoost or LightGBM run is the line a transformer has to clear before it counts.

What you should take away

A decision tree is a sequence of yes/no questions about features that ends in a constant prediction inside each axis-aligned region of input space.
Trees are grown greedily by recursive partitioning, choosing at each node the (feature, threshold) split that most reduces impurity, measured by Gini, entropy or variance.
Unpruned trees overfit; either pre-prune by limiting depth and leaf size, or post-prune via cost-complexity pruning chosen by cross-validation.
The chief weakness of a single tree is high variance, small data perturbations produce structurally different trees, and the chief weakness of all trees is poor approximation of smooth, diagonal boundaries.
Random forests train many trees on bootstrap resamples with random feature subsets at each split and average their predictions, dramatically reducing variance and providing the practical default for tabular machine learning.