Supervised Learning: 7.10   Ensemble methods

Dr Chris Paton

7.10 Ensemble methods

A single decision tree is unstable, small data changes produce different trees. But combine many trees and the noise cancels out. Ensembles are the empirical winners on tabular data; they routinely beat hand-tuned single models and are the engines behind Kaggle competitions, credit scoring, and most production ML on structured data.

7.10.1 Why ensembles work: the bias–variance view

Decompose the expected squared-error loss of a predictor $\hat f$ at a point $\mathbf{x}$:

$$\mathbb{E}[(y-\hat f(\mathbf{x}))^2] = \underbrace{(f^*(\mathbf{x}) - \mathbb{E}\hat f(\mathbf{x}))^2}_{\text{bias}^2} + \underbrace{\text{Var}(\hat f(\mathbf{x}))}_{\text{variance}} + \underbrace{\sigma^2}_{\text{irreducible}}.$$

Average $M$ predictors with the same bias and variance $\sigma_f^2$ but pairwise correlation $\rho$:

$$\text{Var}(\bar f) = \rho\sigma_f^2 + \frac{1-\rho}{M}\sigma_f^2.$$

If $\rho=0$, variance shrinks like $1/M$; if $\rho=1$, no benefit. The ensemble's job is to reduce $\rho$, to make the base learners disagree in useful ways.

7.10.2 Bagging

Bagging (bootstrap aggregating, Breiman 1996) 1996 generates $M$ bootstrap samples $D_1, \ldots, D_M$, each a draw of $n$ examples with replacement from the original training set, and trains a base learner on each. Predictions are averaged (regression) or majority-voted (classification). About $1-(1-1/n)^n\to 1-1/e\approx 63\%$ of unique training examples appear in each bootstrap; the remaining $\sim 37\%$ form the out-of-bag sample, providing a free validation estimate.

Bagging shines for high-variance, low-bias base learners, exactly what unpruned decision trees are.

7.10.3 Random forests

Random forests Breiman, 2001 add a second layer of randomness: at each internal node, only a random subset of $m$ features (out of $d$) is considered. This decorrelates the trees, pushing $\rho$ down further. Defaults: $m=\sqrt d$ for classification, $m=d/3$ for regression.

A few practical observations make random forests the most popular off-the-shelf classifier:

Adding more trees never hurts (assuming you can afford the compute), the variance keeps shrinking and bias stays put.
They handle mixed feature types (continuous and categorical) without preprocessing.
They provide free feature-importance estimates: at each split, attribute the impurity reduction to the chosen feature; sum across trees.
Out-of-bag error estimates generalisation without a separate validation set.
Good defaults work well; the main tuning knobs are number of trees, $m$, and tree depth.

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
rf = RandomForestClassifier(n_estimators=500, max_features="sqrt",
                            oob_score=True, n_jobs=-1, random_state=0).fit(X, y)
print("OOB accuracy:", rf.oob_score_)
print("Top 5 features:", np.argsort(rf.feature_importances_)[-5:][::-1])

7.10.4 AdaBoost: boosting done right

Boosting builds an additive ensemble sequentially: each new learner focuses on the examples the current ensemble gets wrong.

AdaBoost Freund, 1997 (Freund and Schapire, 1997) for binary classification with $y\in\{-1,+1\}$:

Initialise weights $w_i^{(1)} = 1/n$.
For $t=1,\ldots,T$:
- Train weak learner $h_t: \mathcal{X}\to\{-1,+1\}$ on the weighted data.
- Compute weighted error $\varepsilon_t = \sum_i w_i^{(t)} \mathbf{1}[y_i\neq h_t(\mathbf{x}_i)]$.
- If $\varepsilon_t \geq 0.5$, stop.
- Set learner weight $\alpha_t = \tfrac12 \log\frac{1-\varepsilon_t}{\varepsilon_t}$.
- Update example weights $w_i^{(t+1)} \propto w_i^{(t)} \exp(-\alpha_t y_i h_t(\mathbf{x}_i))$, then renormalise.
Final classifier: $H(\mathbf{x}) = \text{sign}\big(\sum_t \alpha_t h_t(\mathbf{x})\big)$.

Derivation. AdaBoost is precisely forward stagewise additive modelling under exponential loss $\ell(y, F(\mathbf{x})) = e^{-y F(\mathbf{x})}$. At round $t$, find $(\alpha_t, h_t)$ minimising

$$\sum_i e^{-y_i (F_{t-1}(\mathbf{x}_i) + \alpha h(\mathbf{x}_i))} = \sum_i w_i^{(t)} e^{-y_i \alpha h(\mathbf{x}_i)}.$$

Splitting into correctly and incorrectly classified terms,

$$\sum_{\text{correct}} w_i^{(t)} e^{-\alpha} + \sum_{\text{wrong}} w_i^{(t)} e^{\alpha} = e^{-\alpha}(1-\varepsilon_t) + e^{\alpha}\varepsilon_t.$$

Differentiating in $\alpha$ and setting to zero gives $\alpha_t = \tfrac12 \log\frac{1-\varepsilon_t}{\varepsilon_t}$, exactly the AdaBoost weight. The training error drops at rate $\prod_t 2\sqrt{\varepsilon_t(1-\varepsilon_t)}$, which is exponential as long as each weak learner beats chance.

7.10.5 Gradient boosting

If you ask a working data scientist what they reach for first on a tabular dataset, patient records, insurance claims, transaction logs, sensor readings, anything that fits in a spreadsheet, the answer is almost always gradient boosting. For roughly two decades it has been the most successful general-purpose algorithm on structured data. It wins the overwhelming majority of Kaggle competitions on tabular tasks. It powers credit-scoring engines at the major banks, click-through prediction at the advertising platforms, fraud detection in payment networks, ranking functions in search, and risk-stratification scores embedded in clinical decision support. When a deep neural network is reported to "match" gradient boosting on a tabular benchmark, it is usually news; the converse is not. On small and medium tabular datasets, anywhere up to a few million rows with a few hundred features, a well-tuned gradient-boosted ensemble is still the algorithm to beat.

The construction is simple. Train a weak learner, typically a shallow regression tree, three to eight levels deep, often called a stump in its smallest form. Look at where it gets the answer wrong. Train a second weak learner whose job is to predict that error. Add it to the first. Look at the new errors. Train a third learner to predict those. Iterate. After a few hundred or a few thousand rounds, the sum of all those tiny corrections is an accurate predictor. Each individual tree on its own is barely better than the mean; the ensemble of them, weighted by a small learning rate so no single round dominates, is a state-of-the-art model.

Three industrial implementations dominate practice: XGBoost (Chen and Guestrin, 2016), LightGBM (Microsoft, 2017), and CatBoost (Yandex, 2018). They differ in engineering details, histogram bins, leaf-wise versus level-wise growth, ordered target statistics for categorical features, but the underlying mathematics is the one Friedman wrote down in 2001. A practitioner can switch between the three in a few lines of code, and on most problems they perform within a percentage point of each other. TabPFN v2 (Hollmann et al., Nature 2025) and TabICL match or beat boosted trees on small tabular datasets (under ten thousand rows); gradient boosting remains the answer for larger data.

§7.10.3 introduced random forests, which combine bagged trees in parallel and rely on decorrelation to reduce variance. §7.10.4 introduced AdaBoost, which builds an additive ensemble sequentially under exponential loss. Gradient boosting is the natural generalisation of AdaBoost: any differentiable loss, not just exponential, and a clean interpretation as gradient descent in function space. This subsection unpacks how it works, why it works, how to tune it, and where its limits lie.

Symbols Used Here

$\mathcal{D}$training data $\{(\mathbf{x}_i, y_i)\}_{i=1}^n$

$F_t$predictor (a function) after $t$ boosting rounds

$h_t$weak learner added at round $t$

$\eta$learning rate, also written $\nu$ (shrinkage)

$\mathcal{L}$loss function, differentiable in its second argument

The boosting idea

Gradient boosting builds an additive model, a sum of weak learners, one term at a time. Each new term is fitted to the residual mistakes of the running ensemble. Formally:

Initialise with a constant predictor: $$F_0(\mathbf{x}) = \arg\min_{c} \sum_{i=1}^n \mathcal{L}(y_i, c).$$ For squared error this is the mean of the targets; for log-loss in binary classification it is the log-odds of the empirical positive rate; for absolute error it is the median.
For $t = 1, 2, \ldots, T$:
- Compute pseudo-residuals by evaluating the negative gradient of the loss with respect to the prediction, at the current ensemble: $$r_i^{(t)} = -\left[\frac{\partial \mathcal{L}(y_i, F(\mathbf{x}_i))}{\partial F(\mathbf{x}_i)}\right]_{F = F_{t-1}}.$$
- Fit a weak learner $h_t$, almost always a regression tree of fixed maximum depth, to predict the pseudo-residuals from the inputs. The tree is trained as an ordinary regression tree on the dataset $\{(\mathbf{x}_i, r_i^{(t)})\}$.
- Optionally re-fit each leaf of $h_t$ by exact minimisation of the original loss within that leaf, rather than using the mean residual the tree learned. For squared loss this is a no-op; for log-loss it gives the Newton-style update.
- Update with shrinkage: $$F_t(\mathbf{x}) = F_{t-1}(\mathbf{x}) + \eta\, h_t(\mathbf{x}).$$
Return $F_T$ as the final predictor.

For squared loss the pseudo-residuals collapse to ordinary residuals, $r_i = y_i - F_{t-1}(\mathbf{x}_i)$, which is why gradient boosting is sometimes presented as "fit each new tree to the residuals". That description is correct only for the squared-error case. The general formulation, fit to the negative gradient, is what lets the same algorithm handle log-loss for classification, absolute error for robust regression, the Huber loss, the quantile pinball loss for prediction intervals, the ranking losses used in learning-to-rank, the Cox partial likelihood for survival analysis, and so on. Each loss simply changes the formula for $r_i^{(t)}$; the rest of the procedure is identical. This generality is the algorithm's defining strength.

Worked example

Take a tiny regression problem with three points: $(x, y) = \{(1, 2), (2, 4), (3, 6)\}$, the targets lying on the line $y = 2x$. We will use squared-error loss, depth-one trees, and a learning rate $\eta = 0.5$.

Initialisation. $F_0$ is the constant minimising the squared error: the mean, $F_0 = (2 + 4 + 6)/3 = 4$. Predictions: $4, 4, 4$. Residuals (which equal the negative gradient for squared loss): $r^{(1)} = (2 - 4, 4 - 4, 6 - 4) = (-2, 0, +2)$.

Round 1. Fit a regression stump to predict the residuals from $x$. With three points, a single split gives two options: split at $x = 1.5$ (left leaf $\{x=1\}$, right leaf $\{x=2,3\}$, residual means $-2$ and $+1$) or split at $x = 2.5$ (left $\{x=1,2\}$, right $\{x=3\}$, residual means $-1$ and $+2$). Both reduce the residual sum of squares to the same value. Suppose the tree picks the second: $h_1(x) = -1$ if $x \le 2$, $+2$ if $x > 2$. Update with $\eta = 0.5$: $$F_1(1) = 4 + 0.5(-1) = 3.5, \quad F_1(2) = 3.5, \quad F_1(3) = 4 + 0.5(+2) = 5.$$ New residuals: $(2 - 3.5, 4 - 3.5, 6 - 5) = (-1.5, +0.5, +1)$. Notice the predictions have crept towards the targets but have not jumped to them, shrinkage is doing its job.

Round 2. Fit another stump to $(-1.5, +0.5, +1)$. The split at $x = 1.5$ now wins: left leaf gives mean residual $-1.5$, right leaf gives mean $0.75$. Update: $$F_2(1) = 3.5 + 0.5(-1.5) = 2.75, \quad F_2(2) = 3.5 + 0.5(0.75) = 3.875, \quad F_2(3) = 5 + 0.5(0.75) = 5.375.$$ The predictor is converging towards $(2, 4, 6)$, but each tree is a small, regularised step. Continue for, say, 200 rounds and the model traces the line almost exactly.

The example is deliberately small enough to follow on paper, and three pedagogical points fall out of it. First, the residuals at each round are themselves a sensible learning signal: if you can predict the current errors at all, the ensemble improves. Second, shrinkage means each round is a fractional move in the right direction, which produces a flexible but stable final model. Third, even depth-one trees produce a piecewise-constant predictor that becomes gradually more refined as more trees are added, the staircase tightens with each round.

Why it works

The key insight, due to Friedman (2001), is that boosting is gradient descent in function space. Ordinary gradient descent updates a parameter vector $\theta$ by $\theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L}$. Boosting updates a function $F$ by $F \leftarrow F - \eta \nabla_F \mathcal{L}$, where the gradient is taken pointwise: at each training point, the partial derivative of the loss with respect to the prediction at that point. The pseudo-residuals are exactly the components of that functional negative gradient, evaluated at the training inputs.

The catch is that we cannot freely modify a function at finitely many points and call the result a function; it would not generalise. So we project the negative gradient onto our hypothesis class of weak learners by training $h_t$ to predict the pseudo-residuals on the training data. The fitted tree $h_t$ is the closest thing in tree-space to the ideal pointwise direction of steepest descent. Adding $\eta h_t$ to the ensemble is therefore an approximate, regularised step of gradient descent in function space, approximate because the tree only roughly matches the gradient, and regularised because the learning rate $\eta$ is small.

This functional view explains several otherwise-puzzling features of the algorithm. Why does shrinkage help? Because gradient descent with a small step size and many iterations is more stable than a large step size and few iterations. Why do shallow trees work better than deep ones? Because each weak learner is meant to capture a direction, not the whole answer; deep trees overshoot the gradient and overfit. Why does the algorithm not overfit catastrophically as $T$ grows? Because each step is small, the implicit regularisation of the tree class is strong, and stochastic subsampling further damps the variance. In practice the test loss curve is a long, slow U-shape, early stopping on a validation set picks the bottom.

The exponential-loss derivation of AdaBoost in §7.10.4 falls out as a special case: substitute $\mathcal{L}(y, F) = e^{-yF}$ into the recipe and the pseudo-residuals become $y_i e^{-y_i F_{t-1}(\mathbf{x}_i)}$, which is the AdaBoost example weight times the label sign. Friedman's contribution was to liberate the algorithm from that single loss.

Hyperparameters

A working gradient-boosting fit has roughly six knobs that matter and a long tail of secondary ones. The six are:

Number of trees $T$. Typically a few hundred to a few thousand. The sensible procedure is to set $T$ large and use early stopping on a validation set: stop adding trees when the validation loss has not improved for, say, fifty rounds.
Learning rate $\eta$ (sometimes $\nu$). Almost always between 0.01 and 0.3. There is a $\eta$–$T$ trade-off: halving $\eta$ roughly doubles the number of trees needed for the same fit, but reduces variance and usually improves generalisation slightly. A common production setting is $\eta = 0.05$ with $T = 1000$ and early stopping.
Maximum tree depth. Usually 3 to 8. Depth one (stumps) gives an additive model with no feature interactions; depth two gives pairwise interactions; depth $k$ allows $k$-way interactions. Most tabular datasets have meaningful interactions of order two to four, which is why depth 4 to 6 is the typical sweet spot.
Row subsample fraction (stochastic gradient boosting). Each round trains the tree on a random fraction, say 0.5 to 0.8, of the training rows. This decorrelates the trees, reduces variance, and adds a small bias-like regularisation effect.
Feature subsample fraction. At each tree, or at each split, consider only a random subset of features. As in random forests, this further reduces the correlation between trees.
L2 regularisation on leaf values (XGBoost's lambda, reg_lambda). Penalises the squared leaf-output magnitudes, shrinking each leaf towards zero and stabilising the Newton updates.

XGBoost adds two technical refinements that account for much of its empirical edge over the textbook algorithm. First, second-order updates: instead of taking a step in the direction of the negative gradient with a fixed leaf-value rule, XGBoost uses both the gradient and the Hessian (the diagonal of the second derivative) to compute Newton-style optimal leaf values. This is mathematically the same as fitting each leaf by exact loss minimisation and is significantly faster to converge. Second, explicit regularisation in the objective: the loss minimised at each split includes a penalty $\gamma T_{\text{leaves}} + \tfrac{1}{2}\lambda \sum_j w_j^2$ on the leaf count and the squared leaf weights. The split-finding routine therefore only takes a split when its loss reduction exceeds $\gamma$, giving principled tree pruning.

LightGBM swaps the level-wise tree-growth strategy for a leaf-wise one, at each step it splits the leaf with the largest loss reduction, regardless of depth, and uses gradient-based one-side sampling to speed up training on large datasets. CatBoost handles categorical features natively with ordered target statistics, avoiding the leakage that naive target encoding suffers. All three implementations accept the standard hyperparameters above; switching between them is usually a matter of changing the import.

A workable starting recipe for a fresh tabular problem is: $\eta = 0.05$, $T = 2000$ with early stopping at 50 rounds, depth 6, row and feature subsample 0.8, $\lambda = 1$. Then tune depth and learning rate via cross-validation. This recipe is good enough to win modest Kaggle competitions before any clever feature engineering.

Strengths

Gradient boosting is the algorithm that earned its place by accumulating practical wins, not by being the most theoretically elegant.

State of the art on tabular data. Across the standard benchmarks, UCI suites, Kaggle competition datasets, the Shwartz-Ziv and Armon (2022) tabular study, gradient-boosted trees match or beat carefully-tuned deep neural networks on the majority of problems with fewer than a few million rows. Deep learning catches up only when the data is large, the features are perceptual, or there is meaningful structure (sequence, image, graph) the network can exploit.
Handles mixed feature types without preprocessing. Continuous, ordinal, and (with CatBoost or one-hot encoding) categorical features can be fed in directly. There is no need to standardise, normalise, or transform.
Robust to feature scaling. Trees split on order, not magnitude, so multiplying a feature by a thousand or taking its log changes nothing about the fit.
Captures interactions automatically. A depth-$k$ tree can express any $k$-way interaction. The user does not have to specify which features should interact, as in classical statistics; the trees discover useful interactions during fitting.
Provides feature importance. Three importance measures fall out naturally: gain (total loss reduction attributable to a feature across all splits), cover (number of training rows touched by splits on that feature), and frequency (number of splits using that feature). For more reliable importance, SHAP values can be computed efficiently for tree ensembles via TreeSHAP.
Handles missing data. XGBoost's sparsity-aware split picks a default direction at each node so that rows with missing values are routed sensibly without imputation.
Engineering maturity. XGBoost, LightGBM, and CatBoost are all heavily optimised, GPU-accelerated, distributed, and battle-tested in production. They have stable APIs, scikit-learn compatibility, and excellent documentation.

Weaknesses

The same shape that makes gradient boosting strong on tabular data limits it elsewhere.

Slow to train compared to a single tree or a linear model. A modern boosted ensemble on a large dataset can take hours; the same problem might take seconds with logistic regression.
Many hyperparameters to tune. The interaction between learning rate, tree count, depth, and subsample fractions means a careful cross-validation sweep is usually required to extract the last few percentage points of performance.
Less interpretable than a single tree. A model with two thousand depth-six trees is not a flowchart you can read. Feature importance and SHAP help, but the model itself is a black box at the level of decisions about individual cases.
Does not extrapolate beyond the range of the training targets. Trees are piecewise constant; the maximum prediction is the largest leaf value, the minimum is the smallest. Linear models extrapolate; trees do not.
Does not help much on perceptual data, images, audio, raw text, video. Convolutional and transformer architectures encode the locality and translation structure that trees cannot exploit. Gradient boosting on raw pixels is hopeless.

What you should take away

Gradient boosting fits an additive ensemble of weak learners, almost always shallow regression trees, by sequentially fitting each new learner to the negative gradient of the loss with respect to the current predictions. This is gradient descent in function space.
The three production-grade libraries are XGBoost, LightGBM, and CatBoost. They share the underlying mathematics and differ in engineering details; pick whichever your team already uses.
The principal hyperparameters are number of trees, learning rate, tree depth, row and feature subsample fractions, and L2 regularisation on leaf values. A good starting point is $\eta = 0.05$, depth 6, subsample 0.8, $T$ large with early stopping.
On tabular data of small to medium size, gradient boosting is the algorithm to beat. On perceptual data (images, audio, language) it is the wrong tool.
Use a held-out validation set with early stopping rather than trying to pick the right $T$ by guesswork. Inspect feature importance or SHAP values to sanity-check that the model is using sensible signals before deploying.

7.10.6 A from-scratch gradient-boosting stub

import numpy as np
from sklearn.tree import DecisionTreeRegressor

class GBRegressor:
    def __init__(self, n_estimators=200, lr=0.05, max_depth=3):
        self.n_estimators, self.lr, self.max_depth = n_estimators, lr, max_depth

    def fit(self, X, y):
        self.f0 = y.mean()
        F = np.full_like(y, self.f0, dtype=float)
        self.trees = []
        for _ in range(self.n_estimators):
            r = y - F                                              # squared-error gradient
            tree = DecisionTreeRegressor(max_depth=self.max_depth).fit(X, r)
            F += self.lr * tree.predict(X)
            self.trees.append(tree)
        return self

    def predict(self, X):
        F = np.full(X.shape[0], self.f0, dtype=float)
        for tree in self.trees:
            F += self.lr * tree.predict(X)
        return F

7.10.7 Stacking

Stacking trains a meta-learner on the out-of-fold predictions of several diverse base models. To avoid leakage, the base models predict via cross-validation: for each fold, train on the other folds and predict the held-out fold. These out-of-fold predictions become the meta-training features. A logistic regression (for classification) or ridge regression (for regression) meta-learner is usually enough, the base learners do the heavy lifting; the meta-learner just learns how to combine them.