Gradient boosting (Friedman 2001) is a generalisation of AdaBoost to arbitrary differentiable loss functions. Builds an additive model
$$F_M(x) = \sum_{m=1}^M \gamma_m h_m(x)$$
by sequentially adding weak learners $h_m$, each fitted to the negative gradient of the loss with respect to the current ensemble's predictions.
Algorithm:
Initialise $F_0(x) = \arg\min_c \sum_n L(y_n, c)$ (a constant, the optimal constant prediction).
For $m = 1, \ldots, M$:
a. Compute pseudo-residuals (negative gradients of the loss):
$$r_{nm} = -\!\left[\frac{\partial L(y_n, F(x_n))}{\partial F(x_n)}\right]_{F = F_{m-1}}$$
b. Fit a weak learner $h_m$ (typically a small decision tree) to $\{(x_n, r_{nm})\}$ by minimising squared error.
c. Find the optimal step size:
$$\gamma_m = \arg\min_\gamma \sum_n L(y_n, F_{m-1}(x_n) + \gamma h_m(x_n))$$
d. Update: $F_m(x) = F_{m-1}(x) + \nu \gamma_m h_m(x)$, with shrinkage $\nu \in (0, 1]$.
For squared error, the pseudo-residuals are simply $r_{nm} = y_n - F_{m-1}(x_n)$. For cross-entropy, they are $y_n - p_n$ (the prediction error in probability space). In each case, gradient boosting trains the next weak learner to predict the current residuals.
Modern gradient-boosted-tree libraries dominate tabular-data competitions:
XGBoost (Chen & Guestrin 2016), adds second-order Taylor expansion of the loss for more accurate split decisions:
$$L_{\mathrm{XGBoost}}^{(m)} = \sum_n [g_n h_m(x_n) + \tfrac{1}{2} h_n h_m(x_n)^2] + \Omega(h_m)$$
where $g_n, h_n$ are first and second derivatives of the loss, and $\Omega$ is a tree-complexity regulariser. XGBoost engineering: histogram-based split finding, sparsity-aware splits, parallel column-block construction.
LightGBM (Microsoft 2017), gradient-based one-side sampling, exclusive feature bundling. Faster than XGBoost on most benchmarks.
CatBoost (Yandex 2018), ordered boosting (uses only earlier examples for split-finding to reduce target leakage), native categorical-feature handling.
Practical strengths:
- State-of-the-art on tabular data, almost universally outperforms neural networks on small-to-medium tabular problems.
- Handles mixed types, missing values, monotonic constraints.
- Feature importance and SHAP values for interpretability.
- Few hyperparameters to tune that matter much; sensible defaults work well.
Hyperparameters (XGBoost/LightGBM/CatBoost have ~30 each, but the important ones):
- Number of trees $M$ (typically 100–10000).
- Learning rate $\nu$ (typically 0.01–0.3).
- Max depth of each tree (typically 4–10).
- Min child weight / min samples per leaf.
- L1, L2 regularisation on leaf weights.
- Subsample and colsample_bytree, bagging-style randomisation per tree.
Gradient-boosted trees remain the algorithm of choice for most tabular ML problems. On Kaggle, they win the majority of structured-data competitions and are often the production choice over deep learning.
Video
Related terms: Boosting, AdaBoost, Decision Tree
Discussed in:
- Chapter 7: Supervised Learning, Supervised Learning