Gradient Boosting, Glossary, Textbook of AI

Gradient boosting (Friedman 2001) is a generalisation of AdaBoost to arbitrary differentiable loss functions. Builds an additive model

$$F_M(x) = \sum_{m=1}^M \gamma_m h_m(x)$$

by sequentially adding weak learners $h_m$, each fitted to the negative gradient of the loss with respect to the current ensemble's predictions.

Algorithm:

Initialise $F_0(x) = \arg\min_c \sum_n L(y_n, c)$ (a constant, the optimal constant prediction).
For $m = 1, \ldots, M$:

a. Compute pseudo-residuals (negative gradients of the loss):

$$r_{nm} = -\!\left[\frac{\partial L(y_n, F(x_n))}{\partial F(x_n)}\right]_{F = F_{m-1}}$$

b. Fit a weak learner $h_m$ (typically a small decision tree) to $\{(x_n, r_{nm})\}$ by minimising squared error.

c. Find the optimal step size:

$$\gamma_m = \arg\min_\gamma \sum_n L(y_n, F_{m-1}(x_n) + \gamma h_m(x_n))$$

d. Update: $F_m(x) = F_{m-1}(x) + \nu \gamma_m h_m(x)$, with shrinkage $\nu \in (0, 1]$.

For squared error, the pseudo-residuals are simply $r_{nm} = y_n - F_{m-1}(x_n)$. For cross-entropy, they are $y_n - p_n$ (the prediction error in probability space). In each case, gradient boosting trains the next weak learner to predict the current residuals.

Modern gradient-boosted-tree libraries dominate tabular-data competitions:

XGBoost (Chen & Guestrin 2016), adds second-order Taylor expansion of the loss for more accurate split decisions:

$$L_{\mathrm{XGBoost}}^{(m)} = \sum_n [g_n h_m(x_n) + \tfrac{1}{2} h_n h_m(x_n)^2] + \Omega(h_m)$$

where $g_n, h_n$ are first and second derivatives of the loss, and $\Omega$ is a tree-complexity regulariser. XGBoost engineering: histogram-based split finding, sparsity-aware splits, parallel column-block construction.

LightGBM (Microsoft 2017), gradient-based one-side sampling, exclusive feature bundling. Faster than XGBoost on most benchmarks.

CatBoost (Yandex 2018), ordered boosting (uses only earlier examples for split-finding to reduce target leakage), native categorical-feature handling.

Practical strengths:

State-of-the-art on tabular data, almost universally outperforms neural networks on small-to-medium tabular problems.
Handles mixed types, missing values, monotonic constraints.
Feature importance and SHAP values for interpretability.
Few hyperparameters to tune that matter much; sensible defaults work well.

Hyperparameters (XGBoost/LightGBM/CatBoost have ~30 each, but the important ones):

Number of trees $M$ (typically 100–10000).
Learning rate $\nu$ (typically 0.01–0.3).
Max depth of each tree (typically 4–10).
Min child weight / min samples per leaf.
L1, L2 regularisation on leaf weights.
Subsample and colsample_bytree, bagging-style randomisation per tree.

Gradient-boosted trees remain the algorithm of choice for most tabular ML problems. On Kaggle, they win the majority of structured-data competitions and are often the production choice over deep learning.

Video

Related terms: Boosting, AdaBoost, Decision Tree

Discussed in:

Chapter 7: Supervised Learning, Supervised Learning

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.