Gradient Boosting generalises the boosting framework by casting ensemble construction as gradient descent in function space. At each iteration, a new weak learner—typically a shallow decision tree—is fitted to the negative gradient of the loss function evaluated at the current ensemble's predictions. These negative gradients are the pseudo-residuals. For squared-error loss, they are simply the residuals themselves, recovering a natural "fit the errors" interpretation. For other losses (log-loss, Huber, quantile), gradient boosting automatically adapts the target of each new tree to the loss's local curvature.
The modern implementations—XGBoost, LightGBM, and CatBoost—incorporate second-order gradient information, histogram-based split finding, and sophisticated regularisation (L1/L2 penalties on leaf weights, depth limits, subsampling of both rows and columns). They are the dominant choice for tabular data, routinely winning Kaggle competitions and delivering state-of-the-art results across structured data problems in finance, insurance, advertising, and the sciences.
Gradient boosting navigates the bias-variance tradeoff effectively: shallow trees keep bias moderate, while sequential fitting and regularisation control variance. Learning rate (shrinkage) plus early stopping provide additional implicit regularisation. Unlike random forests, gradient boosting requires careful hyperparameter tuning—depth, learning rate, number of trees, regularisation strength—but when well-tuned, it is hard to beat on tabular data.
Related terms: Ensemble Methods, Decision Tree
Discussed in:
- Chapter 7: Supervised Learning — Ensemble Methods
Also defined in: Textbook of AI, Textbook of Medical AI