AdaBoost, Glossary, Textbook of AI

AdaBoost (Adaptive Boosting), introduced by Yoav Freund and Robert Schapire in 1995, was the first practical boosting algorithm and won them the 2003 Gödel Prize. It iteratively trains weak learners on reweighted versions of the data and combines them into a weighted majority vote, producing a strong classifier from learners that need only do slightly better than chance.

The algorithm

Given training data $\{(\mathbf{x}_i, y_i)\}_{i=1}^N$ with $y_i \in \{-1, +1\}$, initialise uniform weights $w_i^{(1)} = 1/N$. For round $t = 1, \ldots, T$:

Train weak learner $h_t$ on data weighted by $w_i^{(t)}$.
Compute its weighted error: $\varepsilon_t = \sum_i w_i^{(t)} \, \mathbb{1}[h_t(\mathbf{x}_i) \neq y_i]$.
Compute its vote: $\alpha_t = \tfrac{1}{2} \log\!\left(\frac{1 - \varepsilon_t}{\varepsilon_t}\right)$.
Update weights: $w_i^{(t+1)} \propto w_i^{(t)} \exp(-\alpha_t y_i h_t(\mathbf{x}_i))$ and renormalise.

The final classifier is $H(\mathbf{x}) = \mathrm{sign}\!\left( \sum_t \alpha_t h_t(\mathbf{x}) \right)$.

The reweighting causes subsequent weak learners to focus on examples earlier ones got wrong: misclassified points have their weights multiplied by $e^{\alpha_t} > 1$, correctly classified ones by $e^{-\alpha_t} < 1$. A weak learner with error $\varepsilon_t = 0.5$ (random) gets vote $\alpha_t = 0$; an error rate below $0.5$ gives positive vote.

Theoretical guarantee

Schapire and Freund showed AdaBoost drives the training error to zero exponentially fast: if every weak learner achieves edge $\gamma_t = 0.5 - \varepsilon_t > 0$, training error after $T$ rounds is at most $\exp\!\left(-2 \sum_t \gamma_t^2\right)$. More surprisingly, test error often continues to fall even after training error reaches zero, a phenomenon that motivated the development of margin-based generalisation bounds. AdaBoost is implicitly maximising the margin between the two classes.

AdaBoost can also be viewed as forward stage-wise additive modelling with the exponential loss $L(y, f(\mathbf{x})) = e^{-y f(\mathbf{x})}$ , a unification due to Friedman, Hastie and Tibshirani (2000) that opened the door to gradient boosting.

Impact

AdaBoost was widely deployed in industrial machine vision and information retrieval. The Viola–Jones face detector (2001), an AdaBoost cascade over Haar-like features, made real-time face detection feasible on cheap hardware and shipped in nearly every digital camera of the 2000s.

Gradient boosting (Friedman, 2001) generalises the algorithm to arbitrary differentiable loss functions, replacing the multiplicative reweighting with gradient-based residual fitting. Modern gradient-boosted-tree libraries, XGBoost (2014), LightGBM (2017) and CatBoost (2018), are direct descendants and currently dominate Kaggle competitions and industrial tabular-data pipelines.

Video

Related terms: Boosting, Gradient Boosting, Random Forest, Bagging

Discussed in:

Chapter 6: ML Fundamentals, Ensemble Methods

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).