Bagging (Bootstrap AGGregatING), introduced by Leo Breiman in 1996, trains many models on bootstrap resamples of the training data and aggregates their predictions, averaging for regression, majority vote for classification. It is the prototypical variance-reduction ensemble technique and the foundation of random forests.
Algorithm
Given training data $\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^N$:
- For $b = 1, \ldots, B$:
- Draw a bootstrap sample $\mathcal{D}^{(b)}$ of size $N$ from $\mathcal{D}$ with replacement.
- Train model $f_b$ on $\mathcal{D}^{(b)}$.
- Aggregate: $\hat{f}(\mathbf{x}) = \frac{1}{B} \sum_b f_b(\mathbf{x})$ for regression, or $\mathrm{mode}\{f_b(\mathbf{x})\}$ for classification.
Each bootstrap sample contains, in expectation, $1 - (1 - 1/N)^N \to 1 - 1/e \approx 63.2\%$ unique training points; the remaining $\approx 36.8\%$ are out-of-bag for that model.
Why it works
Bagging reduces variance without changing bias. For an ensemble of $B$ models with individual variance $\sigma^2$ and pairwise correlation $\rho$:
$$\mathrm{Var}[\hat{f}(\mathbf{x})] = \rho \, \sigma^2 + \frac{1 - \rho}{B} \sigma^2$$
As $B \to \infty$, only the $\rho \sigma^2$ term survives. The variance reduction is therefore limited by how correlated the base learners are: identical models give no benefit, while independent ones give full $\sigma^2 / B$ averaging. Bootstrap resampling decorrelates the models, but only modestly. Random forests push correlation lower by adding a second source of randomness, random feature subsets at each split.
Bias-variance decomposition is most informative for the squared-error loss; for $0$-$1$ classification loss the analysis is more subtle, but the qualitative conclusion holds.
Out-of-bag estimation
Because each base model has not seen $\approx 37\%$ of the data, we can predict each training point using only the models that did not see it. The resulting out-of-bag (OOB) error is a nearly unbiased estimate of generalisation error, obtained for free without a separate validation set or cross-validation.
When bagging helps
Bagging works best on high-variance, low-bias base learners, deep, unpruned decision trees are the classic example. The variance of a fully grown tree is large (small data perturbations change the tree topology dramatically), so averaging many of them yields large gains. Random forests are the most influential application.
Stable, low-variance learners (linear regression, naive Bayes, $k$-nearest neighbours with large $k$) gain little from bagging, because variance was small to begin with.
Relationship to boosting
Bagging and boosting are the two great families of ensemble methods. Bagging trains models in parallel on resampled data to reduce variance; boosting trains models sequentially on reweighted data to reduce bias. Stacking learns a meta-model on top of base predictions and is complementary to both.
Related terms: Random Forest, Boosting, AdaBoost, Decision Tree, Bias-Variance Tradeoff
Discussed in:
- Chapter 6: ML Fundamentals, Ensemble Methods