A random forest combines bagging (bootstrap aggregating) with random feature selection at each tree split.
Algorithm:
For $b = 1, \ldots, B$ (number of trees):
a. Draw a bootstrap sample $\mathcal{D}^{(b)}$ of size $N$ from the training data with replacement (about 63% of the original points appear; the rest are out-of-bag).
b. Grow a decision tree on $\mathcal{D}^{(b)}$ to full depth (no pruning), with the modification: at each internal node, randomly select $m$ features from the full $d$ features and consider only those for splitting. Typical choices: $m = \sqrt{d}$ for classification, $m = d/3$ for regression.
Predict by aggregating across trees:
- Classification: majority vote of trees' predictions.
- Regression: average of trees' predictions.
Why two randomisations matter:
- Bagging alone gives correlated trees (since they all see roughly the same data), variance reduction is limited.
- Random feature selection decorrelates trees by forcing them to use different features at each split, dramatically improves variance reduction.
The combination gives random forests' famously strong generalisation: ensemble variance is approximately $\rho \sigma^2 + (1 - \rho) \sigma^2 / B$ where $\rho$ is the correlation between trees and $\sigma^2$ is per-tree variance. Lower $\rho$ means more benefit from adding trees.
Out-of-bag (OOB) estimation: each tree's bootstrap excludes ~37% of the data. To estimate generalisation error, predict each training point using only the trees whose bootstrap excluded it; aggregate as for normal prediction. The OOB error closely approximates cross-validation error and requires no separate held-out data.
Feature importance:
Mean decrease in impurity (MDI): sum the impurity reduction across all splits using each feature, weighted by sample count and averaged over trees. Fast but biased toward high-cardinality features.
Permutation importance: for each feature, permute its values and measure the drop in OOB accuracy. Slower but unbiased.
Splitting criteria:
- Classification: Gini impurity $\sum_k p_k(1 - p_k)$ or entropy $-\sum_k p_k \log p_k$ (information gain). The two often give identical splits; Gini is faster.
- Regression: variance reduction (equivalent to MSE).
Hyperparameters (typically little tuning needed):
- n_estimators $B$: more trees → lower variance, diminishing returns past ~500.
- max_features $m$: $\sqrt{d}$ classification, $d/3$ regression are standard defaults.
- max_depth, min_samples_leaf, min_samples_split: control individual tree complexity (mostly fine to leave at defaults, random forest is robust).
Computational cost: $O(B \cdot N \log N \cdot \sqrt{d})$ training, embarrassingly parallel across trees. Inference: $O(B \cdot \log N)$ per prediction.
Strengths:
- Robust to outliers and irrelevant features.
- No feature scaling needed.
- Handles mixed feature types and missing values gracefully.
- Provides feature importances essentially for free.
- Strong baseline that often beats more complex methods.
Random forests remain the standard choice for many tabular problems, particularly when interpretability and robustness matter more than raw accuracy. Gradient-boosted trees often edge them out on accuracy; random forests are simpler, more parallelisable, and harder to misuse.
Video
Related terms: Random Forest, leo-breiman, Bagging, Decision Tree
Discussed in:
- Chapter 7: Supervised Learning, Supervised Learning