Random Forest (mathematical detail), Glossary, Textbook of AI

A random forest combines bagging (bootstrap aggregating) with random feature selection at each tree split.

Algorithm:

For $b = 1, \ldots, B$ (number of trees):

a. Draw a bootstrap sample $\mathcal{D}^{(b)}$ of size $N$ from the training data with replacement (about 63% of the original points appear; the rest are out-of-bag).

b. Grow a decision tree on $\mathcal{D}^{(b)}$ to full depth (no pruning), with the modification: at each internal node, randomly select $m$ features from the full $d$ features and consider only those for splitting. Typical choices: $m = \sqrt{d}$ for classification, $m = d/3$ for regression.
Predict by aggregating across trees:
- Classification: majority vote of trees' predictions.
- Regression: average of trees' predictions.

Why two randomisations matter:

Bagging alone gives correlated trees (since they all see roughly the same data), variance reduction is limited.
Random feature selection decorrelates trees by forcing them to use different features at each split, dramatically improves variance reduction.

The combination gives random forests' famously strong generalisation: ensemble variance is approximately $\rho \sigma^2 + (1 - \rho) \sigma^2 / B$ where $\rho$ is the correlation between trees and $\sigma^2$ is per-tree variance. Lower $\rho$ means more benefit from adding trees.

Out-of-bag (OOB) estimation: each tree's bootstrap excludes ~37% of the data. To estimate generalisation error, predict each training point using only the trees whose bootstrap excluded it; aggregate as for normal prediction. The OOB error closely approximates cross-validation error and requires no separate held-out data.

Feature importance:

Mean decrease in impurity (MDI): sum the impurity reduction across all splits using each feature, weighted by sample count and averaged over trees. Fast but biased toward high-cardinality features.

Permutation importance: for each feature, permute its values and measure the drop in OOB accuracy. Slower but unbiased.

Splitting criteria:

Classification: Gini impurity $\sum_k p_k(1 - p_k)$ or entropy $-\sum_k p_k \log p_k$ (information gain). The two often give identical splits; Gini is faster.
Regression: variance reduction (equivalent to MSE).

Hyperparameters (typically little tuning needed):

n_estimators $B$: more trees → lower variance, diminishing returns past ~500.
max_features $m$: $\sqrt{d}$ classification, $d/3$ regression are standard defaults.
max_depth, min_samples_leaf, min_samples_split: control individual tree complexity (mostly fine to leave at defaults, random forest is robust).

Computational cost: $O(B \cdot N \log N \cdot \sqrt{d})$ training, embarrassingly parallel across trees. Inference: $O(B \cdot \log N)$ per prediction.

Strengths:

Robust to outliers and irrelevant features.
No feature scaling needed.
Handles mixed feature types and missing values gracefully.
Provides feature importances essentially for free.
Strong baseline that often beats more complex methods.

Random forests remain the standard choice for many tabular problems, particularly when interpretability and robustness matter more than raw accuracy. Gradient-boosted trees often edge them out on accuracy; random forests are simpler, more parallelisable, and harder to misuse.

Video

Related terms: Random Forest, leo-breiman, Bagging, Decision Tree

Discussed in:

Chapter 7: Supervised Learning, Supervised Learning

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).