Glossary

Random Forest (mathematical detail)

A random forest combines bagging (bootstrap aggregating) with random feature selection at each tree split.

Algorithm:

  1. For $b = 1, \ldots, B$ (number of trees):

    a. Draw a bootstrap sample $\mathcal{D}^{(b)}$ of size $N$ from the training data with replacement (about 63% of the original points appear; the rest are out-of-bag).

    b. Grow a decision tree on $\mathcal{D}^{(b)}$ to full depth (no pruning), with the modification: at each internal node, randomly select $m$ features from the full $d$ features and consider only those for splitting. Typical choices: $m = \sqrt{d}$ for classification, $m = d/3$ for regression.

  2. Predict by aggregating across trees:

    • Classification: majority vote of trees' predictions.
    • Regression: average of trees' predictions.

Why two randomisations matter:

  • Bagging alone gives correlated trees (since they all see roughly the same data), variance reduction is limited.
  • Random feature selection decorrelates trees by forcing them to use different features at each split, dramatically improves variance reduction.

The combination gives random forests' famously strong generalisation: ensemble variance is approximately $\rho \sigma^2 + (1 - \rho) \sigma^2 / B$ where $\rho$ is the correlation between trees and $\sigma^2$ is per-tree variance. Lower $\rho$ means more benefit from adding trees.

Out-of-bag (OOB) estimation: each tree's bootstrap excludes ~37% of the data. To estimate generalisation error, predict each training point using only the trees whose bootstrap excluded it; aggregate as for normal prediction. The OOB error closely approximates cross-validation error and requires no separate held-out data.

Feature importance:

Mean decrease in impurity (MDI): sum the impurity reduction across all splits using each feature, weighted by sample count and averaged over trees. Fast but biased toward high-cardinality features.

Permutation importance: for each feature, permute its values and measure the drop in OOB accuracy. Slower but unbiased.

Splitting criteria:

  • Classification: Gini impurity $\sum_k p_k(1 - p_k)$ or entropy $-\sum_k p_k \log p_k$ (information gain). The two often give identical splits; Gini is faster.
  • Regression: variance reduction (equivalent to MSE).

Hyperparameters (typically little tuning needed):

  • n_estimators $B$: more trees → lower variance, diminishing returns past ~500.
  • max_features $m$: $\sqrt{d}$ classification, $d/3$ regression are standard defaults.
  • max_depth, min_samples_leaf, min_samples_split: control individual tree complexity (mostly fine to leave at defaults, random forest is robust).

Computational cost: $O(B \cdot N \log N \cdot \sqrt{d})$ training, embarrassingly parallel across trees. Inference: $O(B \cdot \log N)$ per prediction.

Strengths:

  • Robust to outliers and irrelevant features.
  • No feature scaling needed.
  • Handles mixed feature types and missing values gracefully.
  • Provides feature importances essentially for free.
  • Strong baseline that often beats more complex methods.

Random forests remain the standard choice for many tabular problems, particularly when interpretability and robustness matter more than raw accuracy. Gradient-boosted trees often edge them out on accuracy; random forests are simpler, more parallelisable, and harder to misuse.

Video

Related terms: Random Forest, leo-breiman, Bagging, Decision Tree

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).