Glossary

Random Forest

A random forest, introduced by Leo Breiman in 2001, is an ensemble of decision trees grown with two randomisations: (1) bagging , each tree is trained on a bootstrap resample of the data; (2) random feature selection, at each split, only a random subset of features is considered. The forest's prediction is the average of the individual trees' predictions for regression, or the majority vote for classification.

The randomisation decorrelates the trees, which would otherwise be highly correlated when trained on the same data. The resulting variance reduction makes random forests dramatically less prone to overfitting than individual decision trees, while preserving their ability to capture non-linear feature interactions and handle mixed data types without preprocessing.

Random forests provide several practically valuable byproducts: an out-of-bag estimate of generalisation error (each tree predicts on examples it didn't see during training), feature importance scores (computed by permuting features and measuring the resulting accuracy drop), and proximity scores that can be used for clustering and outlier detection. These properties, together with minimal hyperparameter tuning and robust performance, made random forests the default machine-learning algorithm for many tabular problems.

Despite the rise of deep learning, random forests remain the algorithm of choice for many tabular tasks, particularly those with small to medium-sized datasets, missing values, or mixed feature types where deep learning's advantages are minimal.

Interactive

An ensemble of decision trees votes. Each tree sees a different bootstrap sample and a different random feature subset, then they vote.

Video

Related terms: leo-breiman, Bagging, Decision Tree, Ensemble Methods

Discussed in:

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.