A random forest, introduced by Leo Breiman in 2001, is an ensemble of decision trees grown with two randomisations: (1) bagging , each tree is trained on a bootstrap resample of the data; (2) random feature selection, at each split, only a random subset of features is considered. The forest's prediction is the average of the individual trees' predictions for regression, or the majority vote for classification.
The randomisation decorrelates the trees, which would otherwise be highly correlated when trained on the same data. The resulting variance reduction makes random forests dramatically less prone to overfitting than individual decision trees, while preserving their ability to capture non-linear feature interactions and handle mixed data types without preprocessing.
Random forests provide several practically valuable byproducts: an out-of-bag estimate of generalisation error (each tree predicts on examples it didn't see during training), feature importance scores (computed by permuting features and measuring the resulting accuracy drop), and proximity scores that can be used for clustering and outlier detection. These properties, together with minimal hyperparameter tuning and robust performance, made random forests the default machine-learning algorithm for many tabular problems.
Despite the rise of deep learning, random forests remain the algorithm of choice for many tabular tasks, particularly those with small to medium-sized datasets, missing values, or mixed feature types where deep learning's advantages are minimal.
Interactive
Video
Related terms: leo-breiman, Bagging, Decision Tree, Ensemble Methods
Discussed in:
- Chapter 7: Supervised Learning, Supervised Learning