A Decision Tree partitions the feature space into axis-aligned rectangular regions, assigning a prediction to each region. Construction proceeds greedily: at each internal node, choose the feature and threshold that best separate the target values. The result is a tree-shaped graph in which each path from root to leaf encodes a conjunction of simple rules, making decision trees one of the most interpretable models in machine learning.
Splitting criteria measure the "purity" of the resulting child nodes. For classification, Gini impurity $G = \sum_k p_k(1 - p_k)$ reaches zero when a node contains only one class; information gain uses Shannon entropy instead. For regression, variance reduction plays the analogous role. An unrestricted tree can grow until every leaf contains a single training example, achieving zero training error but almost certainly overfitting. Pruning—either pre-pruning via depth limits and minimum leaf sizes, or post-pruning via cost-complexity pruning—is essential.
Trees handle mixed feature types, require no scaling, and accommodate missing values via surrogate splits. Their principal weaknesses are high variance (small data perturbations produce very different trees) and axis-aligned bias (diagonal or curved boundaries require many splits). These weaknesses are dramatically mitigated by ensemble methods: random forests reduce variance through bagging and feature subsampling, while gradient boosting reduces bias by sequentially fitting residuals. Gradient-boosted decision trees (XGBoost, LightGBM, CatBoost) are the dominant approach to tabular data, routinely winning machine-learning competitions.
Related terms: Random Forest, Gradient Boosting, Ensemble Methods, Entropy
Discussed in:
- Chapter 7: Supervised Learning — Decision Trees
Also defined in: Textbook of AI