Supervised Learning: 7.13   When to use what: a cheat sheet

Dr Chris Paton

7.13 When to use what: a cheat sheet

When you sit down at a fresh problem with a fresh dataset, the question "which algorithm should I use?" is rarely answered by a single name. The short answer is "try several, then choose by held-out performance." But you cannot try everything; you must start somewhere; and the starting point matters because it shapes how you frame the problem, how you split the data, and which baselines you carry through the rest of the pipeline. This section is the practical cheat sheet, the one-page table you consult when a colleague drops a CSV on your desk on a Monday morning.

The right first try depends on a small number of structural facts about the problem: the type of the data (tabular, image, text, sequence, graph); the size of the dataset (hundreds, thousands, millions of rows); the interpretability requirements (regulatory, clinical, audit); the latency constraints (microseconds at inference, or seconds, or batched overnight); and the quality of the features (clean, scaled, missing values, imbalanced classes). §7.1–§7.12 introduced specific algorithms one at a time, with derivations and worked examples; §7.13 steps back and tells you which one to reach for first. §7.14 then walks through a side-by-side comparative experiment so you can see the differences in practice rather than just on paper.

The big table

The table below condenses thousands of Kaggle competitions, internal benchmarks, and applied-ML postmortems into a single first-try recommendation per problem class, with a fallback in the column to its right. Use it as a starting point, not a verdict. The "best first try" column is your default baseline; the "if that fails" column is what to escalate to when the baseline disappoints. Failure here means a held-out metric that is meaningfully worse than what a competent practitioner would expect for that problem class, not noise on a single fold.

Problem	Best first try	If that fails
Tabular regression, $n > 1000$	Gradient boosting (XGBoost / LightGBM)	Random forest, then linear with regularisation
Tabular classification, $n > 1000$	Gradient boosting	Random forest, logistic regression
Tabular, $n < 100$	Linear / logistic regression	$k$-NN with cross-validated $k$
Image classification	Pretrained CNN or ViT, fine-tuned	Train from scratch with strong augmentation
Text classification	Pretrained BERT / RoBERTa, fine-tuned	Naive Bayes for very small data
Time series	Gradient boosting on lagged features	LSTM or Temporal Fusion Transformer
Anomaly detection	Isolation forest or autoencoder	Density estimation
Multi-class, many classes	Hierarchical classifier or softmax with class balancing	Per-class binary one-vs-rest

A few notes on reading this table. First, gradient boosting dominates the tabular rows because it tolerates mixed feature types, missing values, unscaled inputs, and class imbalance better than any alternative; for tabular data with $n$ between roughly $10^3$ and $10^7$ it is the most reliable single bet. Second, pretrained foundation models dominate the unstructured rows (image, text, audio) because the cost of training from scratch is now far higher than the cost of fine-tuning, and the pretrained representations transfer well across domains. Third, small datasets invert the priority order: with $n < 100$, a linear model with a handful of carefully chosen features will beat a 200-tree boosted ensemble nearly every time, because the boosted ensemble has more capacity than the data can constrain. Fourth, time series sit awkwardly between tabular and sequence: if you can engineer good lag features, gradient boosting is hard to beat; if the temporal dynamics are intricate (long-range dependencies, multiple seasonalities, irregular sampling), reach for a recurrent or attention-based architecture.

Decision criteria

Beyond the data-type table, five decision criteria narrow the choice further. Treat them as filters: each one eliminates options rather than picking one outright.

Interpretability is mandatory. Regulated domains, medicine, credit, criminal justice, insurance, often require that every prediction be explainable to a human reviewer. Use linear and logistic regression, single decision trees, generalised linear models, and simple rule-based scorers. These let you read off coefficients or follow a tree path to a leaf. Avoid deep ensembles and neural networks unless paired with a faithful post-hoc explainer (SHAP, LIME, integrated gradients) and a documented justification. Note that post-hoc explanations are not interpretations; they are approximations, and they can disagree with each other.
Latency is critical. If predictions must return in microseconds (ad ranking, high-frequency trading, on-device inference), choose linear models, very shallow trees, or distilled networks. Avoid large ensembles and unquantised deep networks. Batch inference relaxes the constraint dramatically; offline scoring jobs can afford the heaviest model the data justifies.
Few features, lots of data. This is the regime where deep learning shines. Images (millions of pixels are not really $10^6$ independent features, convolutional structure makes the effective dimensionality far smaller), audio waveforms, and dense token streams all benefit from learned representations rather than hand-engineered ones. Use CNNs, ViTs, transformers, and their variants.
Many features, few data. The opposite regime, high-dimensional tabular or wide genomics-style data with $d \gg n$, needs heavy regularisation. Use linear models with $\ell_1$ or $\ell_2$ penalties, or gradient boosting with low tree depth and strong shrinkage. Avoid deep networks, which will overfit unless the prior is exceptional.
Mixed numeric and categorical features. Tree-based methods (random forest, gradient boosting) handle mixed types natively, with no need to one-hot encode high-cardinality categoricals. CatBoost in particular is purpose-built for this. Avoid pure neural approaches without careful embedding tables.

A useful sixth question: do I need calibrated probabilities? If yes (clinical risk scores, expected-value decisions, two-stage pipelines), prefer logistic regression, calibrated boosting (with Platt scaling or isotonic regression on a held-out set), or Bayesian models. Random forests and SVMs without calibration produce scores that look like probabilities but behave badly when fed to downstream calculations, a forest's vote share is not a probability, and an SVM's signed distance to the margin is not a probability either, no matter how convenient it would be if they were.

A seventh: what is the cost of a wrong prediction? If false positives and false negatives carry asymmetric costs, missed cancers, fraud that slips through, ads shown to the wrong person, the decision threshold matters more than the model. Two models with identical AUC can produce wildly different operational outcomes once you set a threshold. Choose the threshold on a validation set with the cost structure baked in, and revisit it whenever the cost structure changes.

Hyperparameter starting points

For each of the workhorse algorithms, here is a reasonable default that will usually land within a few per cent of a fully tuned model. Use these as the starting point of a coarse-to-fine search, not as the final answer.

XGBoost / LightGBM: n_estimators=200, learning_rate=0.05, max_depth=6, subsample=0.8, colsample_bytree=0.8, early stopping with 20-round patience on a validation split. Once the baseline is established, tune learning_rate (downwards) and n_estimators (upwards) jointly, then max_depth and min_child_weight, then the regularisation parameters reg_alpha and reg_lambda.
Random forest: n_estimators=200, max_depth=None, max_features='sqrt' for classification or 1/3 of features for regression, min_samples_leaf=1. Forests are unusually robust to default settings; tuning rarely buys more than a percentage point.
Logistic regression: C=1.0, penalty='l2', solver='lbfgs' (or 'saga' for $\ell_1$ on large data). Standardise features first. Tune C on a log scale across roughly six orders of magnitude.
SVM: kernel='rbf', C=1.0, gamma='scale'. Standardise features. Be aware that training scales as $O(n^2)$ to $O(n^3)$, fine for $n < 10^4$, painful beyond.
$k$-NN: $k=5$, Euclidean distance, standardised features. Choose $k$ by cross-validation; odd values for binary classification to avoid ties.
Neural networks: AdamW with lr=3e-4, weight decay $10^{-2}$, cosine schedule, warmup over the first 5 % of steps, batch size as large as memory allows. Andrej Karpathy's "$3 \times 10^{-4}$ is the best learning rate for Adam, hands down" is a meme but a useful starting point.

What not to do

A short, blunt list of mistakes that recur in every applied-ML team and that good practice avoids by default.

Do not compare models on the training set. Training accuracy tells you almost nothing about generalisation. Always use a held-out validation set, and keep a separate test set that you touch only at the end.
Do not tune hyperparameters on the test set. Tuning is a form of fitting; if you tune on the test set, the test set is no longer held out, and your reported number is optimistically biased. Use nested cross-validation if data is scarce.
Do not ignore class imbalance. With 99 % negatives, a constant predictor is 99 % accurate and useless. Use class-weighted losses, resampling, or threshold tuning, and report precision, recall, and the precision–recall curve rather than accuracy alone.
Do not deploy without monitoring. The moment a model goes to production, the data distribution starts to drift. Log predictions, log inputs (subject to privacy constraints), set up dashboards for input statistics and output distributions, and define a retraining trigger.
Do not over-engineer the baseline. Spending three weeks tuning XGBoost when a logistic regression would have answered the business question in an afternoon is a common form of self-harm. Always run the simple baseline first; it tells you how hard the problem actually is.
Do not leak the future into the past. When constructing features for time-series problems, every feature at time $t$ must be computable strictly from information available before $t$. Lag features, rolling means, and aggregated counts are the usual culprits; any of them computed across the full history at training time will silently leak the test labels and produce flattering numbers that collapse on deployment.
Do not chase one-fold improvements. A single 0.3 % gain on one cross-validation fold is noise. Repeat the comparison with different seeds, average the results, and report the standard deviation. If the improvement does not survive averaging, it does not exist.

What you should take away

Start with the data type. Tabular → gradient boosting; image → pretrained CNN/ViT; text → pretrained transformer; time series → boosting on lagged features. This single decision rules out 90 % of the alternatives.
Scale the model to the data. With $n < 100$, use linear models. With $n$ in the thousands, use boosting or shallow networks. With $n$ in the millions and unstructured data, use deep networks. Capacity must match constraint.
Let interpretability and latency veto choices. Regulatory or sub-millisecond requirements eliminate ensembles and deep nets regardless of accuracy. Decide these constraints up front, not after building the model.
Default hyperparameters are usually within a few per cent of optimal. Get the baseline working end-to-end first; tune second. Premature tuning is a leading cause of project delay.
Always validate on held-out data, monitor in production, and run a simple baseline alongside the fancy one. The simple baseline tells you how hard the problem is and protects you from catastrophic regressions when the fancy one breaks.