7.12 The practical workflow
A real supervised learning pipeline rarely begins or ends with the model. Most of the engineering is upstream and downstream.
Feature scaling. Algorithms based on distances (KNN, SVM, clustering), gradient descent (logistic regression, neural networks), and regularisation (ridge, lasso) need features on comparable scales.
- Standardisation: $\tilde x = (x - \mu)/\sigma$, zero mean, unit variance.
- Min-max scaling: $\tilde x = (x-\min)/(\max-\min)$, bounded to $[0,1]$.
- Robust scaling: subtract median, divide by IQR, robust to outliers.
Tree-based methods (decision trees, random forests, gradient boosting) are scale-invariant and need no scaling.
Encoding categorical features.
- One-hot encoding: one column per category. Sparse but interpretable. Explodes with high-cardinality features (e.g., postcode, user-id).
- Ordinal encoding: integer codes. Only valid if the categories have a true order.
- Target encoding: replace each category with the mean target value in that category, computed out-of-fold to prevent leakage. Effective for high cardinality.
- Embedding: learn a dense vector per category, the standard for neural networks.
Missing data. Strategies in increasing order of sophistication:
- Drop: rows or columns with missing values. Wasteful and may bias.
- Mean / median / mode imputation: simple, distorts variance.
- Regression imputation: predict missing values from observed features.
- Iterative imputation (MICE): repeatedly predict each variable from the others.
- Indicator variable: add a binary "is-missing" feature alongside imputation, to let the model learn missingness patterns.
- Native handling: XGBoost / LightGBM / surrogate splits in CART can handle missing values directly.
Class imbalance. Solutions span the model and the data:
- Class weights: scale the loss by inverse class frequency (
class_weight='balanced'in scikit-learn). - Oversampling minority: random oversampling or SMOTE (synthetic minority oversampling, interpolate between minority points and their neighbours).
- Undersampling majority: random or informed (e.g., Tomek links).
- Threshold tuning: don't use 0.5; tune the threshold to optimise the metric you care about.
- Cost-sensitive learning: bake misclassification costs directly into the loss.
- Anomaly-detection framing: when imbalance is extreme (1 in $10^6$), recast as one-class classification.
Feature engineering. Where the rubber meets the road. Polynomial and interaction features for linear models; log-transforms for skewed positive variables; date features (day of week, hour, season); lag features for time series; aggregations (mean, count, max, percentile) over groups; text features (TF-IDF, n-grams, embeddings); domain-driven features that encode physics, biology, or business knowledge.
Pipeline hygiene. Always wrap preprocessing and modelling in a sklearn.pipeline.Pipeline so that scalers, encoders, and imputers fit only on training data, then their parameters are carried unchanged to validation and test. This single discipline prevents the most common cause of "great in development, broken in production" results.