ML Fundamentals: 6.11   Feature engineering vs representation learning

Dr Chris Paton

6.11 Feature engineering vs representation learning

For the first half-century of ML, the dominant paradigm was feature engineering: a domain expert designs a pipeline that transforms raw data into a fixed-length vector of meaningful features, on top of which a relatively simple model (linear, SVM, random forest) is trained.

For text: bag-of-words, TF-IDF, n-gram counts, hand-curated lexica.
For images: histograms of oriented gradients (HOG), SIFT keypoints, GIST descriptors, colour histograms.
For speech: mel-frequency cepstral coefficients (MFCCs), pitch contours, formant frequencies.
For tabular data: ratios, polynomial expansions, target encodings, lagged features, time-of-day buckets.

Good features encoded human knowledge and dominated competitions. The Pascal VOC object-recognition challenge, the leading image-classification benchmark of the late 2000s, was won repeatedly by hand-crafted SIFT-plus-bag-of-words pipelines.

The 2012 pivot

Then in 2012, AlexNet Krizhevsky, 2012 won ImageNet by a large margin using a deep convolutional network trained end-to-end on raw pixels. The features were learned, not designed. Within a few years, hand-engineered visual features had been almost entirely replaced. The same revolution swept through speech recognition (deep RNNs, then Transformers), NLP (word embeddings, then BERT, then GPT), and now reaches into structured biology (AlphaFold) and chemistry (graph neural networks).

We call this paradigm representation learning. The model jointly learns the features and the prediction function from data. The cost is that representation learning requires very large datasets and substantial compute. The benefit is that the features adapt to the task and exploit structure no human would have thought to encode.

When to engineer, when to learn

Despite the dominance of representation learning, feature engineering is far from dead.

In tabular ML, fraud detection, credit scoring, supply-chain forecasting, gradient-boosted decision trees on hand-engineered features still beat deep networks on many real-world datasets, because tabular data lacks the spatial or temporal structure that deep models exploit. XGBoost Chen, 2016 and LightGBM remain competitive and are often the right answer.
In low-data regimes, where training a deep model from scratch is infeasible, hand-crafted features paired with a simple regulariser are often superior.
In safety-critical or regulated domains (clinical decision support, lending), features must be auditable. "We learned them" is rarely an acceptable explanation to a regulator.
In multimodal pipelines, one almost always combines raw signals (handled by a deep encoder) with engineered side features (geographic codes, time-of-day, user IDs).

The right question is not "engineer or learn" but "what's the cheapest path to enough signal."