A conditional random field (CRF), introduced by John Lafferty, Andrew McCallum and Fernando Pereira in 2001, is a discriminative undirected graphical model for structured prediction. Where a generative hidden Markov model (HMM) factors $p(\mathbf{x}, \mathbf{y})$, a CRF directly models the conditional distribution $p(\mathbf{y} \mid \mathbf{x})$ of structured outputs $\mathbf{y}$ given inputs $\mathbf{x}$, freeing it from the need to model the input distribution and allowing arbitrary, overlapping features of $\mathbf{x}$.
The most common variant, the linear-chain CRF, models the conditional probability of a label sequence $\mathbf{y} = (y_1, \ldots, y_n)$ given an input sequence $\mathbf{x} = (x_1, \ldots, x_n)$ as
$$p(\mathbf{y} \mid \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp\!\left( \sum_{t=1}^{n} \sum_{k} \lambda_k f_k(y_{t-1}, y_t, \mathbf{x}, t) \right),$$
where $f_k$ are user-specified feature functions (binary or real-valued indicators that fire for particular patterns) and $\lambda_k$ are learnable weights. The partition function
$$Z(\mathbf{x}) = \sum_{\mathbf{y}'} \exp\!\left( \sum_{t,k} \lambda_k f_k(y'_{t-1}, y'_t, \mathbf{x}, t) \right)$$
is computed efficiently in $O(n |\mathcal{Y}|^2)$ time by the forward--backward algorithm, and the most likely label sequence by the Viterbi algorithm.
CRFs avoid the label-bias problem of maximum-entropy Markov models (MEMMs): because the partition function normalises over the whole sequence rather than at each position independently, states with few outgoing transitions cannot artificially attract probability mass. The parameters are learned by maximum (conditional) likelihood with gradient methods. The gradient of the log-likelihood has the clean form
$$\frac{\partial \log p(\mathbf{y} \mid \mathbf{x})}{\partial \lambda_k} = \tilde{\mathbb{E}}[f_k] - \mathbb{E}_{p(\mathbf{y}'\mid\mathbf{x})}[f_k],$$
the difference between empirical and expected feature counts, both computable by forward--backward. L-BFGS with $\ell_2$ regularisation is the standard optimiser; stochastic gradient methods scale to larger datasets. General-graph CRFs require approximate inference (loopy belief propagation, mean field, MCMC).
CRFs were the workhorse of sequence-labelling tasks throughout the 2000s: part-of-speech tagging, named-entity recognition (NER), shallow parsing (chunking), gene/protein mention tagging in biomedical text, optical character recognition, and image segmentation (where the conditional random field in vision unifies pixel-wise classification with smoothness priors, e.g. DenseCRF as a post-processing layer). Toolkits like CRF++, MALLET and CRFsuite made them accessible to practitioners.
Neural sequence-labelling models displaced hand-engineered CRFs in the 2010s, but the CRF idea survives as the output layer of many neural taggers. The BiLSTM-CRF architecture (Huang, Xu and Yu, 2015; Lample et al., 2016) became the standard for NER, and BERT-CRF further improved results by replacing the BiLSTM with a transformer encoder. In these hybrids, the neural network produces rich contextual emission scores while the CRF layer enforces global label-sequence consistency through learned transition scores -- preventing illegal label sequences (e.g. I-PER immediately following B-LOC in BIO tagging schemes). CRFs are also widely used in structured prediction for parsing, alignment and segmentation tasks where global consistency matters.
Related terms: Hidden Markov Model
Discussed in:
- Chapter 6: ML Fundamentals, Machine Learning
- Chapter 8: Unsupervised Learning, Natural Language Processing