12.13 Connectionist temporal classification

Many sequence tasks, especially in speech recognition, produce output sequences shorter than their input sequences, with no a priori alignment. An audio recording sampled at 100 frames per second produces 1000 frames for a 10-second utterance, but the transcription might be only 30 characters. We need a loss that handles this length mismatch without requiring frame-by-frame target labels (which would be expensive to annotate). Connectionist Temporal Classification (CTC), introduced by Graves et al. 2006, is the standard solution.

12.13.1 The blank symbol and the alignment formalism

Augment the output vocabulary with a special blank symbol $\varnothing$. At each input frame $t = 1, \ldots, T$, the network predicts a probability distribution over $\mathcal{V} \cup \{\varnothing\}$. A frame-level path $\pi = (\pi_1, \ldots, \pi_T)$ is any sequence of frame outputs.

Define the collapsing function $\mathcal{B}$ that converts a path into a label sequence by (a) merging consecutive identical labels and (b) removing all blanks. So $\mathcal{B}(\text{a a } \varnothing \text{ b b}) = \text{ab}$, $\mathcal{B}(\text{a } \varnothing \text{ a b b } \varnothing) = \text{aab}$ (the blank between the two "a"s preserves them as separate). The probability of a target label sequence $\ell$ is the sum over all paths that collapse to it:

$$P(\ell \mid x) = \sum_{\pi : \mathcal{B}(\pi) = \ell} \prod_{t=1}^{T} P(\pi_t \mid x).$$

There can be exponentially many such paths, so we cannot enumerate them. The CTC forward–backward algorithm computes the sum efficiently using dynamic programming.

12.13.2 Forward–backward algorithm

Construct an extended label sequence $\ell'$ by inserting a blank between every label and at the start and end: if $\ell = (a, b, c)$ then $\ell' = (\varnothing, a, \varnothing, b, \varnothing, c, \varnothing)$, with length $L' = 2L + 1$.

Define the forward variable

$$\alpha_t(s) = \sum_{\substack{\pi_{1:t} \\ \mathcal{B}(\pi_{1:t}) \text{ matches } \ell'_{1:s}}} \prod_{t'=1}^{t} P(\pi_{t'} \mid x),$$

i.e. the total probability of all length-$t$ paths that produce the first $s$ symbols of $\ell'$. With initial conditions $\alpha_1(1) = P(\varnothing \mid x_1)$, $\alpha_1(2) = P(\ell_1 \mid x_1)$ and zero otherwise, the recurrence is

$$\alpha_t(s) = \left( \alpha_{t-1}(s) + \alpha_{t-1}(s-1) + [\ell'_s \ne \varnothing \text{ and } \ell'_s \ne \ell'_{s-2}] \cdot \alpha_{t-1}(s-2) \right) P(\ell'_s \mid x_t).$$

The third term, only included when $\ell'_s$ is not blank and is not equal to $\ell'_{s-2}$, handles transitions from non-blank to non-blank that skip over a blank.

The total likelihood is $P(\ell \mid x) = \alpha_T(L') + \alpha_T(L' - 1)$, summing over the two valid termination states. The backward variable $\beta$ is defined symmetrically. The CTC loss is $-\log P(\ell \mid x)$; gradients with respect to the per-frame predictions can be computed in closed form using the standard forward–backward identity $\alpha_t(s) \beta_t(s) / P(\ell \mid x)$.

12.13.3 Decoding

CTC decoding at test time has a greedy variant, pick the most probable symbol at each frame, then collapse, and a beam variant that maintains a beam of label sequences with their summed-over-path probabilities. Beam search with a CTC-aware update rule (which carefully tracks "ends in blank" versus "ends in non-blank" path probabilities) is the standard production approach.

CTC remains a workhorse for speech recognition (alongside RNN-Transducer and attention-based encoder–decoders) and for any task where input frames vastly outnumber output labels.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).