CTC Loss, Glossary, Textbook of AI

Connectionist Temporal Classification (CTC) is a loss function introduced by Graves, Fernández, Gomez and Schmidhuber (2006) for training networks that map an input frame sequence $x_{1:T}$ to an output label sequence $y_{1:U}$ where $U \le T$ and the alignment between frames and labels is unknown. It became the workhorse loss for end-to-end speech recognition, handwriting recognition, and OCR, removing the need for hand-crafted phoneme alignments.

Setup. The output alphabet is augmented with a special blank symbol $\varnothing$, giving $|\mathcal{V}| + 1$ classes. At each frame $t$ the network produces a distribution $p(\pi_t \mid x)$ over this extended alphabet via a softmax. A path $\pi \in (\mathcal{V} \cup \{\varnothing\})^T$ is a frame-level label sequence; the collapsing function $\mathcal{B}$ removes consecutive duplicates and then deletes blanks, e.g. $\mathcal{B}(a\,a\,\varnothing\,b\,\varnothing\,b) = a\,b\,b$.

Loss. The probability of label sequence $y$ given $x$ marginalises over all paths that collapse to $y$:

$$P(y \mid x) = \sum_{\pi: \mathcal{B}(\pi) = y} \prod_{t=1}^{T} p(\pi_t \mid x).$$

Frame independence given $x$ is assumed, a strong simplification that lets the sum factorise. The training objective is $\mathcal{L} = -\log P(y \mid x)$.

Forward-backward. Direct enumeration is intractable ($|\mathcal{V}|^T$ paths). CTC uses dynamic programming on an extended label sequence $y' = \varnothing\,y_1\,\varnothing\,y_2\,\ldots\,y_U\,\varnothing$ of length $2U+1$. Define forward variable $\alpha_t(s) = \sum_{\pi: \mathcal{B}(\pi_{1:t}) = y'_{1:s}} \prod_{t'=1}^{t} p(\pi_{t'} \mid x)$. The recursion is

$$\alpha_t(s) = p(y'_s \mid x_t) \cdot \big[\alpha_{t-1}(s) + \alpha_{t-1}(s-1) + \mathbb{1}[y'_s \ne \varnothing,\, y'_s \ne y'_{s-2}]\,\alpha_{t-1}(s-2)\big],$$

with a symmetric backward $\beta_t(s)$. Then $P(y \mid x) = \alpha_T(2U+1) + \alpha_T(2U)$, and gradients with respect to logits are

$$\frac{\partial \mathcal{L}}{\partial p(k \mid x_t)} = p(k \mid x_t) - \frac{1}{P(y \mid x)} \sum_{s \in \text{lab}(y', k)} \alpha_t(s) \beta_t(s).$$

The recursion runs in $\mathcal{O}(TU)$ time.

Decoding. Greedy CTC decoding picks $\arg\max_k p(k \mid x_t)$ per frame and applies $\mathcal{B}$. Better results come from prefix-beam search that merges paths collapsing to the same prefix, optionally fused with an external language model via shallow fusion.

Strengths and weaknesses. CTC is monotonic (no reordering), streaming-friendly (forward-only encoders work), and produces well-calibrated peaky posteriors. The conditional-independence assumption hurts it on tasks needing long-range output dependencies; RNN-Transducer and attention-based encoder-decoders remove this assumption while keeping streaming ability for the former.

Related terms: RNN-Transducer, Cross-Entropy Loss, LSTM, Whisper

Discussed in:

Chapter 12: Sequence Models, Sequence Models

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).