The Conformer (Gulati et al., Conformer: Convolution-augmented Transformer for Speech Recognition, Interspeech 2020) is the dominant encoder for modern automatic speech recognition. It hybridises self-attention (good at modelling global context, e.g. long-range coarticulation) with depthwise convolution (good at modelling local patterns, e.g. formant transitions) inside a single block, achieving state-of-the-art WER on LibriSpeech with far fewer parameters than pure Transformer baselines.
Block structure. A Conformer block applies four sub-modules in a Macaron arrangement:
$$\tilde{x} = x + \tfrac{1}{2}\,\text{FFN}(x), \quad x' = \tilde{x} + \text{MHSA}(\tilde{x}),$$ $$x'' = x' + \text{Conv}(x'), \quad y = \text{LayerNorm}\!\left(x'' + \tfrac{1}{2}\,\text{FFN}(x'')\right).$$
The two half-step feed-forward networks bracket the attention and convolution, inspired by Macaron-Net. Each FFN follows pre-LayerNorm with Swish activation and a dropout-regularised expansion of $4\times d_{\text{model}}$.
Multi-head self-attention. Standard scaled dot-product attention with relative positional encoding (Shaw et al.; Transformer-XL style), critical because absolute position embeddings generalise poorly across utterance lengths in speech.
Convolution module. The most novel component:
- LayerNorm.
- Pointwise 1-D conv $\to 2 d_{\text{model}}$ channels followed by GLU (gated linear unit), halving back to $d_{\text{model}}$.
- Depthwise 1-D conv with kernel size 31 (LibriSpeech), operates per-channel along time, capturing local temporal structure with $\mathcal{O}(d_{\text{model}} \cdot k)$ parameters rather than $\mathcal{O}(d_{\text{model}}^2 \cdot k)$.
- BatchNorm + Swish.
- Pointwise 1-D conv back to $d_{\text{model}}$.
- Dropout.
Depthwise separable convolution is the parameter-efficient trick borrowed from MobileNet that lets the Conformer add local modelling at minimal cost.
Front-end. A two-layer convolutional subsampling reduces the 10 ms-frame mel-spectrogram input by $4\times$ along time (typical: kernels 3, stride 2, twice). SpecAugment (frequency and time masking) is the standard regulariser.
Results. Conformer-L (118 M parameters) achieves 2.1% / 4.3% WER on LibriSpeech test-clean / test-other without an external language model, and 1.9% / 3.9% with an external language model, surpassing prior Transformers and ContextNet. The medium and small variants (30 M and 10 M) trade WER for latency on mobile devices.
Variants. Conformer-Transducer (with RNN-T loss) is the de facto streaming ASR architecture. Squeezeformer (Kim et al., 2022) reorders the Macaron block and adds squeeze-excitation. Efficient Conformer uses progressive downsampling. The original block has been adopted beyond ASR, for keyword spotting, speech enhancement, music transcription, and even protein structure (in AlphaFold 2's Evoformer, which shares the convolution-plus-attention philosophy).
Related terms: Transformer, Attention Mechanism, Convolution, Convolutional Neural Network, RNN-Transducer, wav2vec 2.0
Discussed in:
- Chapter 12: Sequence Models, Speech Recognition