A Transformer Encoder is a stack of identical layers that maps an input sequence to a sequence of contextualised representations. Each layer contains two sub-layers: multi-head self-attention (allowing every position to attend to every other position) and a position-wise feed-forward network (a two-layer MLP applied independently to each position). Each sub-layer is wrapped in a residual connection and followed by layer normalisation.
The encoder is bidirectional: there is no causal masking, so every position can attend to every other position. This makes the encoder suitable for tasks where the entire input is available at once and bidirectional context is useful: classification, named entity recognition, question answering, representation learning. BERT is the canonical encoder-only model, pretrained with masked language modelling—where a fraction of input tokens are randomly masked and the model learns to predict them from bidirectional context.
Transformer encoders appear in many modern architectures beyond pure NLP. Vision Transformers apply the encoder to sequences of image patches. The encoder in encoder-decoder models (T5, BART, translation transformers) processes the source sequence before the decoder generates the target. Most text embedding models (Sentence-BERT, E5, BGE) are encoder-based. The encoder's parallel processing and global receptive field make it efficient and powerful for any task where the goal is to produce rich representations from a variable-length input.
Related terms: Transformer, BERT, Self-Attention, Multi-Head Attention
Discussed in:
- Chapter 13: Attention & Transformers — The Transformer
Also defined in: Textbook of AI