BERT (Bidirectional Encoder Representations from Transformers), introduced by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova at Google in 2018, is an encoder-only Transformer pre-trained with two unsupervised objectives:
Masked Language Modeling (MLM), randomly mask 15% of tokens, train the model to predict them from context. Critically, attention is bidirectional: each masked token sees both left and right context.
Next Sentence Prediction (NSP), given two sentences, predict whether the second follows the first in the original text. (Later models including RoBERTa dropped NSP as not adding value.)
The pre-trained BERT can be fine-tuned for any classification or sequence-labelling task by adding a small task-specific head and continuing training on labelled data. The pre-train-then- fine-tune paradigm BERT demonstrated became the template for an entire generation of NLP models. BERT achieved state-of-the-art on nearly every NLP benchmark in 2018 by a substantial margin.
BERT's architecture: 12 layers (BERT-base) or 24 layers (BERT-large), 768 / 1024 hidden dimensions, 12 / 16 attention heads, ~110M / 340M parameters. Subwords are tokenised with WordPiece. The special [CLS] token at sequence start serves as a classification anchor; [SEP] separates sentences in the NSP objective.
BERT remains widely used for embedding, retrieval and classification tasks, particularly the sentence-BERT variant and successors like E5 and BGE. As autoregressive decoder models (GPT) came to dominate generation tasks, BERT-style encoders became the natural choice for representation learning, with the two architectural traditions complementing rather than competing.
Video
Related terms: jacob-devlin, Transformer, GPT
Discussed in:
- Chapter 13: Attention & Transformers, Attention and Transformers