BERT's pre-training objective combines two losses:
Masked Language Modeling (MLM). Given an input sequence $x = (x_1, \ldots, x_n)$, randomly select 15% of token positions $\mathcal{M}$. Of these, 80% are replaced with a [MASK] token, 10% with a random token, 10% left unchanged. The model is trained to predict the original tokens at masked positions:
$$\mathcal{L}_{\mathrm{MLM}} = -\mathbb{E}_{x, \mathcal{M}}\!\left[\sum_{i \in \mathcal{M}} \log P_\theta(x_i \mid \tilde x)\right]$$
where $\tilde x$ is the masked sequence. The model output at each masked position is a softmax over the vocabulary.
Next Sentence Prediction (NSP). For sentence pairs $(A, B)$, predict whether $B$ follows $A$ in the original text or is a random sentence:
$$\mathcal{L}_{\mathrm{NSP}} = -\mathbb{E}\!\left[\log P_\theta(\mathrm{IsNext} \mid [\mathrm{CLS}], A, [\mathrm{SEP}], B)\right]$$
The combined loss is $\mathcal{L} = \mathcal{L}_{\mathrm{MLM}} + \mathcal{L}_{\mathrm{NSP}}$. Subsequent work (RoBERTa) showed NSP adds little and dropped it; modern BERT-style models train MLM only.
Architecture details:
- BERT-base: 12 Transformer encoder layers, $d = 768$, 12 attention heads, 110M parameters.
- BERT-large: 24 layers, $d = 1024$, 16 heads, 340M parameters.
- WordPiece tokeniser with 30,522 subwords.
- Position embeddings are learned (modern descendants use RoPE or ALiBi).
- Special tokens
[CLS](classification anchor at position 0) and[SEP](segment separator).
Fine-tuning for downstream tasks adds a small task-specific head on top of the BERT representations and trains the entire model with a task-specific loss:
- Classification: linear layer over the
[CLS]representation, cross-entropy. - Sequence labelling (NER, POS): linear layer per token, cross-entropy.
- Sentence-pair tasks (NLI, similarity): both sentences encoded together, classification head over
[CLS]. - Question answering: predict start and end positions of the answer span via two linear layers.
The pre-train-then-fine-tune paradigm BERT established was the dominant NLP recipe from 2018 to ~2022, when zero-shot and few-shot prompting of large autoregressive models began to displace fine-tuning for many tasks.
Related terms: BERT, jacob-devlin, Transformer, Cross-Entropy Loss
Discussed in:
- Chapter 13: Attention & Transformers, Attention and Transformers