BERT (mathematical detail), Glossary, Textbook of AI

BERT's pre-training objective combines two losses:

Masked Language Modeling (MLM). Given an input sequence $x = (x_1, \ldots, x_n)$, randomly select 15% of token positions $\mathcal{M}$. Of these, 80% are replaced with a [MASK] token, 10% with a random token, 10% left unchanged. The model is trained to predict the original tokens at masked positions:

$$\mathcal{L}_{\mathrm{MLM}} = -\mathbb{E}_{x, \mathcal{M}}\!\left[\sum_{i \in \mathcal{M}} \log P_\theta(x_i \mid \tilde x)\right]$$

where $\tilde x$ is the masked sequence. The model output at each masked position is a softmax over the vocabulary.

Next Sentence Prediction (NSP). For sentence pairs $(A, B)$, predict whether $B$ follows $A$ in the original text or is a random sentence:

$$\mathcal{L}_{\mathrm{NSP}} = -\mathbb{E}\!\left[\log P_\theta(\mathrm{IsNext} \mid [\mathrm{CLS}], A, [\mathrm{SEP}], B)\right]$$

The combined loss is $\mathcal{L} = \mathcal{L}_{\mathrm{MLM}} + \mathcal{L}_{\mathrm{NSP}}$. Subsequent work (RoBERTa) showed NSP adds little and dropped it; modern BERT-style models train MLM only.

Architecture details:

BERT-base: 12 Transformer encoder layers, $d = 768$, 12 attention heads, 110M parameters.
BERT-large: 24 layers, $d = 1024$, 16 heads, 340M parameters.
WordPiece tokeniser with 30,522 subwords.
Position embeddings are learned (modern descendants use RoPE or ALiBi).
Special tokens [CLS] (classification anchor at position 0) and [SEP] (segment separator).

Fine-tuning for downstream tasks adds a small task-specific head on top of the BERT representations and trains the entire model with a task-specific loss:

Classification: linear layer over the [CLS] representation, cross-entropy.
Sequence labelling (NER, POS): linear layer per token, cross-entropy.
Sentence-pair tasks (NLI, similarity): both sentences encoded together, classification head over [CLS].
Question answering: predict start and end positions of the answer span via two linear layers.

The pre-train-then-fine-tune paradigm BERT established was the dominant NLP recipe from 2018 to ~2022, when zero-shot and few-shot prompting of large autoregressive models began to displace fine-tuning for many tasks.

Related terms: BERT, jacob-devlin, Transformer, Cross-Entropy Loss

Discussed in:

Chapter 13: Attention & Transformers, Attention and Transformers

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).