Attention & Transformers: 13.11   BERT: masked LM pretraining and fine-tuning

Dr Chris Paton

13.11 BERT: masked LM pretraining and fine-tuning

In October 2018, a small team at Google Research released a paper with a deliberately understated title: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Devlin, 2019. The paper announced Bidirectional Encoder Representations from Transformers, and within weeks it had broken records on every major natural language benchmark. The lesson it taught the field was simple, and once stated it sounds almost obvious. If you take the encoder half of the Transformer from §13.10, train it on enormous quantities of plain text using a single self-supervised objective, and then fine-tune it briefly on whatever downstream task you care about, you will outperform every bespoke task-specific architecture that came before. There is no longer any need to hand-design a parser, a sentiment classifier, or a question-answering system. One pretrained encoder, one short fine-tuning run, one new state-of-the-art result.

This was the moment NLP collapsed into transfer learning. Computer vision had already learnt this trick in 2012 with ImageNet pretraining for convolutional networks; now language had its equivalent. Five months before BERT, OpenAI had released the original GPT, which used the decoder half of the Transformer with a unidirectional objective; we cover GPT in §13.12. BERT is the bidirectional counterpart, and in 2026 it remains the dominant choice for retrieval, reranking, sentence embedding, and cheap classification, even as decoder-only models have eclipsed it for general capability. This section covers what BERT is, how it is pretrained, how it is fine-tuned, and where its descendants live in the modern stack.

Symbols Used Here

$\mathbf{x} = (x_1, \ldots, x_T)$input tokens (WordPiece sub-words)

$\mathbf{h}_t \in \mathbb{R}^{d}$contextual embedding at position $t$

$d$model dimension (768 for BERT-base, 1024 for BERT-large)

$\texttt{[CLS]}, \texttt{[SEP]}, \texttt{[MASK]}$special tokens

$\mathcal{M} \subset \{1, \ldots, T\}$masked positions, $|\mathcal{M}| \approx 0.15 T$

Pretraining tasks

BERT is trained jointly on two self-supervised tasks. Both run on the same forward pass, so the encoder learns to satisfy both at once.

Masked language modelling (MLM). Take a sentence. Sample 15 percent of its tokens uniformly at random. For each sampled token, decide what the network actually sees:

80 percent of the time, replace the token with the special symbol [MASK].
10 percent of the time, replace it with a random other token from the vocabulary.
10 percent of the time, leave it unchanged.

The model's job is to predict the original identity of every sampled token, regardless of what was substituted at the input. The loss is the standard cross-entropy over the vocabulary, summed across $\mathcal{M}$. Why mask? A plain auto-encoder, where the input equals the target, has a trivial solution: copy the input. Hiding 15 percent of the tokens forces the network to use the surrounding context, both to the left and to the right, to fill the gaps. That is precisely the bidirectional capability the encoder was built to provide. The 80/10/10 split is not arbitrary either. If only [MASK] were ever used, the model would learn to ignore non-masked positions entirely, since they would never be queried; and at fine-tuning time, where [MASK] does not appear, performance would collapse. The random-token substitution and the leave-unchanged cases force every position to remain potentially predictive, smoothing the gap between pretraining and fine-tuning distributions.

Next sentence prediction (NSP). Each training example is a pair of segments $A$ and $B$, separated by [SEP]. With probability one half, $B$ is the actual sentence that follows $A$ in the corpus; with probability one half, $B$ is a random sentence drawn from elsewhere in the corpus. A binary classifier on top of the [CLS] embedding predicts which case it is, and the loss is binary cross-entropy. The two losses, MLM and NSP, are summed and optimised jointly. The intent of NSP was to teach BERT relations between sentences, on the grounds that question answering and natural language inference both need them, and a single sentence's masked tokens give no signal about discourse structure. In practice, NSP turned out to be too easy: random sentences usually come from a different document and therefore differ in topic, so the model can solve the binary task from surface vocabulary alone without learning anything about discourse coherence. RoBERTa Liu, 2019 removed NSP entirely a year later, trained for longer with larger batches on more data, and improved on BERT across the board. Most subsequent encoders followed suit. The historical record of NSP is therefore mixed: it was in the original recipe, and BERT-base is normally trained with it, but it is not load-bearing, and modern descendants either drop it or replace it with stronger sentence-level objectives such as sentence-order prediction (ALBERT) or a pure document-level MLM with longer sequences.

Architecture

BERT is the encoder-only Transformer of §13.7, with no decoder, no causal mask, and full bidirectional self-attention at every layer. There are two standard sizes. BERT-base has 12 layers, 12 attention heads, model dimension $d = 768$, feed-forward dimension 3072, and approximately 110 million parameters. BERT-large has 24 layers, 16 heads, $d = 1024$, feed-forward dimension 4096, and roughly 340 million parameters. Both use GELU activations rather than ReLU, learned positional embeddings rather than sinusoids, and a maximum sequence length of 512 tokens.

Tokenisation is WordPiece, a sub-word scheme that splits rare words into common fragments (so unbelievable might become un, ##believ, ##able) and keeps the vocabulary fixed at around 30 000 entries. Two special tokens punctuate every input. [CLS] is prepended to every sequence, and its final-layer embedding is the canonical sequence-level representation: classification heads sit on top of it. [SEP] separates segments in sentence-pair inputs and also terminates the sequence. A learned segment embedding ($A$ or $B$) is added to each token's positional and word embeddings, allowing the model to distinguish the two halves of a pair.

Pretraining ran on the concatenation of English Wikipedia (about 2.5 billion words, with lists, tables and headers stripped out) and BookCorpus (about 0.8 billion words, drawn from roughly 11 000 unpublished novels), giving a corpus of roughly 3.3 billion tokens. BERT-base was trained for 1 million optimiser steps with batch size 256 sequences of length 512, using Adam with a peak learning rate of $10^{-4}$ and a linear warm-up for the first 10 000 steps. The published recipe took four days on 16 TPU chips for BERT-base and four days on 64 TPU chips for BERT-large. By the standards of 2026 this is a small training run; by the standards of 2018 it was substantial but accessible, and the public release of pretrained weights meant that thousands of laboratories could fine-tune without ever paying the pretraining cost.

Fine-tuning

Fine-tuning is the half of BERT that made it useful. The recipe is short. Take the pretrained encoder. Discard the masked-LM head used during pretraining. Attach a new task-specific head, usually a single linear layer. Train the entire model, encoder plus head, end-to-end on the labelled task data, with a small learning rate (typically between $2 \times 10^{-5}$ and $5 \times 10^{-5}$), for two to four epochs. Use batch size 16 or 32. That is the whole procedure.

The shape of the head depends on the task.

Sentence classification (sentiment, topic, natural language inference). Feed the input through BERT, take the final-layer embedding of [CLS], and pass it through a linear layer to the class logits. For pair inputs (premise and hypothesis, or question and passage), pack both segments into one sequence with [SEP] between them and let the model handle the rest.
Question answering (extractive, SQuAD-style). The question and the passage are concatenated with [SEP]. Two linear heads produce, for each token, the probability that it is the start of the answer span and the probability that it is the end. At inference, choose the start–end pair with highest combined log-probability under the constraint that end follows start.
Named entity recognition and other sequence labelling. A linear classifier sits on every token's final-layer embedding, predicting one of the entity tags (B-PER, I-PER, B-ORG, O, etc.). Loss is summed across positions.

Within months of the BERT release, every major NLP benchmark, GLUE, SuperGLUE, SQuAD, CoNLL-NER, RACE, was topped by a fine-tuned BERT variant, often by several points. The pretrain-then-fine-tune workflow became the default discipline for the next four years and remains the default for tasks where you have labelled data and care about cost.

Worked example

Consider sentiment classification on the IMDB movie review dataset: 25 000 training reviews, 25 000 test reviews, two classes (positive or negative). Before BERT, the strongest published from-scratch architectures, bidirectional LSTMs with attention, trained end-to-end on the labels, reached around 85 to 88 percent test accuracy. Fine-tuning BERT-base on the same training set is almost embarrassing in its simplicity: tokenise each review, truncate to 512 tokens, prepend [CLS], run through the encoder, take the [CLS] embedding, pass it through a single linear layer to two logits, train for three epochs at learning rate $2 \times 10^{-5}$. Test accuracy: around 94 percent. BERT-large pushes this to about 95 percent. The pretraining buys you something like nine percentage points on a task it never saw, learnt purely from filling in masked tokens on Wikipedia and novels. That gap, replicated across dozens of benchmarks, is what pretraining-as-default looks like.

Where in 2026

The headline-grabbing models today are decoder-only: GPT-4, Claude, Gemini, DeepSeek-V3, Llama 3. BERT lost the capability race because masked LM is less sample-efficient per token than autoregressive LM (BERT predicts only 15 percent of tokens per pass; GPT predicts every one), because you cannot easily generate from BERT, and because scaling decoder-only models turned out to unlock in-context learning in a way that scaling encoders did not. For raw question-answering and reasoning, you would not reach for BERT now.

But the encoder-only family did not die. It moved into the parts of the stack where bidirectional, fixed-vector representations are exactly what you want, and where running a 70-billion-parameter decoder for every query would be absurd. Three roles dominate.

Retrieval and embedding. Sentence-BERT Reimers, 2019 fine-tunes BERT with a Siamese architecture and a contrastive loss so that semantically similar sentences produce embeddings with high cosine similarity. The descendants of SBERT, MPNet, E5, BGE, GTE, Voyage, the Cohere embedding family, the OpenAI text-embedding-3 models, power almost every production retrieval-augmented-generation system, every vector database query, every "find similar document" feature. They are typically 100 million to 1 billion parameters, trained on hundreds of millions of weakly-supervised pairs.

Reranking. A cross-encoder version of BERT, where query and candidate document are concatenated and fed through one forward pass to produce a single relevance score, is the standard reranker placed in front of generation in modern RAG pipelines. It is more expensive than a dual-encoder embedding lookup but vastly more accurate, and is cheap enough to apply to the top 100 retrieved candidates.

Cheap classification. Toxicity filters, spam detectors, intent classifiers, language identifiers, content moderation, and the per-token NER inside production information-extraction pipelines almost all run on fine-tuned BERT or a smaller cousin (DistilBERT, TinyBERT, MiniLM, ModernBERT). Decoder-only LLMs can do these tasks zero-shot through prompting, but at one or two orders of magnitude more cost per call, with worse calibration, and with much less predictable latency. For high-volume classification, millions of inputs per day, sub-100-millisecond budgets, BERT-style encoders remain the sensible choice. A fine-tuned DistilBERT will happily run at thousands of inputs per second on a single GPU, or hundreds per second on CPU, and produce a calibrated probability that you can threshold and audit.

The lineage continues to be refined. RoBERTa fixed the training recipe; ELECTRA replaced masked LM with replaced-token detection for better sample efficiency; DeBERTa added disentangled attention over content and position; XLM-RoBERTa and mBERT extended the recipe to a hundred languages. The encoders are quieter than the chatbots, but they are everywhere.

What you should take away

BERT showed pretraining wins. A single self-supervised objective on raw text, applied to the Transformer encoder, beats every bespoke architecture once you fine-tune briefly on the task. This was the moment NLP became a transfer-learning discipline.
Masked language modelling is the load-bearing task. Hide 15 percent of the input tokens (with the 80/10/10 split between [MASK], random, and unchanged) and predict the originals from bidirectional context. Next sentence prediction was included in the original recipe but later judged unnecessary and dropped by RoBERTa.
BERT-base is 110M parameters, BERT-large 340M. Encoder-only, learned positional embeddings, WordPiece tokenisation, [CLS] and [SEP] special tokens, pretrained on Wikipedia plus BookCorpus, roughly 3.3 billion tokens.
Fine-tuning is short and uniform. Add a linear head, on [CLS] for classification, on every token for sequence labelling, on start and end positions for span extraction, and train end-to-end for two to four epochs at learning rate around $3 \times 10^{-5}$. Pretraining typically buys close to ten percentage points over from-scratch baselines.
In 2026, BERT-style encoders own retrieval, reranking, and cheap classification. Frontier capability has moved to decoder-only models, but the embedding models inside every RAG system, the rerankers inside every search pipeline, and the lightweight classifiers inside every content-moderation stack are nearly all encoder-only descendants of BERT.