Glossary

BERT

BERT (Bidirectional Encoder Representations from Transformers), introduced by Devlin et al. in 2018, is a transformer encoder pretrained on large text corpora via masked language modelling: a random fraction (typically 15%) of input tokens are replaced with a special [MASK] token, and the model is trained to predict the original tokens from their context. Unlike GPT's left-to-right modelling, BERT sees tokens on both sides—hence "bidirectional"—yielding representations that incorporate full context.

BERT was a watershed for NLP. Its pretrained representations, when fine-tuned on downstream tasks (classification, named entity recognition, question answering, natural language inference), dramatically outperformed previous approaches across the GLUE and SQuAD benchmarks. The fine-tuning paradigm—take a pretrained BERT, add a small task-specific head, and fine-tune on labelled data—became the standard recipe for NLP and unlocked strong performance on tasks with limited training data.

The original BERT had two sizes: BERT-base (12 layers, 110M parameters) and BERT-large (24 layers, 340M parameters). Its many descendants—RoBERTa (better training), ALBERT (parameter sharing), DistilBERT (distilled), DeBERTa (disentangled attention)—explored the design space. BERT's core insight—that bidirectional transformer encoders trained with self-supervised objectives produce powerful, transferable representations—remains central to modern NLP, even as decoder-only models like GPT have taken centre stage for generation. BERT and its variants still dominate encoding tasks in production systems.

Related terms: Transformer, GPT, Self-Supervised Learning

Discussed in:

Also defined in: Textbook of AI, Textbook of Medical AI