605 terms

Glossary

A B C D E F G H I J K L M N O P Q R S T U V W Z

A

  • A* Search , Hart, Nilsson and Raphael's 1968 best-first search using f(n) = g(n) + h(n)
  • Academic Paper Corpora (arXiv, S2ORC, PubMed Central) , Three open scientific corpora used to give LLMs technical and biomedical depth.
  • Accuracy, Precision, Recall and F1 , Standard metrics for evaluating classification performance
  • Activation Function , The nonlinear element that gives neural networks their expressive power.
  • AdaBoost , Freund and Schapire's 1995 adaptive boosting algorithm, reweights data to target hard examples.
  • Adam , Kingma and Ba's 2014 adaptive optimiser using running estimates of gradient moments
  • Adversarial Examples , Inputs perturbed by an imperceptible amount that cause a trained model to misclassify with high confidence, exposing fundamental fragility in deep networks.
  • Adversarial Training , Defence against adversarial examples that trains the model on worst-case perturbations of each input via min-max optimisation, with PGD as the inner-loop attack.
  • Adversarial Training (LLMs) , Training large language models on adversarial examples, jailbreaks, injections, harmful prompts, to improve robustness to attack.
  • Agent , An entity that perceives and acts in an environment to achieve goals
  • Agentic RAG , Variant of retrieval-augmented generation in which the **agent decides when, what, and how to retrieve**, including multi-step retrieval, query rewriting, and tool selection, rather than always retrieving once before answering.
  • AI Accelerator Landscape , The competitive map of AI hardware, Nvidia's near-monopoly, AMD's GPU challenger, Google's TPU, hyperscaler in-house chips, and a long tail of specialised architectures.
  • AI Alignment , Ensuring AI systems pursue objectives matching human values and intentions
  • AI Safety , Preventing unintended harms from AI systems
  • AI Safety Levels (ASL) , Anthropic's tiered capability classification, ASL-1 to ASL-4+, that defines escalating safety and security commitments under its Responsible Scaling Policy.
  • AI Winter , A period of reduced funding and interest in AI research; usually two episodes are distinguished
  • Aider , Open-source AI pair programmer (Paul Gauthier, 2023) that edits files in a local git repository via diff-based prompts; pioneered the **edit-format** approach now standard in coding agents.
  • AIME , 30 American Invitational Mathematics Examination problems used as a frontier reasoning benchmark.
  • AlexNet , The 2012 convolutional network that won ImageNet and launched the deep-learning era
  • ALPAC Report , The 1966 US government report that ended the early machine-translation funding boom
  • Alpha–Beta Search , A pruning improvement to minimax search that ignores branches provably irrelevant to the result
  • AlphaFold , DeepMind's protein-structure-prediction system; AlphaFold 2 won CASP14 in 2020
  • AlphaFold 2 CASP14 , DeepMind's December 2020 protein-structure-prediction breakthrough at CASP14
  • AlphaFold 3 , Diffusion-based biomolecular structure predictor that handles proteins, ligands, nucleic acids and complexes in a single architecture
  • AlphaGeometry 2 , DeepMind's neuro-symbolic geometry prover combining a symbolic deductive engine with a neural construction proposer, trained entirely on synthetic data.
  • AlphaGo , DeepMind's 2016 Go-playing system that defeated Lee Sedol, holder of 18 international Go titles
  • AlphaProof Internals , DeepMind's Lean-based formal-proof reinforcement learning system that achieved silver-medal IMO performance via 100M synthetic problems and AlphaZero-style proof-tree search.
  • Anomaly Detection , Identifying observations that deviate significantly from expected patterns
  • Anthropic , AI safety-focused organisation founded in 2021 by former OpenAI researchers
  • Anthropic HH-RLHF , Anthropic's helpful-and-harmless preference dataset, foundational training data for modern RLHF.
  • ARC , AI2 Reasoning Challenge, grade-school science questions with Easy and Challenge subsets.
  • ARC-AGI , Abstract Reasoning Corpus, grid-puzzle benchmark designed by François Chollet as an AGI test.
  • Artificial General Intelligence , Hypothetical AI with human-level flexibility across all domains
  • Artificial Intelligence , The study of building machines that perform tasks said to require human intelligence.
  • Attention Mechanism , A weighted aggregation that lets a model focus on different parts of its input
  • Attention Weights , Softmax-normalised scores determining how much each input contributes
  • AUC , Area under the ROC curve, a threshold-independent measure of binary classifier performance.
  • AUC-ROC , Area Under the Receiver Operating Characteristic curve
  • Audio Foundation Models , Class of large pretrained models that handle speech and audio as a first-class modality; beyond Whisper includes AudioPaLM, SeamlessM4T, Qwen-Audio, and AudioLM, supporting multilingual ASR, TTS, and speech-to-speech translation.
  • AudioLM , Google's 2023 hierarchical audio language model that separates semantic from acoustic tokens, enabling long-form coherent speech and music continuation.
  • AudioSet , Google's 2-million YouTube-clip audio-event dataset spanning 632 ontology classes.
  • Autoencoder , Neural network trained to reconstruct its input through a bottleneck
  • AutoGen , Microsoft's 2023 multi-agent framework built around **conversable agents** that exchange messages in group chats; pioneered the "agents as participants in a meeting" pattern.
  • Autonomous Driving Stack , The layered software architecture that converts sensor data into vehicle controls, spanning perception, prediction, planning and control
  • Autoregressive Image Models , Generative models that factorise the joint distribution over pixels into a product of conditionals and predict pixels sequentially, predecessors to diffusion for image generation.
  • Autoregressive Model , Model that generates sequences one element at a time, conditioned on previous

B

  • Backdoors / Trojans , A trigger-conditional misbehaviour planted in a model during training; the model behaves normally except when the trigger appears.
  • Backpropagation , The algorithm for computing gradients in a layered neural network by recursive application of the chain rule
  • Bagging , Bootstrap aggregating, average models trained on bootstrap resamples to reduce variance.
  • Batch Normalisation , Ioffe and Szegedy's 2015 normalisation of pre-activations across a mini-batch
  • Batch Size , Number of examples processed in parallel during training
  • Bayes' Theorem , The rule for updating probabilities given evidence, $P(H \mid E) = P(E \mid H) P(H) / P(E)$
  • Bayesian Inference , Updating beliefs about parameters or hypotheses given observed data
  • Bayesian Network , A directed acyclic graph encoding the conditional independence structure of a joint distribution
  • Beam Search , Limited-width best-first decoding for sequence models
  • Behaviour Cloning , Imitation learning by supervised maximum-likelihood on (state, action) pairs from an expert, with distribution shift as its central failure mode.
  • Belief Propagation , Pearl's message-passing algorithm for exact and approximate inference in graphical models.
  • Bellman Equation , The recursive equation defining the value function in a Markov decision process
  • Bernoulli Distribution , The simplest discrete distribution, a single biased coin flip with success probability p.
  • BERT , Devlin et al.'s 2018 bidirectional encoder Transformer for NLP
  • BERT (mathematical detail) , Mathematical formulation of BERT's masked language modelling pre-training objective
  • Best-of-N Sampling , Sample N candidates, score each, return the highest-scoring
  • Bias (Fairness) , Systematic errors disadvantaging particular groups of people
  • Bias and Fairness , The study of, and intervention against, systematic disparities in how machine-learning systems treat individuals or groups defined by protected attributes.
  • Bias-Variance Tradeoff , The classical decomposition of expected prediction error into bias squared, variance, and irreducible noise, motivating model selection in the underparameterised regime but partially superseded by double descent in the overparameterised regime.
  • BIG-Bench and BBH , Collaborative 200+ task benchmark; BIG-Bench Hard is the 23 hardest multi-step reasoning tasks.
  • Blackboard Architecture , A multi-agent paradigm where knowledge sources cooperate via a shared global data structure.
  • Bletchley AI Safety Summit , The November 2023 summit that produced the Bletchley Declaration on frontier AI safety.
  • Bletchley Park , The British wartime codebreaking centre where Turing built the Bombe and helped break Enigma
  • BLEU , Papineni et al.'s 2002 machine-translation evaluation metric, n-gram precision
  • Blocks World , A simulated tabletop of coloured blocks; the canonical micro-world of 1960s and 1970s AI research.
  • Boltzmann Machine , A stochastic Hopfield network trained by contrastive statistics, ancestor of deep belief networks and modern diffusion models.
  • Books1, Books2, Books3 , Three book-text corpora used in GPT-3 and LLaMA training; Books3 became the centre of major copyright litigation.
  • Boolean Algebra , The two-valued algebra of logic founded by Boole and applied by Shannon to digital circuits, the bedrock of all digital computation.
  • Boosting , Combining many weak learners into a strong one through iterative reweighting
  • Browser-Use Agents , Agents that drive a real web browser, clicking, typing, scrolling, to accomplish tasks on arbitrary websites; productised as OpenAI Operator (2025), Claude Computer Use (browser mode), and the open-source `browser-use` library.
  • Byte-Pair Encoding , Subword tokenisation algorithm that iteratively merges the most frequent adjacent symbol pairs in a corpus, producing a vocabulary that balances character-level and word-level granularity.

C

  • C2PA / Content Provenance , An open cryptographic standard, led by the Coalition for Content Provenance and Authenticity, for binding tamper-evident creator and edit metadata to media files.
  • C4 (Colossal Clean Crawled Corpus) , Google's 750 GB cleaned Common Crawl subset, introduced for T5 and now a standard pre-training benchmark.
  • Categorical Distribution , The discrete distribution over $K$ classes, multi-class generalisation of Bernoulli
  • Causal Inference , Inferring causal effects from observational or experimental data
  • Central Limit Theorem , Sums of many independent random variables converge to a Gaussian, regardless of the underlying distribution, which is why the bell curve is everywhere.
  • Chain Rule , The calculus rule for differentiating composed functions, the foundation of backpropagation
  • Chain-of-Thought , Prompting LLMs to produce intermediate reasoning before final answers
  • Channel Capacity , The maximum reliable information rate of a noisy channel, set by Shannon's noisy-channel coding theorem.
  • Chatbot Arena , LMSYS pairwise human-preference Elo leaderboard for chat models, the most-watched live LLM ranking.
  • ChatGPT , OpenAI's November 2022 conversational AI release that brought LLMs to mass attention
  • Chinchilla Scaling , Hoffmann et al.'s 2022 finding that compute-optimal training balances parameters and tokens
  • Church–Turing Thesis , The hypothesis that effectively computable functions are exactly those computable by a Turing machine
  • Circumscription , McCarthy's 1980 non-monotonic logic, minimise extension of abnormality predicates
  • Claude , Anthropic's family of large language models, named for Claude Shannon
  • Claude 3.5 Sonnet Computer Use , Anthropic's October 2024 capability allowing Claude to operate a computer by viewing screenshots and issuing mouse and keyboard actions through tool calls.
  • Claude 4 Family , Anthropic's late-2025 generation of frontier models, comprising Opus 4, Sonnet 4 and Haiku 4, with matured constitutional AI training and stronger reasoning, coding and long-context performance.
  • Claude Vision , Anthropic's multimodal capability across the Claude 3, 3.5, and 4 families (2024-2025), supporting image input alongside text and serving as the foundation for Claude Computer Use.
  • CLIP , Contrastive vision-language model aligning images and text in a shared space
  • CodeForces and Competitive Programming , Competitive-programming Elo benchmark, the headline coding metric for reasoning-model launches.
  • Common Crawl , Petabyte-scale open web archive that is the foundational input to nearly every modern large language model.
  • Compute Governance , Using control over AI compute resources, chips, data centres, cloud access, as a policy lever to influence frontier-AI development.
  • Computer Vision , Automatic extraction of meaning from images and video
  • Computer-Use Agents , Agents that control an entire desktop OS via screenshots and synthetic mouse and keyboard events; debuted in Claude 3.5 Sonnet (Oct 2024), generalised by Devin (Mar 2024) and OpenAI Operator (Jan 2025).
  • Conceptual Dependency , Roger Schank's 1972 language-independent representation of meaning via a small set of primitive acts, foundational to symbolic NLP.
  • Conditional Probability , The probability of A given that B has occurred
  • Conditional Random Field , Lafferty, McCallum and Pereira's 2001 discriminative graphical model for sequence labelling, the workhorse of NLP before deep learning.
  • Confidence Interval , A range of plausible parameter values whose construction would, in repeated sampling, contain the true value with stated frequency.
  • Conformer , Gulati et al's 2020 convolution-augmented Transformer for speech, sandwiching attention and depthwise convolution between feed-forward halves to capture global and local structure.
  • Connectionism , The school of cognitive science modelling cognition as networks of simple interacting units
  • Constitutional AI , Anthropic's 2022 method for aligning LLMs using AI-generated rather than human feedback
  • Constitutional AI (mathematical detail) , Anthropic's 2022 alignment pipeline using AI feedback grounded in written principles
  • Constitutional AI Dataset , Anthropic's synthetic critique-and-revise pairs used to align Claude without human harm-labels.
  • Constrained Decoding , Decoder-time technique that restricts next-token sampling to those tokens permitted by a grammar, regular expression, or finite-state machine, used to enforce JSON, code syntax, or domain languages.
  • Continual Learning , The challenge of learning a sequence of tasks over time without catastrophically forgetting earlier ones, addressed by regularisation, replay, or architectural growth.
  • Continuous Batching , Dynamic request batching where new requests join and finished requests leave at every generation step rather than at batch boundaries.
  • Contrastive Divergence , Hinton's 2002 approximate learning algorithm for energy-based models
  • Contrastive Divergence (mathematical detail) , Hinton's 2002 approximate maximum-likelihood for energy-based models
  • Contrastive Learning , Self-supervised learning by pulling similar pairs together and pushing dissimilar apart
  • Control Theory , The mathematical and engineering study of feedback systems that maintain desired behaviour under disturbance, deeply related to reinforcement learning.
  • Convex Function , A function whose epigraph is a convex set; second derivative non-negative
  • Convex Optimisation , Minimising a convex function over a convex set, solvable globally
  • Convolution , The sliding-kernel operation underlying convolutional neural networks
  • Convolutional Neural Network , A neural network using convolutional layers, the dominant vision architecture from 2012 to ~2020
  • Crescendo Attack , A multi-turn jailbreak that gradually escalates the conversation through innocuous-seeming intermediate steps to elicit harmful content.
  • CrewAI , Production-focused multi-agent framework structured around **role-based crews**, Researcher, Writer, Editor, etc., that execute a task list cooperatively; popularised the "AI staffing" mental model.
  • Cross-Entropy Loss , The negative log-likelihood loss $-\sum_i y_i \log p_i$ for classification
  • Cross-Validation , Estimating generalisation performance by repeated train-test splits
  • CTC Loss , Connectionist Temporal Classification, an alignment-free loss that lets recurrent or Transformer encoders output label sequences shorter than the input frame sequence.
  • Curse of Dimensionality , Exponential data requirements and geometric anomalies in high dimensions
  • Cybernetics , The science of control and communication in animal and machine, founded by Wiener in 1948
  • CYC , Lenat's 1984 project to encode the entirety of human commonsense knowledge as logic

D

  • Dartmouth Workshop , The 1956 summer meeting that founded artificial intelligence as a research field
  • Data Drift , Change in the distribution of input features over time
  • Data Poisoning , Inserting adversarial examples into a training set to cause targeted misbehaviour or degraded performance in the resulting model.
  • DataComp , 2023 image-text pre-training benchmark, fixed compute, fixed model, compete on data curation.
  • DBSCAN , Density-based clustering that discovers arbitrarily shaped clusters and noise
  • DCLM (DataComp-LM) , 2024 Apple-led benchmark and 240-trillion-token corpus for evaluating language-model data-curation strategies.
  • Debate-Based Alignment , Two AI debaters argue opposing positions; a human judge picks the more convincing
  • Deceptive Alignment , A hypothetical failure mode where a model behaves aligned during training but pursues different goals at deployment
  • Decidability , The property of a problem having an algorithm that always halts with the correct answer, the boundary set by Church, Turing and Gödel.
  • Decision Transformer , Reformulates reinforcement learning as a sequence-modelling problem, predicting actions conditioned on returns-to-go and trajectory history with a causal transformer.
  • Decision Tree , A model that partitions feature space with axis-aligned splits
  • Deep Belief Network , Stacked RBMs trained layer-by-layer; Hinton's 2006 deep-learning launch architecture
  • Deep Blue versus Kasparov , The 1997 chess match in which a computer first beat a reigning world champion
  • Deep Learning , Machine learning with deep neural networks
  • Deep Reinforcement Learning , Reinforcement learning with deep neural networks as function approximators
  • Deepfakes , AI-generated synthetic images, audio or video that depict real people doing or saying things they did not.
  • DeepMind , London-based AI lab founded 2010, acquired by Google 2014, merged with Brain in 2023, behind AlphaGo, AlphaFold and Gemini.
  • DeepSeek R1-Zero , A January 2025 DeepSeek experiment showing that strong reasoning behaviour emerges from pure reinforcement learning on a base model, with no supervised fine-tuning beforehand.
  • DeepSeek-R1 Release , The January 2025 open-source reasoning model release that reshaped the AI competitive landscape
  • DeepSeek-V3 , A 671B-parameter mixture-of-experts model from DeepSeek released December 2024, trained for under $6M and matching frontier performance with open weights.
  • DENDRAL , The first major expert system; inferred organic molecular structure from mass-spectrometry data (Stanford, 1965).
  • Derivative , The instantaneous rate of change of a function
  • Devin / AI Software Engineer , An autonomous coding agent launched by Cognition Labs in March 2024, marketed as the "first AI software engineer" and credited with starting the AI-coding-agent product wave.
  • Differential Privacy , Mathematical privacy guarantee bounding how much a single individual's data can influence an algorithm's output, achieved in deep learning by clipping gradients and adding Gaussian noise.
  • Diffusion Model , Generative models that learn to invert a noise-adding process
  • Diffusion Policy , A robot policy that uses a diffusion model to represent the action distribution given observations, producing robust multimodal behaviour.
  • Dimensionality Reduction , Projecting high-dimensional data into fewer dimensions
  • Direct Preference Optimization , Rafailov et al.'s 2023 alternative to RLHF that aligns language models without an explicit reward model or reinforcement learning.
  • Dirichlet Distribution , A distribution over probability simplices, the conjugate prior for categorical
  • Distributed Data Parallel , PyTorch's standard data parallelism that replicates the model across devices and averages gradients via all-reduce.
  • Dolma and OLMo , Allen Institute for AI's 3-trillion-token open pre-training corpus, designed for full-stack open science.
  • Dot Product , Scalar measure of similarity between two vectors; the workhorse operation of numerical linear algebra and modern AI.
  • Double Descent , An empirical phenomenon where test error first rises then falls as model capacity increases past the interpolation threshold, contradicting the classical bias-variance U-curve.
  • DPO Variants , Family of preference-optimisation algorithms, IPO, KTO, ORPO, SimPO, that reformulate or relax DPO's objective to fix specific failure modes such as overfitting and the need for paired data.
  • DQN , Mnih et al.'s 2013-2015 Deep Q-Network, neural Q-learning with experience replay and target networks
  • DROP , Reading-comprehension benchmark requiring discrete reasoning over paragraphs (counting, sorting, arithmetic).
  • Dropout , Randomly zeroing neurons during training to prevent overfitting; one of deep learning's most effective regularisers.
  • Dropout (mathematical detail) , Srivastava et al.'s 2014 stochastic regularisation by random unit-zeroing
  • DSPy , Stanford framework (Khattab et al. 2023) that treats LLM programs as **modules with parameters that compile**, automatic prompt and few-shot optimisation replaces hand-crafted prompt strings.

E

  • Early Stopping , Halting training when validation performance stops improving; a free and effective regulariser.
  • Eigendecomposition , The factorisation $A = Q \Lambda Q^{-1}$ of a matrix into eigenvectors and eigenvalues
  • Eigenvalue and Eigenvector , A direction preserved by a linear transformation, and its scaling factor
  • Eliciting Latent Knowledge , The problem of getting a model to honestly report what it knows
  • ELIZA , Weizenbaum's 1966 conversational program; ancestor of every chatbot
  • Embedding , A dense vector representation of a discrete entity
  • Embedding Layer , A lookup table mapping discrete tokens to dense continuous vectors
  • Embeddings APIs , Hosted services that convert text (and increasingly images, audio, code) into fixed-dimensional vectors used for semantic search, RAG, clustering, and classification; the dominant providers are OpenAI, Cohere, Voyage, Anthropic, and the open BGE/E5/Jina families.
  • Embodied AI , Branch of AI in which the agent has a physical body (real or simulated) and learns by acting in and perceiving its environment; recently dominated by robot foundation models that treat actions as another modality.
  • Emergent Abilities , Capabilities that appear sharply at certain model scales rather than improving smoothly with size.
  • EnCodec , Meta's 2022 neural audio codec, convolutional encoder-decoder with residual vector quantisation, compressing audio to 1.5-24 kbps tokens that underlie modern audio language models.
  • Energy-Based Model , A generative model that assigns each input an unnormalised energy score; low energy means high probability.
  • Ensemble Methods , Combining multiple models into a single stronger predictor
  • Entropy , Average information or uncertainty in a probability distribution; the foundation of information theory.
  • Entscheidungsproblem , Hilbert's 1928 challenge for a general decision procedure for first-order logic; proven impossible by Church and Turing in 1936.
  • Epoch , One complete pass through the training dataset; a fundamental unit of training schedules.
  • ESM-2 , 15-billion-parameter protein language model from Meta FAIR that predicts structure from single sequences without multiple sequence alignments
  • Evaluations / Capability Evaluations , Structured tests measuring whether a frontier model has acquired specific dangerous or transformative capabilities.
  • Expectation , The mean or average value of a random variable
  • Expectation–Maximisation , Dempster, Laird and Rubin's 1977 algorithm for maximum-likelihood estimation with latent variables
  • Expected Calibration Error , Measures how well predicted probabilities match observed frequencies
  • Expert System , A rule-based system encoding the knowledge of human domain experts
  • Explainable AI , Making AI decisions understandable to humans
  • Extended Kalman Filter , A Kalman filter for non-linear systems that linearises the dynamics and observation around the current state estimate via Jacobians.

F

  • F1 Score , The harmonic mean of precision and recall
  • FAISS , Meta's open-source library for efficient similarity search and clustering of dense vectors, supporting exact and approximate nearest-neighbour indices over billions of items.
  • Feature Engineering , Designing informative input features for a machine learning model
  • Feature Map , Output of a convolutional layer representing detected features
  • Federated Learning , Distributed training paradigm in which a global model is learned across many decentralised devices without raw data leaving them, using local updates and a coordinating server.
  • Feedback Loop , A circular causal structure where a system's output influences its future input
  • Fifth Generation Computer Systems , Japan's 1981–92 national project to build "intelligent" parallel logic-programming computers
  • Fine-tune with LoRA , Recipe to fine-tune a pretrained LLM on instruction data using Low-Rank Adaptation, training <1% of parameters.
  • Fine-Tuning , Adapting a pretrained model to a new task by continuing training
  • FineWeb and FineWeb-Edu , Hugging Face's 15-trillion-token deduplicated Common Crawl corpus, with a curriculum-filtered educational subset.
  • Fisher Information , The information a random sample provides about the parameter generating it
  • Flamingo , DeepMind's 2022 few-shot visual language model (Alayrac et al.) using a Perceiver Resampler and gated cross-attention to bridge a frozen vision encoder and a frozen Chinchilla language model.
  • FlashAttention , Tri Dao's 2022 IO-aware attention algorithm, exact attention with reduced memory bandwidth
  • FlashAttention Internals , A reformulation of attention that streams the softmax through SRAM-resident tiles, eliminating quadratic HBM traffic and turning attention from a memory-bound kernel into a compute-bound one.
  • Foundation Model , A large model pre-trained on broad data, adaptable to many downstream tasks
  • Frame , Minsky's 1974 knowledge-representation primitive, a structured collection of slots with default values
  • Frame Problem , The problem of specifying what does NOT change when an action is taken
  • Frontier AI Safety Commitments , A set of voluntary, public undertakings by frontier-AI developers to test for, mitigate and report on dangerous capabilities, anchored in the Bletchley and Seoul declarations.
  • Frontier Lab Compute Consumption , The scale of compute used by the largest training runs, currently $10^{25}$–$10^{26}$ FLOPs and rising roughly an order of magnitude per generation.
  • FrontierMath , Epoch AI's research-level mathematics benchmark; frontier LLMs around 25-50% in 2025.
  • Fully Sharded Data Parallel , Data-parallel training that shards parameters, gradients, and optimiser states across devices to fit larger models.
  • Function Calling , Native LLM capability, first shipped by OpenAI in June 2023, that lets the model emit a structured JSON object selecting a developer-supplied tool and its arguments, rather than inventing free-form syntax.

G

  • Gaussian Distribution , The bell-curve probability distribution, the most important continuous distribution in AI
  • Gaussian Mixture Model , A weighted sum of Gaussians, the canonical soft-clustering model
  • Gaussian Splatting , 3D scene representation introduced by Kerbl et al. (2023); represents a scene as millions of anisotropic 3D Gaussians with view-dependent colour, rendered in real time via differentiable rasterisation.
  • GCG Attack , A discrete-optimisation attack that finds universal, transferable adversarial token suffixes that jailbreak aligned language models.
  • Gemini 2.x , Google DeepMind's 2024-2025 model family, featuring native multimodal generation, integrated reasoning modes and context windows up to 2 million tokens.
  • Gemini Multimodal , Google DeepMind's natively multimodal model family (Gemini 1.0/1.5/2.0/2.5) trained from scratch on interleaved text, image, audio, and video tokens, with context windows up to 2M tokens.
  • Gemini Robotics , Google DeepMind's 2024-2025 vision-language-action models built on Gemini 2.0; specialised for dexterous manipulation across multiple robot embodiments, with a separate Gemini Robotics-ER variant for embodied reasoning.
  • General Problem Solver , Newell, Shaw and Simon's 1957 system that solved problems by means-ends analysis
  • Generalisation , A model's ability to perform well on data not seen during training
  • Generative Adversarial Network , Goodfellow et al.'s 2014 generator-vs-discriminator generative model
  • Generative Grammar , Chomsky's framework for describing natural-language syntax with formal recursive rules
  • Generative Model , Model that learns the data distribution and can sample from it
  • Genie 2 , A DeepMind world model from December 2024 that generates playable, action-controllable 3D environments from a single image prompt, used as a training ground for embodied agents.
  • Gibbs Sampling , An MCMC method that updates one variable at a time from its conditional distribution
  • GitHub Code Corpus , The world's largest open-source code archive, used in Codex, Code Llama and DeepSeek-Coder, with active licensing controversy.
  • GloVe , Pennington, Socher and Manning's 2014 count-based word embedding method
  • GLUE and SuperGLUE , General Language Understanding Evaluation suites, the benchmarks that defined the BERT era.
  • GNoME , DeepMind's Graph Networks for Materials Exploration that discovered 380 000 stable inorganic crystals via active-learning and DFT verification
  • Gödel's Incompleteness Theorems , Gödel's 1931 proof that any rich consistent formal system contains true unprovable statements
  • Goodhart's Law (in ML) , When a measure becomes a target, it ceases to be a good measure
  • GPQA , Graduate-level "Google-Proof" Q&A in physics, chemistry, and biology, frontier benchmark for reasoning.
  • GPT , OpenAI's 2018 decoder-only autoregressive Transformer language model
  • GPT (mathematical detail) , Mathematical formulation of GPT-style autoregressive language modelling
  • GPT-3 , OpenAI's 2020 175-billion-parameter language model that established in-context learning
  • GPT-3 Launch , OpenAI's June 2020 release of GPT-3, the first commercial frontier LLM API
  • GPT-4V and GPT-4o Vision , OpenAI's multimodal extensions of GPT-4 (vision input, 2023) and GPT-4o ("omni", 2024); native image, audio, and text processing in a single transformer enabling document understanding and screenshot-driven agents.
  • GPTQ , Layer-wise post-training quantisation method that uses approximate second-order information to compress LLM weights to 4 bits with minimal accuracy loss.
  • GPU Acceleration , Using graphics-processing units for parallel matrix arithmetic, enabling deep learning at scale
  • GPU Memory Hierarchy , The layered storage on a modern GPU spanning HBM, L2 cache, SRAM and registers, whose bandwidth and capacity asymmetries dictate how AI kernels must be written.
  • Gradient , The vector of partial derivatives of a scalar function
  • Gradient Boosting , Friedman's 2001 generalisation of boosting to arbitrary differentiable losses
  • Gradient Descent , Iteratively moving parameters in the direction of steepest descent of the loss
  • Graph Attention Network , A graph neural network that learns attention weights for each neighbour, replacing the GCN's fixed degree-based normalisation.
  • Graph Convolutional Network , A graph neural network whose aggregation is a symmetric normalised average of neighbour features, derived from spectral graph theory.
  • Graph Neural Network , A neural network that operates directly on graph-structured data via iterative message passing between nodes.
  • Graph of Thoughts , Besta et al. 2023 generalisation of Tree of Thoughts to **directed acyclic graphs** of reasoning steps, supporting aggregation, refinement, and arbitrary topology of dependencies.
  • GraphCast , DeepMind's graph neural network for medium-range global weather forecasting at 0.25-degree resolution, surpassing ECMWF HRES on most metrics
  • Greedy Decoding , At each step pick the highest-probability next token
  • Greshake's Indirect Prompt Injection , Prompt injection delivered through retrieved or third-party content the LLM consumes, web pages, emails, documents, tool outputs.
  • Group Relative Policy Optimization , DeepSeek's policy-gradient algorithm that replaces PPO's value function with group-relative advantage estimation, dramatically reducing memory cost for LLM RL.
  • GRU , Cho et al.'s 2014 simplified gated recurrent unit; two gates rather than LSTM's three
  • GSM8K , 8.5K grade-school math word problems with chain-of-thought solutions.

H

  • Hallucination , LLM generating plausible-sounding but false information
  • Halting Problem , The undecidable problem of determining whether an arbitrary program halts on a given input
  • Hearsay-II , Carnegie Mellon's 1971–80 speech-understanding system; canonical blackboard architecture
  • Hebbian Learning , A learning rule strengthening synapses when pre- and post-synaptic neurons fire together
  • Helix , Figure AI's February 2025 end-to-end humanoid VLA; a two-system "fast and slow" architecture combining a 7B VLM (System 2) at 7-9Hz with an 80M visuomotor policy (System 1) at 200Hz.
  • HellaSwag , Commonsense sentence-completion benchmark using adversarially-filtered distractors.
  • HELM , Stanford CRFM's holistic evaluation framework spanning many tasks, metrics, and scenarios.
  • Hessian , The matrix of second partial derivatives of a scalar function
  • Heuristic Search , Search guided by an estimating function that orders candidate states by promise
  • Hidden Markov Model , A Markov chain over hidden states emitting observable symbols
  • Hierarchical Clustering , Build a tree of nested clusters by repeated merging or splitting
  • Hinge Loss , The SVM loss $\max(0, 1 - y \hat y)$, penalises margin violations linearly
  • HNSW , Hierarchical Navigable Small World graphs, an approximate-nearest-neighbour algorithm that searches a multi-layer proximity graph in roughly logarithmic time.
  • Hopfield Network , Hopfield's 1982 fully-connected recurrent network functioning as an associative memory
  • HumanEval , 164 hand-written Python programming problems for measuring code-generation correctness.
  • Humanity's Last Exam , Cross-disciplinary frontier benchmark from CAIS and Scale AI, 3,000 expert-written questions across all domains.
  • Hyperparameter Tuning , Searching for optimal model settings that are not learned from data
  • Hypothesis Testing , Formal procedure for deciding between competing claims about data

I

  • ImageNet , Fei-Fei Li's 14-million-image dataset that catalysed the deep-learning vision revolution
  • ImageNet (ILSVRC) , 1.28M-image, 1000-class classification benchmark whose 2012 win catalysed the deep-learning revolution.
  • Implicit Regularisation , The phenomenon by which optimisation algorithms favour particular solutions among many that fit the training data, providing regularisation without an explicit penalty term in the loss.
  • In-Context Learning , Performing a new task from examples in the prompt, without any parameter updates
  • Induction Head , A 2-layer attention pattern implementing in-context copying, a key Transformer circuit
  • Inference Cost Economics , The dollar-per-token cost of serving large models, dominated by KV-cache memory bandwidth at long context, and falling roughly 10× per year at fixed quality.
  • Inference-Time Scaling , Generic term for spending more compute per query at deployment to improve answer quality, covering best-of-N, beam search, extended chains of thought, and tree search.
  • InfiniBand and RoCE , Cluster-scale networking technologies that connect AI servers at hundreds of gigabits per second using RDMA, the backbone of multi-thousand-GPU training.
  • InfoNCE , van den Oord et al.'s 2018 contrastive loss, categorical cross-entropy on similarities
  • Information Processing Language , The first list-processing language, designed by Shaw, Newell and Simon at RAND
  • Information Theory , Mathematical theory of communication, compression and uncertainty
  • Inner Alignment , The problem of making a model's internal objective match the training objective
  • InstantNGP , NVIDIA's 2022 acceleration of NeRF (Müller et al.); replaces sinusoidal positional encoding with a multi-resolution learned hash grid, training NeRF-quality scenes in seconds rather than days.
  • Integral , Continuous accumulation of a quantity, dual to the derivative
  • IVF-PQ , Billion-scale approximate-nearest-neighbour index combining inverted-file partitioning by k-means with product-quantisation compression of residuals.

J

  • Jacobian , The matrix of first partial derivatives of a vector-valued function
  • Jailbreak , An adversarial prompt that bypasses an LLM's safety training and elicits behaviour the model was trained to refuse.
  • Joint Distribution , The distribution over two or more random variables simultaneously

K

  • K-Means , Lloyd's algorithm for clustering data into k groups by minimising within-cluster variance
  • K-Nearest Neighbours , Classify or predict by voting among the k closest training examples
  • Kalman Filter , An optimal recursive estimator for a linear-Gaussian state-space model, alternating prediction and measurement-update steps.
  • Kernel Trick , Computing high-dimensional inner products without explicit mapping
  • KL Divergence , A measure of how one probability distribution differs from another
  • Knowledge Distillation , Training a small student model to match the soft output distribution of a large teacher model, transferring information beyond hard labels.
  • Knowledge Representation , Symbolic encoding of facts and relationships for reasoning
  • KV Cache , The cache of key and value tensors used to avoid recomputing attention over past tokens during autoregressive generation.

L

  • Lagrangian and KKT , The Lagrangian function and KKT conditions for constrained optimisation
  • LAION-400M and LAION-5B , Schuhmann et al.'s open image-text pair corpora that trained Stable Diffusion; centre of the 2023 CSAM controversy.
  • Lambda Calculus , A minimal formal system of anonymous functions; foundation of functional programming
  • LangChain , The original (Oct 2022) Python and JavaScript framework for building LLM applications; popularised the abstractions of **chains**, **agents**, **tools**, and **memory** that defined the early agentic-AI era.
  • Language Model , A model that assigns probabilities to sequences of tokens
  • Large Language Model , Massive transformer-based language model trained on vast text corpora
  • Latent Dirichlet Allocation , Blei, Ng and Jordan's 2003 generative model of documents as mixtures of topics
  • Latent Dirichlet Allocation (mathematical detail) , Full generative model and inference for LDA topic modelling
  • Layer Normalisation , Ba, Kiros and Hinton's 2016 alternative to batch normalisation, normalising across features per example
  • Learning Rate , Step size in gradient-based optimisation; the most important hyperparameter
  • LibriSpeech and LibriLight , Open speech corpora derived from LibriVox audiobooks; the canonical English ASR benchmarks.
  • Lighthill Report , Lighthill's 1973 report whose damning AI assessment caused the UK AI winter
  • Linear Regression , Fitting a linear function to data by minimising squared error
  • LISP , McCarthy's 1958 list-processing language; the lingua franca of symbolic AI for forty years
  • LiveBench , Contamination-resistant LLM benchmark with monthly question refresh, White et al. 2024.
  • Llama 3 / 3.1 / 3.3 , Meta's 2024-2025 family of open-weights language models, ranging from 8B to 405B parameters, with the 405B model reaching GPT-4-class performance.
  • LlamaIndex , Data-centric LLM framework launched by Jerry Liu in November 2022, originally **GPT Index**; specialises in **retrieval-augmented generation** and structured-data agents.
  • LLaVA , Large Language and Vision Assistant introduced by Liu et al. (2023); an open-source vision-language model combining a CLIP image encoder, a Vicuna language model, and a small MLP projection.
  • Logic Programming , Programming by declaring logical relations and querying for solutions
  • Logic Theorist , The 1956 Newell–Shaw–Simon program that proved theorems from Principia Mathematica
  • Logistic Regression , Linear classifier with sigmoid output and cross-entropy loss
  • LoRA , Edward Hu et al.'s 2021 low-rank adapter method for parameter-efficient fine-tuning
  • Loss Function , A function measuring how poorly a model's predictions match the truth
  • Lottery Ticket Hypothesis , The conjecture that dense randomly-initialised networks contain sparse subnetworks (winning tickets) that, trained in isolation from the original initialisation, match the dense network's performance.
  • LSTM , Long Short-Term Memory: a gated RNN that learns long-range dependencies

M

  • MACE , Equivariant message-passing interatomic potential using higher-order tensor messages, achieving quantum-mechanical accuracy in molecular dynamics
  • Machine Learning , Algorithms that improve at a task through experience with data
  • Machine Translation , Automatic translation between natural languages
  • Macy Conferences , The 1946–1953 interdisciplinary meetings that founded cybernetics
  • MADLAD-400 , Google's 3-trillion-token, 419-language web corpus for massively multilingual model training.
  • Mamba , Gu and Dao's 2023 state-space model, linear-time alternative to the Transformer
  • Mamba (mathematical detail) , Mathematical formulation of the Mamba selective state-space model
  • MAML , Model-Agnostic Meta-Learning, learns an initialisation from which a model can be fine-tuned to new tasks in just a few gradient steps.
  • Manifold Hypothesis , The conjecture that real-world high-dimensional data such as images and text lies on or near a low-dimensional manifold embedded in the ambient space.
  • Many-Shot Jailbreaking , A long-context jailbreak technique that fills the context window with dozens of fake examples of the model complying, then asks for the real harmful answer.
  • MAP Estimation , Maximum a posteriori, the parameter value that maximises the posterior
  • Mark I Perceptron , Rosenblatt's 1958 hardware perceptron, the first neural-network machine
  • Markov Chain , A stochastic process where the future depends only on the present, not the past
  • Markov Decision Process , The mathematical framework underlying reinforcement learning
  • MATH , 12,500 high-school competition mathematics problems with full step-by-step solutions.
  • Matrix , A rectangular array of numbers representing a linear transformation
  • Matrix Factorisation , A recommender-system technique that approximates a sparse user-item rating matrix as the product of two low-rank embedding matrices.
  • Matrix Multiplication , The core operation $C = AB$ where $C_{ij} = \sum_k A_{ik} B_{kj}$
  • Maximum Likelihood Estimation , Choosing parameters to maximise the probability of observed data
  • MBPP , Mostly Basic Python Problems, 974 short coding tasks crowd-sourced for code-generation eval.
  • McCulloch–Pitts Neuron , The first mathematical model of a neuron; a binary threshold unit
  • MCMC , Markov-chain Monte Carlo, sampling from intractable distributions via a constructed Markov chain
  • Mean Squared Error , The L2 regression loss $\frac{1}{N} \sum_n (y_n - \hat y_n)^2$
  • Means–Ends Analysis , Problem-solving by repeatedly applying operators that reduce the difference to the goal
  • Mechanisation of Thought Processes , The 1958 NPL conference, Britain's parallel to the Dartmouth workshop
  • Mechanistic Interpretability , Reverse-engineering neural networks into human-understandable algorithms
  • Med-PaLM , PaLM language model fine-tuned for medical question answering, the first to reach passing performance on USMLE-style questions
  • MedSAM , Foundation segmentation model adapted from Meta's Segment Anything for medical imaging across modalities
  • Membership Inference Attacks , An attack that, given access to a trained model, decides whether a specific record was in its training set, a direct privacy violation.
  • Memory and Context Management , Architectural pattern that gives LLM agents three tiers of memory, **short-term** (current context window), **long-term** (vector store), and **episodic** (past conversation summaries), to operate beyond a single context window.
  • Mesa-Optimisation , When a learned model contains an internal optimisation process pursuing its own objective
  • Message Passing Neural Network , A unifying framework that expresses most graph neural networks as message, update, and readout phases.
  • MetaGPT , Open-source multi-agent framework that simulates an entire software company, Product Manager, Architect, Engineer, QA, driven by **Standard Operating Procedures (SOPs)** rather than free-form chat.
  • METR and RE-Bench , Long-horizon AI research-engineering benchmark from METR, measures hours-scale agentic capability.
  • Micro-Worlds , The 1960s–70s AI strategy of working in small, fully-specified toy domains
  • MIT AI Lab , Founded by McCarthy and Minsky in 1959; one of the two great early AI institutions
  • Mixed Precision Training , Training in 16-bit floating-point formats while keeping a master copy of weights in FP32, with loss scaling to preserve numerical stability.
  • Mixture of Experts , Architecture where a gate routes inputs to a subset of expert networks
  • Mixture of Experts (mathematical detail) , Mathematical formulation of sparse mixture-of-experts routing
  • MLOps , Operational discipline for deploying and maintaining ML systems
  • MMLU , 57-subject multiple-choice benchmark spanning undergraduate to professional knowledge.
  • MMLU-Pro , Harder, ten-option, less-contaminated successor to MMLU released by TIGER-Lab in 2024.
  • MMMU , Massive Multi-discipline Multimodal Understanding, visual reasoning across 30 college subjects.
  • MNIST, Fashion-MNIST, CIFAR-10/100 , Foundational small image datasets used to teach and benchmark deep-learning vision models since the 1990s.
  • Model Context Protocol , An open protocol introduced by Anthropic in November 2024 for connecting LLM applications to external tools, data sources and prompts via a standardised client-server interface.
  • Model Predictive Control , A control method that solves a finite-horizon optimisation at each timestep, applies the first action, and repeats.
  • Model Stealing / Distillation Attacks , Reconstructing a model's behaviour, parameters or architecture by querying its API and training a substitute on the responses.
  • Mother of All Demos , Engelbart's 1968 live demo introducing the mouse, hypertext, video calling, and shared editing
  • MS COCO , Microsoft Common Objects in Context, 330k images with segmentation, keypoints, captions; the canonical detection benchmark.
  • Multi-Agent Orchestration , Architecture in which several LLM-driven agents, often with distinct roles, prompts, or models, collaborate via message passing to solve a task that exceeds a single agent's context or capability.
  • Multi-Agent System , A system of multiple interacting agents, often with conflicting objectives
  • Multi-Head Attention , Running several attention operations in parallel to capture diverse patterns
  • Multilayer Perceptron , Fully connected feedforward neural network with one or more hidden layers
  • Multimodal Model , Model that processes or generates multiple data modalities
  • MusicGen , Meta's 2023 single-stage Transformer language model over EnCodec tokens, text-to-music generation with optional melody conditioning, simplifying the AudioLM hierarchy.
  • Mutual Information , The reduction in uncertainty about one variable given knowledge of another
  • MuZero , Schrittwieser et al. 2020 model-based RL algorithm that learns its own world model from interaction, generalising AlphaZero to environments without known dynamics.
  • MYCIN , Stanford's 1974 expert system for diagnosing bacterial infections

N

  • Naive Bayes , Probabilistic classifier assuming features are conditionally independent given the class
  • Narrow AI , AI designed for a specific, bounded task
  • Natural Language Processing , Computer processing and understanding of human language
  • NETtalk , Sejnowski and Rosenberg's 1987 demonstration of neural-network speech generation
  • Neural Collaborative Filtering , A recommender architecture that replaces the inner product of matrix factorisation with a neural network over user and item embeddings.
  • Neural Network , A composition of linear and nonlinear transformations for learning
  • Neural Radiance Fields , Volumetric scene representation introduced by Mildenhall et al. (2020); a small MLP maps 3D position and viewing direction to colour and density, enabling photorealistic novel view synthesis from sparse photographs.
  • Neural Tangent Kernel , A kernel that captures the linearised training dynamics of an infinite-width neural network, equating gradient descent on the network with kernel regression in the infinite-width limit.
  • Newton's Method , Second-order optimisation using the Hessian
  • NIPS 2006 Deep Learning Workshop , The workshop that broadcast the deep-learning revival to the wider ML community
  • nnU-Net , Self-configuring U-Net pipeline that automatically derives preprocessing, architecture and training schedule from dataset properties
  • Non-Monotonic Reasoning , Reasoning where adding new premises can invalidate previously-drawn conclusions
  • Normalising Flow , Generative model that transforms a simple base distribution into a complex one through a sequence of invertible, differentiable maps with tractable Jacobian determinants.
  • NVLink and NVSwitch , Nvidia's proprietary GPU-to-GPU interconnect and crossbar switch, providing an order of magnitude more bandwidth than PCIe and forming the substrate for tightly coupled multi-GPU training.

O

  • o1 / Reasoning Models , A class of LLMs trained to use extended chain-of-thought via reinforcement learning
  • o1's Hidden Chain of Thought , OpenAI's choice to hide o1's internal reasoning tokens from API users, motivated by IP protection and distillation defence at the cost of transparency and debuggability.
  • Object Detection , Locating and classifying objects within an image via bounding boxes
  • Open Images , Google's 9-million-image labelled dataset spanning 600 detection classes and 19,995 image-level labels.
  • OpenAI , AI organisation founded in 2015; produced GPT-3, ChatGPT, GPT-4 and o1
  • OpenAI Board Crisis , The November 2023 episode in which Sam Altman was briefly removed as OpenAI CEO
  • OpenAI Codex (2025 generation) , OpenAI's 2025 family of cloud and CLI software-engineering agents built over o3-class reasoning models, distinct from the 2021 Codex code-completion model.
  • OpenAI o3 , OpenAI's December 2024 reasoning model that achieved a breakthrough on the ARC-AGI benchmark via large-scale reinforcement learning on chain-of-thought.
  • OpenAssistant Conversations (OASST) , LAION-led crowdsourced conversational instruction-tuning dataset, released April 2023.
  • OpenHands , Open-source autonomous software-engineer agent, formerly **OpenDevin**, released March 2024 in response to Cognition's Devin demo; full sandboxed VM with browser, terminal, and IDE.
  • OpenVLA , Stanford and Berkeley's 2024 open-source vision-language-action model (Kim et al.); 7B parameters, Llama-2 backbone with Prismatic vision, trained on 970k Open X-Embodiment trajectories.
  • OpenWebText and OpenWebText2 , Open-source community replications of OpenAI's closed WebText corpus, built from Reddit-linked pages.
  • Out-of-Distribution Generalisation , The challenge of producing accurate predictions on inputs drawn from distributions other than the training distribution, including covariate shift, label shift, concept drift, and domain shift.
  • Outcome Reward Model , Reward model that scores only the final answer of a reasoning chain, providing a single sparse reward per trajectory.
  • Outer Alignment , The problem of specifying a training objective that captures human values
  • Overfitting , Fitting noise in the training data, harming generalisation

P

  • P-value , Probability of observing data at least as extreme as actual, if null is true
  • PAC Learning , Valiant's 1984 framework, Probably Approximately Correct learning
  • PAC-Bayes , A generalisation framework that bounds the expected risk of a stochastic predictor by its empirical risk plus a KL-divergence complexity penalty, yielding the tightest non-vacuous bounds known for deep neural networks.
  • PagedAttention , vLLM's KV cache management technique that borrows virtual-memory paging from operating systems to eliminate fragmentation.
  • PaLI and PaLI-3 , Google Research's Pathways Language and Image models (Chen et al. 2022, 2023); multilingual vision-language models combining a ViT image encoder and an mT5 text encoder-decoder, trained on 100+ languages.
  • PaLM-E , Google's 562B-parameter embodied multimodal language model (Driess et al. 2023) integrating PaLM with vision and robot state inputs to produce both natural language and grounded robot plans.
  • Pandemonium Architecture , Selfridge's 1959 distributed pattern recogniser, populations of feature-detecting "demons"
  • Pangu-Weather , Huawei's 3D Earth-specific Transformer for global weather forecasting with hierarchical temporal aggregation
  • Parallel Distributed Processing , The 1986 framework reviving connectionist cognitive modelling
  • Partial Derivative , Derivative with respect to one variable, holding others fixed
  • Particle Filter , A sequential Monte Carlo method that represents the state posterior as a weighted set of samples and updates it via importance sampling and resampling.
  • PCA (mathematical detail) , Principal component analysis, eigendecomposition of the covariance matrix
  • Perceptron , The simplest neural network: a single linear classifier with step activation
  • Perplexity , The standard language-model evaluation metric, exponential of cross-entropy per token
  • PGD Attack , Projected Gradient Descent attack, the standard iterative white-box adversarial attack used to evaluate robustness, applying repeated small FGSM steps with projection back into a norm ball.
  • Pi-Zero , Physical Intelligence's 2024 general-purpose robot foundation model; combines a PaliGemma VLM backbone with a flow-matching action expert producing 50Hz continuous action chunks.
  • Pipeline Parallelism , Splitting a model's layers across devices so that activations flow through a hardware pipeline.
  • Planning , Computing a sequence of actions to achieve a goal
  • Policy Gradient Theorem , The foundational theorem of policy-based reinforcement learning
  • Pooling , Downsampling feature maps by summarising local regions
  • Positional Encoding , Injecting sequence order information into permutation-invariant attention
  • Power and Cooling , The thermal and electrical reality of modern AI clusters, hundreds of kilowatts per rack, liquid cooling required at high density, and site selection driven by grid availability.
  • PPO , Schulman et al.'s 2017 Proximal Policy Optimization, clipped surrogate policy gradient
  • Precision (classification) , Fraction of predicted positives that are true positives
  • Principal Component Analysis , Linear dimension reduction by projecting onto directions of maximum variance
  • Privacy in ML , Techniques and threat models for protecting personal data used in training, fine-tuning or inference of machine-learning models.
  • Probability Distribution , How probability is allocated across possible values of a random variable
  • Process Reward Model , Neural network that scores each intermediate step of a reasoning chain, trained on human or AI-generated step-correctness labels.
  • Process Supervision , Training paradigm that rewards each intermediate reasoning step rather than only the final answer, introduced by OpenAI's "Let's Verify Step by Step" paper.
  • Prolog , The canonical logic-programming language; backward chaining over Horn clauses
  • Prompt Engineering , Crafting effective prompts to elicit desired LLM behaviour
  • Prompt Injection , An attack in which adversarial instructions hidden in input or retrieved content override the developer's prompt.
  • Protein Folding , The process by which an amino-acid sequence determines a 3D protein structure
  • Pruning , Removing weights or structural units from a trained network to reduce size and inference cost while preserving accuracy.

Q

  • Q-Learning , Watkins's 1989 model-free RL algorithm, learn the optimal action-value function
  • Quantisation , Compressing neural network weights and activations to low-bit integer or float formats for memory-efficient inference.
  • Quantisation for Inference , Reducing model numerical precision below BF16, to INT8, FP8, INT4 or FP4, to multiply inference throughput and shrink memory footprint, ideally with under 1% quality loss.

R

  • Rademacher Complexity , A data-dependent measure of the richness of a function class equal to its expected best correlation with random sign noise, providing tight generalisation bounds that often improve on VC dimension.
  • Random Forest , Breiman's 2001 ensemble of decision trees with bagging and random feature selection
  • Random Forest (mathematical detail) , Random forest with full bagging and random feature selection algorithm
  • Random Variable , A numerical quantity whose value is determined by a random outcome
  • Re-Ranking , Second-stage retrieval refinement in which a slow, accurate **cross-encoder** rescores the top-$k$ candidates returned by a fast bi-encoder vector search, dramatically improving RAG quality.
  • ReAct , Agentic prompting pattern that interleaves verbal **Reasoning** ("Thought") with **Acting** (tool calls) and **Observation** of tool output, forming the foundation of modern LLM agent loops.
  • Reasoning Model Training , A 2024-2025 training paradigm in which models are reinforced for producing chain-of-thought traces that lead to verifiably correct outputs, distinguishing frontier models from earlier RLHF-only systems.
  • Recall (classification) , Fraction of actual positives correctly predicted
  • Recommendation System , Algorithms that suggest items users are likely to engage with
  • Recurrent Neural Network , A neural network with a feedback loop processing sequences one element at a time
  • Red-Teaming (LLMs) , Systematic adversarial testing of an AI system to discover failure modes, dangerous capabilities and exploitable vulnerabilities before deployment.
  • RedPajama , Together AI's open replication of the LLaMA training mixture, released in versions v1 (1.2T tokens) and v2 (30T tokens).
  • Regularisation , Adding a penalty to a learning objective to prevent overfitting
  • Reinforcement Learning , Learning a policy by interacting with an environment for reward
  • ReLU , The rectified linear unit $\mathrm{ReLU}(x) = \max(0, x)$
  • Residual Connection , Skip connection that adds a layer's input to its output
  • Residual Stream , The running activation in a Transformer that each layer reads from and writes to
  • ResNet , He et al.'s 2015 deep CNN with residual (skip) connections
  • Resolution , Robinson's 1965 single-rule complete inference procedure for first-order logic
  • Responsible Scaling Policy (RSP) , A framework, originated by Anthropic in 2023, in which capability thresholds trigger pre-specified safety commitments before further scaling.
  • Restricted Boltzmann Machine , A Boltzmann machine restricted to a bipartite visible-hidden graph
  • Retrieval-Augmented Generation , Augmenting a language model with retrieval from a corpus to reduce hallucination
  • Reward Hacking , An agent finding unintended ways to score high on its reward function
  • RFDiffusion , Diffusion model for de novo protein design built on RoseTTAFold, capable of generating novel proteins with prescribed function
  • RLAIF , Reinforcement Learning from AI Feedback, replaces human raters with an LLM rater to scale preference labelling, with Anthropic's Constitutional AI as the canonical example.
  • RLAIF and Magpie , Synthetic preference-data techniques, RLAIF substitutes AI feedback for human ratings; Magpie extracts instructions from LLMs themselves.
  • RLHF , Reinforcement Learning from Human Feedback: aligning LLMs to preferences
  • RLHF Pipeline , Three-stage recipe (SFT, reward model, PPO) to align a pretrained LLM with human preferences.
  • RNN-Transducer , Graves's 2012 streaming sequence-to-sequence model combining a frame encoder, a label prediction network, and a joint network, the workhorse of on-device speech recognition.
  • RoPE , Rotary Position Embedding, the standard positional encoding for modern LLMs
  • RT-1 and RT-2 , Google DeepMind's Robotic Transformer models (Brohan et al. 2022, 2023); RT-1 introduced a transformer policy trained on 130k robot demonstrations, RT-2 reformulated robot actions as text tokens emitted by a VLM.

S

  • Scalable Oversight , Providing useful training signal for AI tasks where humans cannot directly evaluate behaviour
  • Scaling Laws , Power-law relationships between model quality and compute, data, parameters
  • Score Matching , Training generative models by matching gradients of log-density rather than densities directly
  • Script , Schank and Abelson's 1977 knowledge structure for stereotyped sequences of events
  • Self-Attention , Attention where queries, keys and values all come from the same input
  • Self-Consistency , Wang et al.'s 2022 reasoning-improvement technique, sample many CoT, take majority answer
  • Self-Distillation , Training the next generation of a model on filtered outputs from the current generation, used in iterative refinement loops such as Llama 3 instruction tuning.
  • Self-Play on Verifiable Rewards , Reinforcement learning paradigm where an LLM trains itself by generating attempts on tasks whose correctness can be machine-checked, producing a clean reward signal without human labels.
  • Self-Reflection , Family of agentic techniques in which an LLM **evaluates its own output, identifies flaws, and revises**, including Reflexion (Shinn 2023), Self-Refine (Madaan 2023), and Constitutional AI critique passes.
  • Self-Supervised Learning , Learning from unlabelled data by generating supervision from its structure
  • Semantic Network , A graph of concepts connected by labelled relations; Quillian's 1968 model of memory
  • Semantic Segmentation , Assigning a class label to every pixel in an image
  • SentencePiece , Language-agnostic subword tokeniser that operates on raw byte streams including whitespace, supporting both BPE and unigram language model segmentation.
  • Sequence-to-Sequence , Encoder-decoder architecture mapping variable-length input to output
  • Sequential Recommendation , A class of recommender models that treat a user's interaction history as a sequence and predict the next item using a Transformer or RNN.
  • Shannon Entropy , The expected information content of a random variable, measured in bits
  • ShareGPT and Vicuna , 70,000-conversation ChatGPT-export dataset that powered the first open ChatGPT replicas (Vicuna, Koala).
  • SHRDLU , Winograd's 1970 natural-language system that conversed about a virtual blocks world
  • Sigmoid Function , The S-shaped function $\sigma(x) = 1/(1 + e^{-x})$ mapping reals to $(0, 1)$
  • Singular Value Decomposition , The factorisation $A = U \Sigma V^\top$ revealing the principal axes of any matrix
  • Skill Library , Agent architecture (Wang et al.'s **Voyager** 2023) in which the agent **writes, stores, and re-uses** its own programmatic skills as a growing library, demonstrated by an open-ended Minecraft agent.
  • SLAM , Simultaneous Localisation and Mapping, the joint estimation of a robot's pose and a map of its environment from sensor data.
  • SlimPajama , Cerebras's 627B-token deduplicated and quality-filtered subset of RedPajama, designed for efficient pre-training.
  • Softmax , The function mapping a vector of logits to a probability distribution
  • Solomonoff Induction , An idealised universal theory of inductive inference based on algorithmic probability
  • Sora , OpenAI's February 2024 diffusion Transformer for video, treating spacetime patches as tokens and trained on variable resolutions, durations, and aspect ratios.
  • Sparse Autoencoder (interpretability) , Wide overcomplete autoencoders that disentangle superposed features in neural networks
  • Specification Gaming , When an AI achieves the literal specification while violating the intent
  • Speculative Decoding , Leviathan et al.'s 2023 inference-time acceleration via a small draft model
  • SQuAD , Stanford Question Answering Dataset, foundational extractive reading-comprehension benchmark.
  • Stable Diffusion , Stability AI's 2022 latent diffusion model released open-source
  • Stack Exchange and Stack Overflow Corpus , Programmer Q&A archive, millions of question-answer pairs that gave early LLMs their coding ability.
  • Stanford AI Lab , Founded by McCarthy in 1963; one of the two great early AI institutions
  • Stanley (DARPA Grand Challenge) , The 2005 Stanford autonomous vehicle that won the DARPA Grand Challenge
  • STaR (Self-Taught Reasoner) , Zelikman et al. 2022 method that bootstraps reasoning ability by generating chain-of-thought rationales, filtering for those that reach correct answers, and fine-tuning on the survivors.
  • State-Space Model , A sequence model with continuous-time linear dynamics; basis of Mamba
  • Statistical Learning Theory , The mathematical theory of learning from finite samples, founded by Vapnik and Chervonenkis
  • Stochastic Gradient Descent , Gradient descent using noisy gradients from mini-batches of data
  • Stochastic Gradient Descent (mathematical detail) , SGD with full mathematical formulation, momentum, Nesterov, convergence analysis
  • STRIPS , Fikes and Nilsson's 1971 planning formalism, preconditions and add/delete lists
  • Structural Risk Minimisation , Choosing the model class with minimum capacity sufficient to fit the data
  • Structured Outputs , API mode that **guarantees** model output conforms to a developer-supplied JSON Schema, implemented at the decoder level via constrained sampling rather than prompt-engineered hope.
  • Subsumption Architecture , Brooks's 1986 layered control architecture for behaviour-based robotics
  • Supervised Learning , Learning a function from labelled input-output pairs
  • Support Vector Machine , Vapnik's maximum-margin classifier, dominant supervised method 1995–2012
  • SVM (mathematical detail) , Support vector machine, maximum-margin classifier with Lagrangian dual and kernel trick
  • SWE-Bench , A 2024 benchmark of real GitHub issues from popular Python repositories, requiring AI systems to generate code patches that pass the project's test suite.
  • Symbolic AI , The classical AI tradition based on explicit symbol manipulation and logical inference
  • Synthetic Content Detection , Classifiers and forensic methods that attempt to determine whether a piece of content was AI-generated rather than human-produced.
  • Synthetic Data for Reasoning , Generating artificial problem/solution pairs to train reasoning models, exploiting verifiable correctness in math, code, and formal proofs.

T

  • Tanh , The hyperbolic tangent activation function, a zero-centred sigmoid mapping to $(-1, 1)$.
  • TD-Gammon , Tesauro's 1992 self-play backgammon program, the foundational proof-of-concept for deep reinforcement learning.
  • Temperature (sampling) , A scalar that controls the sharpness of a softmax distribution
  • Temporal-Difference Learning , Sutton's 1988 method updating value estimates by the difference between successive predictions
  • Tensor , A multi-dimensional array generalising scalars, vectors and matrices, the fundamental data structure of modern deep learning.
  • Tensor Cores , Specialised Nvidia matrix-multiply units that perform a small dense MMA (matrix multiply-accumulate) per cycle, providing the bulk of an AI GPU's FLOPs across successive numerical formats.
  • Tensor Parallelism , Splitting individual matrix multiplications across devices, as in Megatron-LM's column-parallel and row-parallel linear layers.
  • Tesla FSD , Tesla's Full Self-Driving system, an end-to-end vision-only neural network mapping camera input directly to driving controls
  • Test-Time Compute Scaling , Increasing inference-time compute to improve model performance
  • Test-Time Compute Scaling Laws , Empirical relationships showing that performance on reasoning tasks improves predictably as more compute is spent at inference time, often dollar-for-dollar more efficiently than scaling training.
  • The Pile , EleutherAI's 825 GB curated mixture of 22 sources, foundational training set for GPT-Neo and Pythia.
  • The Stack and The Stack v2 , Hugging Face / BigCode's permissively licensed code corpus, the canonical open code training set.
  • Thinking Tokens , Special tokens that delimit a model's internal reasoning from its user-facing answer, central to the reasoning-model training paradigm of 2024-2025.
  • Tokenisation , Splitting text into discrete units for language model input
  • Tool Use , General capability of an LLM agent to invoke external software, calculators, web search, code execution, databases, REST APIs, in order to overcome the limits of pure parametric memory.
  • Toolformer , Self-supervised method (Schick et al. 2023) that teaches an LLM to decide *when* and *how* to call tools by automatically annotating its training corpus with tool calls and keeping only those that lower perplexity.
  • Top-k Sampling , Sample next token from the top k most probable; zero out the rest
  • Top-p (Nucleus) Sampling , Holtzman et al.'s 2019 adaptive sampling, keep smallest set whose probabilities sum past p
  • Topic Model , A statistical model that discovers thematic structure in document collections
  • TPU Systolic Array , Google's custom AI accelerator built around a large 2D grid of processing elements through which data flows rhythmically, optimised for dense matrix multiplication at pod scale.
  • Train a CLIP-style Multimodal Model , Recipe to pretrain a CLIP-style image–text dual encoder with symmetric InfoNCE on web image–caption pairs.
  • Train a Diffusion Model , End-to-end recipe to train a DDPM-style image diffusion model with UNet, EMA, and classifier-free guidance.
  • Train GPT-2 Recipe , End-to-end recipe to pretrain a 125M-parameter GPT-2-style decoder-only Transformer from scratch on a web corpus.
  • Train ResNet on ImageNet , End-to-end recipe to train ResNet-50 from scratch on ImageNet-1k to ~76% top-1 accuracy.
  • Training-Cluster Economics , The cost structure of training a frontier model, dominated over the hardware lifetime by power, cooling and networking rather than GPU purchase price.
  • Training, Validation, and Test Sets , Three-way data split for fitting, tuning, and honest evaluation
  • Transfer Learning , Adapting a model trained on one task to perform another
  • Transfer Learning (NLP) , Pretraining on large corpora and fine-tuning for specific tasks
  • Transformer , The 2017 attention-based architecture that became the foundation of modern AI
  • Transformer Decoder , Stack of causal self-attention and feed-forward layers for generation
  • Transformer Encoder , Stack of self-attention and feed-forward layers producing representations
  • Tree of Thoughts , Generalisation of chain-of-thought (Yao et al. 2023) in which the LLM explores **multiple reasoning paths in a tree**, evaluates partial states, and uses search algorithms (BFS, DFS, best-first) to find a good answer.
  • Triplet Loss , Margin-based loss for learning embeddings, anchor closer to positive than to negative
  • TruthfulQA , 817 questions designed to elicit imitative falsehoods, common misconceptions repeated by humans online.
  • Turing Machine , An abstract model of computation, finite states, an unbounded tape, a read-write head
  • Two-Tower Recommender , A neural recommender with separate encoders for users and items whose dot product is used for retrieval at scale.

U

  • U-Net , Encoder-decoder convolutional network with skip connections that became the foundational architecture for medical image segmentation
  • UltraChat and UltraFeedback , Tsinghua's 1.5M-conversation synthetic instruction-tuning dataset, generated by GPT-3.5/GPT-4.
  • Underfitting , A model too simple or too constrained to capture the true structure of the data, exhibiting high training and high test error.
  • Universal Approximation Theorem , The theorem that a feedforward neural network with a single hidden layer and enough units can approximate any continuous function on a compact set to arbitrary precision.
  • Unsupervised Learning , Finding structure in data without labels

V

  • VALL-E , Microsoft's 2023 neural codec language model that performs zero-shot voice cloning from a 3-second prompt by treating EnCodec tokens as a discrete language.
  • Vanishing Gradient , Gradients shrinking to near-zero as they propagate through deep networks
  • Vanishing Gradient Problem , The exponential decay of gradients in deep or recurrent networks during backpropagation
  • Variance , A measure of how spread out a random variable is around its mean, central to uncertainty quantification in machine learning.
  • Variational Autoencoder , Kingma and Welling's 2013 probabilistic deep generative model
  • Variational Inference , Approximating intractable posteriors by optimisation rather than sampling
  • VC Dimension , Vapnik–Chervonenkis dimension, the capacity of a hypothesis class
  • Vector , An ordered list of numbers representing a point or direction in space, the basic data type of machine learning.
  • Vector Database , Specialised database for storing and querying high-dimensional embedding vectors via approximate nearest-neighbour search; the backing store for RAG, agent memory, and semantic search.
  • Vector Quantisation , Replacing a continuous vector with the index of the nearest entry in a learned codebook, used to discretise neural representations.
  • Veo , Google DeepMind's 2024 high-fidelity video generation model, Veo 2 produces 1080p clips with strong physical realism via a latent diffusion Transformer.
  • Verifiable Rewards , Tasks where ground-truth correctness can be checked by a deterministic procedure rather than learned from human preferences, enabling clean reinforcement-learning signals.
  • Video Diffusion Models , Class of diffusion models that generate temporally coherent video clips by extending image diffusion to spacetime; includes Sora, Veo, Runway Gen-3, Stable Video Diffusion, and Kling.
  • Video Understanding Models , Self-supervised video representation models such as VideoMAE, V-JEPA, and InternVideo; pretrained on unlabelled video to learn temporal and spatial features for downstream classification, retrieval, and VLM input.
  • Visible vs Hidden Thinking Tokens , Architectural choice in reasoning models between exposing extended thinking traces (Claude 4, DeepSeek-R1) or concealing them (o1, o3) , affecting trust, distillability, and debuggability.
  • Vision Transformer , Dosovitskiy et al.'s 2020 application of the Transformer to image classification
  • Vision Transformer (mathematical detail) , Mathematical formulation of the Vision Transformer
  • Vision-Language Model , Generic class of neural networks that jointly process images and text in a single model, enabling tasks such as visual question answering, image captioning, and grounded reasoning across modalities.
  • vLLM , Open-source high-throughput LLM inference server built around PagedAttention, continuous batching, and prefix caching.
  • Von Neumann Architecture , The stored-program computer architecture that became the template for every general-purpose CPU

W

  • Wasserstein Distance , The Earth Mover's Distance between probability distributions
  • Watermarking AI Content , Cryptographically or statistically embedding an imperceptible signal in AI-generated images, audio, video or text to enable later detection of synthesis.
  • wav2vec 2.0 , Baevski et al's 2020 self-supervised speech encoder, a CNN feature extractor followed by a Transformer trained with a contrastive loss over quantised latent codes.
  • WaveNet , DeepMind's 2016 autoregressive raw-audio generator using stacked dilated causal convolutions, the breakthrough that made neural text-to-speech sound human.
  • Waymo Driver , Waymo's autonomous-driving stack combining lidar, camera and radar in a modular perception-prediction-planning pipeline running commercial robotaxi service
  • WebText and WebText2 , OpenAI's Reddit-karma-filtered web scrape, used to pre-train GPT-2 (WebText) and as a component of GPT-3 (WebText2).
  • Weight Decay , An L2 regularisation technique that pulls parameters toward zero at each update step.
  • Whisper , OpenAI's encoder-decoder Transformer for multilingual, multitask speech recognition trained on 680,000 hours of weakly supervised web audio.
  • Wikipedia (training corpus) , Universal pretraining ingredient, high-quality, encyclopedic, multilingual, freely licensed text.
  • WinoGrande , 44K Winograd-Schema-style pronoun-resolution problems for commonsense reasoning.
  • Word2Vec , Learning dense word embeddings by predicting context
  • WordPiece , Subword tokenisation algorithm that, like BPE, builds a vocabulary by merging symbol pairs, but selects merges that maximise corpus likelihood rather than raw frequency.
  • World Model , A learned generative model of an environment's dynamics, used for planning or policy learning by simulating imagined rollouts.

Z

  • ZeRO , DeepSpeed's Zero Redundancy Optimiser, which progressively partitions optimiser states, gradients, and parameters across data-parallel workers.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).