150 terms
Glossary
A
- Accuracy, Precision, Recall and F1 — Standard metrics for evaluating classification performance
- Activation Function — The nonlinear element that gives neural networks their expressive power
- Adam — Adaptive optimiser combining momentum with per-parameter learning rates
- Adversarial Example — Input with small perturbations crafted to fool a model
- Agent — An entity that perceives and acts in an environment to achieve goals
- AI Alignment — Ensuring AI systems pursue objectives matching human values and intentions
- AI Safety — Preventing unintended harms from AI systems
- Anomaly Detection — Identifying observations that deviate significantly from expected patterns
- Artificial General Intelligence — Hypothetical AI with human-level flexibility across all domains
- Artificial Intelligence — The study of making machines do tasks that require intelligence
- Attention Mechanism — Learning to weight input parts by relevance to an output position
- Attention Weights — Softmax-normalised scores determining how much each input contributes
- AUC — Area under the ROC curve, a threshold-independent classifier metric
- Autoencoder — Neural network trained to reconstruct its input through a bottleneck
- Autoregressive Model — Model that generates sequences one element at a time, conditioned on previous
B
- Backpropagation — Algorithm for computing gradients in neural networks via the chain rule
- Batch Normalisation — Normalising activations within each mini-batch to stabilise training
- Batch Size — Number of examples processed in parallel during training
- Bayes' Theorem — Rule for updating beliefs given new evidence
- BERT — Bidirectional transformer encoder pretrained via masked language modelling
- Bias (Fairness) — Systematic errors disadvantaging particular groups of people
- Bias-Variance Tradeoff — Tension between model simplicity and flexibility in generalisation
C
- Central Limit Theorem — Sums of independent random variables tend toward a Gaussian
- Chain Rule — Rule for differentiating compositions of functions
- Chain-of-Thought — Prompting LLMs to produce intermediate reasoning before final answers
- CLIP — Contrastive vision-language model aligning images and text in a shared space
- Computer Vision — Automatic extraction of meaning from images and video
- Conditional Probability — The probability of A given that B has occurred
- Confidence Interval — Range of plausible values for a parameter, with stated confidence level
- Convolution — Sliding a small kernel over an input to produce a feature map
- Convolutional Neural Network — Neural network with convolutional layers, ideal for images
- Cross-Entropy — The loss function for classification under probabilistic predictions
- Cross-Validation — Estimating generalisation performance by repeated train-test splits
- Curse of Dimensionality — Exponential data requirements and geometric anomalies in high dimensions
D
- Data Drift — Change in the distribution of input features over time
- DBSCAN — Density-based clustering that discovers arbitrarily shaped clusters and noise
- Decision Tree — A model that partitions feature space with axis-aligned splits
- Deep Learning — Machine learning with deep neural networks
- Derivative — The instantaneous rate of change of a function
- Differential Privacy — Rigorous framework for bounded information leakage from data analysis
- Diffusion Model — Generative model that learns to reverse a gradual noising process
- Dimensionality Reduction — Projecting high-dimensional data into fewer dimensions
- Dot Product — Scalar measure of similarity between two vectors
- Dropout — Randomly zeroing neurons during training to prevent overfitting
E
- Early Stopping — Halting training when validation performance starts to degrade
- Eigenvalue and Eigenvector — A direction preserved by a linear transformation, and its scaling factor
- Embedding — A dense vector representation of a discrete entity
- Ensemble Methods — Combining multiple models into a single stronger predictor
- Entropy — Average information or uncertainty in a distribution
- Epoch — One complete pass through the training dataset
- Expectation — The mean or average value of a random variable
- Explainable AI — Making AI decisions understandable to humans
F
- Feature Engineering — Designing informative input features for a machine learning model
- Feature Map — Output of a convolutional layer representing detected features
- Federated Learning — Distributed training where data stays on user devices
- Fine-Tuning — Adapting a pretrained model to a new task by continuing training
G
- Gaussian Distribution — The normal bell curve, fundamental to statistics and ML
- Generative Adversarial Network — Two neural networks playing a minimax game for generative modelling
- Generative Model — Model that learns the data distribution and can sample from it
- GPT — Generative Pre-trained Transformer: decoder-only autoregressive language model
- Gradient — Vector of partial derivatives pointing toward steepest ascent
- Gradient Boosting — Sequential ensemble that fits each new learner to the loss's negative gradient
- Gradient Descent — Iterative optimisation by stepping downhill along the gradient
- GRU — Gated Recurrent Unit: a simpler alternative to LSTM
H
- Hallucination — LLM generating plausible-sounding but false information
- Hierarchical Clustering — Build a tree of nested clusters by repeated merging or splitting
- Hyperparameter Tuning — Searching for optimal model settings that are not learned from data
- Hypothesis Testing — Formal procedure for deciding between competing claims about data
I
- In-Context Learning — LLMs adapting to new tasks via examples in the prompt, without weight updates
- Information Theory — Mathematical theory of communication, compression and uncertainty
- Integral — Continuous accumulation of a quantity, dual to the derivative
J
- Jacobian — Matrix of partial derivatives of a vector-valued function
- Joint Distribution — The distribution over two or more random variables simultaneously
K
- K-Means Clustering — Partition data into k clusters by alternating assignment and centroid update
- K-Nearest Neighbours — Classify or predict by voting among the k closest training examples
- Kernel Trick — Computing high-dimensional inner products without explicit mapping
- KL Divergence — Asymmetric measure of difference between two distributions
- Knowledge Distillation — Training a smaller student model to mimic a larger teacher
L
- Language Model — A model that assigns probabilities to sequences of tokens
- Large Language Model — Massive transformer-based language model trained on vast text corpora
- Learning Rate — Step size in gradient-based optimisation; the most important hyperparameter
- Linear Regression — Fitting a linear relationship between features and a continuous target
- Logistic Regression — Linear classifier that outputs probabilities via the sigmoid
- Loss Function — A function measuring how poorly a model's predictions match the truth
- LSTM — Long Short-Term Memory: a gated RNN that learns long-range dependencies
M
- Machine Learning — Algorithms that improve at a task through experience with data
- Matrix — A rectangular array of numbers representing a linear transformation
- Matrix Multiplication — Composition of linear transformations via row-column dot products
- Maximum Likelihood Estimation — Choosing parameters to maximise the probability of the observed data
- Mixture of Experts — Architecture where a gate routes inputs to a subset of expert networks
- MLOps — Operational discipline for deploying and maintaining ML systems
- Multi-Head Attention — Running several attention operations in parallel to capture diverse patterns
- Multilayer Perceptron — Fully connected feedforward neural network with one or more hidden layers
- Multimodal Model — Model that processes or generates multiple data modalities
N
- Narrow AI — AI designed for a specific, bounded task
- Natural Language Processing — Computer processing and understanding of human language
- Neural Network — A composition of linear and nonlinear transformations for learning
- Normalising Flow — Generative model using invertible transformations for exact likelihood
O
- Object Detection — Locating and classifying objects within an image via bounding boxes
- Overfitting — Fitting noise in the training data, harming generalisation
P
- P-value — Probability of observing data at least as extreme as actual, if null is true
- Partial Derivative — Derivative with respect to one variable, holding others fixed
- Perceptron — The simplest neural network: a single linear classifier with step activation
- Perplexity — Standard metric for language models: exponential of cross-entropy
- Pooling — Downsampling feature maps by summarising local regions
- Positional Encoding — Injecting sequence order information into permutation-invariant attention
- Principal Component Analysis — Dimensionality reduction by projecting onto directions of maximum variance
- Probability Distribution — How probability is allocated across possible values of a random variable
- Prompt Engineering — Crafting effective prompts to elicit desired LLM behaviour
- Pruning — Removing unnecessary parameters or structures from a neural network
Q
- Quantisation — Reducing numerical precision of model weights to save memory and compute
R
- Random Forest — Ensemble of decision trees trained via bagging and feature subsampling
- Random Variable — A numerical quantity whose value is determined by a random outcome
- Recommendation System — Algorithms that suggest items users are likely to engage with
- Recurrent Neural Network — Neural network with a hidden state that processes sequences step by step
- Regularisation — Techniques for constraining model complexity to improve generalisation
- Reinforcement Learning — Learning a policy by interacting with an environment for reward
- ReLU — Rectified Linear Unit: max(0, x), the dominant activation in deep learning
- Residual Connection — Skip connection that adds a layer's input to its output
- Retrieval-Augmented Generation — Grounding LLMs in external knowledge fetched at query time
- RLHF — Reinforcement Learning from Human Feedback: aligning LLMs to preferences
S
- Scaling Laws — Power-law relationships between model quality and compute, data, parameters
- Self-Attention — Attention where every position attends to every other position in the same sequence
- Self-Supervised Learning — Learning from unlabelled data by generating supervision from its structure
- Semantic Segmentation — Assigning a class label to every pixel in an image
- Sequence-to-Sequence — Encoder-decoder architecture mapping variable-length input to output
- Singular Value Decomposition — Factorisation of any matrix into rotation, scaling, rotation
- Softmax — Turns a vector of scores into a probability distribution
- Stochastic Gradient Descent — Gradient descent using noisy gradients from mini-batches of data
- Supervised Learning — Learning a function from labelled input-output pairs
- Support Vector Machine — Maximum-margin classifier leveraging the kernel trick
T
- Tensor — A multi-dimensional array generalising vectors and matrices
- Tokenisation — Splitting text into discrete units for language model input
- Tool Use — LLMs invoking external functions and APIs to extend their capabilities
- Training, Validation, and Test Sets — Three-way data split for fitting, tuning, and honest evaluation
- Transfer Learning — Adapting a model trained on one task to perform another
- Transfer Learning (NLP) — Pretraining on large corpora and fine-tuning for specific tasks
- Transformer — Deep learning architecture based entirely on self-attention
- Transformer Decoder — Stack of causal self-attention and feed-forward layers for generation
- Transformer Encoder — Stack of self-attention and feed-forward layers producing representations
U
- Underfitting — Using a model too simple to capture the underlying structure
- Universal Approximation Theorem — A sufficiently wide neural network can approximate any continuous function
- Unsupervised Learning — Finding structure in data without labels
V
- Vanishing Gradient — Gradients shrinking to near-zero as they propagate through deep networks
- Variance — A measure of how spread out a random variable is around its mean
- Variational Autoencoder — Probabilistic autoencoder with a structured latent space
- Vector — An ordered list of numbers representing a point or direction
- Vision Transformer — Applying the transformer architecture to images
W
- Weight Decay — L2 regularisation that pulls weights toward zero each step
- Word2Vec — Learning dense word embeddings by predicting context
© 2026 Chris Paton. All rights reserved.