150 terms

Glossary

A B C D E F G H I J K L M N O P Q R S T U V W

A

  • Accuracy, Precision, Recall and F1 — Standard metrics for evaluating classification performance
  • Activation Function — The nonlinear element that gives neural networks their expressive power
  • Adam — Adaptive optimiser combining momentum with per-parameter learning rates
  • Adversarial Example — Input with small perturbations crafted to fool a model
  • Agent — An entity that perceives and acts in an environment to achieve goals
  • AI Alignment — Ensuring AI systems pursue objectives matching human values and intentions
  • AI Safety — Preventing unintended harms from AI systems
  • Anomaly Detection — Identifying observations that deviate significantly from expected patterns
  • Artificial General Intelligence — Hypothetical AI with human-level flexibility across all domains
  • Artificial Intelligence — The study of making machines do tasks that require intelligence
  • Attention Mechanism — Learning to weight input parts by relevance to an output position
  • Attention Weights — Softmax-normalised scores determining how much each input contributes
  • AUC — Area under the ROC curve, a threshold-independent classifier metric
  • Autoencoder — Neural network trained to reconstruct its input through a bottleneck
  • Autoregressive Model — Model that generates sequences one element at a time, conditioned on previous

B

  • Backpropagation — Algorithm for computing gradients in neural networks via the chain rule
  • Batch Normalisation — Normalising activations within each mini-batch to stabilise training
  • Batch Size — Number of examples processed in parallel during training
  • Bayes' Theorem — Rule for updating beliefs given new evidence
  • BERT — Bidirectional transformer encoder pretrained via masked language modelling
  • Bias (Fairness) — Systematic errors disadvantaging particular groups of people
  • Bias-Variance Tradeoff — Tension between model simplicity and flexibility in generalisation

C

  • Central Limit Theorem — Sums of independent random variables tend toward a Gaussian
  • Chain Rule — Rule for differentiating compositions of functions
  • Chain-of-Thought — Prompting LLMs to produce intermediate reasoning before final answers
  • CLIP — Contrastive vision-language model aligning images and text in a shared space
  • Computer Vision — Automatic extraction of meaning from images and video
  • Conditional Probability — The probability of A given that B has occurred
  • Confidence Interval — Range of plausible values for a parameter, with stated confidence level
  • Convolution — Sliding a small kernel over an input to produce a feature map
  • Convolutional Neural Network — Neural network with convolutional layers, ideal for images
  • Cross-Entropy — The loss function for classification under probabilistic predictions
  • Cross-Validation — Estimating generalisation performance by repeated train-test splits
  • Curse of Dimensionality — Exponential data requirements and geometric anomalies in high dimensions

D

  • Data Drift — Change in the distribution of input features over time
  • DBSCAN — Density-based clustering that discovers arbitrarily shaped clusters and noise
  • Decision Tree — A model that partitions feature space with axis-aligned splits
  • Deep Learning — Machine learning with deep neural networks
  • Derivative — The instantaneous rate of change of a function
  • Differential Privacy — Rigorous framework for bounded information leakage from data analysis
  • Diffusion Model — Generative model that learns to reverse a gradual noising process
  • Dimensionality Reduction — Projecting high-dimensional data into fewer dimensions
  • Dot Product — Scalar measure of similarity between two vectors
  • Dropout — Randomly zeroing neurons during training to prevent overfitting

E

  • Early Stopping — Halting training when validation performance starts to degrade
  • Eigenvalue and Eigenvector — A direction preserved by a linear transformation, and its scaling factor
  • Embedding — A dense vector representation of a discrete entity
  • Ensemble Methods — Combining multiple models into a single stronger predictor
  • Entropy — Average information or uncertainty in a distribution
  • Epoch — One complete pass through the training dataset
  • Expectation — The mean or average value of a random variable
  • Explainable AI — Making AI decisions understandable to humans

F

  • Feature Engineering — Designing informative input features for a machine learning model
  • Feature Map — Output of a convolutional layer representing detected features
  • Federated Learning — Distributed training where data stays on user devices
  • Fine-Tuning — Adapting a pretrained model to a new task by continuing training

G

  • Gaussian Distribution — The normal bell curve, fundamental to statistics and ML
  • Generative Adversarial Network — Two neural networks playing a minimax game for generative modelling
  • Generative Model — Model that learns the data distribution and can sample from it
  • GPT — Generative Pre-trained Transformer: decoder-only autoregressive language model
  • Gradient — Vector of partial derivatives pointing toward steepest ascent
  • Gradient Boosting — Sequential ensemble that fits each new learner to the loss's negative gradient
  • Gradient Descent — Iterative optimisation by stepping downhill along the gradient
  • GRU — Gated Recurrent Unit: a simpler alternative to LSTM

H

I

  • In-Context Learning — LLMs adapting to new tasks via examples in the prompt, without weight updates
  • Information Theory — Mathematical theory of communication, compression and uncertainty
  • Integral — Continuous accumulation of a quantity, dual to the derivative

J

  • Jacobian — Matrix of partial derivatives of a vector-valued function
  • Joint Distribution — The distribution over two or more random variables simultaneously

K

  • K-Means Clustering — Partition data into k clusters by alternating assignment and centroid update
  • K-Nearest Neighbours — Classify or predict by voting among the k closest training examples
  • Kernel Trick — Computing high-dimensional inner products without explicit mapping
  • KL Divergence — Asymmetric measure of difference between two distributions
  • Knowledge Distillation — Training a smaller student model to mimic a larger teacher

L

  • Language Model — A model that assigns probabilities to sequences of tokens
  • Large Language Model — Massive transformer-based language model trained on vast text corpora
  • Learning Rate — Step size in gradient-based optimisation; the most important hyperparameter
  • Linear Regression — Fitting a linear relationship between features and a continuous target
  • Logistic Regression — Linear classifier that outputs probabilities via the sigmoid
  • Loss Function — A function measuring how poorly a model's predictions match the truth
  • LSTM — Long Short-Term Memory: a gated RNN that learns long-range dependencies

M

  • Machine Learning — Algorithms that improve at a task through experience with data
  • Matrix — A rectangular array of numbers representing a linear transformation
  • Matrix Multiplication — Composition of linear transformations via row-column dot products
  • Maximum Likelihood Estimation — Choosing parameters to maximise the probability of the observed data
  • Mixture of Experts — Architecture where a gate routes inputs to a subset of expert networks
  • MLOps — Operational discipline for deploying and maintaining ML systems
  • Multi-Head Attention — Running several attention operations in parallel to capture diverse patterns
  • Multilayer Perceptron — Fully connected feedforward neural network with one or more hidden layers
  • Multimodal Model — Model that processes or generates multiple data modalities

N

O

  • Object Detection — Locating and classifying objects within an image via bounding boxes
  • Overfitting — Fitting noise in the training data, harming generalisation

P

  • P-value — Probability of observing data at least as extreme as actual, if null is true
  • Partial Derivative — Derivative with respect to one variable, holding others fixed
  • Perceptron — The simplest neural network: a single linear classifier with step activation
  • Perplexity — Standard metric for language models: exponential of cross-entropy
  • Pooling — Downsampling feature maps by summarising local regions
  • Positional Encoding — Injecting sequence order information into permutation-invariant attention
  • Principal Component Analysis — Dimensionality reduction by projecting onto directions of maximum variance
  • Probability Distribution — How probability is allocated across possible values of a random variable
  • Prompt Engineering — Crafting effective prompts to elicit desired LLM behaviour
  • Pruning — Removing unnecessary parameters or structures from a neural network

Q

  • Quantisation — Reducing numerical precision of model weights to save memory and compute

R

  • Random Forest — Ensemble of decision trees trained via bagging and feature subsampling
  • Random Variable — A numerical quantity whose value is determined by a random outcome
  • Recommendation System — Algorithms that suggest items users are likely to engage with
  • Recurrent Neural Network — Neural network with a hidden state that processes sequences step by step
  • Regularisation — Techniques for constraining model complexity to improve generalisation
  • Reinforcement Learning — Learning a policy by interacting with an environment for reward
  • ReLU — Rectified Linear Unit: max(0, x), the dominant activation in deep learning
  • Residual Connection — Skip connection that adds a layer's input to its output
  • Retrieval-Augmented Generation — Grounding LLMs in external knowledge fetched at query time
  • RLHF — Reinforcement Learning from Human Feedback: aligning LLMs to preferences

S

T

  • Tensor — A multi-dimensional array generalising vectors and matrices
  • Tokenisation — Splitting text into discrete units for language model input
  • Tool Use — LLMs invoking external functions and APIs to extend their capabilities
  • Training, Validation, and Test Sets — Three-way data split for fitting, tuning, and honest evaluation
  • Transfer Learning — Adapting a model trained on one task to perform another
  • Transfer Learning (NLP) — Pretraining on large corpora and fine-tuning for specific tasks
  • Transformer — Deep learning architecture based entirely on self-attention
  • Transformer Decoder — Stack of causal self-attention and feed-forward layers for generation
  • Transformer Encoder — Stack of self-attention and feed-forward layers producing representations

U

V

  • Vanishing Gradient — Gradients shrinking to near-zero as they propagate through deep networks
  • Variance — A measure of how spread out a random variable is around its mean
  • Variational Autoencoder — Probabilistic autoencoder with a structured latent space
  • Vector — An ordered list of numbers representing a point or direction
  • Vision Transformer — Applying the transformer architecture to images

W

  • Weight Decay — L2 regularisation that pulls weights toward zero each step
  • Word2Vec — Learning dense word embeddings by predicting context