GPU acceleration is the use of graphics processing units, originally developed for real-time 3D graphics, for general-purpose parallel computation. The breakthrough that enabled it was Nvidia's CUDA programming model (released 2007), which exposed the GPU as a general-purpose massively-parallel array processor accessible from C-like code.
The deep-learning revolution would not have been possible without GPU acceleration. A modern neural network's inner loop is dense matrix multiplication, exactly the workload GPUs are architected for: thousands of simple cores with high memory bandwidth, optimised for SIMD execution. A GPU executes matrix multiplications 10× to 100× faster than a contemporaneous CPU, bringing training that would have taken weeks down to hours.
AlexNet (2012) was trained on two consumer-grade GTX 580 GPUs over six days; the same training would have taken months or years on CPU. Subsequent generations of Nvidia hardware (Kepler, Maxwell, Pascal, Volta, Turing, Ampere, Hopper, Blackwell) and competing accelerators (Google TPUs, AMD Instinct, Apple Neural Engine, Cerebras wafer-scale, Tenstorrent, Groq) have driven training-flop-per-dollar improvements of roughly 30× per decade, without which the modern era of large language models would be infeasible.
GPU acceleration is also why deep learning's hardware requirements remain a structural constraint on the field: state-of-the-art training runs require thousands or tens of thousands of high-end accelerators in coordinated networks, accessible only to a handful of well-resourced organisations.
Related terms: TPU Systolic Array, AlexNet, Deep Learning
Discussed in:
- Chapter 10: Training & Optimisation, Training Optimisation