Linear Algebra: 2.12   Hardware reality

Dr Chris Paton

2.12 Hardware reality

Open any introduction to algorithms and you will find a careful discussion of operation counts: this routine is $O(N^2)$, that one is $O(N^3)$, and the table at the end ranks them from fastest to slowest. The implicit assumption is that the count of arithmetic operations is what determines wall-clock time. On a modern GPU that assumption is wrong by a factor of one hundred. Two pieces of code that perform exactly the same multiplications and additions can finish in one millisecond or one hundred milliseconds depending on how their data is laid out in memory, whether their working set fits in cache, whether they keep the tensor cores fed, and whether they cross the boundary between graphics memory and host memory. The arithmetic count is, at best, an upper bound on what is possible; the bottleneck is almost always the rate at which bytes can be moved into the units that do the arithmetic.

This section gives the practitioner-level intuitions that explain those one-hundred-fold differences. It is not a survey of GPU microarchitecture and it is not a tuning guide. It is an attempt to install the small set of mental models, the memory hierarchy, arithmetic intensity and the roofline, contiguity, batching, mixed precision, and inter-device communication, that you will rely on whenever you ask "why is my training loop so slow?" or "why does this model fit on one card but not another?" These questions arrive on the first day of any serious deep-learning project and never really go away.

Where §2.10 gave the language and §2.11 gave the precision concerns, this section explains why the same expression can run at vastly different speeds.

Symbols Used Here

$N$matrix dimension

$B$batch size

$\text{FLOPs}$floating-point operations

$\text{bytes}$memory traffic

$AI$arithmetic intensity = FLOPs / bytes

Memory hierarchy

A modern GPU is not one memory connected to one set of arithmetic units. It is a layered pyramid in which each level is faster, smaller, and closer to the silicon than the one below it. Understanding the pyramid is the single most useful thing you can do, because almost every performance problem reduces to asking "which level am I really running out of?"

Starting from the top:

Registers. Tens of kilobytes per streaming multiprocessor, private to each thread, accessed in zero or one cycle. Your loop variables and accumulators live here. There is no faster store on the chip.
Shared memory and L1 cache. Tens to hundreds of kilobytes per block, shared by the threads of one block, accessed in roughly ten cycles. This is the scratchpad where a well-written matrix-multiplication kernel keeps the small tile of $A$ and $B$ that it is currently chewing on.
L2 cache. Tens of megabytes, shared chip-wide, accessed in roughly one hundred cycles. The L2 is the buffer between the streaming multiprocessors and the much slower main memory.
HBM (high-bandwidth memory). The "GPU memory" that you see when you read nvidia-smi: tens to hundreds of gigabytes, accessed in several hundred cycles. The bandwidth out of HBM is enormous by historical standards, three terabytes per second on an H100, but it is still the slowest store anything on the GPU has to talk to.
Host memory. The system RAM on the motherboard, reached over the PCIe bus. Capacity is large, a server may have a terabyte, but latency is enormous (tens of thousands of cycles) and bandwidth is a tiny fraction of HBM.

The practical consequence is simple. A computation whose working set fits in shared memory runs roughly fifty times faster than one that has to stream from HBM. That HBM kernel is itself perhaps twenty times faster than one whose tensors have spilled to host memory and have to crawl back over PCIe. Where your data is right now matters at least as much as what arithmetic you are doing on it. Two of the three big performance wins of the last decade, FlashAttention's tiling of attention into shared-memory blocks, and PagedAttention's careful reuse of the KV cache in HBM, are wins about where the data lives, not about the arithmetic itself.

Arithmetic intensity and the roofline

Once you accept that bandwidth is finite, the question becomes: how much arithmetic do I get for each byte I move? That ratio is called the arithmetic intensity of a kernel, its FLOPs divided by its bytes of memory traffic. Arithmetic intensity is the most important number in a performance discussion you will ever have, because it determines whether the chip's compute units or its memory channels are the limiting factor.

The roofline model makes this precise. Plot achievable throughput against arithmetic intensity on a log–log graph. The chip's peak FLOPs draw a horizontal ceiling at the top. The chip's peak bandwidth draws a sloped line on the left. The achievable throughput of any kernel is the lower of the two: $\min(\text{peak FLOPs},\; AI \times \text{peak bandwidth})$. Below a certain arithmetic intensity (the "ridge point") the kernel is bandwidth-bound: it cannot fetch operands fast enough to keep the multipliers busy. Above it the kernel is compute-bound: arithmetic units are saturated and adding more bandwidth would not help.

A handful of worked examples make the framework concrete:

Element-wise add. Two FLOPs per output value, twelve bytes of memory traffic in fp32 (read $a$, read $b$, write $c$). Arithmetic intensity is $2/12 \approx 0.17$. This is firmly in the bandwidth-bound region. No matter how fast your multipliers are, the kernel can only run as fast as memory can deliver bytes.
Matrix multiplication. Multiplying two $N \times N$ fp32 matrices costs $2N^3$ FLOPs (one multiply, one add per inner-product term). The data moved is roughly $3N^2$ floats, read $A$, read $B$, write $C$, or $12 N^2$ bytes. Arithmetic intensity is $2N^3 / 12N^2 = N/6$, which is to say roughly proportional to $N$. Once $N$ exceeds a few tens, matmul lifts above the ridge point and becomes compute-bound. With careful tiling that reuses each loaded tile many times, the effective intensity of a matmul kernel is usually written as $AI = 2N/3$, and that is the figure most performance papers quote.
Softmax. A row-wise softmax must read the row twice, once to find the maximum (for the numerically stable version of §2.11), once to compute the normalised exponentials, and write it once. Arithmetic intensity is around one. Softmax is bandwidth-bound on nearly every modern chip, and that is exactly why every serious attention implementation fuses it into the surrounding matmul kernels.

The implication is stark. Matrix multiplication is the operation modern accelerators are built for; everything else is starved. This is the single biggest reason that "fused" kernels, fused softmax-matmul, fused layer-norm-matmul, FlashAttention's fused attention block, exist. They keep intermediate values in registers or shared memory and never write them out to HBM, raising arithmetic intensity by an order of magnitude with no change to the underlying mathematics.

Row-major vs column-major

Memory is one long ribbon of bytes; multidimensional arrays are flattened onto that ribbon in one of two conventions. Row-major order (C, NumPy's default, PyTorch's default) stores the first row, then the second, and so on, so that two elements in the same row are adjacent in memory. Column-major order (Fortran, MATLAB, BLAS internally) stores the first column, then the second, so that two elements in the same column are adjacent. The two conventions are mathematically equivalent but performance-wise emphatically not.

The reason is the cache line. When the chip fetches one float from HBM it actually pulls a contiguous block of typically 128 bytes, a whole cache line, into the L2. If the next access lands inside that block, it is essentially free. If the next access strides far away, the chip pays the full memory-latency tax again. Cache-friendly traversal walks along the storage axis; cache-hostile traversal jumps perpendicular to it.

Worked example. A $1000 \times 1000$ fp32 matrix in row-major order, summed row by row, runs in a few milliseconds: each inner loop sweeps a contiguous run of four kilobytes. The same matrix summed column by column on the same hardware is five to ten times slower, because each step of the inner loop strides four kilobytes forward in memory and busts the cache.

PyTorch defaults to row-major (torch.contiguous_format) but several convolution kernels prefer channels-last order, with shape $(B, H, W, C)$, because that makes per-pixel work contiguous and lifts cache hit rates. A tensor that you reach by transpose or permute is a view (§2.10), the metadata changes but the bytes do not, and is therefore non-contiguous. Operations that scan it linearly will pay the stride tax. Calling .contiguous() re-copies the tensor into a fresh row-major buffer; doing this once before a hot loop is almost always worth it.

Batching for arithmetic intensity

A single matrix–vector multiply has arithmetic intensity around one: it does $2N^2$ FLOPs for $N^2$ bytes of weight read. That is firmly bandwidth-bound. Stacking $B$ vectors into a matrix and doing one batched matrix–matrix multiply does $2BN^2$ FLOPs for the same $N^2$ bytes of weight read, because each weight is reused $B$ times. Arithmetic intensity rises by a factor of $B$. The hardware, which only ever wanted to be doing matmul, finally goes fast.

This is the deep reason mini-batches dominate deep-learning training. The statistical argument, that small batches give noisier gradients, larger batches smoother ones, is real but secondary. The primary reason is that batch size 1 leaves the silicon idle and batch size 32 keeps it busy. A standard rule of thumb for LLM inference is that serving at batch 1 gets perhaps ten per cent of theoretical peak FLOPs, while batch 32 reaches sixty per cent or more. Continuous batching, the central trick of vLLM-style serving stacks, stitches incoming user requests into a steady high-batch stream so that no token generation step ever has to run at low arithmetic intensity. It is one of the most important production-engineering wins of recent years.

Mixed precision and tensor cores

NVIDIA's tensor cores, introduced with Volta and refined in every generation since, are dedicated matrix-multiply units that consume fp16 or bf16 inputs and accumulate into fp32. They run roughly eight to sixteen times faster than the general-purpose fp32 ALUs for matmul of the same shape. On an H100, fp32 throughput is around sixty-seven TFLOPS, while bf16 throughput on the tensor cores is around 989 TFLOPS, almost fifteen times more. Any deep-learning workload that does not use the tensor cores leaves a one-digit-percent factor of available speed on the table.

This is the engineering pressure behind mixed-precision training. Master copies of weights and the optimiser moments live in fp32 to preserve dynamic range; forward and backward matmuls cast their inputs to bf16 (or fp16 with loss scaling) so they hit the tensor cores; the optimiser step runs in fp32 against the master weights. The whole protocol is wrapped up in one line as torch.cuda.amp.autocast in PyTorch. Hopper-generation chips push further with fp8, single-byte floats in two variants, E4M3 for precision and E5M2 for range, and frontier-scale training runs now routinely use fp8 for most matmuls. Inference-time quantisation to int8 or int4 goes further still, trading a small accuracy cost for a several-fold cut in both bandwidth and memory.

Communication: the other bottleneck

A frontier model no longer fits on a single card. A seventy-billion-parameter transformer in bf16 weighs 140 GB before you have stored a single optimiser moment; with Adam's two moment buffers in fp32 it is closer to a terabyte. Training therefore spreads across many GPUs, and the chosen split has its own technical name. Data parallelism replicates the weights and partitions the batch, all-reducing the gradients each step. Tensor parallelism partitions individual matrix multiplications across devices, so one matmul's output is the concatenation of several devices' partial products. Pipeline parallelism assigns successive layers to successive devices and feeds micro-batches through them like an assembly line. Expert parallelism, in mixture-of-experts models, routes each token to whichever GPU happens to host its chosen expert.

Each pattern has a different communication signature: an all-reduce per step for data parallelism, an all-gather inside every layer for tensor parallelism, point-to-point activations for pipeline, all-to-all for experts. The feasibility of each depends on the network fabric. NVLink and NVSwitch within a server give close to a terabyte per second; InfiniBand between servers gives one to four hundred gigabits per second; ordinary Ethernet gives less still and is rarely used for training. Modern frontier training combines all four parallelism patterns simultaneously, and the art of distributed training is choosing the combination that minimises whichever communication step would otherwise dominate.

What you should take away

The asymptotic FLOP count of a deep-learning kernel is at best an upper bound on its speed. The real performance question is how many bytes you have to move per FLOP, its arithmetic intensity, and how fast the relevant level of the memory hierarchy can supply them.
The memory hierarchy is a pyramid: registers, shared memory, L2, HBM, host. A kernel that fits in shared memory runs roughly fifty times faster than one streaming from HBM, and the HBM kernel is roughly twenty times faster than one paging over PCIe.
Matrix multiplication is the only common operation whose arithmetic intensity rises with problem size; on $N \times N$ matmul it scales as $AI = 2N/3$. Everything else, element-wise ops, softmax, normalisation, is bandwidth-bound and benefits enormously from kernel fusion.
Memory layout matters. Row-major versus column-major, contiguous versus strided, channels-first versus channels-last: each can change wall-clock time by a factor of five or ten without changing the mathematics. Calling .contiguous() before a hot loop is cheap insurance.
Mixed precision, large batches, and continuous batching exist because they raise arithmetic intensity and feed the tensor cores. They are not optional micro-optimisations; they are the difference between using ten per cent of your hardware and using sixty per cent.