Supervised Learning: 7.2   Linear regression

Dr Chris Paton

7.2 Linear regression

Linear regression is the simplest meaningful regression model and the foundation on which almost every other regression and classification method is built. The premise is modest: assume the conditional mean of the target is a weighted sum of the inputs, fit the weights by minimising squared error, and read off the prediction. Despite its simplicity, in many practical settings, clinical risk scores, drug-dose response, financial returns, scientific dose–response curves, calibrating sensors against a gold standard, linear regression remains the right tool. It gives you a closed-form solution that runs in milliseconds, coefficients that domain experts can read and argue with, well-understood standard errors, and a century of statistical theory to back it up. There are no hyperparameters to tune in the basic version, no learning rate to nurse through training, and no random seed that changes the answer. The model either fits the data well or it does not, and the residual plots tell you which.

As a machine-learning algorithm, linear regression is unusual in being entirely closed form. Train time is one matrix solve. Inference time is one inner product. The objective is convex with a unique minimum (when the design has full column rank), so there is no question of getting stuck in a bad local optimum or initialising poorly. Memory cost at inference is $d+1$ floats. These properties make it an indispensable baseline: if a deep neural network cannot beat regularised linear regression on a tabular problem, the problem is probably not the algorithm.

Linear regression is the foundation for §7.3 (logistic regression, replacing the identity link with a sigmoid), for §7.4 (generalised linear models), and for the linear layer of every neural network in Chapter 9 onwards. When a Transformer multiplies its hidden state by a weight matrix, it is performing linear regression in disguise.

Symbols Used Here

$\mathbf{X}$design matrix, shape $(n, d+1)$ with intercept column

$\mathbf{y}$target vector, shape $(n,)$

$\mathbf{w}$coefficient vector

$\hat{\mathbf{y}} = \mathbf{X}\mathbf{w}$predictions

The model

The model says that the target $y$ is a linear function of the input features plus an error term:

$$\hat y = w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_d x_d = \mathbf{w}^\top \mathbf{x}.$$

The coefficient $w_0$ is the intercept or bias (the predicted value when every feature is zero). The coefficients $w_1, \ldots, w_d$ are the slopes (the predicted change in $y$ per unit change in the corresponding feature, holding all other features fixed). To absorb the intercept into the same notation as the slopes, we adopt a small but useful convention: prepend a constant 1 to every input vector, so that $\mathbf{x} \in \mathbb{R}^{d+1}$ and $w_0$ is just the first weight. From this point on we write the model as $\hat y = \mathbf{w}^\top\mathbf{x}$ and talk about a single weight vector $\mathbf{w}$.

Stack the $n$ training inputs as rows of a design matrix $\mathbf{X} \in \mathbb{R}^{n \times (d+1)}$, the leading column is all ones, the remaining columns hold the features, and stack the targets in a vector $\mathbf{y} \in \mathbb{R}^n$. The vector of predictions is $\hat{\mathbf{y}} = \mathbf{X}\mathbf{w}$ and the residuals are $\mathbf{r} = \mathbf{y} - \mathbf{X}\mathbf{w}$.

The loss is mean squared error:

$$L(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^n (y_i - \mathbf{w}^\top \mathbf{x}_i)^2 = \frac{1}{n} \|\mathbf{X}\mathbf{w} - \mathbf{y}\|^2.$$

Squared error is not an arbitrary choice. It penalises large residuals quadratically, which makes it sensitive to outliers but also smoothly differentiable everywhere, a property the closed-form derivation exploits. As we shall see, MSE is the negative log-likelihood under a Gaussian noise assumption, so minimising MSE is the same as maximum-likelihood estimation under that assumption. It also makes the loss surface a convex paraboloid in $\mathbf{w}$: there is a unique global minimum, and curvature is captured by a single positive semi-definite matrix.

The closed form

Because the loss is a smooth convex quadratic in $\mathbf{w}$, the minimum is the unique stationary point. Differentiate $L$ with respect to $\mathbf{w}$ and set the gradient to zero:

$$\nabla_{\mathbf{w}} L = \frac{2}{n} \mathbf{X}^\top (\mathbf{X}\mathbf{w} - \mathbf{y}) = \mathbf{0}.$$

Dropping the harmless factor of $2/n$ rearranges to the normal equations:

$$\mathbf{X}^\top\mathbf{X}\,\mathbf{w} = \mathbf{X}^\top \mathbf{y}.$$

This is a square linear system of $d+1$ equations in $d+1$ unknowns. When $\mathbf{X}$ has full column rank, equivalently, when no feature is an exact linear combination of the others, the Gram matrix $\mathbf{X}^\top\mathbf{X}$ is invertible and the unique solution is

$$\hat{\mathbf{w}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \mathbf{y}.$$

Geometrically, $\hat{\mathbf{w}}$ is the unique vector whose corresponding $\mathbf{X}\hat{\mathbf{w}}$ is the orthogonal projection of $\mathbf{y}$ onto the column space of $\mathbf{X}$. The residual $\mathbf{y} - \mathbf{X}\hat{\mathbf{w}}$ is orthogonal to every column of $\mathbf{X}$, which is exactly the content of the normal equations.

In practice you do not form $(\mathbf{X}^\top\mathbf{X})^{-1}$ explicitly; that loses precision and inflates condition numbers. The standard implementation solves the system via Cholesky factorisation of $\mathbf{X}^\top\mathbf{X}$ or, more robustly, via QR decomposition of $\mathbf{X}$ directly. NumPy's np.linalg.lstsq and SciPy's scipy.linalg.lstsq use the latter under the hood.

When $\mathbf{X}^\top\mathbf{X}$ is singular or near-singular, two pathologies arise. Multicollinearity, where two or more features are highly correlated, makes the matrix ill-conditioned, blowing up the variance of $\hat{\mathbf{w}}$ even though the prediction error $\hat{\mathbf{y}}$ remains reasonable. In the extreme case where two columns are exactly equal, the matrix is singular and there are infinitely many solutions. A different pathology arises when $d > n$: with more features than samples, $\mathbf{X}^\top\mathbf{X}$ is rank-deficient by construction. For example, in genomics one might have $d = 20\,000$ gene expression measurements per patient and only $n = 200$ patients.

The remedy in both cases is to fall back on the Moore–Penrose pseudo-inverse $\mathbf{X}^+$, computed via singular value decomposition: $\hat{\mathbf{w}} = \mathbf{X}^+ \mathbf{y}$ returns the minimum-norm solution. A more useful remedy is ridge regularisation, which we shall meet shortly: it adds a small positive constant to the diagonal of $\mathbf{X}^\top\mathbf{X}$ and restores invertibility while shrinking large coefficients toward zero.

Worked example

Take a tiny dataset of four points to make the arithmetic transparent. Suppose we observe

$$\mathbf{X} = \begin{pmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \end{pmatrix}, \qquad \mathbf{y} = \begin{pmatrix} 3 \\ 5 \\ 7 \\ 9 \end{pmatrix}.$$

The first column of $\mathbf{X}$ holds the intercept (all ones); the second column holds a single scalar feature $x$ taking values $1, 2, 3, 4$. We want to fit $\hat y = w_0 + w_1 x$.

Compute $\mathbf{X}^\top\mathbf{X}$:

$$\mathbf{X}^\top\mathbf{X} = \begin{pmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 \end{pmatrix} \begin{pmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \end{pmatrix} = \begin{pmatrix} 4 & 10 \\ 10 & 30 \end{pmatrix}.$$

The top-left entry, $4$, is just $n$. The off-diagonals, both $10$, are $\sum x_i$. The bottom-right, $30$, is $\sum x_i^2$. Compute $\mathbf{X}^\top \mathbf{y}$:

$$\mathbf{X}^\top \mathbf{y} = \begin{pmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 \end{pmatrix}\begin{pmatrix} 3 \\ 5 \\ 7 \\ 9 \end{pmatrix} = \begin{pmatrix} 24 \\ 70 \end{pmatrix}.$$

The first entry, $24$, is $\sum y_i$. The second, $70$, is $\sum x_i y_i$.

Now solve the $2\times 2$ system

$$\begin{pmatrix} 4 & 10 \\ 10 & 30 \end{pmatrix} \begin{pmatrix} w_0 \\ w_1 \end{pmatrix} = \begin{pmatrix} 24 \\ 70 \end{pmatrix}.$$

The determinant of the matrix is $4 \cdot 30 - 10 \cdot 10 = 20$. Invert by the cofactor formula and multiply:

$$\begin{pmatrix} w_0 \\ w_1 \end{pmatrix} = \frac{1}{20}\begin{pmatrix} 30 & -10 \\ -10 & 4 \end{pmatrix}\begin{pmatrix} 24 \\ 70 \end{pmatrix} = \frac{1}{20}\begin{pmatrix} 30 \cdot 24 - 10 \cdot 70 \\ -10 \cdot 24 + 4 \cdot 70 \end{pmatrix} = \frac{1}{20}\begin{pmatrix} 20 \\ 40 \end{pmatrix} = \begin{pmatrix} 1 \\ 2 \end{pmatrix}.$$

So $\hat{\mathbf{w}} = (1, 2)^\top$. The fitted line is $\hat y = 1 + 2x$. Check: at $x = 1$, $\hat y = 3$; at $x = 2$, $\hat y = 5$; at $x = 3$, $\hat y = 7$; at $x = 4$, $\hat y = 9$. The line passes exactly through every point, because the data is perfectly linear with intercept $1$ and slope $2$. The residuals are all zero, the training MSE is zero, and the fit is exact.

This worked example is the simplest possible non-trivial regression: one feature, one intercept, four observations, no noise. The same closed-form recipe, form $\mathbf{X}^\top\mathbf{X}$, form $\mathbf{X}^\top \mathbf{y}$, solve, scales unchanged to thousands of features and millions of observations. What changes with realistic data is that the residuals become non-zero and we have to talk about how good "good enough" is, which is what the Gauss–Markov theorem tells us next.

Properties

The OLS estimator has four classical properties that explain why it is still the default in inferential statistics.

Unbiasedness. Suppose the data are generated by $\mathbf{y} = \mathbf{X}\mathbf{w}^\star + \boldsymbol\varepsilon$ where $\mathbb{E}[\boldsymbol\varepsilon] = \mathbf{0}$. Plug into the closed form: $\hat{\mathbf{w}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top(\mathbf{X}\mathbf{w}^\star + \boldsymbol\varepsilon) = \mathbf{w}^\star + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol\varepsilon$. Taking expectations across draws of the noise, $\mathbb{E}[\hat{\mathbf{w}}] = \mathbf{w}^\star$. The estimator is unbiased: averaged over many independent samples from the same data-generating process, the OLS estimate centres on the true coefficients.

BLUE by Gauss–Markov. Suppose further that the errors are uncorrelated and homoskedastic: $\text{Cov}(\boldsymbol\varepsilon) = \sigma^2 \mathbf{I}$. The Gauss–Markov theorem says that among all linear unbiased estimators of $\mathbf{w}^\star$, that is, estimators of the form $\hat{\mathbf{w}} = \mathbf{C}\mathbf{y}$ with $\mathbb{E}[\hat{\mathbf{w}}] = \mathbf{w}^\star$, the OLS estimator has the smallest variance. "BLUE" stands for best linear unbiased estimator. This is the strongest possible efficiency guarantee that does not invoke a specific noise distribution, and it is why classical statisticians reach for OLS by default. Crucially, the theorem does not require the errors to be Gaussian, only zero-mean and constant-variance.

MLE under Gaussian residuals. If we additionally assume $\boldsymbol\varepsilon \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I})$, the log-likelihood of $\mathbf{w}$ is $-\tfrac{1}{2\sigma^2}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2$ plus terms not depending on $\mathbf{w}$. Maximising the log-likelihood is identical to minimising the squared-error loss, so the OLS estimator is the maximum-likelihood estimator under Gaussian noise. This recovers the Cramér–Rao lower bound asymptotically and gives a principled basis for the standard errors $\text{Var}(\hat{\mathbf{w}}) = \sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1}$, $t$-tests, $F$-tests, and confidence intervals reported by every statistical package.

Closed form, no iteration. No gradient descent, no learning rate, no convergence diagnostics, no random initialisation. One matrix solve and you are done. For $n \lesssim 10^6$ and $d \lesssim 10^4$, this is the fastest possible regression on commodity hardware.

Ridge regression (L2)

When $\mathbf{X}^\top\mathbf{X}$ is ill-conditioned, OLS is unbiased but the variance of $\hat{\mathbf{w}}$ explodes, small changes in $\mathbf{y}$ produce wildly different coefficient estimates. Ridge regression Hoerl, 1970, also known as Tikhonov regularisation, treats the disease by adding a penalty proportional to the squared norm of $\mathbf{w}$:

$$L_{\text{ridge}}(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|^2.$$

Setting the gradient to zero gives the still-closed form

$$\hat{\mathbf{w}}_{\text{ridge}} = (\mathbf{X}^\top\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^\top \mathbf{y}.$$

Adding $\lambda\mathbf{I}$ to the diagonal lifts every eigenvalue of $\mathbf{X}^\top\mathbf{X}$ by $\lambda$, guaranteeing invertibility for any $\lambda > 0$ regardless of the rank of $\mathbf{X}$.

The effect on the solution is to shrink every coefficient toward zero. Eigen-decompose $\mathbf{X}^\top\mathbf{X} = \mathbf{V}\boldsymbol\Lambda\mathbf{V}^\top$; then $\hat{\mathbf{w}}_{\text{ridge}} = \mathbf{V}(\boldsymbol\Lambda + \lambda \mathbf{I})^{-1}\mathbf{V}^\top \mathbf{X}^\top\mathbf{y}$. The $j$-th eigen-direction is shrunk by the factor $\lambda_j/(\lambda_j + \lambda)$. Directions with high data variance ($\lambda_j$ large) are barely shrunk; directions with low data variance ($\lambda_j$ small) are shrunk hard. Ridge thus stabilises the badly-determined directions while leaving the well-determined ones alone.

The trade-off is classical bias–variance. Ridge introduces a small bias toward $\mathbf{0}$ but reduces variance dramatically, often producing lower test error than OLS even though OLS is unbiased and BLUE. The hyperparameter $\lambda$ is chosen by cross-validation. By convention the intercept is not penalised, only the slopes, by zeroing the corresponding entry of the regularisation matrix.

Ridge has a Bayesian reading: it is the maximum a posteriori estimator under a Gaussian prior $\mathbf{w} \sim \mathcal{N}(\mathbf{0}, (\sigma^2/\lambda)\mathbf{I})$ and Gaussian noise. Larger $\lambda$ corresponds to a tighter prior, stronger belief that the true coefficients are small.

Lasso (L1)

The lasso Tibshirani, 1996 (least absolute shrinkage and selection operator) replaces the squared norm with the absolute-value norm:

$$L_{\text{lasso}}(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|_1, \qquad \|\mathbf{w}\|_1 = \sum_j |w_j|.$$

The $L_1$ ball $\{\mathbf{w} : \|\mathbf{w}\|_1 \leq t\}$ has corners on the coordinate axes. Geometrically, the elliptical level sets of the squared-error loss are most likely to first touch the constraint set at one of these corners, a corner where some components of $\mathbf{w}$ are exactly zero. The lasso therefore performs automatic feature selection: as $\lambda$ increases, more and more coefficients are driven to exactly zero, leaving a sparse model.

The price is that the absolute value is non-differentiable at zero. There is no closed form. Standard solvers are iterative: coordinate descent updates one coefficient at a time using a soft-thresholding shrinkage operator $S_\lambda(z) = \text{sign}(z)\max(|z| - \lambda, 0)$; ISTA (iterative shrinkage-thresholding) and its accelerated variant FISTA generalise to wider problem classes; LARS (least-angle regression) constructs the entire regularisation path piecewise-linearly. scikit-learn's Lasso uses coordinate descent by default.

In Bayesian terms, the lasso is the MAP estimator under a Laplace prior on $\mathbf{w}$, which has a sharper peak at zero than the Gaussian prior of ridge. Elastic net combines $L_1$ and $L_2$ penalties to inherit sparsity from the lasso and stability under correlated features from ridge, when groups of correlated features are present, the lasso tends to pick one and drop the others arbitrarily, while elastic net keeps the whole group.

Polynomial and basis-function extensions

The "linear" in linear regression refers to the parameters, not the features. We can fit a curve by choosing basis functions $\phi_j(x)$ and modelling

$$\hat y = w_0 \phi_0(x) + w_1 \phi_1(x) + w_2 \phi_2(x) + \cdots + w_K \phi_K(x).$$

For polynomial regression we take $\phi_j(x) = x^j$, giving $\hat y = w_0 + w_1 x + w_2 x^2 + w_3 x^3 + \cdots + w_K x^K$. The model is non-linear in $x$ but linear in $\mathbf{w}$. Build a feature matrix $\boldsymbol\Phi$ with columns $1, x, x^2, \ldots, x^K$ and apply the same closed form: $\hat{\mathbf{w}} = (\boldsymbol\Phi^\top\boldsymbol\Phi)^{-1}\boldsymbol\Phi^\top\mathbf{y}$. All the theory transfers unchanged, Gauss–Markov, ridge, lasso, the lot.

Other useful basis families: radial basis functions $\phi_j(x) = \exp(-(x - c_j)^2 / 2\sigma^2)$ for smooth interpolation; splines (piecewise polynomials joined at knots) for flexible curves with controlled degrees of freedom; Fourier features $\phi_j(x) = \sin(jx), \cos(jx)$ for periodic signals; interaction terms $\phi(x_1, x_2) = x_1 x_2$ for capturing multiplicative effects. High-degree polynomials oscillate badly between data points and overfit easily, splines and regularisation are usually preferable.

This basis-function framework is the conceptual bridge from linear regression to kernel methods (§7.8) and to neural networks (chapter 9). A neural network can be read as learning the basis functions from data rather than fixing them by hand.

Where linear regression appears in modern AI

Linear regression is so foundational that it hides in places one might not expect.

The linear layer of every neural network, the nn.Linear(in, out) of every PyTorch model, is exactly $\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}$, a vector-valued linear regression. At initialisation, before any training, the network is doing nothing more than linear regression with random weights. The non-linearity comes from activation functions stacked between linear layers; the linear pieces themselves are the algebraic content of the entire chapter compressed into one matrix multiplication.

Probing classifiers are linear regressions or logistic regressions trained on the activations of a frozen pre-trained model to test whether some property, part of speech, syntactic dependency, geographical knowledge, factual recall, is linearly decodable from the representations. If a linear probe succeeds, the property is "encoded" in the embeddings; if it fails, the property is either absent or only non-linearly accessible.

Calibration of classifier outputs uses linear regression on the logits (or their reciprocals) to fit a temperature parameter that rescales over-confident probabilities. Platt scaling fits a logistic regression on top of a base classifier's decision values to convert margins into calibrated probabilities, itself a linear-regression-like procedure.

Linear regression is also the canonical baseline in any ML experiment. Before reaching for a Transformer or a gradient-boosted ensemble, fit a regularised linear model. If the linear baseline scores within a few percent of the fancy method on validation, the fancy method is not earning its complexity, and on tabular data with $n \lesssim 10^4$, this scenario is common.

What you should take away

Linear regression assumes $\mathbb{E}[y \mid \mathbf{x}] = \mathbf{w}^\top \mathbf{x}$, fits $\mathbf{w}$ by minimising mean squared error, and admits the closed form $\hat{\mathbf{w}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ whenever $\mathbf{X}^\top\mathbf{X}$ is invertible.
Squared-error loss is not arbitrary: it is the negative log-likelihood under Gaussian noise, which makes OLS the maximum-likelihood estimator and connects it to the Bayesian framework via Gaussian priors.
The Gauss–Markov theorem guarantees OLS is the best linear unbiased estimator under zero-mean homoskedastic noise, no normality required, which is the strongest distribution-free efficiency result in classical statistics.
When $\mathbf{X}^\top\mathbf{X}$ is singular or ill-conditioned (multicollinearity, $d > n$), regularise: ridge ($L_2$) shrinks coefficients smoothly and keeps a closed form; lasso ($L_1$) sets coefficients to exactly zero and selects features but needs an iterative solver.
Linear regression underpins logistic regression, generalised linear models, the linear layers of neural networks, probing classifiers, calibration, and every sensible baseline, master it once and you will see it everywhere for the rest of the book.