5.10 Linear Regression as Statistics
Linear regression is the simplest statistical model that does anything useful, and the best understood by a wide margin. You sketch a cloud of points, draw the line that passes closest to them, and use that line to predict new values. Nothing about that procedure is glamorous. Yet the same handful of equations powers least-squares fits in physics laboratories, the analysis of A/B tests in industry, the baseline against which every machine-learning model must justify itself, and the linear layer that sits at the heart of every neural network you will meet later in this book. If you understand linear regression as a statistical procedure, what it assumes, what it estimates, and what can break, you understand the scaffolding on which a great deal of modern AI is built.
This section is the statistical companion to §5.5, where we showed that ordinary least squares (OLS) is the maximum likelihood estimator under Gaussian errors. There we treated regression as an exercise in optimisation: write down a likelihood, take its log, minimise the negative. Here we step back and treat the whole apparatus as a statistical procedure. What population are we sampling from? What does it mean for the estimator to be unbiased? How wide are the confidence intervals around the fitted slope? When is it safe to trust the $p$-values reported by your software, and when is the report polite fiction? §5.11 will then generalise the framework to non-Gaussian outcomes via generalised linear models (GLMs), of which logistic regression is the most familiar special case.
The model
The linear model assumes that each observed outcome $y_i$ is generated by taking a weighted sum of $p$ predictor values and adding random noise:
$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}).$$
Here $\mathbf{X}$ is the $n \times p$ design matrix; row $i$ contains the $p$ predictor values for observation $i$, and the first column is conventionally a column of ones so that $\beta_0$ acts as an intercept. The vector $\boldsymbol{\beta}$ holds the $p$ coefficients we want to estimate. The error vector $\boldsymbol{\epsilon}$ is independent Gaussian noise with mean zero and the same variance $\sigma^2$ in every observation.
The OLS estimator chooses $\hat{\boldsymbol{\beta}}$ to minimise the squared residuals $\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2$. Setting the gradient with respect to $\boldsymbol{\beta}$ to zero produces the normal equations $\mathbf{X}^\top\mathbf{X}\,\boldsymbol{\beta} = \mathbf{X}^\top\mathbf{y}$, whose solution is the closed form
$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}.$$
This is one of the very few estimators in statistics that you can write down on a single line and compute exactly without any iterative procedure. Numerically, well-written software does not actually invert $\mathbf{X}^\top\mathbf{X}$; it uses a QR or SVD factorisation of $\mathbf{X}$ for stability. Mathematically, however, the formula is the answer.
The estimator has three properties that justify its dominance. First, it is unbiased: $E[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}$, so on average across many imaginary repeats of the experiment the estimate equals the truth. Second, under mild conditions it is asymptotically normal, so as $n$ grows the sampling distribution of $\hat{\boldsymbol{\beta}}$ approaches a multivariate Gaussian centred at $\boldsymbol{\beta}$. Third, the Gauss-Markov theorem establishes that, among all linear unbiased estimators of $\boldsymbol{\beta}$, OLS has the smallest variance, it is BLUE, the best linear unbiased estimator. Note the weasel words. OLS is not the best estimator full stop; biased shrinkage estimators such as ridge or lasso can beat it on mean squared error, particularly when predictors are correlated. But within the class of linear unbiased estimators, you cannot do better than OLS without giving something up.
Worked numerical example
Algebra in matrix form is easy to mistake for hand-waving, so it is worth grinding through a tiny example by hand. Suppose we have three observations of a single predictor $x$ paired with an outcome $y$:
| $i$ | $x_i$ | $y_i$ |
|---|---|---|
| 1 | 1 | 2 |
| 2 | 2 | 4 |
| 3 | 3 | 5 |
We will fit a model with an intercept and a slope, so the design matrix has two columns: a column of ones for the intercept and a column of $x$ values for the slope. That is,
$$\mathbf{X} = \begin{pmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{pmatrix}, \qquad \mathbf{y} = \begin{pmatrix} 2 \\ 4 \\ 5 \end{pmatrix}.$$
The first ingredient is $\mathbf{X}^\top\mathbf{X}$, a $2 \times 2$ matrix. The top-left entry is the sum of squared ones, which equals $n = 3$. The top-right and bottom-left entries are the sum of $x_i$, which is $1 + 2 + 3 = 6$. The bottom-right entry is the sum of $x_i^2$, which is $1 + 4 + 9 = 14$. So
$$\mathbf{X}^\top\mathbf{X} = \begin{pmatrix} 3 & 6 \\ 6 & 14 \end{pmatrix}.$$
The second ingredient is $\mathbf{X}^\top\mathbf{y}$, a $2 \times 1$ vector. Its first entry is the sum of the $y_i$, which is $2 + 4 + 5 = 11$. Its second entry is $\sum x_i y_i = 1\cdot 2 + 2\cdot 4 + 3\cdot 5 = 25$. So $\mathbf{X}^\top\mathbf{y} = (11, 25)^\top$.
Now we solve the normal equations $\mathbf{X}^\top\mathbf{X}\,\hat{\boldsymbol{\beta}} = \mathbf{X}^\top\mathbf{y}$. The determinant of $\mathbf{X}^\top\mathbf{X}$ is $3\cdot 14 - 6\cdot 6 = 42 - 36 = 6$. Inverting gives
$$(\mathbf{X}^\top\mathbf{X})^{-1} = \frac{1}{6}\begin{pmatrix} 14 & -6 \\ -6 & 3 \end{pmatrix}.$$
Multiplying by $\mathbf{X}^\top\mathbf{y} = (11, 25)^\top$:
$$\hat{\boldsymbol{\beta}} = \frac{1}{6}\begin{pmatrix} 14\cdot 11 - 6\cdot 25 \\ -6\cdot 11 + 3\cdot 25 \end{pmatrix} = \frac{1}{6}\begin{pmatrix} 4 \\ 9 \end{pmatrix} = \begin{pmatrix} 0.667 \\ 1.5 \end{pmatrix}.$$
So $\hat\beta_0 \approx 0.667$ and $\hat\beta_1 = 1.5$, and the fitted line is $\hat y = 0.667 + 1.5\,x$. To check, the fitted values are $\hat y_1 = 2.167$, $\hat y_2 = 3.667$ and $\hat y_3 = 5.167$. The residuals are $e_1 = 2 - 2.167 = -0.167$, $e_2 = 4 - 3.667 = 0.333$, and $e_3 = 5 - 5.167 = -0.167$. They sum to zero (a property guaranteed when the model includes an intercept), which is a useful sanity check that the arithmetic is correct.
This three-observation toy carries every essential idea of OLS: turn the $n$ data points into the matrices $\mathbf{X}^\top\mathbf{X}$ and $\mathbf{X}^\top\mathbf{y}$, solve a linear system, read off the coefficients. With a million points and twenty predictors the structure is exactly the same; only the dimensions change.
Assumptions
OLS is a sound estimator only under five assumptions. They are worth memorising because every diagnostic test you will encounter is checking one of them.
Linearity. The conditional expectation of $\mathbf{y}$ given $\mathbf{X}$ is linear in $\boldsymbol{\beta}$: $E[\mathbf{y}\mid\mathbf{X}] = \mathbf{X}\boldsymbol{\beta}$. Note that linear refers to linearity in the parameters, not in the predictors. You can include $x^2$ or $\log x$ as a column of $\mathbf{X}$ and the model is still linear in $\boldsymbol{\beta}$.
Independence. The errors $\epsilon_i$ are uncorrelated across observations. Time-series data and clustered survey data routinely violate this; observations close in time or within the same cluster tend to share noise.
Homoscedasticity. The errors all have the same variance $\sigma^2$. The classic violation is when the variance of the outcome grows with its mean, for example, expenditure variance grows with income.
Normality. The errors are Gaussian. This assumption is needed only for inference, confidence intervals, $p$-values, and $F$-tests. Estimation by OLS does not require Gaussianity, and the central limit theorem rescues most large-sample inference even when the errors are mildly non-Gaussian.
No multicollinearity. The design matrix $\mathbf{X}$ has full column rank, so $\mathbf{X}^\top\mathbf{X}$ is invertible. Perfect collinearity (one predictor is a linear combination of others) makes the inverse undefined; near-collinearity makes it numerically unstable, inflating the standard errors of the affected coefficients to the point of uselessness.
When these assumptions are violated, what fails depends on which one. A wrongly specified linear model (violation 1) gives biased coefficients full stop. Correlated errors (violation 2) leave the coefficients unbiased but make the standard errors too small, so confidence intervals shrink to a misleadingly narrow band. Heteroscedasticity (violation 3) is similar: the coefficients are still unbiased estimators of $\boldsymbol{\beta}$, but the textbook standard errors are wrong. Non-normal errors (violation 4) damage inference in small samples but are harmless asymptotically. Multicollinearity (violation 5) is the most insidious because $R^2$ and predictions remain fine; only the per-coefficient standard errors blow up, leaving the analyst unable to say which of two correlated predictors is doing the work.
A useful rule of thumb is that OLS is robust for prediction, the fitted values $\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}$ remain decent predictors under most violations, but fragile for inference. If your only goal is to predict next quarter's sales, modest assumption violations rarely hurt. If your goal is to make causal claims about the effect of one predictor, every assumption matters and most need to be defended explicitly.
Inference: $t$-tests for coefficients
Under the Gaussian-error assumption, the sampling distribution of $\hat{\boldsymbol{\beta}}$ is itself multivariate Gaussian:
$$\hat{\boldsymbol{\beta}} \sim \mathcal{N}\!\left(\boldsymbol{\beta},\ \sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1}\right).$$
In particular, the marginal distribution of any single coefficient $\hat\beta_j$ is Gaussian with mean $\beta_j$ and variance $\sigma^2 [(\mathbf{X}^\top\mathbf{X})^{-1}]_{jj}$. We do not know $\sigma^2$ in practice, so we estimate it by the residual variance $s^2 = \|\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}\|^2 / (n - p)$, where the $n - p$ in the denominator corrects for the $p$ degrees of freedom we used up estimating $\boldsymbol{\beta}$. The estimated standard error of coefficient $j$ is then $\operatorname{SE}(\hat\beta_j) = s\sqrt{[(\mathbf{X}^\top\mathbf{X})^{-1}]_{jj}}$.
To test the null hypothesis $H_0: \beta_j = 0$ (the predictor has no effect once others are accounted for), form the standardised statistic
$$T = \frac{\hat\beta_j}{\operatorname{SE}(\hat\beta_j)}.$$
Under $H_0$ and the Gaussian assumption, $T$ follows a $t$-distribution with $n - p$ degrees of freedom, the $t$ rather than the Gaussian arising because we are using an estimate $s$ in place of the true $\sigma$. A two-sided $p$-value comes from $2\Pr(|T_{n-p}| > |t_{\text{obs}}|)$. A $(1 - \alpha)$ confidence interval is
$$\hat\beta_j \pm t_{n-p,\,\alpha/2}\cdot \operatorname{SE}(\hat\beta_j),$$
where $t_{n-p,\,\alpha/2}$ is the upper $\alpha/2$ quantile of the $t$ distribution. Once $n - p$ is more than about 30, the $t$ is indistinguishable from the standard normal and the interval becomes the familiar $\hat\beta_j \pm 1.96\operatorname{SE}$. The mistake to avoid is treating the $t$-test as a verdict on whether the predictor matters. It tests only whether the data are compatible with $\beta_j = 0$ given the other predictors in the model. A predictor can be highly statistically significant and substantively trivial, or substantively important and statistically insignificant in a small sample.
Goodness of fit: $R^2$
The coefficient of determination summarises how much of the outcome's variability the model accounts for:
$$R^2 = 1 - \frac{\text{RSS}}{\text{TSS}}, \quad \text{RSS} = \sum_i (y_i - \hat y_i)^2, \quad \text{TSS} = \sum_i (y_i - \bar y)^2.$$
The total sum of squares (TSS) measures the variability of $y$ around its mean. The residual sum of squares (RSS) measures what is left after the model has done its work. $R^2$ ranges from 0 (the model explains nothing the mean does not) to 1 (the model fits perfectly). For our worked example, $\bar y = (2 + 4 + 5)/3 = 11/3 \approx 3.667$, so the TSS is $(2 - 3.667)^2 + (4 - 3.667)^2 + (5 - 3.667)^2 \approx 4.667$. The RSS is $(-0.167)^2 + (0.333)^2 + (-0.167)^2 \approx 0.167$. So $R^2 \approx 1 - 0.167/4.667 \approx 0.964$, the line absorbs about 96 per cent of the variability in three points, which sounds impressive but reflects the small sample more than the quality of any underlying pattern.
The treacherous feature of $R^2$ is that adding any predictor, even pure noise, can only make it go up or stay the same, because the optimiser is free to set the coefficient on the new predictor to zero if it does not help. Adjusted $R^2$ corrects for this by penalising additional predictors:
$$R^2_{\text{adj}} = 1 - (1 - R^2)\frac{n - 1}{n - p - 1}.$$
The adjustment is small for $n \gg p$ and aggressive when $p$ approaches $n$. Use $R^2$ to summarise a single fit; use adjusted $R^2$ (or, better, cross-validation as in §5.13) to compare models with different numbers of predictors.
Diagnostics
Reading the regression output is half the job. The other half is checking that the assumptions hold. The standard diagnostic plots are quick to produce and quick to read:
- Residuals vs fitted plot. Plot $e_i = y_i - \hat y_i$ against $\hat y_i$. The cloud should be flat around zero with no funnels or curves. A U-shape suggests missing nonlinearity; a fan that widens with $\hat y_i$ suggests heteroscedasticity.
- QQ plot of residuals. Order the standardised residuals and plot them against the corresponding quantiles of a standard normal. A straight line is what the Gaussian assumption demands. Curvature in the tails indicates heavy tails or skew.
- Leverage and Cook's distance. The hat matrix $\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$ has diagonal $h_{ii}$ measuring how much the fitted value $\hat y_i$ depends on $y_i$ itself. Average leverage is $p/n$; values above $2p/n$ are worth inspecting. Cook's distance $D_i$ combines leverage with the residual to measure how much the whole coefficient vector would shift if observation $i$ were removed. Influential observations are not necessarily wrong, but they are doing more work than their share.
- Variance inflation factor (VIF). For predictor $j$, regress it on all the other predictors and read off $R^2_j$; then $\text{VIF}_j = 1/(1 - R^2_j)$. A value above five (some say ten) indicates that the predictor's coefficient standard error has been inflated by collinearity.
A clean residuals-vs-fitted plot and a straight QQ plot together justify roughly nine-tenths of the assumptions you needed to make. Glance at them before you read a single $p$-value.
Where regression fits in modern AI
It is tempting to think of linear regression as a relic from before the deep-learning era, useful only for explaining things to undergraduates. The opposite is closer to the truth. Almost every model in this book is linear regression with the dial turned up.
- The linear layer of every neural network is OLS at training initialisation: a matrix multiplied by an input plus a bias. The only difference is that the optimiser is stochastic gradient descent rather than the closed-form normal equations, and the layer is composed with non-linear activations.
- Logistic regression is the GLM analogue, covered in §5.11. The outcome is Bernoulli rather than Gaussian, and the link function is the sigmoid rather than the identity. The training objective is no longer least squares but cross-entropy.
- Bayesian linear regression places a Gaussian prior on $\boldsymbol{\beta}$ and yields a Gaussian posterior in closed form, giving uncertainty estimates around predictions for free. It is the simplest place to see the Bayesian machinery of §5.6 in action.
- Ridge ($L_2$-penalised) and lasso ($L_1$-penalised) regression extend OLS for high-dimensional or correlated predictor settings. Both are biased estimators that often beat OLS on out-of-sample mean squared error, and lasso has the bonus of zeroing out coefficients on irrelevant predictors, doing implicit feature selection.
- Polynomial regression is the same machinery applied to expanded features $(x, x^2, x^3, \ldots)$. The model is non-linear in $x$ but still linear in $\boldsymbol{\beta}$, so all the OLS theory carries over unchanged.
- Kernel regression and Gaussian processes are infinite-dimensional cousins where $\mathbf{X}$ is replaced by a feature map, and transformers can be read as a particularly elaborate non-linear regression in which the design matrix is built dynamically from attention.
Once you have internalised OLS, the rest of supervised learning is a series of generalisations: change the loss to handle a non-Gaussian outcome, regularise to handle high-dimensional predictors, compose with non-linearities to handle complicated functions, and replace the closed form with stochastic optimisation when the data no longer fits in memory. Each step builds on the last.
What you should take away
- The OLS estimator $\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ is the closed-form solution to the normal equations and the maximum likelihood estimator under Gaussian errors.
- Five assumptions, linearity, independence, homoscedasticity, normality, and no multicollinearity, underwrite the standard inferential output. Estimation is more robust than inference; predictions can survive violations that destroy $p$-values.
- The sampling distribution $\hat{\boldsymbol{\beta}} \sim \mathcal{N}(\boldsymbol{\beta}, \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1})$ generates the $t$-tests and confidence intervals you read off the regression table.
- $R^2$ summarises in-sample fit but is non-decreasing in the number of predictors; use adjusted $R^2$ or cross-validation to compare models honestly. Always look at the residual and QQ plots before trusting any number in the output.
- Linear regression is not a museum piece. It is the dense layer of every neural network, the kernel of every GLM, and the baseline against which more elaborate models must justify their additional complexity.