Linear Regression, Glossary, Textbook of AI

Linear regression fits a linear function $y = w^\top x + b$ to data $\{(x_n, y_n)\}_{n=1}^N$ by minimising the squared error:

$$\min_{w, b} \frac{1}{N} \sum_n (y_n - w^\top x_n - b)^2$$

In matrix form (absorbing $b$ into $w$ by appending a 1 to $x$): with $X \in \mathbb{R}^{N \times d}$ stacking the inputs and $y \in \mathbb{R}^N$ the targets,

$$\min_w \|y - Xw\|^2$$

Closed-form solution (when $X^\top X$ is invertible):

$$\hat w = (X^\top X)^{-1} X^\top y$$

These are the normal equations; the matrix $(X^\top X)^{-1} X^\top$ is the pseudoinverse $X^+$ when $X$ has full column rank.

Probabilistic interpretation: assuming $y_n = w^\top x_n + \epsilon_n$ with $\epsilon_n \sim \mathcal{N}(0, \sigma^2)$, maximum-likelihood estimation of $w$ is exactly squared-error minimisation. Linear regression is the MLE under Gaussian noise.

Regularised variants:

Ridge regression (Tikhonov regularisation): add $\frac{\lambda}{2} \|w\|^2$.

$$\hat w_\mathrm{ridge} = (X^\top X + \lambda I)^{-1} X^\top y$$

The added $\lambda I$ stabilises the matrix inversion when $X^\top X$ is ill-conditioned and shrinks coefficients towards zero. Equivalent to MAP estimation with a Gaussian prior $w \sim \mathcal{N}(0, (\lambda)^{-1} I)$.

Lasso regression adds $\lambda \|w\|_1$. No closed form; solved by coordinate descent or proximal methods. Induces sparsity, many coefficients exactly zero, useful for feature selection.

Elastic net combines both: $\lambda_1 \|w\|_1 + \frac{\lambda_2}{2} \|w\|^2$.

Computational notes:

Direct matrix inversion: $O(d^3)$, prohibitive for $d > 10^4$.
QR decomposition or Cholesky of $X^\top X$: more stable, same complexity.
Conjugate gradient: iterative method, $O(d^2)$ per iteration, useful for very large $d$.
SGD for streaming/online linear regression.

Generalisations:

Polynomial regression: linear regression in polynomial features $\phi(x) = (1, x, x^2, \ldots)$.
Generalised linear models (GLM): replace Gaussian likelihood with another exponential family (logistic regression for Bernoulli, Poisson regression for counts, etc.).
Generalised additive models (GAM): $y = \sum_i f_i(x_i) + \epsilon$ with each $f_i$ a smooth nonlinear function.
Neural networks: replace the linear $w^\top x$ with a deep nonlinear function.

Despite its simplicity, linear regression remains the standard baseline and a foundational tool in statistics, econometrics and machine learning. The R-squared, residual analysis, leverage, and influence diagnostics all developed for linear regression carry over to many more sophisticated methods.

Video

Discussed in:

Chapter 7: Supervised Learning, Supervised Learning

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.