Linear regression fits a linear function $y = w^\top x + b$ to data $\{(x_n, y_n)\}_{n=1}^N$ by minimising the squared error:
$$\min_{w, b} \frac{1}{N} \sum_n (y_n - w^\top x_n - b)^2$$
In matrix form (absorbing $b$ into $w$ by appending a 1 to $x$): with $X \in \mathbb{R}^{N \times d}$ stacking the inputs and $y \in \mathbb{R}^N$ the targets,
$$\min_w \|y - Xw\|^2$$
Closed-form solution (when $X^\top X$ is invertible):
$$\hat w = (X^\top X)^{-1} X^\top y$$
These are the normal equations; the matrix $(X^\top X)^{-1} X^\top$ is the pseudoinverse $X^+$ when $X$ has full column rank.
Probabilistic interpretation: assuming $y_n = w^\top x_n + \epsilon_n$ with $\epsilon_n \sim \mathcal{N}(0, \sigma^2)$, maximum-likelihood estimation of $w$ is exactly squared-error minimisation. Linear regression is the MLE under Gaussian noise.
Regularised variants:
Ridge regression (Tikhonov regularisation): add $\frac{\lambda}{2} \|w\|^2$.
$$\hat w_\mathrm{ridge} = (X^\top X + \lambda I)^{-1} X^\top y$$
The added $\lambda I$ stabilises the matrix inversion when $X^\top X$ is ill-conditioned and shrinks coefficients towards zero. Equivalent to MAP estimation with a Gaussian prior $w \sim \mathcal{N}(0, (\lambda)^{-1} I)$.
Lasso regression adds $\lambda \|w\|_1$. No closed form; solved by coordinate descent or proximal methods. Induces sparsity, many coefficients exactly zero, useful for feature selection.
Elastic net combines both: $\lambda_1 \|w\|_1 + \frac{\lambda_2}{2} \|w\|^2$.
Computational notes:
- Direct matrix inversion: $O(d^3)$, prohibitive for $d > 10^4$.
- QR decomposition or Cholesky of $X^\top X$: more stable, same complexity.
- Conjugate gradient: iterative method, $O(d^2)$ per iteration, useful for very large $d$.
- SGD for streaming/online linear regression.
Generalisations:
- Polynomial regression: linear regression in polynomial features $\phi(x) = (1, x, x^2, \ldots)$.
- Generalised linear models (GLM): replace Gaussian likelihood with another exponential family (logistic regression for Bernoulli, Poisson regression for counts, etc.).
- Generalised additive models (GAM): $y = \sum_i f_i(x_i) + \epsilon$ with each $f_i$ a smooth nonlinear function.
- Neural networks: replace the linear $w^\top x$ with a deep nonlinear function.
Despite its simplicity, linear regression remains the standard baseline and a foundational tool in statistics, econometrics and machine learning. The R-squared, residual analysis, leverage, and influence diagnostics all developed for linear regression carry over to many more sophisticated methods.
Video
Related terms: Logistic Regression, Mean Squared Error, Regularisation, Maximum Likelihood Estimation
Discussed in:
- Chapter 7: Supervised Learning, Supervised Learning