5.11 Generalised Linear Models

Ordinary least squares, the workhorse of §5.10, asks the world to behave Gaussianly: continuous responses, additive errors of constant variance, a mean that moves linearly with the inputs. The world rarely obliges. A clinician records whether a patient survived ninety days, not by how much they survived. An epidemiologist counts cases of a notifiable disease per district per week, non-negative integers, bunched near zero, occasionally exploding. A reliability engineer measures the time until a pump fails, strictly positive, right-skewed, often heavy-tailed. None of these outcomes is Gaussian, and forcing them through OLS produces predictions that fall outside the support (negative counts, probabilities greater than one) and standard errors that are simply wrong.

Generalised linear models, introduced by John Nelder and Robert Wedderburn in 1972, dissolve this awkwardness by separating what kind of randomness the response has from how the mean depends on the inputs. A single inferential machinery, iteratively reweighted least squares, deviance, score tests, then covers linear regression, logistic regression, Poisson regression, gamma regression and a dozen relatives, each emerging as the same template instantiated with a different distribution and link.

For a modern AI reader the payoff is twofold. First, every classification network you train ends in a GLM: the final softmax-and-cross-entropy layer is multinomial logistic regression sitting on top of a learned feature extractor. Second, the recipe for honest uncertainty about non-Gaussian outcomes, Wald intervals, likelihood-ratio tests, deviance residuals, comes from the GLM literature, not from the deep-learning literature. §5.10 was OLS; this section generalises it. §5.12 will go one floor up and let the coefficients themselves vary across groups.

Symbols Used Here
$\mathbf{x}$feature vector
$\eta = \mathbf{w}^\top \mathbf{x}$linear predictor
$g$link function
$\mu = g^{-1}(\eta)$mean of the response
$V(\mu)$variance function from the exponential family

Three components

A GLM is built from three pieces, and once you can name them you can read any GLM table in any paper.

Random component. The response $Y_i$, conditional on $\mathbf{x}_i$, follows a distribution from the exponential family, written in the canonical form

$$f(y; \theta, \phi) = \exp\!\left\{ \frac{y\theta - b(\theta)}{\phi} + c(y, \phi) \right\},$$

where $\theta$ is the natural parameter, $\phi$ is a (possibly known) dispersion, and $b$, $c$ are distribution-specific functions. The mean and variance are encoded in $b$: $\mathbb{E}[Y] = b'(\theta) = \mu$ and $\operatorname{Var}(Y) = \phi b''(\theta) = \phi V(\mu)$. The variance is therefore not free; it is tied to the mean through the variance function $V$. For the Gaussian, $V(\mu) = 1$ and the mean and variance decouple. For the Poisson, $V(\mu) = \mu$, so the variance grows with the mean. For the Bernoulli, $V(\mu) = \mu(1 - \mu)$, peaking at $\mu = 0.5$. This mean-variance link is what makes a GLM a GLM, and it is what OLS gets wrong when the response is not Gaussian.

Systematic component. The features enter through a linear predictor $\eta_i = \mathbf{w}^\top \mathbf{x}_i$. Linearity in the parameters, not the features: you may include $x_j^2$, $\log x_j$, splines, or interactions, exactly as in OLS. The coefficient vector $\mathbf{w}$ is what you fit.

Link function. A monotone, differentiable $g$ ties the two components together by $g(\mu_i) = \eta_i$, equivalently $\mu_i = g^{-1}(\eta_i)$. The link controls the scale on which inputs combine additively. For probabilities you want $g$ to map $(0,1)$ to $\mathbb{R}$; for positive means, $(0, \infty)$ to $\mathbb{R}$. The canonical link is the one that makes $\theta = \eta$, collapsing the natural and linear parameters; it gives algebraically tidy score equations and is the default in most software. You are free to choose a non-canonical link, probit instead of logit, identity instead of log, when interpretation demands it.

That is the entire frame. Pick a distribution; pick a link; the rest is mechanics.

Common GLMs

Distribution Canonical link Variance function $V(\mu)$ Typical use
Gaussian identity, $\eta = \mu$ $1$ Continuous, symmetric responses
Bernoulli / Binomial logit, $\log\frac{\mu}{1-\mu}$ $\mu(1-\mu)$ Binary classification, proportions
Categorical / Multinomial softmax (multivariate logit) block diagonal Multi-class classification
Poisson log, $\log\mu$ $\mu$ Counts with no upper bound
Negative binomial log $\mu + \mu^2/k$ Overdispersed counts
Gamma inverse, $1/\mu$ (or log) $\mu^2$ Positive continuous, right-skewed (waiting times, costs)
Inverse-Gaussian $1/\mu^2$ $\mu^3$ Positive heavy-tailed durations

A few practical remarks. Binomial with a known number of trials $n_i$ subsumes Bernoulli and is what glm(cbind(success, failure) ~ x, family = binomial) expects in R. Negative binomial is the standard remedy for Poisson overdispersion, where the empirical variance exceeds the mean, common in clinical counts (admissions, infections) because of unmodelled patient heterogeneity. Gamma with a log link is a workmanlike model for healthcare costs, insurance claims and pump failure times: positive support, right skew, multiplicative covariate effects. The softmax entry is not separate machinery: it is the multivariate logit, with $K-1$ linear predictors and a reference category, and the cross-entropy loss is its negative log-likelihood. When a deep network ends in softmax followed by categorical cross-entropy, it is fitting a multinomial GLM whose features happen to be learned by the layers below.

The table is also a checklist. Faced with a response, ask: is it bounded? on what set? is the variance constant or growing with the mean? is it skewed? are zeros special? The answers point to a row of the table.

Worked: Poisson regression

Suppose a hospital records the number of unplanned readmissions per ward per week, and we want to relate it to staffing levels and case mix. The response is a non-negative integer count with no fixed ceiling, so a Poisson GLM with log link is the natural starting point:

$$Y_i \mid \mathbf{x}_i \sim \text{Poisson}(\mu_i), \qquad \log \mu_i = \mathbf{w}^\top \mathbf{x}_i.$$

The log link guarantees $\mu_i > 0$ and turns coefficients into multiplicative effects on the rate: a coefficient of $0.2$ on a binary indicator means readmissions are $e^{0.2} \approx 1.22$ times higher when the indicator is on. If wards are observed for different lengths of time, an offset $\log t_i$ enters the linear predictor with a fixed coefficient of one, converting counts into rates per unit time.

The log-likelihood is

$$\ell(\mathbf{w}) = \sum_i \big[y_i \mathbf{w}^\top \mathbf{x}_i - e^{\mathbf{w}^\top \mathbf{x}_i} - \log y_i!\big],$$

with score $\nabla \ell = \mathbf{X}^\top(\mathbf{y} - \boldsymbol{\mu})$ and Hessian $\nabla^2 \ell = -\mathbf{X}^\top \mathbf{W} \mathbf{X}$ where $\mathbf{W} = \operatorname{diag}(\mu_i)$. The Hessian is negative definite as long as $\mathbf{X}$ has full column rank and the $\mu_i$ are positive, so the likelihood is strictly concave: a unique maximum, found by iteratively reweighted least squares. Each IRLS step solves a weighted least-squares problem with weights $\mathbf{W}$ and pseudo-response $\mathbf{z} = \boldsymbol{\eta} + \mathbf{W}^{-1}(\mathbf{y} - \boldsymbol{\mu})$, then updates $\boldsymbol{\mu}$ and repeats. Convergence is typically under ten iterations.

Two diagnostics deserve attention. The deviance, $D = 2[\ell_{\text{sat}} - \ell(\hat{\mathbf{w}})]$, compares the fitted model to the saturated one and is the GLM analogue of the residual sum of squares. Pearson residuals $r_i = (y_i - \hat\mu_i)/\sqrt{\hat\mu_i}$ should look roughly homoscedastic and free of structure when plotted against $\hat\eta_i$; a fan shape signals overdispersion and points you towards a negative binomial model.

Logistic regression as a GLM

For binary responses $y_i \in \{0,1\}$ with $\Pr(Y_i = 1 \mid \mathbf{x}_i) = p_i$, the Bernoulli GLM with canonical logit link is

$$\log \frac{p_i}{1 - p_i} = \mathbf{w}^\top \mathbf{x}_i, \qquad p_i = \sigma(\mathbf{w}^\top \mathbf{x}_i),$$

with $\sigma(z) = 1/(1 + e^{-z})$. The log-likelihood

$$\ell(\mathbf{w}) = \sum_i \big[y_i \log p_i + (1 - y_i)\log(1 - p_i)\big]$$

has score $\mathbf{X}^\top(\mathbf{y} - \mathbf{p})$ and Hessian $-\mathbf{X}^\top \mathbf{W}\mathbf{X}$ with $W_{ii} = p_i(1 - p_i) > 0$. It is again strictly concave; Newton–Raphson, equivalently IRLS, converges in five to eight iterations on well-conditioned data.

Coefficients are log odds ratios: $w_j$ is the change in $\log[p/(1-p)]$ for a unit increase in $x_j$, holding the rest fixed; $e^{w_j}$ is the corresponding odds ratio, which is what clinicians actually report. Decision boundaries are linear in $\mathbf{x}$ at any chosen probability threshold. Logistic regression is also the maximum-entropy distribution consistent with matching the empirical feature expectations under a binary outcome, a satisfying coincidence that reappears in conditional random fields and energy-based models.

Inference

Maximum likelihood gives more than point estimates. The observed information $\mathbf{I}(\hat{\mathbf{w}}) = \mathbf{X}^\top \hat{\mathbf{W}} \mathbf{X}$, the same matrix that drives IRLS, is, at convergence, the inverse of the asymptotic covariance of $\hat{\mathbf{w}}$. From it we read off standard errors, Wald confidence intervals $\hat w_j \pm 1.96\,\widehat{\operatorname{SE}}(\hat w_j)$, and Wald $z$-statistics for testing $w_j = 0$. Likelihood-ratio tests compare nested models by twice the difference in log-likelihoods, distributed asymptotically as $\chi^2$ on the difference in parameter counts; they are usually preferred to Wald tests when the likelihood is asymmetric. The deviance plays the role of residual sum of squares: an asymptotic $\chi^2_{n - p}$ reference (when valid) gives a goodness-of-fit check, and changes in deviance compare nested models exactly as ANOVA does for OLS.

In practice, glm() in R and statsmodels.GLM in Python report all of this from one call: coefficient table with estimates, SEs, $z$-values and $p$-values; null and residual deviance; AIC; and dispersion estimate. For binomial and multinomial models, predict(type = "response") returns probabilities; for Poisson, fitted rates. Robust ("sandwich") standard errors are a one-liner when the variance assumption is suspect, and bootstrap intervals (§5.9) are a model-agnostic fallback when the asymptotic theory wobbles.

Three caveats are worth carrying in your back pocket. Separation in logistic regression, a feature that perfectly predicts the outcome on the training data, pushes $\hat w_j$ to infinity, blows up standard errors and renders Wald tests meaningless; Firth's penalised likelihood or a weakly informative Bayesian prior fixes it. Overdispersion in Poisson models, empirical variance much greater than the mean, invalidates Wald and likelihood-ratio inference; switch to a negative binomial GLM or a quasi-Poisson with a free dispersion parameter. Sparse data at small sample sizes makes the asymptotic $\chi^2$ reference distribution untrustworthy; exact tests, the bootstrap, or a fully Bayesian fit (§5.6) are safer.

Where GLMs appear in modern ML

GLMs are not a quaint corner of classical statistics; they sit at several hot junctions of contemporary machine learning.

  • The output layer of every classification neural network is a GLM. A sigmoid head with binary cross-entropy is logistic regression on the penultimate features; a softmax head with categorical cross-entropy is multinomial logistic regression. Backpropagation simply propagates the GLM score $\mathbf{X}^\top(\mathbf{y} - \mathbf{p})$, evaluated on learned features, through the rest of the network.
  • Calibration. Modern classifiers, especially deep ones, often produce overconfident probabilities. Platt scaling fits a one-dimensional logistic GLM to the classifier's logits to recover honest probabilities; temperature scaling is its single-parameter cousin. Both are post-hoc GLM fits.
  • Tabular baselines. On structured data, a regularised GLM (logistic with $\ell_1/\ell_2$, Poisson with log link, gamma with log link) is the baseline gradient boosting must beat, and frequently does so by less than practitioners assume.
  • Survival and recurrent events. The Cox proportional hazards model is a partial-likelihood GLM on the log-hazard scale; Poisson regression on person-time slices recovers it as a limiting case.
  • Domain practice. Clinical trials, pharmacoepidemiology, insurance pricing and econometrics live on GLMs. When an AI system is deployed in those domains, its outputs are typically benchmarked against, and combined with, GLM forecasts.
  • Variational inference. Exponential-family conjugacy makes the variational lower bound factorise into closed-form coordinate updates in models such as latent Dirichlet allocation; the same structure recurs in mean-field Bayesian neural networks.
  • Generalised additive models replace each $w_j x_j$ with a smooth function $f_j(x_j)$ fitted by penalised splines, preserving the GLM scaffolding; they remain a strong, interpretable baseline against tree ensembles and shallow networks on tabular data.
  • Mixed effects and longitudinal data. Generalised linear mixed models add random effects to a GLM linear predictor, supporting clustered data such as repeated measurements on the same patient or pupils within schools, the bridge to §5.12.

A small habit is worth forming. When you reach for a sigmoid-and-cross-entropy loss in PyTorch or JAX, pause for a beat and ask the GLM questions out loud: what is the response distribution? what is the link? what would the Wald interval on this coefficient look like? are my assumptions about the variance function plausible? The answers may not change the code, but they make the resulting probabilities far more trustworthy in the report you write afterwards.

What you should take away

  1. Three components. Pick a distribution from the exponential family, a linear predictor $\eta = \mathbf{w}^\top \mathbf{x}$ and a link $g$ with $g(\mu) = \eta$. Everything else follows.
  2. The link sets the scale. Logit for probabilities, log for rates, identity for unbounded continuous responses, inverse for positive scales. Canonical links are tidy; non-canonical links are sometimes more interpretable.
  3. One algorithm fits them all. IRLS, Newton–Raphson on the log-likelihood, converges in a handful of iterations and falls out of the same matrix algebra as weighted least squares.
  4. You get honest inference for free. Standard errors, Wald and likelihood-ratio tests, deviance and AIC arrive automatically from the maximum-likelihood machinery.
  5. Every classification network is a GLM at the top. Softmax-plus-cross-entropy is multinomial logistic regression on learned features; calibration, robustness and uncertainty quantification all draw on the GLM toolkit.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).