Supervised Learning: 7.4   Generalised linear models

Dr Chris Paton

7.4 Generalised linear models

Linear and logistic regression are members of a single family, the generalised linear models (GLMs) of Nelder and Wedderburn (1972). A GLM has three ingredients:

A random component: $y\mid\mathbf{x}$ comes from an exponential-family distribution.
A linear predictor: $\eta = \mathbf{w}^\top\mathbf{x}$.
A link function $g$: $g(\mu) = \eta$, where $\mu = \mathbb{E}[y\mid\mathbf{x}]$.

The exponential family has density of the form

$$p(y\mid \theta, \phi) = \exp\!\left(\frac{y\theta - b(\theta)}{a(\phi)} + c(y, \phi)\right),$$

where $\theta$ is the natural parameter, $b(\cdot)$ is the cumulant function (with $\mathbb{E}[y]=b'(\theta)$ and $\text{Var}(y)=a(\phi) b''(\theta)$), and $\phi$ is a dispersion parameter. The canonical link is the function that makes $\theta = \eta$ directly.

Distribution	Support	Canonical link $g(\mu)$	Use case
Gaussian	$\mathbb{R}$	identity, $\mu$	linear regression
Bernoulli	$\{0,1\}$	logit, $\log\frac{\mu}{1-\mu}$	logistic regression
Categorical	$\{1,\ldots,K\}$	softmax	multinomial logit
Poisson	$\mathbb{N}$	log, $\log\mu$	counts (clinic visits, photon counts)
Gamma	$\mathbb{R}_{>0}$	inverse, $1/\mu$ (or log)	positive skewed (insurance claims, durations)
Inverse Gaussian	$\mathbb{R}_{>0}$	$1/\mu^2$	reaction times

Poisson regression. For count data with mean $\mu_i = \exp(\mathbf{w}^\top\mathbf{x}_i)$, the log-likelihood is $\sum_i [y_i\,\mathbf{w}^\top\mathbf{x}_i - \exp(\mathbf{w}^\top\mathbf{x}_i)]$ (modulo constants). The gradient is $\sum_i (y_i - \mu_i)\mathbf{x}_i$, the same residual-weighted feature pattern. Poisson is the workhorse for clinic-visit counts, A/B click-through, photon and packet counts.

Gamma regression. When $y>0$ and the variance grows like the square of the mean (a common pattern in insurance claims, hospital lengths-of-stay, and times-to-event), the gamma GLM with a log link is the natural choice. Its variance function is $V(\mu) = \mu^2$, so observations with larger expected values are downweighted accordingly.

Quasi-likelihood and overdispersion. Real count data often has variance $> \mu$ (overdispersion). Quasi-Poisson and negative-binomial GLMs add a dispersion parameter to fix this; they share the same coefficient estimates as ordinary Poisson but inflated standard errors.

GLMs are fit by IRLS, the same algorithm we used for logistic regression, using the variance and link functions to compute weights and working responses. R's glm() and statsmodels' GLM cover the entire family.