Logistic Regression, Glossary, Textbook of AI

Logistic regression is a linear classifier that models the conditional probability of class membership via a sigmoid (or softmax for multi-class):

Binary logistic regression:

$$P(y = 1 | x) = \sigma(w^\top x + b) = \frac{1}{1 + e^{-(w^\top x + b)}}$$

where $\sigma$ is the sigmoid function. Equivalently, the log-odds (logit) is linear in $x$:

$$\log \frac{P(y=1|x)}{P(y=0|x)} = w^\top x + b$$

Multi-class logistic regression (also called softmax regression):

$$P(y = k | x) = \frac{e^{w_k^\top x + b_k}}{\sum_j e^{w_j^\top x + b_j}}$$

Maximum likelihood training maximises the log-likelihood, equivalently minimises binary cross-entropy (or categorical cross-entropy for multi-class):

$$\mathcal{L}(w, b) = -\sum_n \!\left[y_n \log \hat p_n + (1 - y_n) \log(1 - \hat p_n)\right]$$

where $\hat p_n = \sigma(w^\top x_n + b)$. The objective is convex, there is a unique global minimum (assuming the data is not perfectly separable; otherwise the MLE is unbounded and regularisation is essential).

Gradient with respect to weights:

$$\nabla_w \mathcal{L} = \sum_n (\hat p_n - y_n) x_n$$

The simple form, error times input, is identical to the gradient of MSE for linear regression. This common form across regression and classification reflects the deeper unity of generalised linear models.

Optimisation: convexity allows global optimisation by gradient descent (any reasonable learning rate converges), Newton's method (uses the Hessian for fast quadratic convergence), L-BFGS (quasi-Newton, the standard for moderate-sized problems), or coordinate descent.

Regularisation:

L2 (ridge): add $\frac{\lambda}{2} \|w\|^2$, equivalent to a Gaussian prior on $w$.
L1 (lasso): add $\lambda \|w\|_1$, induces sparsity, useful for feature selection.
Elastic net: combines both.

Connection to neural networks: a logistic regression model is the simplest possible neural network, a single fully-connected layer with sigmoid output. The output layer of every classification neural network IS logistic/softmax regression on top of the learned features.

Strengths:

Calibrated probabilities, when the model is well-specified.
Interpretable coefficients, each $w_i$ is the log-odds change per unit change in feature $i$.
Convex optimisation, guaranteed global minimum.

Weaknesses:

Linear decision boundary in feature space, limits expressiveness on complex data.
Feature engineering needed for non-linear problems (or use polynomial features, kernels, or neural networks).

Logistic regression remains the standard baseline in nearly every supervised classification problem and the workhorse of medical statistics, biostatistics, social-science modelling and many other applied fields.

Interactive

Logistic regression finds a boundary. A separating line learns its place by minimising cross-entropy on labelled points.

The logistic curve maps any number to a probability. A linear score, squashed by a sigmoid, becomes a probability between zero and one.

Video

Related terms: Sigmoid Function, Cross-Entropy Loss, Linear Regression

Discussed in:

Chapter 7: Supervised Learning, Supervised Learning

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.