Supervised Learning: 7.3   Logistic regression

Dr Chris Paton

7.3 Logistic regression

Logistic regression is what happens when you take linear regression and ask it a different question. Instead of "what is the expected value of $y$ given $\mathbf{x}$?", you ask "what is the probability that $y$ is 1 given $\mathbf{x}$?". The answer must lie in $(0, 1)$, because probabilities cannot exceed one or fall below zero. A raw linear combination $\mathbf{w}^\top\mathbf{x}$ can take any real value, so it cannot itself be a probability. The fix is small but consequential: pass the linear combination through a squashing function, the sigmoid, that bends the real line into the unit interval. Train the model by minimising binary cross-entropy, the negative log-likelihood under a Bernoulli assumption, and you have logistic regression.

Despite the misleading name inherited from nineteenth-century demography, logistic regression is not a regression algorithm. It is a classifier. The "regression" refers only to the historical lineage: the same generalised-linear-model machinery that fits a regression line is being recycled to fit a classification boundary. The model itself is the workhorse of applied statistics, the single most-deployed classification algorithm in production use today, running quietly behind credit-scoring decisions, hospital risk calculators, marketing churn predictions, and click-through-rate forecasts. It is also lurking inside every modern neural network: the output layer of any binary classifier with a sigmoid head is, mathematically, a logistic regression on top of learned features. Multi-class softmax heads are its categorical generalisation. Understand this section and you understand the final layer of nearly every classification model in deep learning.

This section is the binary-classification counterpart to §7.2. Linear regression assumed Gaussian noise around a linear mean, and the maximum-likelihood solution was ordinary least squares. Logistic regression assumes Bernoulli noise around a sigmoid-of-linear mean, and the maximum-likelihood solution is cross-entropy minimisation. Same recipe, different distribution.

Symbols Used Here

$\mathbf{x}$input feature vector

$y \in \{0, 1\}$binary class label

$\mathbf{w}$learned coefficient vector (intercept absorbed)

$\sigma(z) = 1/(1+e^{-z})$sigmoid (logistic) function

$\hat p = \sigma(\mathbf{w}^\top\mathbf{x})$predicted probability of class 1

$\mathcal{L}$loss (binary cross-entropy, averaged over $n$ examples)

The model

We assume each label is drawn from a Bernoulli distribution whose parameter depends on the input:

$$P(y=1\mid\mathbf{x}) = \hat p = \sigma(\mathbf{w}^\top\mathbf{x}), \qquad \sigma(z) = \frac{1}{1+e^{-z}}.$$

The sigmoid takes any real number and returns a value in $(0, 1)$. It is monotonic, smooth, and symmetric about zero: $\sigma(0) = 0.5$, $\sigma(z) \to 1$ as $z \to \infty$, $\sigma(z) \to 0$ as $z \to -\infty$. The slope is steepest at the origin (a quarter, exactly) and flattens out at the extremes.

The decision rule is straightforward. To turn a probability into a discrete prediction, threshold at one half: predict class 1 if $\hat p > 0.5$, else class 0. Because the sigmoid crosses 0.5 exactly when its argument crosses zero, the decision boundary is the hyperplane

$$\mathbf{w}^\top\mathbf{x} = 0.$$

That is a flat surface in feature space, a line in two dimensions, a plane in three, a hyperplane in higher dimensions. So although the relationship between features and probability is non-linear (the sigmoid is curved), the boundary between the two classes is perfectly linear. Logistic regression is a linear classifier.

The geometry becomes clearer if you rearrange the model in terms of log-odds. The odds of class 1 are $\hat p / (1 - \hat p)$. The log of the odds is the logit:

$$\mathrm{logit}(\hat p) = \log\frac{\hat p}{1 - \hat p} = \mathbf{w}^\top\mathbf{x}.$$

This is the inverse of the sigmoid: applying the logit to a probability returns the linear combination that produced it. The model is therefore linear on the log-odds scale, even though it is non-linear on the probability scale. This is the defining property of a generalised linear model with a logit link function (we will meet the wider GLM family in §7.4).

The interpretability is genuinely useful. A unit increase in feature $x_j$ adds $w_j$ to the log-odds, which multiplies the odds by $e^{w_j}$. Clinicians and statisticians read coefficients as odds ratios all the time: a coefficient of $0.7$ corresponds to roughly a doubling of odds per unit change.

The loss: binary cross-entropy

To fit $\mathbf{w}$ we maximise the likelihood of the observed labels. Because each $y_i \sim \mathrm{Bernoulli}(\hat p_i)$, the probability mass function evaluated at the observed label is

$$P(y_i \mid \mathbf{x}_i) = \hat p_i^{y_i}(1 - \hat p_i)^{1 - y_i}.$$

This compact expression handles both cases at once: if $y_i = 1$ it returns $\hat p_i$; if $y_i = 0$ it returns $1 - \hat p_i$. Taking the log of the likelihood across $n$ independent training examples and averaging, we obtain the log-likelihood per example:

$$\frac{1}{n}\sum_{i=1}^n \big[y_i \log \hat p_i + (1 - y_i) \log(1 - \hat p_i)\big].$$

By convention we minimise the negative of this quantity, giving the binary cross-entropy loss:

$$\mathcal{L}(\mathbf{w}) = -\frac{1}{n}\sum_{i=1}^n \big[y_i \log \hat p_i + (1 - y_i) \log(1 - \hat p_i)\big].$$

Two derivations lead to exactly the same loss, which is reassuring. The probabilistic route, just shown, is maximum likelihood under a Bernoulli model. The information-theoretic route arrives at cross-entropy by asking how many bits are needed to encode the true label distribution under a model distribution: minimising cross-entropy minimises the gap (the Kullback–Leibler divergence) between the two distributions.

The shape of the loss is worth understanding intuitively. When $y_i = 1$ and $\hat p_i$ is close to one, $\log \hat p_i \approx 0$, almost no penalty. When $y_i = 1$ and $\hat p_i$ drifts towards zero, $\log \hat p_i \to -\infty$, the penalty grows without bound. Cross-entropy is unforgiving of confident wrong answers; it is the same property that makes it the loss of choice for neural classifiers. Compared with squared error on probabilities, cross-entropy yields steeper gradients when the model is badly wrong, which speeds learning.

The gradient takes a clean form:

$$\nabla_{\mathbf{w}} \mathcal{L} = \frac{1}{n}\sum_{i=1}^n (\hat p_i - y_i)\,\mathbf{x}_i.$$

The error $(\hat p_i - y_i)$ is the difference between predicted probability and observed label. The gradient is the input vector weighted by this prediction error and averaged. This is the same form as the gradient of squared error in linear regression, a deliberate consequence of the cross-entropy / sigmoid pairing, and it is one reason backpropagation through sigmoid output layers is numerically well-behaved.

No closed form, but convex

Linear regression had a closed form: the normal equations $\mathbf{w}^\star = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ give the optimum in one matrix inversion. Logistic regression has no such luxury. Setting the gradient $\sum_i (\hat p_i - y_i)\,\mathbf{x}_i = \mathbf{0}$ to zero produces a system of equations that are non-linear in $\mathbf{w}$, the sigmoid sits stubbornly inside $\hat p_i$. There is no algebraic rearrangement that isolates the optimum.

Fortunately, the loss has a property that makes iterative optimisation almost as good: it is convex in $\mathbf{w}$. Convexity means the loss surface is bowl-shaped; any local minimum is the global minimum, and gradient descent, provided the step size is reasonable, is guaranteed to converge to it. There are no awkward saddle points or competing basins of attraction. This is a direct consequence of the sigmoid–cross-entropy pairing: the Hessian of $\mathcal{L}$ is positive semidefinite, equal to $\frac{1}{n}\mathbf{X}^\top\mathbf{D}\mathbf{X}$ where $\mathbf{D}$ is the diagonal matrix with entries $\hat p_i(1 - \hat p_i)$.

Two algorithms dominate in practice. Plain gradient descent (or its stochastic variants) repeats $\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla_{\mathbf{w}} \mathcal{L}$ until convergence and is what neural-network training uses by default. Newton's method, the classical second-order alternative, takes the step $\mathbf{w} \leftarrow \mathbf{w} - \mathbf{H}^{-1}\nabla_{\mathbf{w}} \mathcal{L}$. When you write Newton's method out for logistic regression, the resulting iteration looks structurally identical to weighted least squares on a transformed target, hence its venerable name, iteratively reweighted least squares (IRLS). IRLS converges in a handful of iterations on well-conditioned problems and is the algorithm behind R's glm() and most classical statistical packages. For very high-dimensional problems, L-BFGS or stochastic gradient methods are preferred, because Newton's method requires inverting a $d \times d$ Hessian per step.

Worked example

Consider a small two-feature problem with four training points, two per class:

$x_1$	$x_2$	$y$
2.0	2.0	1
2.5	1.5	1
0.5	1.0	0
1.0	1.0	0

Suppose, after training, the fitted coefficients are $\mathbf{w} = (-3, 1, 1)$, where the first entry is the bias and the others multiply $x_1$ and $x_2$ respectively. The model is

$$\hat p(\mathbf{x}) = \sigma(-3 + x_1 + x_2).$$

Evaluate it on a positive example $\mathbf{x} = (2, 2)$. The linear combination is $-3 + 2 + 2 = 1$. The sigmoid of one is

$$\sigma(1) = \frac{1}{1 + e^{-1}} = \frac{1}{1 + 0.3679} \approx 0.731.$$

So $\hat p \approx 0.731$, comfortably above 0.5: predict class 1.

Now evaluate on a negative example $\mathbf{x} = (1, 1)$. The linear combination is $-3 + 1 + 1 = -1$. The sigmoid of minus one is

$$\sigma(-1) = \frac{1}{1 + e^{1}} = \frac{1}{1 + 2.7183} \approx 0.269.$$

So $\hat p \approx 0.269$, comfortably below 0.5: predict class 0. The two predictions are mirror images about 0.5 because the inputs are symmetric about the decision boundary $x_1 + x_2 = 3$.

A point on the boundary, say $\mathbf{x} = (1.5, 1.5)$, has linear combination $-3 + 1.5 + 1.5 = 0$, so $\hat p = \sigma(0) = 0.5$, maximum uncertainty. Move a small distance perpendicular to the boundary and the probability shifts smoothly: $(1.6, 1.6)$ gives $\sigma(0.2) \approx 0.550$, while $(1.4, 1.4)$ gives $\sigma(-0.2) \approx 0.450$. The sigmoid is steepest near the boundary and saturates further away, a useful property, because the model expresses appropriate uncertainty in the contested middle and high confidence on points it considers far from the boundary.

It is instructive to compute the loss on this small dataset. The point $(2.5, 1.5)$ also gives $\sigma(1) \approx 0.731$, contributing $-\log(0.731) \approx 0.314$. The negative point $(0.5, 1.0)$ has logit $-1.5$, $\hat p \approx 0.182$, contributing $-\log(0.818) \approx 0.201$. The negative point $(1.0, 1.0)$ contributes $-\log(0.731) \approx 0.314$. Average loss is roughly $0.286$ nats per example, well below the $\log 2 \approx 0.693$ that an uninformed model would incur on a balanced sample.

Multinomial extension: softmax regression

For $K > 2$ classes, the binary sigmoid is replaced by the softmax function. Each class $k$ has its own coefficient vector $\mathbf{w}_k$, stacked into a matrix $\mathbf{W}$. The model is

$$\hat p_k(\mathbf{x}) = \frac{\exp(\mathbf{w}_k^\top\mathbf{x})}{\sum_{j=1}^K \exp(\mathbf{w}_j^\top\mathbf{x})}.$$

The vector $\mathbf{W}\mathbf{x}$ is called the logits vector, the unnormalised scores, or relative log-odds, for each class. Softmax exponentiates each logit and divides by the sum, producing a probability distribution over the $K$ classes that sums to one by construction.

The corresponding loss is categorical cross-entropy:

$$\mathcal{L}(\mathbf{W}) = -\frac{1}{n}\sum_{i=1}^n \sum_{k=1}^K \mathbb{1}[y_i = k]\,\log \hat p_k(\mathbf{x}_i),$$

which simplifies, because the indicator picks out a single term per example, to the negative log-probability assigned to the true class. When $K = 2$ this reduces exactly to binary cross-entropy with a sigmoid (after a redundant-parameter check), so softmax regression is a strict generalisation.

Softmax has a redundancy: adding the same constant to every logit leaves the probabilities unchanged, because the constant cancels in the numerator and denominator. To remove the ambiguity one column of $\mathbf{W}$ can be fixed at zero, yielding the asymmetric "reference category" parameterisation familiar from epidemiological multinomial logistic regression. Most deep-learning implementations skip the constraint and let the optimiser find any equivalent solution.

The gradient of categorical cross-entropy with respect to the logits is again clean: $\hat{\mathbf{p}} - \mathbf{y}$, where $\mathbf{y}$ is the one-hot label. This is why softmax + cross-entropy is the default classification head in every neural-network library: the joint gradient is the prediction error, with no awkward sigmoid–log products to compute. Numerically, a fused log-softmax + negative-log-likelihood kernel avoids overflow when logits are large, and is what torch.nn.CrossEntropyLoss implements internally.

Regularisation

Logistic regression overfits in the same way linear regression does, by allowing coefficients to grow large on noisy features, and is regularised in exactly the same two ways.

L2 (ridge) logistic regression adds $\lambda \|\mathbf{w}\|_2^2$ to the loss. This is the default in scikit-learn and most production fits. The penalty shrinks coefficients smoothly towards zero, reduces variance, and makes the optimisation strictly convex even when features are perfectly collinear. On small samples or with many correlated features, ridge is almost always a good idea; cross-validate $\lambda$.

L1 (lasso) logistic regression adds $\lambda \|\mathbf{w}\|_1$. The L1 ball has corners on the coordinate axes, so the optimum lands on those corners and many coefficients become exactly zero. The result is a sparse model, useful in genetics, text classification, and any setting where most features are expected to be irrelevant. The optimisation is no longer differentiable at zero, so coordinate descent or proximal-gradient methods replace plain gradient descent.

Elastic net, $\lambda_1 \|\mathbf{w}\|_1 + \lambda_2 \|\mathbf{w}\|_2^2$, blends the two and handles correlated feature groups more gracefully than pure lasso, which arbitrarily picks one feature from each group.

A practical note: the bias (intercept) term is usually not regularised, penalising the intercept biases the model towards the wrong base rate.

Where logistic regression appears in modern AI

Logistic regression is older than most of the field, but it has aged better than almost any other algorithm. It is everywhere.

Output layer of every binary-classification neural network. When a deep model ends in a single sigmoid unit trained with binary cross-entropy, the final layer is a logistic regression on top of learned features. Medical-imaging classifiers, fraud detectors and abuse filters all terminate in this single equation.
Multi-class classification heads with softmax + categorical cross-entropy. ImageNet classifiers, language-model next-token predictions (a softmax over the vocabulary), and speech recognisers all end in softmax regression on learned representations. The "language model head" of GPT-style transformers is exactly this.
Calibration baselines. Raw classifier outputs, especially from boosted trees or deep nets, are often miscalibrated. Platt scaling fits a one-dimensional logistic regression on held-out scores to produce well-calibrated probabilities, used from radiology AI to advertising auctions.
Probing classifiers. To ask "does layer 7 of this transformer encode part-of-speech?", researchers freeze the network, extract activations, and fit a logistic regression on top. High accuracy with low capacity means the information is linearly decodable, the standard tool of mechanistic interpretability.
Production ML where interpretability matters. Credit scoring (where regulators require explainable models), insurance underwriting, hospital risk calculators (Wells score, PRECISE-DAPT), churn prediction, A/B-test covariate adjustment. The coefficients are odds ratios; the model is auditable; the output is a probability. For regulated domains this is decisive.
Causal inference. Logistic regression is the workhorse of propensity-score modelling, where a binary treatment indicator is regressed on covariates for matching or inverse-probability weighting.

Logistic regression is the simplest model that produces calibrated probabilities for a binary or categorical outcome. Whenever you need a probability and you can afford a linear decision boundary, or you have already built non-linear features through a neural network, logistic regression is the right final step.

What you should take away

Logistic regression is a classifier, not a regression algorithm. The name is historical baggage. The model passes a linear combination through a sigmoid (or softmax for multi-class) to produce a probability, then thresholds at 0.5 to produce a class.
The loss is binary cross-entropy, derived as the negative log-likelihood of a Bernoulli model. Categorical cross-entropy is its multi-class generalisation. The gradient with respect to the logits is the prediction error $(\hat p - y)$, which is why this loss pairs so cleanly with sigmoid and softmax in neural networks.
There is no closed form, but the loss is convex. Gradient descent and Newton's method (IRLS) both converge to the unique global optimum. Convexity is what makes the model trustworthy in regulated settings: re-fitting on the same data gives the same answer.
Regularise, almost always. Ridge by default, lasso when you want sparsity, elastic net when features are grouped. Do not penalise the intercept. Cross-validate the strength.
It is the final layer of every modern classifier. Understanding logistic regression is understanding the output of every neural-network classification model, every Platt-scaled probability, every linear probe in interpretability research, and most of applied biostatistics. It is the smallest model that earns its keep in production, and the largest model's last word.