Sigmoid Function, Glossary, Textbook of AI

The sigmoid function, also known as the logistic function, maps any real number to the open interval $(0, 1)$:

$$\sigma(x) = \frac{1}{1 + e^{-x}} = \frac{e^x}{1 + e^x}.$$

Its S-shaped graph is symmetric about the point $\bigl(0, \tfrac{1}{2}\bigr)$, satisfies the reflection identity $\sigma(-x) = 1 - \sigma(x)$, has horizontal asymptotes at 0 and 1, and possesses the clean derivative

$$\sigma'(x) = \sigma(x)\bigl(1 - \sigma(x)\bigr) \in \bigl(0, \tfrac{1}{4}\bigr],$$

with the maximum $\tfrac{1}{4}$ attained at $x = 0$.

History

The logistic curve was introduced by the Belgian mathematician Pierre-François Verhulst in 1838 as a model of population growth under resource limits, generalising Malthusian exponential growth. It re-entered statistics through Berkson (1944) as the basis of logistic regression, and entered neural-network practice as the activation function of Rosenblatt's perceptron (1958, in its smooth variants), the multilayer perceptron (Rumelhart, Hinton, Williams 1986) and the original LSTM gates (Hochreiter & Schmidhuber 1997).

Properties for neural networks

The sigmoid was the dominant hidden-layer activation for nearly four decades because it is bounded, monotone, smooth and differentiable everywhere , and the derivative can be computed cheaply once $\sigma(x)$ is known, $\sigma'(x) = y(1 - y)$. It has two practical drawbacks for deep networks:

Vanishing gradients. For $|x| \gtrsim 5$, $\sigma'(x) \approx 0$, so gradients shrink rapidly through stacked sigmoid layers; backpropagation through $L$ sigmoid layers attenuates the signal by up to $4^{-L}$.
Non-zero-centred outputs. All outputs are positive, biasing the gradients on the next layer's weights to share signs and producing zigzag optimisation paths.

These problems were among the central motivations for the shift to ReLU (Glorot, Bordes, Bengio 2011) and its variants, Leaky ReLU, GELU, Swish, in modern deep architectures.

Where the sigmoid still lives

In modern deep networks the sigmoid has been largely displaced by ReLU and its variants in hidden layers. It remains the standard choice for:

The output layer of binary classifiers, producing a calibrated probability $p = \sigma(z)$ from a logit $z$.
LSTM and GRU gates (input, output, forget, update gates), where bounded outputs in $(0,1)$ are needed to act as multiplicative masks on the cell state.
Attention masking and mixture-of-experts gating in some Transformer variants.
Multi-label classification, where each label gets an independent sigmoid (in contrast to softmax for mutually exclusive classes).

The logit and logistic regression

The sigmoid is the inverse of the logit function

$$\mathrm{logit}(p) = \log\frac{p}{1-p},$$

so $\sigma(\mathrm{logit}(p)) = p$. Logistic regression fits

$$P(y = 1 \mid \mathbf{x}) = \sigma(\mathbf{w}^{\top} \mathbf{x} + b)$$

by maximising the Bernoulli log-likelihood, equivalent to minimising binary cross-entropy. Logistic regression remains a workhorse of clinical risk modelling, credit scoring, epidemiology and A/B-testing analysis, partly because the coefficients $w_j$ have a clean odds-ratio interpretation: a unit increase in $x_j$ multiplies the odds by $e^{w_j}$.

The softmax is the multinomial generalisation of the sigmoid; for two classes the two formulations coincide up to parameter identifiability.

Discussed in:

Chapter 6: ML Fundamentals, Activation Functions

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).