The Perceptron, introduced by Frank Rosenblatt in 1958, is the simplest possible neural network: a single computational unit that takes an input vector $\mathbf{x}$, computes a weighted sum $z = \mathbf{w}^T \mathbf{x} + b$, and applies a step function to produce a binary output. Geometrically, the perceptron implements a hyperplane decision boundary in feature space, classifying points on one side as positive and the other as negative, it is a linear classifier.
The perceptron learning rule updates weights iteratively: for each misclassified example, $\mathbf{w} \leftarrow \mathbf{w} + \eta(y_i - \hat{y}_i)\mathbf{x}_i$. The perceptron convergence theorem guarantees that if the training data is linearly separable, the algorithm finds a separating hyperplane in finitely many steps. The bound depends on the margin, the distance from the nearest point to the boundary, foreshadowing the margin-based reasoning of support vector machines.
The perceptron generated enormous excitement, but Minsky and Papert's 1969 book Perceptrons showed rigorously that a single-layer perceptron cannot compute the XOR function or any non-linearly-separable function. This result cast a long shadow, contributing to the first AI winter, even though multilayer networks could compute XOR, the problem was that no one yet knew how to train them efficiently. That problem was eventually solved by backpropagation, and the perceptron's conceptual descendants, modern neural networks with nonlinear activations and many layers, dominate contemporary AI.
Mathematics
The perceptron computes $\hat y = \mathrm{sign}(w^\top x + b) \in \{-1, +1\}$ for input $x \in \mathbb{R}^d$ with weights $w \in \mathbb{R}^d$ and bias $b \in \mathbb{R}$. The perceptron learning rule updates only on misclassifications: for a training example $(x, y)$ with $y \in \{-1, +1\}$, if $y \cdot (w^\top x + b) \leq 0$ then
$$w \leftarrow w + \eta \, y \, x, \qquad b \leftarrow b + \eta \, y$$
with learning rate $\eta > 0$.
Convergence theorem (Rosenblatt 1962, Novikoff 1962): if the data is linearly separable with margin $\gamma > 0$, there exists a unit-norm $w^*$ with $y_n (w^* \cdot x_n) \geq \gamma$ for all $n$, and $\|x_n\| \leq R$ for all $n$, then the perceptron learning rule converges in at most $(R/\gamma)^2$ updates, regardless of dimension.
The multilayer perceptron (MLP) stacks linear layers with non-linear activations:
$$h^{(1)} = \phi(W^{(1)} x + b^{(1)}), \quad h^{(2)} = \phi(W^{(2)} h^{(1)} + b^{(2)}), \quad \ldots$$
A two-layer MLP with sufficiently many hidden units is a universal function approximator (Cybenko 1989, Hornik 1991): any continuous function on a compact domain can be approximated to arbitrary accuracy. The bound is non-constructive , it does not say how many units or how to find the weights, but it justifies the modern reliance on stacked linear-plus-nonlinearity blocks.
Interactive
Video
Related terms: Neural Network, Activation Function
Discussed in:
- Chapter 9: Neural Networks, The Perceptron