Neural Networks: 9.1   From neuroscience to artificial neurons

Dr Chris Paton

9.1 From neuroscience to artificial neurons

This section explains where the idea of an "artificial neuron" came from and how the very first learning machines worked. We start in the 1940s, when neuroscientists and mathematicians began asking whether the activity of brain cells could be described in the same language as logic gates. By the end of the section you will understand a complete, working classifier, the perceptron, and you will have hand-traced its learning rule on a small numerical example. None of this requires linear algebra fluency; every symbol will be defined the moment it appears.

Modern deep networks are stacks of slightly more sophisticated artificial neurons, trained by a generalisation of the perceptron's learning rule. §9.2 shows why a single perceptron is not enough; the rest of the chapter shows how to fix the problem.

Symbols Used Here

$x_i$the $i$-th input to a neuron (a real number, often 0 or 1 in early models)

$d$the number of inputs to the neuron

$w_i$the weight applied to input $x_i$ (a real number; negative means inhibitory)

$\theta$the firing threshold (a real number)

$y$the neuron's output (a single real number, often 0 or 1)

$\mathbb{1}[\cdot]$indicator function: 1 if the bracketed condition is true, 0 otherwise

$\eta$the learning rate, a small positive real number controlling step size

$\Delta w$the change applied to a weight at a learning step

$\mathbf{w}^\top \mathbf{x}$the dot product $\sum_i w_i x_i$

McCulloch and Pitts (1943): a neuron is a logic gate

Warren McCulloch was a neurophysiologist working at the University of Illinois. Walter Pitts was a self-taught logician, still in his teens, who had run away from home and was sleeping rough in Chicago when McCulloch took him in. In 1943, in the middle of the Second World War, the two of them wrote a paper called A Logical Calculus of the Ideas Immanent in Nervous Activity. Their question was simple but audacious: can the firing of brain cells be described by the same mathematics that describes "and", "or" and "not" in formal logic? If so, then thinking, at least in principle, is a kind of computation, and a network of neurons is at least as powerful as a digital computer.

To make the question precise they invented an idealised neuron, now called a threshold logic unit. A threshold logic unit takes some number of inputs, multiplies each one by a fixed weight, adds the results together, and then asks: is the total at least as big as a threshold? If yes, the unit outputs a 1, meaning "fired". If not, it outputs a 0, meaning "silent".

Let us write that down very carefully. Suppose the unit has $d$ inputs, where $d$ is just a positive whole number, for example, $d = 2$ or $d = 3$. Each input is a number called $x_i$, where the subscript $i$ runs from 1 up to $d$, so the inputs are $x_1, x_2, \ldots, x_d$. In the original McCulloch–Pitts model each input is either 0 or 1. Each input has its own weight $w_i$, also a number. Negative weights represent inhibitory connections (an active input reduces the chance of firing); positive weights are excitatory (an active input increases it). There is one more number, the threshold $\theta$ (the Greek letter theta), which sets how much total excitation the unit needs in order to fire.

The unit's output, called $y$, is then defined by the equation

$$y = \mathbb{1}\!\left[\sum_{i=1}^{d} w_i x_i \ge \theta\right] .$$

Read that line by line. The expression $\sum_{i=1}^{d} w_i x_i$ means "add up $w_i x_i$ for every $i$ from 1 to $d$", so for $d = 2$ it is just $w_1 x_1 + w_2 x_2$. The square brackets $\mathbb{1}[\cdots]$ mean: "look inside the brackets; if what is inside is true, this whole expression equals 1, otherwise it equals 0". The condition inside is "is the weighted sum at least as big as the threshold?". So the equation says, in plain English: "compute the weighted sum; output 1 if it reaches the threshold, 0 otherwise".

Worked example, AND. We want a unit that outputs 1 when both inputs are 1, and 0 in every other case. Use $d = 2$, $w_1 = 0.5$, $w_2 = 0.5$, $\theta = 0.6$. Try the four possible input pairs.

Inputs $(x_1, x_2) = (0, 0)$. Weighted sum $= 0.5 \cdot 0 + 0.5 \cdot 0 = 0$. Is $0 \ge 0.6$? No. So $y = 0$.
Inputs $(0, 1)$. Sum $= 0.5 \cdot 0 + 0.5 \cdot 1 = 0.5$. Is $0.5 \ge 0.6$? No. $y = 0$.
Inputs $(1, 0)$. Sum $= 0.5$. Same as above. $y = 0$.
Inputs $(1, 1)$. Sum $= 0.5 \cdot 1 + 0.5 \cdot 1 = 1.0$. Is $1.0 \ge 0.6$? Yes. $y = 1$.

That is the truth table for AND, computed by a single threshold logic unit.

Worked example, OR. Use the same $d = 2$ but with $w_1 = 1$, $w_2 = 1$, $\theta = 0.5$. Now $(0,0)$ gives sum 0, output 0. $(0,1)$ and $(1,0)$ both give sum 1, which is at least 0.5, output 1. $(1,1)$ gives sum 2, output 1. That is OR.

Worked example, NOT. We want a unit that outputs 1 when its single input $x_1$ is 0, and 0 when $x_1$ is 1. Use $d = 1$ with $w_1 = -1$ and $\theta = 0$. If $x_1 = 0$, the sum is 0, and $0 \ge 0$ is true, so $y = 1$. If $x_1 = 1$, the sum is $-1$, and $-1 \ge 0$ is false, so $y = 0$. The negative weight is essential: it lets a present input suppress firing.

McCulloch and Pitts then proved a striking result. By wiring such units together, feeding the output of one unit into the input of another, you can compute any function of binary inputs that propositional logic can express. Since AND, OR and NOT together are enough to build every Boolean function (for instance via the standard disjunctive normal form construction), and since a digital computer at the hardware level is just a vast tangle of Boolean logic, a network of threshold logic units is at least as powerful as a digital computer. Their formal proof was phrased in the language of recursive function theory and finite automata.

This was a remarkable historical bridge. It tied the wet biology of neurons, on one side, to the dry mathematics of logic, on the other, and it suggested that "thought" and "computation" might be two names for the same thing. The cybernetics movement, computational neuroscience, and ultimately deep learning all trace their lineage to this paper. Two caveats deserve emphasis. First, real neurons do not actually behave like threshold logic units, they fire trains of electrical spikes, are influenced by neuromodulators, and do considerable computation in their dendrites, so the model is a caricature, not a description. Second, the McCulloch–Pitts unit has no learning rule: weights and thresholds are fixed by hand. The model proves that neurons can in principle represent any Boolean function; it does not say how a brain might find the right weights from experience. That gap is what Hebb addressed next.

Hebb (1949): "neurons that fire together, wire together"

Donald Hebb, a Canadian psychologist, published The Organization of Behavior in 1949. The book proposes a hypothesis about how the brain learns. In Hebb's own careful prose: "When an axon of cell A is near enough to excite cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased." This is now compressed into the slogan "neurons that fire together, wire together". The deep idea is that learning is local: each connection between two cells changes by an amount that depends only on what those two cells are doing right now, with no global supervisor telling them what the right answer is.

To turn this into mathematics we need names for "what cell A is doing", "what cell B is doing", and "the strength of the connection between them". Let $x_i$ be the activity of the input cell (a number; for binary cells, 1 if firing and 0 if silent). Let $y_j$ be the activity of the output cell, defined the same way. Let $w_{ij}$ be the weight (the connection strength) from input $i$ to output $j$. Hebb's rule says: change the weight by an amount proportional to the product of the two activities. Writing $\Delta w_{ij}$ for the change applied to the weight in one learning step, and $\eta$ (the Greek letter eta) for a small positive number called the learning rate that controls how big the steps are,

$$\Delta w_{ij} = \eta \, x_i \, y_j .$$

In words: if input $i$ and output $j$ are both active at the same time, increase $w_{ij}$ a little; if either is silent, do nothing.

Worked example. Suppose $\eta = 0.1$, the input cell is active so $x_i = 1$, and the output cell is also active so $y_j = 1$. Then $\Delta w_{ij} = 0.1 \cdot 1 \cdot 1 = 0.1$, and the connection strengthens by 0.1 on this presentation. If we now present a case where the input is silent, $x_i = 0$, then $\Delta w_{ij} = 0.1 \cdot 0 \cdot 1 = 0$, and the weight is unchanged. Repeated co-firing slowly grows the weight; absence of co-firing leaves it alone.

Hebb's contribution was the first concrete learning rule in the artificial-neuron tradition. It inspired later models such as Hopfield networks, Boltzmann machines and self-organising maps. But it has a serious limitation: there is no notion of an error signal. The rule does not know whether the output was correct; it only notices that two cells were active together. If $y_j$ is wrong, Hebb's rule will happily strengthen the wrong association. Variants such as Oja's rule add normalisation to keep weights bounded, and spike-timing-dependent plasticity in computational neuroscience refines the timing details. Hinton's forward–forward algorithm (2022) is a modern attempt to design a biologically plausible learning procedure in the same spirit. None of these has displaced backpropagation, which we shall develop later in the chapter, but they remind us that learning need not require the global, non-local error signal that backprop relies on.

Rosenblatt's perceptron (1958)

Frank Rosenblatt was a psychologist at Cornell. In 1958 he combined McCulloch and Pitts' threshold unit with Hebb's idea of an adjustable connection, and added the missing ingredient: an error-driven learning rule. He called the result the perceptron. His Mark I Perceptron, funded by the US Office of Naval Research, was an analogue computer the size of a wardrobe. It had 400 photocells arranged as a 20-by-20 retina; each cell was wired through an adjustable potentiometer (a knob whose resistance set the weight) to a single output neuron. The knobs were turned by small electric motors driven by the perceptron learning rule. The machine could learn to classify simple shapes, the first physical instance of supervised learning, three years before the term "artificial intelligence" was coined at the 1956 Dartmouth workshop.

Mathematically, the perceptron is a small upgrade of the threshold unit. The inputs are now real numbers, not just 0 or 1, and they are gathered into a single object called a vector. A vector in this context is just an ordered list of $d$ numbers; we write it in bold as $\mathbf{x} = (x_1, x_2, \ldots, x_d)$. The weights are gathered into another vector $\mathbf{w} = (w_1, w_2, \ldots, w_d)$. There is also a single number $b$ called the bias, which plays the role of a threshold but written on the other side of the equation (so that "$\ge \theta$" becomes "$+ b \ge 0$"; the bias is essentially $-\theta$).

The perceptron computes the dot product of the weight and input vectors. The dot product is the sum of products of corresponding entries:

$$\mathbf{w}^\top \mathbf{x} = \sum_{i=1}^{d} w_i x_i = w_1 x_1 + w_2 x_2 + \cdots + w_d x_d .$$

This is the same weighted sum that appeared in the McCulloch–Pitts model. The notation $\mathbf{w}^\top \mathbf{x}$ (read "w-transpose x") is just a compact way of writing it. The perceptron's predicted output, written $\hat{y}$ (read "y-hat", with the hat indicating a predicted value), is then

$$\hat{y} = \mathrm{sign}(\mathbf{w}^\top \mathbf{x} + b) ,$$

where the function $\mathrm{sign}(z)$ equals $+1$ if $z \ge 0$ and $-1$ otherwise. The perceptron labels every input as either $+1$ or $-1$.

The geometric picture is clean. The set of all input vectors $\mathbf{x}$ for which the score $\mathbf{w}^\top \mathbf{x} + b$ is exactly zero forms a flat surface called a hyperplane. In two dimensions a hyperplane is a straight line; in three dimensions a flat plane; in $d$ dimensions a $(d-1)$-dimensional flat slab. The weight vector $\mathbf{w}$ points in the direction perpendicular to this hyperplane (it is normal to it). Points lying on the side of the hyperplane that $\mathbf{w}$ points towards get label $+1$; points on the other side get label $-1$. The bias $b$ shifts the hyperplane towards or away from the origin. So a perceptron is a linear classifier: it carves the input space into two halves with a single straight cut, and labels the two halves $+1$ and $-1$. This is what we mean when we call a dataset linearly separable: the two classes can be cleanly cut apart by some such hyperplane.

The perceptron learning rule is short. We loop through the training examples one at a time. For each example $(\mathbf{x}_i, y_i)$, where $y_i \in \{-1, +1\}$ is the true label, we compute the perceptron's prediction $\hat{y}_i$. If the prediction is correct, we do nothing. If it is wrong, we update the weights and bias:

$$\mathbf{w} \leftarrow \mathbf{w} + \eta \, y_i \, \mathbf{x}_i , \qquad b \leftarrow b + \eta \, y_i .$$

The arrow $\leftarrow$ means "replace the left-hand side by the right-hand side". The learning rate $\eta$ is again a small positive number. Because the update only fires on mistakes and is always in the direction $y_i \mathbf{x}_i$, it nudges the weight vector towards the misclassified example whenever the true label is $+1$, and away from it whenever the true label is $-1$. Geometrically, the hyperplane rotates a little so that, next time, this example is more likely to fall on the correct side.

A full worked update. Suppose $d = 2$, weights start at $\mathbf{w} = (0, 0)$, bias $b = 0$, and learning rate $\eta = 1$. The first training example is $\mathbf{x}_1 = (2, 1)$ with true label $y_1 = +1$. Compute the score: $\mathbf{w}^\top \mathbf{x}_1 + b = 0 \cdot 2 + 0 \cdot 1 + 0 = 0$. The sign of 0 is $+1$ by convention, so $\hat{y}_1 = +1$, correct, no update. Suppose instead the convention treats a tied score of 0 as a misclassification, or suppose the first example were $\mathbf{x}_1 = (2, 1)$ with label $y_1 = -1$. Then the score is still 0, the prediction is still $+1$, but the true label is now $-1$, so this is a mistake. Update:

$$\mathbf{w} \leftarrow (0, 0) + 1 \cdot (-1) \cdot (2, 1) = (-2, -1), \qquad b \leftarrow 0 + 1 \cdot (-1) = -1 .$$

Re-check: with the new weights, $\mathbf{w}^\top \mathbf{x}_1 + b = -2 \cdot 2 + -1 \cdot 1 + -1 = -4 - 1 - 1 = -6$, which is negative, so $\hat{y}_1 = -1$. The example is now correctly classified. The hyperplane has rotated so that $(2, 1)$ falls on the negative side, exactly as the label demanded.

The headline result about the perceptron is Novikoff's perceptron convergence theorem (1962). In plain English: if there exists any hyperplane that perfectly separates your two classes, that is, the data is linearly separable, and if the data points all live within some bounded region, then the perceptron, run long enough on the training set, is guaranteed to find a separating hyperplane in a finite number of mistakes. Specifically, if $R$ is the radius of a ball that contains all the training points and $\gamma > 0$ is the margin of the best separating hyperplane (how much room there is between the closest points and the boundary), then the perceptron makes at most $(R/\gamma)^2$ mistakes before it stops making any. The proof, which we shall not work through here, uses two short bounds, one growing the alignment of $\mathbf{w}$ with the optimal separator, the other bounding the squared length of $\mathbf{w}$. Two consequences worth remembering. First, the bound depends only on the geometry of the data ($R$ and $\gamma$), not on the dimension $d$ of the input space; the perceptron is therefore well-behaved in very high-dimensional spaces. Second, the bound is silent about non-separable data: if no hyperplane separates the classes, the perceptron never converges and oscillates forever. That failure mode will become important in the next section.

Why is this called a "neural" network?

The artificial neuron has biological roots, but its mathematical character matters far more than its biological inspiration. The name is partly historical, McCulloch was a neurophysiologist, Hebb a psychologist, Rosenblatt a psychologist again, and partly a metaphor that has stuck.

Real cortical neurons are wet, slow, noisy and stochastic. They communicate in trains of electrical spikes, not single real numbers. Their behaviour is shaped by neuromodulators (dopamine, serotonin, acetylcholine and others) that change firing properties on time scales of seconds to hours. Each dendritic tree performs its own non-trivial computation before any signal even reaches the cell body. Synapses are influenced by the precise relative timing of pre- and post-synaptic spikes, not by simple co-activation. Connectivity is sparse and largely local. Energy is at a premium: the human brain runs on roughly 20 watts.

Artificial neurons are dry, fast, deterministic and differentiable. They communicate in floating-point numbers. They have no dendrites, no neuromodulators, no spikes, no timing. They are connected densely, all-to-all within a layer. A modern large language model contains a few hundred billion artificial neurons and consumes megawatts of power for days or weeks during training. The biological analogy provided the original motivation, many inputs, weighted summation, a non-linear response, a single output, and that motif turns out to be effective. But the engineering of modern deep learning is shaped by the demands of optimisation on GPUs, not by the biology of the cortex. Treat the word "neural" as a tribute to its origins and a useful organising metaphor, not as a claim that the systems we shall build are good models of brains.

What you should take away

The McCulloch–Pitts threshold logic unit (1943) showed that a network of brain-like cells can in principle compute any Boolean function. It set the stage but had no learning rule.
Hebb's rule (1949) gave the first concrete, local learning rule: strengthen a connection between two cells when they are active together. It introduced learning but lacked an error signal.
Rosenblatt's perceptron (1958) is a linear classifier whose weight vector defines a separating hyperplane. Its error-driven learning rule provably converges in finitely many mistakes whenever the data is linearly separable (Novikoff's theorem).
The perceptron cannot represent every Boolean function, most famously, it cannot represent XOR. Section 9.2 will dwell on this and the crisis it provoked.
Biological inspiration motivated the form of artificial neurons but the mathematics can be derived independently of any neuroscience. The perceptron is the simplest member of the family that includes every modern deep network; everything else in this chapter generalises it.