3.3 Derivatives in one variable
A derivative tells you how fast something is changing. If $f(x)$ is some quantity that depends on $x$, perhaps the loss of a neural network as you nudge a single weight, or the height of a hill as you walk along a path, then the derivative $f'(x)$ measures the instantaneous rate of change of $f$ at the point $x$. Geometrically, it is the slope of the line that just touches the graph of $f$ at $(x, f(x))$ without crossing it: the tangent line. The flatter the graph, the smaller the derivative; the steeper the graph, the larger the derivative in magnitude. A negative derivative simply means the graph is sloping downwards as $x$ increases.
In machine learning, the derivative of the loss with respect to a single parameter answers the most practical question we ever ask of a model during training: if I make this parameter slightly larger, does the loss go up or down, and by how much? That single number is enough to take a step in the right direction. Repeated billions of times across billions of parameters, this is gradient descent, the entire engine of modern AI. Everything in Chapter 9 on optimisation, every line of every backpropagation pass in §3.7, and every gradient identity in §3.12 is ultimately built on the one-dimensional derivative we develop here. Master this and the rest is, mostly, careful bookkeeping.
The single-variable case is where the ideas first appear in their cleanest form. §3.4 generalises everything here to functions of many variables, where the derivative becomes a vector, the gradient.
Definition
The derivative of $f$ at the point $x$ is the limit
$$f'(x) = \lim_{\Delta x \to 0} \frac{f(x + \Delta x) - f(x)}{\Delta x}.$$
Read this aloud: take the change in $f$ caused by a small change in $x$, divide by the size of that change in $x$, and see what number this ratio approaches as the change shrinks to zero. The fraction $[f(x + \Delta x) - f(x)] / \Delta x$ is the slope of the secant line through the two points $(x, f(x))$ and $(x + \Delta x, f(x + \Delta x))$. As $\Delta x$ shrinks, the second point slides along the curve towards the first, and the secant line rotates into the tangent line. The limiting slope is $f'(x)$.
There is an equivalent way of saying the same thing that is more useful in practice. For small $\Delta x$,
$$f(x + \Delta x) \approx f(x) + f'(x) \, \Delta x.$$
This says: near $x$, the function $f$ is well approximated by a straight line of slope $f'(x)$. The approximation gets better and better as $\Delta x$ shrinks; the error is of order $(\Delta x)^2$. This best linear approximation view of the derivative is the one that generalises cleanly to many variables (where the line becomes a tangent plane) and to vector-valued functions (where the slope becomes a Jacobian matrix). Whenever you see a Taylor series, a Newton step, or a backpropagation update, this is the picture lurking underneath.
Worked example. Take $f(x) = x^2$ and apply the definition directly. Compute the change in $f$:
$$f(x + \Delta x) - f(x) = (x + \Delta x)^2 - x^2 = 2x\,\Delta x + (\Delta x)^2.$$
Divide by $\Delta x$:
$$\frac{f(x + \Delta x) - f(x)}{\Delta x} = 2x + \Delta x.$$
Now take the limit as $\Delta x \to 0$. The first term, $2x$, does not depend on $\Delta x$ at all. The second term vanishes. So $f'(x) = 2x$. At $x = 3$ the slope is $6$; at $x = 0$ the slope is $0$ (the parabola is momentarily flat at the origin); at $x = -2$ the slope is $-4$ (the parabola is heading downwards).
This is the prototype for everything that follows. Every standard derivative below could in principle be derived this way; in practice we derive a few from first principles and then use the resulting rules to assemble the rest. The same pattern recurs in every science: define an operation precisely once, work a small canonical example to make sure the definition behaves the way intuition demands, and from then on rely on a short library of derived rules. By the time you reach Chapter 9 you will be writing chain-rule derivations for whole neural networks without ever returning to the limit definition, but the limit is what it ultimately means, and a clear mental picture of secant lines collapsing onto a tangent will repeatedly rescue you when an unfamiliar formula needs a sanity check.
Standard derivatives
The following table lists the derivatives you will see most often. Each one has a one-line justification; the proofs use either the definition above or the rules in the next subsection.
- Constant. $\frac{d}{dx} c = 0$. A constant function does not change, so its rate of change is zero.
- Power rule. $\frac{d}{dx} x^n = n x^{n-1}$ for any real $n$. For positive integers this follows from the binomial expansion; for general real $n$ one uses $x^n = e^{n \ln x}$ together with the chain rule.
- Exponential. $\frac{d}{dx} e^x = e^x$. The number $e$ is defined as the unique base for which the exponential function is its own derivative.
- Logarithm. $\frac{d}{dx} \ln x = 1/x$ for $x > 0$. This is the inverse of the exponential identity: $\ln$ undoes $e^x$, and the derivative of an inverse function is the reciprocal of the original derivative evaluated at the inverse point.
- Sine and cosine. $\frac{d}{dx} \sin x = \cos x$ and $\frac{d}{dx} \cos x = -\sin x$. These follow from the angle-addition formulae and the small-angle limits $\sin(\Delta x)/\Delta x \to 1$ and $(1 - \cos \Delta x)/\Delta x \to 0$.
- Sigmoid. $\frac{d}{dx} \sigma(x) = \sigma(x)\bigl(1 - \sigma(x)\bigr)$, where $\sigma(x) = 1/(1 + e^{-x})$. Derivation below.
- Hyperbolic tangent. $\frac{d}{dx} \tanh x = 1 - \tanh^2 x$. The proof uses the quotient rule on $(e^x - e^{-x})/(e^x + e^{-x})$.
- ReLU. $\frac{d}{dx} \max(0, x) = 1$ if $x > 0$, $0$ if $x < 0$, and undefined at $x = 0$. In practice frameworks return $0$ at the kink because it is convenient and almost never matters.
Worked example: the sigmoid derivative. The sigmoid $\sigma(x) = 1/(1 + e^{-x})$ is the canonical squashing non-linearity. Its derivative is so cheap that it is worth deriving once and remembering. Using the quotient rule with numerator $1$ and denominator $1 + e^{-x}$,
$$\sigma'(x) = \frac{0 \cdot (1 + e^{-x}) - 1 \cdot (-e^{-x})}{(1 + e^{-x})^2} = \frac{e^{-x}}{(1 + e^{-x})^2}.$$
Now split this into two factors:
$$\sigma'(x) = \frac{1}{1 + e^{-x}} \cdot \frac{e^{-x}}{1 + e^{-x}} = \sigma(x) \cdot \bigl(1 - \sigma(x)\bigr),$$
since $1 - \sigma(x) = e^{-x}/(1 + e^{-x})$. The derivative of $\sigma$ at $x$ is computable from $\sigma(x)$ alone — no need to recompute $e^{-x}$ during the backward pass; the activation already carries everything needed. Numerically, at $x = 0$ the sigmoid is $\sigma(0) = 0.5$, so $\sigma'(0) = 0.5 \cdot (1 - 0.5) = 0.25$. This is the largest the derivative ever gets. At the most-active point of the sigmoid, gradient flow through it is divided by $4$; at less active points it shrinks rapidly towards zero. This is the origin of the vanishing-gradient problem in deep sigmoid networks, and a large reason ReLU and its descendants displaced sigmoid in hidden layers.
Linearity, product rule, quotient rule, chain rule
Four rules let you differentiate almost anything you will meet:
- Linearity. $(a f + b g)' = a f' + b g'$ for constants $a$ and $b$. Differentiation respects sums and scalar multiples.
- Product rule. $(f g)' = f' g + f g'$. Each factor takes a turn being differentiated; sum the results.
- Quotient rule. $(f/g)' = (f' g - f g')/g^2$ wherever $g \neq 0$. A consequence of the product rule applied to $f \cdot g^{-1}$.
- Chain rule. $(f \circ g)'(x) = f'(g(x)) \cdot g'(x)$. Compose two functions; multiply their slopes evaluated at the right points.
Worked product rule. Take $f(x) = x \sin x$. Set $u = x$ and $v = \sin x$, so $u' = 1$ and $v' = \cos x$. Then
$$(uv)' = u' v + u v' = 1 \cdot \sin x + x \cdot \cos x = \sin x + x \cos x.$$
You can sanity-check this at $x = 0$: $f(0) = 0$ and the derivative is $\sin 0 + 0 \cdot \cos 0 = 0$, which matches the fact that the graph passes flat through the origin.
Worked quotient rule. Take $f(x) = e^x / (1 + x)$. The numerator is $e^x$ with derivative $e^x$; the denominator is $1 + x$ with derivative $1$. Then
$$f'(x) = \frac{e^x (1 + x) - e^x \cdot 1}{(1 + x)^2} = \frac{e^x \bigl[(1 + x) - 1\bigr]}{(1 + x)^2} = \frac{x \, e^x}{(1 + x)^2}.$$
At $x = 0$ this gives $f'(0) = 0$, the curve is momentarily flat there, consistent with $f(0) = 1$ being a local extremum of this expression.
Worked chain rule. Take $h(x) = \sin(x^2)$. Decompose: let $g(x) = x^2$ (the inner function) and $f(u) = \sin u$ (the outer function), so $h = f \circ g$. Then $g'(x) = 2x$ and $f'(u) = \cos u$, so
$$h'(x) = f'\bigl(g(x)\bigr) \cdot g'(x) = \cos(x^2) \cdot 2x = 2x \cos(x^2).$$
The chain rule is the operation behind backpropagation. Every layer of a neural network is a composed function; every gradient computed during a backward pass is an iterated application of the chain rule, multiplying local Jacobians from output back to input. Once you understand the line above, you understand the mechanical core of training neural networks. We will use the chain rule constantly from §3.5 onwards, where it gets dressed up with vectors and matrices but never changes in spirit.
Higher-order derivatives
Once you have $f'(x)$, it too is a function of $x$, and you can differentiate it again. The result, $f''(x)$, is the second derivative. Geometrically, $f'$ is the slope of $f$, so $f''$ is the rate at which the slope is changing, the curvature of the graph. Where $f''(x) > 0$, the graph is curving upwards (think of a bowl); we say $f$ is convex there. Where $f''(x) < 0$, the graph is curving downwards (a dome); $f$ is concave. Where $f''(x) = 0$ and the sign changes either side, the graph has an inflection point, the curvature flips direction.
Worked example. Take $f(x) = x^3$. Differentiate once: $f'(x) = 3x^2$. Differentiate again: $f''(x) = 6x$. Then $f''(0) = 0$, an inflection point sits at the origin, where the cubic switches from concave to convex. To the right, $f''(1) = 6 > 0$ and the graph curves upwards. To the left, $f''(-1) = -6 < 0$ and it curves downwards. The classic S-shape of the cubic falls out of these signs.
You can keep differentiating: $f'''$, $f^{(4)}$, and so on, provided $f$ is smooth enough at the point of interest. Higher-order derivatives appear in three places later in this book: Newton's method (§3.10), which uses $f''$ to build a step that already accounts for curvature, jumping straight to the minimum of the local quadratic rather than crawling along the slope; Taylor series (the next subsection), which uses every derivative at a single point to reconstruct the function in a neighbourhood; and second-order optimisation methods, which exploit Hessian information to converge in dramatically fewer iterations than plain gradient descent, at the cost of computing or approximating curvature, which for a network with billions of parameters is itself a serious engineering problem.
Taylor series
Around a point $x_0$, a sufficiently smooth function $f$ admits the expansion
$$f(x) = f(x_0) + f'(x_0)(x - x_0) + \tfrac{1}{2} f''(x_0)(x - x_0)^2 + \tfrac{1}{6} f'''(x_0)(x - x_0)^3 + \cdots,$$
with the $n$-th term being $f^{(n)}(x_0) (x - x_0)^n / n!$. The first two terms give the linear approximation, the tangent line. The first three give the quadratic approximation, the parabola that matches $f$ in value, slope, and curvature at $x_0$. For optimisation, this distinction is everything: gradient descent uses the linear approximation (it follows the tangent slope); Newton's method uses the quadratic (it jumps to the minimum of the local parabola).
Worked example. Take $f(x) = e^x$ around $x_0 = 0$. Every derivative of the exponential is itself, so $f^{(n)}(0) = e^0 = 1$ for all $n$. The Taylor series is therefore
$$e^x = 1 + x + \tfrac{x^2}{2} + \tfrac{x^3}{6} + \tfrac{x^4}{24} + \cdots.$$
Plug in $x = 0.1$: the partial sum $1 + 0.1 + 0.005 + 0.000167 + \cdots$ comes to $1.10517 \ldots$, matching the true value $e^{0.1} = 1.10517 \ldots$ to many decimal places after only a handful of terms. This rapid convergence is why floating-point libraries compute exponentials via short polynomial approximations and why Taylor expansions are everywhere in numerical analysis and physics.
Critical points and optima
A critical point of $f$ is any $x$ at which $f'(x) = 0$. These are the candidates for local extrema; once you have located them you classify them with the second derivative:
- $f''(x) > 0$: the graph is curving upwards through a horizontal tangent, a local minimum.
- $f''(x) < 0$: the graph is curving downwards through a horizontal tangent, a local maximum.
- $f''(x) = 0$: the test is inconclusive. The point may be an inflection, a flat plateau, or a higher-order critical point that needs further analysis.
Worked example. Take $f(x) = x^3 - 3x$. Differentiate: $f'(x) = 3x^2 - 3 = 3(x^2 - 1)$, which vanishes at $x = \pm 1$. Differentiate again: $f''(x) = 6x$. At $x = 1$ we have $f''(1) = 6 > 0$, so $x = 1$ is a local minimum (with value $f(1) = -2$). At $x = -1$ we have $f''(-1) = -6 < 0$, so $x = -1$ is a local maximum (with value $f(-1) = 2$). Both are local, not global: as $x \to \pm \infty$ the cubic runs off to $\pm \infty$, so this function has no global extrema.
In machine learning, the loss landscapes we navigate are vastly higher-dimensional and rarely have a single, clean global minimum. They contain a complex tangle of local minima, saddle points (where the gradient vanishes but the Hessian has both positive and negative eigenvalues, so the surface curves up in some directions and down in others), wide flat plateaus where progress crawls, and narrow ravines where it can oscillate. The simple one-dimensional picture, locate the critical points, classify them with the second derivative, is the conceptual core, but the practical work of finding good parameters in a network with billions of weights is dominated by the geometry of these high-dimensional surfaces. Gradient descent and its cousins navigate them using only first-derivative information, with the occasional dose of second-derivative or curvature-approximating tricks. We develop this story in detail in Chapters 9 and 10; the ideas in this section are the alphabet of that story.
What you should take away
A derivative is a slope. $f'(x)$ is the instantaneous rate of change of $f$ at $x$, equivalently the slope of the tangent line, equivalently the coefficient of $\Delta x$ in the best linear approximation $f(x + \Delta x) \approx f(x) + f'(x)\,\Delta x$.
Eight standard derivatives cover almost everything. Constants, powers, $e^x$, $\ln x$, $\sin$, $\cos$, sigmoid, and ReLU. Memorise these; the rest you can compose.
Four rules combine them. Linearity, product, quotient, and chain. The chain rule is the heart of backpropagation.
Second derivatives encode curvature. $f'' > 0$ is convex (local minimum candidate), $f'' < 0$ is concave (local maximum candidate). Curvature drives Newton's method, Hessian-based optimisers, and the Taylor expansions used throughout numerical analysis.
Critical points are where $f'(x) = 0$. They are candidates for extrema, classified by the sign of $f''$. In ML, the loss surface is full of them, minima, maxima, and saddle points alike, and most of optimisation theory is the study of how to navigate this zoo.