Calculus: 3.2   Limits, continuity, and differentiability

Dr Chris Paton

3.2 Limits, continuity, and differentiability

The previous section motivated calculus as the engine of learning: gradient descent moves a model's parameters in the direction that reduces a loss, and the gradient is built from derivatives. That description glossed over a subtle question. What does it actually mean to talk about "the slope of $f$ at the point $x$" when a single point has no width and a slope needs two points to be defined? The classical answer, and the one that powers every line of backpropagation code, is that the slope at $x$ is the limiting value of slopes of secant lines as the second point is brought arbitrarily close to $x$. Limits are the conceptual hinge on which the rest of calculus turns.

To make this precise we need three connected ideas. Limits describe the value a function approaches near a point, even when the function may not be defined exactly at that point. Continuity says the function has no sudden jumps, holes, or runaways to infinity, so its limiting value at a point matches its actual value there. Differentiability says the function is locally well approximated by a straight line, the gradient. Each idea builds on the previous, and each one shows up in the practice of training neural networks: limits underwrite the very definition of a derivative; continuity is the minimum we ask of an activation function; differentiability is what makes the chain rule of backpropagation legal.

The treatment here is precise enough for ML practice but stops short of measure theory and real analysis.

Symbols Used Here

$\lim_{x \to a} f(x)$limit of $f$ as $x$ approaches $a$

$\epsilon, \delta$small positive numbers used in formal limit definitions

$f(a)$value of $f$ at $a$

$\partial$partial-derivative symbol; also denotes the subdifferential of a convex function

Limits, intuitively

Imagine you are trying to predict the value of a function at a point you cannot quite reach. Perhaps the formula explicitly forbids that point, perhaps a measurement instrument cannot record the value there, perhaps the function is defined by a process that never finishes exactly. The limit is the value the function appears to be heading towards as you sneak up on the awkward point from either side.

The classical worked example is

$$ f(x) = \frac{x^2 - 1}{x - 1}. $$

At $x = 1$ the denominator is zero and $f$ is undefined; the formula has a hole there. Yet for any $x \ne 1$ we can simplify by factoring the numerator, $x^2 - 1 = (x-1)(x+1)$, and cancelling the common factor: $f(x) = x + 1$. So as $x$ wanders towards 1, the function values wander towards 2, even though the function itself never quite reaches 2 along its own graph. We write $\lim_{x \to 1} f(x) = 2$.

The limit captures the function's "intended" value at the awkward point, the value it would have if we patched the hole. Notice three things about this idea. First, the limit is concerned with values near $a$, not at $a$; whatever happens at $a$ itself is irrelevant to the limit. Second, both directions matter: $x$ must approach $a$ from the left and from the right, and the two approaches must agree. Third, the limit may simply not exist, for instance when a function jumps or oscillates. These three observations carry directly into machine learning. When we define the derivative as a limit, we never actually divide by zero; we use the limit to give meaning to a quantity that would otherwise be $0/0$.

Limits, formally

The intuitive picture is fine for hand-waving but useless for proving anything. The standard $\epsilon$-$\delta$ definition, due to Cauchy and Weierstrass, makes the idea precise. We say

$$ \lim_{x \to a} f(x) = L $$

if and only if for every $\epsilon > 0$ there exists a $\delta > 0$ such that whenever $0 < |x - a| < \delta$, we have $|f(x) - L| < \epsilon$. Read this as a game between two players. A user picks any tolerance $\epsilon$, no matter how small. We must respond with a window width $\delta$ such that, whenever $x$ is inside that window around $a$ but not equal to $a$ itself, the function value $f(x)$ is within the requested tolerance of $L$. If we can win this game for any tolerance the user picks, the limit equals $L$.

The condition $0 < |x - a|$ excludes $x = a$, again reflecting the fact that the limit does not care what happens exactly at $a$.

A short worked example illustrates the mechanics. Claim: $\lim_{x \to 0} (3x + 5) = 5$. Take any $\epsilon > 0$ and choose $\delta = \epsilon / 3$. Then whenever $|x| < \delta$ we have

$$ |f(x) - 5| = |3x + 5 - 5| = 3|x| < 3 \delta = 3 \cdot \frac{\epsilon}{3} = \epsilon. $$

So the chosen $\delta$ wins the game for any $\epsilon$, and the limit is indeed 5. The pattern is typical: write the gap $|f(x) - L|$ in terms of $|x - a|$, find the constant of proportionality, and divide $\epsilon$ by it to obtain $\delta$.

Two facts about limits are worth keeping in your pocket because they recur. The algebra of limits says that limits respect addition, multiplication and (with a non-zero denominator) division: if $\lim_{x \to a} f(x) = L$ and $\lim_{x \to a} g(x) = M$, then $\lim_{x \to a} (f \pm g) = L \pm M$, $\lim_{x \to a} fg = LM$, and $\lim_{x \to a} f/g = L/M$ provided $M \ne 0$. The squeeze theorem says that if $g(x) \le f(x) \le h(x)$ near $a$ and $g$ and $h$ both tend to the same limit $L$, then $f$ is forced to tend to $L$ as well. The squeeze theorem is the classical engine behind the result $\lim_{x \to 0} \sin(x)/x = 1$, and the same trick recurs whenever an awkward function is sandwiched between two friendlier ones.

We also need one-sided limits, $x \to a^+$ and $x \to a^-$, which use the same definition but restrict $x$ to lie above or below $a$ respectively. The two-sided limit exists precisely when both one-sided limits exist and agree. The asymmetry of one-sided limits is exactly what lets us talk about ReLU's left and right derivatives at zero, and what therefore lets us discuss its kink without ambiguity.

Continuity

A function $f$ is continuous at $a$ if three conditions hold simultaneously: (i) $f(a)$ is defined, (ii) $\lim_{x \to a} f(x)$ exists, and (iii) the two coincide, $\lim_{x \to a} f(x) = f(a)$. If a function is continuous at every point of its domain we say it is continuous on that domain.

Intuitively, continuity means the graph can be drawn without lifting the pen: no holes, no jumps, no infinities. The functions we meet most often in machine learning are continuous wherever they are defined. Polynomials are continuous everywhere. The exponential $e^x$ is continuous everywhere; the natural logarithm $\ln x$ is continuous on $(0, \infty)$ where it is defined; $\sin$ and $\cos$ are continuous everywhere; the standard activation functions sigmoid, tanh, GELU and ReLU are all continuous on the real line.

Some familiar functions are not continuous everywhere. The tangent function $\tan x$ has vertical asymptotes at $\pi/2 + k\pi$, where it shoots to $\pm \infty$, so it is discontinuous at those points. The indicator function $\mathbb{1}[x > 0]$, which returns 0 for non-positive inputs and 1 for positive ones, jumps abruptly at 0 and is therefore discontinuous there.

Why does machine learning care about continuity? Because backpropagation propagates gradients through whatever activation functions you choose, and for those gradients to be well defined you need at the very least continuity. A step-shaped function such as the sign function has zero derivative almost everywhere and an undefined jump at zero; if you used it as a hidden activation, no gradient signal would flow back through it and learning would stall. This is why early connectionists, who started with the perceptron's hard threshold, switched to the smooth sigmoid the moment they wanted to train multi-layer networks.

Two classical theorems for continuous functions on a closed bounded interval $[a, b]$ are worth knowing. The intermediate value theorem says that a continuous function takes every value between $f(a)$ and $f(b)$ at some point in the interval; this is what justifies bisection-style root finders, including the line searches that some optimisers use. The extreme value theorem says a continuous function on a closed bounded interval attains its maximum and minimum on that interval. Without compactness this can fail: $f(x) = 1/x$ on the half-open interval $(0, 1]$ has no maximum because the function blows up as $x \to 0$.

Differentiability

A function $f$ is differentiable at $a$ if the limit

$$ f'(a) = \lim_{h \to 0} \frac{f(a + h) - f(a)}{h} $$

exists. Geometrically, $f'(a)$ is the slope of the tangent line at $a$, the limiting value of secant slopes as the second point $a + h$ slides into $a$. Analytically it is the best linear approximation: $f(a + h) = f(a) + f'(a) h + o(h)$ as $h \to 0$, where the error shrinks faster than $h$ itself.

Differentiability is strictly stronger than continuity. Every differentiable function is continuous (because the limit defining the derivative would otherwise diverge), but not every continuous function is differentiable. The standard counterexample is $f(x) = |x|$ at $x = 0$: the function is continuous everywhere, but the difference quotient $|h|/h$ equals $-1$ for $h < 0$ and $+1$ for $h > 0$, so the left and right one-sided limits disagree and the two-sided limit does not exist. The graph has a sharp corner at the origin.

Worked example. Consider ReLU, $f(x) = \max(0, x)$. The function is continuous everywhere; for $x > 0$ it equals $x$ and has derivative 1, for $x < 0$ it equals 0 and has derivative 0. At the kink $x = 0$ the left derivative is 0 and the right derivative is 1, so ReLU is not differentiable at the origin in the strict sense. By convention, the standard frameworks pick $f'(0) = 0$ (a few pick 1); it does not matter in practice, because the probability that an activation hits exactly zero in floating-point arithmetic during training is measure zero. We have a single offending point, and we patch it with a sensible convention.

The same pattern appears throughout deep learning. A function fails to be differentiable on a tiny exceptional set, the optimiser ignores the kink, and training proceeds. The conceptual price is that the chain rule no longer holds at the bad points, but the practical price is zero because we essentially never visit those points exactly.

Subdifferentials

When we cannot lean on a classical derivative we have a more permissive substitute, at least for convex functions. If $f$ is convex but not differentiable at $a$, the subdifferential $\partial f(a)$ is the set of all real numbers $g$ such that the affine function $\ell(x) = f(a) + g(x - a)$ lies on or below $f$ everywhere and touches $f$ at $a$. Each such $g$ is called a subgradient. When $f$ is differentiable at $a$, the subdifferential collapses to the singleton $\{f'(a)\}$; when $f$ has a kink, the subdifferential typically contains a whole interval of slopes.

For $f(x) = |x|$ at the origin the subdifferential is the closed interval $[-1, 1]$: any line through the origin with slope between $-1$ and $1$ stays below $|x|$ and touches it at zero. For ReLU at zero, the subdifferential is $[0, 1]$, which is exactly the range from the left to the right derivative.

Subgradients let us run "gradient descent" on convex functions that are not strictly differentiable. The subgradient method picks any element of $\partial f(a)$ and uses it in place of the gradient. The classical use case in ML is L1-regularised regression (the lasso), where the regulariser $\lambda \|w\|_1$ has kinks at every coordinate axis; subgradient or proximal methods handle these kinks gracefully.

Almost-everywhere differentiability

Many functions in machine learning are differentiable everywhere except on a small exceptional set. ReLU fails at zero, max-pooling fails at ties, the absolute value fails at zero, and so on. In each case the exceptional set has Lebesgue measure zero: it is so thin that any reasonable probability distribution puts zero probability on it.

For practical purposes, a function differentiable almost everywhere may as well be treated as differentiable everywhere. Backpropagation reads off whatever derivative is defined at the visited points; the bad set is essentially never hit; gradient descent makes progress. The framework simply bakes in a convention for the exceptional points (ReLU at zero gets derivative 0, max-pooling ties get split or all-or-nothing) and moves on.

The mathematical theorem behind this comfort is Rademacher's theorem: every Lipschitz function on $\mathbb{R}^n$ is differentiable almost everywhere. ReLU, max-pooling, leaky ReLU, the absolute value, and many other piecewise-linear primitives are Lipschitz, so Rademacher hands us almost-everywhere differentiability for free. The picture you should carry around is that "differentiable in ML practice" means "differentiable on a set of full measure", and the optimisation theorems that rely on it (descent, convergence rates) are written to tolerate the bad set.

Smoothness and Lipschitz continuity

Two further notions sharpen what we mean by a "well-behaved" function. The first is Lipschitz continuity. A function $f$ is Lipschitz continuous with constant $L$ if for all $x, y$ in its domain,

$$ |f(x) - f(y)| \le L \, |x - y|. $$

Lipschitz continuity bounds how fast the function can change: no matter where you stand, the function never out-paces a fixed slope $L$. ReLU is 1-Lipschitz because its slope is at most 1. The sigmoid $\sigma$ is $\tfrac{1}{4}$-Lipschitz because $|\sigma'(x)| \le \tfrac{1}{4}$ everywhere, the maximum being attained at the origin. A linear layer $x \mapsto W x$ has Lipschitz constant equal to the spectral norm of $W$, the largest singular value. Lipschitz constants control how a network amplifies small input perturbations, and they appear directly in robustness analyses, certified defences against adversarial examples, and convergence proofs for GAN training.

The second notion is smoothness. A function is $C^k$ if it is $k$-times differentiable and the $k$-th derivative is continuous; it is $C^\infty$, or smooth in every order, if all derivatives exist and are continuous. Most ML losses built from polynomials, exponentials and logarithms are at least $C^2$, so we can talk about Hessians and quadratic Taylor expansions. Some theoretical analyses require $C^\infty$; classical examples are polynomials, $e^x$, $\sin$ and $\cos$, all of which are infinitely differentiable. Activation functions like GELU and softplus are $C^\infty$, while ReLU is only $C^0$ globally and $C^\infty$ on each side of the kink.

In gradient-descent theory we usually assume the loss has Lipschitz gradients, $\|\nabla L(\theta) - \nabla L(\theta')\| \le L \|\theta - \theta'\|$. Under that assumption the descent lemma gives a guaranteed decrease per step provided the learning rate is below $2/L$, which is the textbook reason learning rates must be tuned and why curvature-aware methods can step more aggressively.

What you should take away

The limit of $f$ as $x$ approaches $a$ is the value $f$ heads towards near $a$, regardless of what happens at $a$ itself; the formal $\epsilon$-$\delta$ definition turns this picture into a proof discipline.
Continuity at $a$ requires that $f(a)$ exists, the limit exists, and they agree; ReLU and the standard activations are continuous, while step functions are not, which is exactly why step functions cannot be used as hidden activations.
Differentiability at $a$ requires the difference-quotient limit to exist; it implies continuity but not the converse, and many ML primitives fail at a measure-zero set of points which we patch by convention.
The subdifferential generalises the derivative for convex non-smooth functions and gives the lasso, hinge loss and friends a clean optimisation theory via subgradient methods.
Lipschitz continuity bounds how fast a function can change and underwrites both robustness analyses and the convergence proofs for gradient descent; smoothness classes $C^k$ tell us how many derivatives are at our disposal for higher-order methods such as Newton's.