Convex Function, Glossary, Textbook of AI

A function $f: \mathbb{R}^n \to \mathbb{R}$ is convex if for all $x, y$ and $\lambda \in [0, 1]$:

$$f(\lambda x + (1 - \lambda) y) \leq \lambda f(x) + (1 - \lambda) f(y)$$

Geometrically, the line segment between $(x, f(x))$ and $(y, f(y))$ lies on or above the graph of $f$. Strictly convex if the inequality is strict for $x \neq y$.

Equivalent characterisations (for differentiable $f$):

First-order: $f(y) \geq f(x) + \nabla f(x)^\top (y - x)$, the tangent line is a global lower bound.
Second-order (for twice-differentiable $f$): the Hessian $\nabla^2 f(x)$ is positive semi-definite for all $x$.

Examples of convex functions:

Affine: $f(x) = a^\top x + b$ (both convex and concave).
Quadratic: $f(x) = \frac{1}{2} x^\top A x + b^\top x$ where $A$ is positive semi-definite.
Exponential: $e^x$.
Negative log: $-\log x$ on $x > 0$.
Norms: $\|x\|_p$ for $p \geq 1$.
Log-sum-exp: $\log \sum_i e^{x_i}$.
Indicator of a convex set.

Examples of non-convex functions:

Neural network loss landscapes (in general).
Trigonometric functions like $\sin$.
Products of convex functions (in general).

Convex preserving operations:

Sum of convex functions is convex.
Composition $f(g(x))$ is convex when $f$ is convex and increasing and $g$ is convex.
Maximum $\max_i f_i(x)$ of convex functions is convex.
Affine substitution $f(Ax + b)$ inherits convexity from $f$.

Why convexity matters:

Every local minimum is a global minimum. Any descent algorithm finds the optimum.

Strong convexity ($\nabla^2 f \succeq \mu I$ for some $\mu > 0$) implies linear convergence rates for gradient descent and uniqueness of the optimum.

Duality theory is exact: the primal and dual problems have the same optimal value (under mild constraint qualifications).

Many classical ML algorithms have convex losses:

Linear regression (squared loss): convex.
Logistic regression: convex.
Linear SVM with hinge loss: convex.
Lasso/ridge regression: convex.
MLE for exponential family: log-likelihood is concave (so negative log-likelihood is convex).

Neural networks are non-convex, the composition of layers ruins convexity. Despite this, SGD finds good solutions empirically; understanding why is one of the central theoretical questions of deep learning.

The 2019 Belkin-Hsu-Ma-Mandal "double descent" phenomenon and the empirical observation that overparameterised networks often interpolate the training data (zero training loss) without overfitting are signs that the classical convex-optimisation intuitions need substantial revision in the neural-network regime.

Related terms: Convex Optimisation, Gradient, Hessian

Discussed in:

Chapter 3: Calculus, Calculus

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).