A function $f: \mathbb{R}^n \to \mathbb{R}$ is convex if for all $x, y$ and $\lambda \in [0, 1]$:
$$f(\lambda x + (1 - \lambda) y) \leq \lambda f(x) + (1 - \lambda) f(y)$$
Geometrically, the line segment between $(x, f(x))$ and $(y, f(y))$ lies on or above the graph of $f$. Strictly convex if the inequality is strict for $x \neq y$.
Equivalent characterisations (for differentiable $f$):
- First-order: $f(y) \geq f(x) + \nabla f(x)^\top (y - x)$, the tangent line is a global lower bound.
- Second-order (for twice-differentiable $f$): the Hessian $\nabla^2 f(x)$ is positive semi-definite for all $x$.
Examples of convex functions:
- Affine: $f(x) = a^\top x + b$ (both convex and concave).
- Quadratic: $f(x) = \frac{1}{2} x^\top A x + b^\top x$ where $A$ is positive semi-definite.
- Exponential: $e^x$.
- Negative log: $-\log x$ on $x > 0$.
- Norms: $\|x\|_p$ for $p \geq 1$.
- Log-sum-exp: $\log \sum_i e^{x_i}$.
- Indicator of a convex set.
Examples of non-convex functions:
- Neural network loss landscapes (in general).
- Trigonometric functions like $\sin$.
- Products of convex functions (in general).
Convex preserving operations:
- Sum of convex functions is convex.
- Composition $f(g(x))$ is convex when $f$ is convex and increasing and $g$ is convex.
- Maximum $\max_i f_i(x)$ of convex functions is convex.
- Affine substitution $f(Ax + b)$ inherits convexity from $f$.
Why convexity matters:
Every local minimum is a global minimum. Any descent algorithm finds the optimum.
Strong convexity ($\nabla^2 f \succeq \mu I$ for some $\mu > 0$) implies linear convergence rates for gradient descent and uniqueness of the optimum.
Duality theory is exact: the primal and dual problems have the same optimal value (under mild constraint qualifications).
Many classical ML algorithms have convex losses:
- Linear regression (squared loss): convex.
- Logistic regression: convex.
- Linear SVM with hinge loss: convex.
- Lasso/ridge regression: convex.
- MLE for exponential family: log-likelihood is concave (so negative log-likelihood is convex).
Neural networks are non-convex, the composition of layers ruins convexity. Despite this, SGD finds good solutions empirically; understanding why is one of the central theoretical questions of deep learning.
The 2019 Belkin-Hsu-Ma-Mandal "double descent" phenomenon and the empirical observation that overparameterised networks often interpolate the training data (zero training loss) without overfitting are signs that the classical convex-optimisation intuitions need substantial revision in the neural-network regime.
Related terms: Convex Optimisation, Gradient, Hessian
Discussed in:
- Chapter 3: Calculus, Calculus