Glossary

Convex Function

A function $f: \mathbb{R}^n \to \mathbb{R}$ is convex if for all $x, y$ and $\lambda \in [0, 1]$:

$$f(\lambda x + (1 - \lambda) y) \leq \lambda f(x) + (1 - \lambda) f(y)$$

Geometrically, the line segment between $(x, f(x))$ and $(y, f(y))$ lies on or above the graph of $f$. Strictly convex if the inequality is strict for $x \neq y$.

Equivalent characterisations (for differentiable $f$):

  • First-order: $f(y) \geq f(x) + \nabla f(x)^\top (y - x)$, the tangent line is a global lower bound.
  • Second-order (for twice-differentiable $f$): the Hessian $\nabla^2 f(x)$ is positive semi-definite for all $x$.

Examples of convex functions:

  • Affine: $f(x) = a^\top x + b$ (both convex and concave).
  • Quadratic: $f(x) = \frac{1}{2} x^\top A x + b^\top x$ where $A$ is positive semi-definite.
  • Exponential: $e^x$.
  • Negative log: $-\log x$ on $x > 0$.
  • Norms: $\|x\|_p$ for $p \geq 1$.
  • Log-sum-exp: $\log \sum_i e^{x_i}$.
  • Indicator of a convex set.

Examples of non-convex functions:

  • Neural network loss landscapes (in general).
  • Trigonometric functions like $\sin$.
  • Products of convex functions (in general).

Convex preserving operations:

  • Sum of convex functions is convex.
  • Composition $f(g(x))$ is convex when $f$ is convex and increasing and $g$ is convex.
  • Maximum $\max_i f_i(x)$ of convex functions is convex.
  • Affine substitution $f(Ax + b)$ inherits convexity from $f$.

Why convexity matters:

Every local minimum is a global minimum. Any descent algorithm finds the optimum.

Strong convexity ($\nabla^2 f \succeq \mu I$ for some $\mu > 0$) implies linear convergence rates for gradient descent and uniqueness of the optimum.

Duality theory is exact: the primal and dual problems have the same optimal value (under mild constraint qualifications).

Many classical ML algorithms have convex losses:

Neural networks are non-convex, the composition of layers ruins convexity. Despite this, SGD finds good solutions empirically; understanding why is one of the central theoretical questions of deep learning.

The 2019 Belkin-Hsu-Ma-Mandal "double descent" phenomenon and the empirical observation that overparameterised networks often interpolate the training data (zero training loss) without overfitting are signs that the classical convex-optimisation intuitions need substantial revision in the neural-network regime.

Related terms: Convex Optimisation, Gradient, Hessian

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).