Information Theory, Glossary, Textbook of AI

Information Theory, founded by Claude Shannon in 1948, provides a mathematical framework for quantifying information, uncertainty, and the cost of communication. Its core concepts, entropy, cross-entropy, KL divergence, and mutual information, have become indispensable in machine learning, where they serve as the theoretical basis for loss functions, model comparison, and representation learning.

Mutual information $I(X; Y) = H(X) - H(X \mid Y) = D_{KL}(p(x,y) \parallel p(x)p(y))$ quantifies how much knowing $Y$ reduces uncertainty about $X$. Unlike correlation, it captures all forms of statistical dependence, not just linear ones. It is symmetric, non-negative, and zero if and only if $X$ and $Y$ are independent. Mutual information is used in feature selection, in the information bottleneck method for learning compressed representations, and in analysing deep networks, though estimating it in high dimensions is notoriously difficult.

The data-processing inequality states that post-processing cannot increase mutual information, a fundamental constraint on what any learning algorithm can achieve. Information theory also connects to the geometry of probability distributions via the Fisher information matrix, which defines a Riemannian metric on parameter space; natural gradient descent follows the steepest descent direction in this metric. The deep connection between entropy, divergence, and geometry shows that information theory is not merely a collection of formulas but a coherent lens for understanding learning.

Related terms: Entropy, KL Divergence, Cross-Entropy Loss

Discussed in:

Chapter 4: Probability, Information Theory

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.