Channel Capacity, Glossary, Textbook of AI

The channel capacity of a noisy communication channel is the maximum mutual information between its input $X$ and output $Y$ achievable by any choice of input distribution:

$$C = \max_{p(x)} I(X; Y) = \max_{p(x)} \big[H(Y) - H(Y \mid X)\big],$$

measured in bits per channel use (or bits per second for continuous-time channels). Capacity is a property of the channel alone -- the conditional distribution $p(y \mid x)$ -- and depends on neither the source nor the code.

Claude Shannon's noisy-channel coding theorem (1948) is the foundational result: for any rate $R < C$ there exists a code of sufficiently large block length whose error probability is arbitrarily small; conversely, for any rate $R > C$, no code can achieve vanishing error. The proof uses random coding on the achievability side and Fano's inequality on the converse. The theorem cleanly separates communication into two essentially independent problems -- source coding (compression to entropy $H$) and channel coding (error correction up to capacity $C$) -- a separation principle that is asymptotically optimal even though joint design can be more efficient at finite block lengths.

For specific channels capacity has clean closed forms. The binary symmetric channel with crossover probability $p$ has $C = 1 - H(p)$ bits per use, where $H(p) = -p \log_2 p - (1-p) \log_2(1-p)$ is the binary entropy. The binary erasure channel with erasure probability $\varepsilon$ has $C = 1 - \varepsilon$. The additive white Gaussian noise (AWGN) channel with bandwidth $W$ and signal-to-noise ratio $S/N$ has

$$C = W \log_2\!\left(1 + \frac{S}{N}\right) \quad \text{bits per second},$$

the celebrated Shannon--Hartley theorem, which sets the ultimate physical limits on every modem, wireless link, optical fibre and satellite channel ever deployed. For multi-antenna (MIMO) channels, capacity scales linearly with the minimum of the transmit and receive antenna counts, providing the theoretical foundation for 4G/5G wireless.

Designing capacity-achieving codes was a half-century quest. Early codes -- Hamming, BCH, Reed--Solomon, convolutional codes -- delivered useful but suboptimal performance. Turbo codes (Berrou, Glavieux and Thitimajshima, 1993) came within a fraction of a decibel of capacity at moderate block lengths via iterative decoding between two convolutional encoders. Low-density parity-check (LDPC) codes (Gallager 1962, rediscovered by MacKay and Luby in the late 1990s) are similarly near-capacity and underlie Wi-Fi 6, 5G data channels and DVB-S2. Polar codes (Arıkan, 2009) provably achieve capacity in the limit through a recursive channel-polarisation construction and are deployed in 5G control channels. Spatially coupled codes push closer still under iterative decoding.

Channel capacity also frames problems in machine learning and neuroscience. Mutual-information-based representation learning seeks features that maximise $I(\text{features}; \text{labels})$. The information bottleneck trades off compression and prediction. Capacity bounds the rate at which neurons can transmit information about a stimulus, which is central to theoretical neuroscience.

Related terms: claude-shannon, Mutual Information, Entropy, Information Theory

Discussed in:

Chapter 4: Probability, Information Theory
Chapter 1: What Is AI?, History of AI

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).