4.16 Summary
- Probability is a calculus for reasoning under uncertainty. Kolmogorov's three axioms generate everything else.
- Bayes' theorem updates beliefs in light of evidence. The base-rate fallacy and prosecutor's fallacy are concrete cautionary tales.
- A small zoo of distributions, Bernoulli, Binomial, Poisson, Gaussian (uni- and multivariate), Beta, Dirichlet, Gamma, covers most ML modelling needs. Many sit inside the exponential family, which gives conjugate priors and clean MLEs.
- Expectation is linear; variance is not. Conditioning gives the tower rule and the law of total variance. Higher moments matter when noise is heavy-tailed.
- Markov, Chebyshev, Jensen, and Hoeffding turn fuzzy notions of "with high probability" into concrete bounds, the foundation of generalisation theory.
- The LLN and CLT explain why empirical means converge and why so many empirical phenomena look Gaussian.
- The multivariate Gaussian is closed under linear maps, marginalisation, and conditioning. These closures power Gaussian processes, Kalman filters, and PCA.
- Information theory gives the mathematical scaffolding for ML losses: cross-entropy is forward KL plus a constant; minimising NLL is maximum likelihood.
- Sampling, inverse-CDF, rejection, importance, is how probability becomes computation. MCMC handles intractable normalisers (Chapter 14).
The next chapter (Statistics) builds on these foundations: estimation theory, hypothesis testing, the bias-variance decomposition, and the bootstrap. By the end you will be ready for the supervised-learning machinery in Chapters 6 and 7.