Probability: 4.16   Summary

Dr Chris Paton

4.16 Summary

Probability is a calculus for reasoning under uncertainty. Kolmogorov's three axioms generate everything else.
Bayes' theorem updates beliefs in light of evidence. The base-rate fallacy and prosecutor's fallacy are concrete cautionary tales.
A small zoo of distributions, Bernoulli, Binomial, Poisson, Gaussian (uni- and multivariate), Beta, Dirichlet, Gamma, covers most ML modelling needs. Many sit inside the exponential family, which gives conjugate priors and clean MLEs.
Expectation is linear; variance is not. Conditioning gives the tower rule and the law of total variance. Higher moments matter when noise is heavy-tailed.
Markov, Chebyshev, Jensen, and Hoeffding turn fuzzy notions of "with high probability" into concrete bounds, the foundation of generalisation theory.
The LLN and CLT explain why empirical means converge and why so many empirical phenomena look Gaussian.
The multivariate Gaussian is closed under linear maps, marginalisation, and conditioning. These closures power Gaussian processes, Kalman filters, and PCA.
Information theory gives the mathematical scaffolding for ML losses: cross-entropy is forward KL plus a constant; minimising NLL is maximum likelihood.
Sampling, inverse-CDF, rejection, importance, is how probability becomes computation. MCMC handles intractable normalisers (Chapter 14).

The next chapter (Statistics) builds on these foundations: estimation theory, hypothesis testing, the bias-variance decomposition, and the bootstrap. By the end you will be ready for the supervised-learning machinery in Chapters 6 and 7.