4.16 Summary

  • Probability is a calculus for reasoning under uncertainty. Kolmogorov's three axioms generate everything else.
  • Bayes' theorem updates beliefs in light of evidence. The base-rate fallacy and prosecutor's fallacy are concrete cautionary tales.
  • A small zoo of distributions, Bernoulli, Binomial, Poisson, Gaussian (uni- and multivariate), Beta, Dirichlet, Gamma, covers most ML modelling needs. Many sit inside the exponential family, which gives conjugate priors and clean MLEs.
  • Expectation is linear; variance is not. Conditioning gives the tower rule and the law of total variance. Higher moments matter when noise is heavy-tailed.
  • Markov, Chebyshev, Jensen, and Hoeffding turn fuzzy notions of "with high probability" into concrete bounds, the foundation of generalisation theory.
  • The LLN and CLT explain why empirical means converge and why so many empirical phenomena look Gaussian.
  • The multivariate Gaussian is closed under linear maps, marginalisation, and conditioning. These closures power Gaussian processes, Kalman filters, and PCA.
  • Information theory gives the mathematical scaffolding for ML losses: cross-entropy is forward KL plus a constant; minimising NLL is maximum likelihood.
  • Sampling, inverse-CDF, rejection, importance, is how probability becomes computation. MCMC handles intractable normalisers (Chapter 14).

The next chapter (Statistics) builds on these foundations: estimation theory, hypothesis testing, the bias-variance decomposition, and the bootstrap. By the end you will be ready for the supervised-learning machinery in Chapters 6 and 7.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).