Maximum likelihood: peak of the likelihood curve, Textbook of AI

Sweep the parameter, plot the likelihood, take the maximum.

From the chapter: Chapter 5: Statistics

Glossary: maximum likelihood estimation

Transcript

We have data drawn from a Gaussian with unknown mean. We want to estimate that mean.

The likelihood function: assume any candidate mean, compute the probability of having seen the data under that hypothesis. Sweep the candidate from low to high.

For very low candidates, the data look unlikely. Likelihood is small.

For very high candidates, same. Likelihood is small.

Somewhere in the middle, the data are most consistent with the parameter. The likelihood peaks. The peak is the maximum likelihood estimate.

For a Gaussian's mean, the MLE is just the sample mean. For other distributions, it can be more involved.

Why work with the log instead. Likelihoods multiply over independent data points. Their logs add. The log-likelihood is numerically stable and easier to differentiate.

Take the log-likelihood, set its derivative to zero, solve. The MLE.

Most of statistics is built on this trick. Logistic regression, neural networks trained with cross-entropy, hidden Markov models, all are MLE in disguise. Even least-squares regression is the MLE under Gaussian noise.

The likelihood function turns "what value best explains the data" into a numerical optimisation.

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).