Moritz Hardt, Benjamin Recht, & Yoram Singer (2016), References, Textbook of AI

Moritz Hardt, Benjamin Recht, & Yoram Singer (2016)

International Conference on Machine Learning.

URL: https://arxiv.org/abs/1509.01240

Abstract. Provides one of the canonical theoretical analyses of why SGD generalises in the overparameterised regime. Uses algorithmic stability: a learning algorithm is stable if removing one training example changes the resulting model only slightly, and stable algorithms generalise. Bounds the stability of SGD as a function of the learning rate and the number of passes. Implies an implicit-regularisation effect, early stopping is itself a form of regularisation, even on convex objectives, because longer training reduces stability. The argument has been extended in many directions and undergirds much of the modern view that SGD-derived models inherit a flat-minimum bias.

Tags: optimisation generalisation theory

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Train faster, generalize better: Stability of stochastic gradient descent