Moritz Hardt, Benjamin Recht, & Yoram Singer (2016)
International Conference on Machine Learning.
URL: https://arxiv.org/abs/1509.01240
Abstract. Provides one of the canonical theoretical analyses of why SGD generalises in the overparameterised regime. Uses algorithmic stability: a learning algorithm is stable if removing one training example changes the resulting model only slightly, and stable algorithms generalise. Bounds the stability of SGD as a function of the learning rate and the number of passes. Implies an implicit-regularisation effect, early stopping is itself a form of regularisation, even on convex objectives, because longer training reduces stability. The argument has been extended in many directions and undergirds much of the modern view that SGD-derived models inherit a flat-minimum bias.
Tags: optimisation generalisation theory