Abstract. Empirical demonstration that adaptive methods such as Adam can converge to solutions that generalise worse than well-tuned SGD with momentum, particularly on image classification, motivating caution and the development of improved variants.
Tags:optimisationadamgeneralisation
This site is currently in Beta. Contact: Chris Paton