Razvan Pascanu, Tomas Mikolov, & Yoshua Bengio (2013)
International Conference on Machine Learning.
URL: https://arxiv.org/abs/1211.5063
Abstract. A clean exposition of the vanishing- and exploding-gradient problem in recurrent networks, building on the earlier qualitative analysis of Bengio, Simard and Frasconi (1994). The product of step-Jacobians $\prod_j J_j$ governs how a perturbation propagates through time; in generic networks it grows or shrinks exponentially with sequence length. The paper introduces gradient clipping as the standard mitigation for the exploding case and shows that it stabilises training across many architectures. Gradient clipping remains a default in every modern deep-learning library.
Tags: rnn optimisation sequence-models
Cited in: