Razvan Pascanu, Tomas Mikolov, & Yoshua Bengio (2013), References, Textbook of AI

Razvan Pascanu, Tomas Mikolov, & Yoshua Bengio (2013)

International Conference on Machine Learning.

URL: https://arxiv.org/abs/1211.5063

Abstract. A clean exposition of the vanishing- and exploding-gradient problem in recurrent networks, building on the earlier qualitative analysis of Bengio, Simard and Frasconi (1994). The product of step-Jacobians $\prod_j J_j$ governs how a perturbation propagates through time; in generic networks it grows or shrinks exponentially with sequence length. The paper introduces gradient clipping as the standard mitigation for the exploding case and shows that it stabilises training across many architectures. Gradient clipping remains a default in every modern deep-learning library.

Tags: rnn optimisation sequence-models

Cited in:

Chapter 12: Sequence Models

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

On the difficulty of training recurrent neural networks