Sequence Models: Further reading

Goodfellow, Bengio and Courville (2016), Deep Learning, Chapter 10, recurrent and recursive networks.
Jurafsky and Martin (forthcoming third edition), Speech and Language Processing, chapters on language models, embeddings, RNNs, and attention.
Olah (2015), "Understanding LSTM Networks", the canonical visual exposition.
Karpathy (2015), "The Unreasonable Effectiveness of Recurrent Neural Networks", the min-char-rnn blog post.
Bahdanau, Cho and Bengio (2015) 2014, the original attention paper.
Sutskever, Vinyals and Le 2014, sequence-to-sequence learning.
Hochreiter and Schmidhuber 1997, the LSTM paper.
Mikolov et al. 2013, Pennington et al. 2014, Bojanowski et al. 2017, the word-embedding triumvirate.
Sennrich, Haddow and Birch 2016, BPE for NMT.
Graves et al. 2006, Connectionist Temporal Classification.
Holtzman et al. 2019, nucleus sampling.
Vaswani et al. 2017, Attention Is All You Need (preview of Chapter 13).

Textbook of AI