Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, & Furu Wei (2023), References, Textbook of AI

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, & Furu Wei (2023)

arXiv:2307.08621.

URL: https://arxiv.org/abs/2307.08621

Abstract. Introduces the Retentive Network (RetNet) architecture. Reformulates the sequence layer as a retention operator that admits three mathematically equivalent computational forms, a parallel form for efficient training (looking like attention with a complex-exponential decay), a recurrent form for $O(1)$-memory inference, and a chunkwise form for long-sequence modelling. RetNet matches Transformer perplexity at scale while removing the $O(n^2)$ inference cost of attention. The architecture is part of the cluster of "linear attention with structured decay" models that includes Mamba, RWKV and Gated Linear Attention.

Tags: sequence-models language-models architecture

Cited in:

Chapter 13: Attention & Transformers

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Retentive Network: A Successor to Transformer for Large Language Models