Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Xiangru Tang, Bolun Wang, Johan S. Wind, Stansilaw Wozniak, Ruichong Zhang, Zhenyuan Zhang, Qihang Zhao, Peng Zhou, Jian Zhu, & Rui-Jie Zhu (2023), References, Textbook of AI

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Xiangru Tang, Bolun Wang, Johan S. Wind, Stansilaw Wozniak, Ruichong Zhang, Zhenyuan Zhang, Qihang Zhao, Peng Zhou, Jian Zhu, & Rui-Jie Zhu (2023)

Findings of the Association for Computational Linguistics.

URL: https://arxiv.org/abs/2305.13048

Abstract. Introduces RWKV (Receptance-Weighted-Key-Value), a community-developed recurrent architecture that admits a parallelisable training form and an RNN-like inference form. The core layer is a hand-designed mixture of a linear-attention-style channel and a token-shift channel. RWKV scales to 14B parameters, matches contemporary GPT-class models on language modelling, and offers $O(1)$-memory inference. The architecture has been the most heavily community-iterated non-attention model of the post-Transformer era and shipped in version 6 by 2024.

Tags: sequence-models rnn language-models

Cited in:

Chapter 13: Attention & Transformers

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

RWKV: Reinventing RNNs for the Transformer Era