Yaniv Leviathan, Matan Kalman, & Yossi Matias (2023), References, Textbook of AI

Yaniv Leviathan, Matan Kalman, & Yossi Matias (2023)

International Conference on Machine Learning.

URL: https://arxiv.org/abs/2211.17192

Abstract. Introduces speculative decoding, an inference technique that accelerates autoregressive Transformer decoding without changing the output distribution. A small draft model produces $k$ candidate tokens cheaply; the large target model evaluates all $k$ tokens in a single parallel forward pass and accepts each token with the rejection-sampling rule that preserves the target distribution. Tokens accepted in parallel are free; the first rejected token costs the same as a normal target step. In practice 2-3× wall-clock speedup is typical, more in agreeable distributions. Speculative decoding has become a standard production-inference technique.

Tags: inference language-models

Cited in:

Chapter 13: Attention & Transformers

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Fast Inference from Transformers via Speculative Decoding