Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, & Ion Stoica (2023)
ACM Symposium on Operating Systems Principles.
URL: https://arxiv.org/abs/2309.06180
Abstract. The vLLM paper. Identifies that naive Transformer serving wastes most of the GPU memory allocated to the KV cache because of internal fragmentation and conservative pre-allocation. Introduces PagedAttention, a virtual-memory-style scheme that stores KV blocks in non-contiguous physical memory, indexed by a per-request page table. PagedAttention enables fine-grained memory sharing across concurrent requests and dynamic growth of the cache. Combined with continuous batching, vLLM achieves 2-4× higher serving throughput than the best previous systems and is now the dominant open-source LLM-serving stack.
Tags: inference serving language-models
Cited in: