Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, & Ion Stoica (2023), References, Textbook of AI

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, & Ion Stoica (2023)

ACM Symposium on Operating Systems Principles.

URL: https://arxiv.org/abs/2309.06180

Abstract. The vLLM paper. Identifies that naive Transformer serving wastes most of the GPU memory allocated to the KV cache because of internal fragmentation and conservative pre-allocation. Introduces PagedAttention, a virtual-memory-style scheme that stores KV blocks in non-contiguous physical memory, indexed by a per-request page table. PagedAttention enables fine-grained memory sharing across concurrent requests and dynamic growth of the cache. Combined with continuous batching, vLLM achieves 2-4× higher serving throughput than the best previous systems and is now the dominant open-source LLM-serving stack.

Tags: inference serving language-models

Cited in:

Chapter 13: Attention & Transformers

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Efficient Memory Management for Large Language Model Serving with PagedAttention