PagedAttention: what the vLLM paper actually says
Reading Kwon et al. (UC Berkeley, SOSP 2023) while debugging why our serving throughput plateaued at 12 concurrent requests despite 40 GB of GPU memory remaining.
We were running LLaMA-13B on A100s and the batch size ceiling made no sense. GPU memory utilization sat at 70%, but adding the 13th concurrent request triggered OOM. We hadn't hit the model size limit — we'd hit the KV cache limit. And the reason for that limit, it turned out, was that our serving framework was wasting the majority of the memory it claimed to be using.
The paper that explained this — and fixed it — is "Efficient Memory Management for Large Language Model Serving with PagedAttention," Kwon, Li, Zhuang, Sheng, Zheng, Yu, Gonzalez, Zhang, and Stoica from UC Berkeley, presented at SOSP 2023. The contribution is simultaneously obvious in retrospect and genuinely non-trivial in execution: apply OS virtual memory paging to the KV cache. The result, vLLM, became the de facto standard for production LLM serving within months of publication, which is fast for a systems paper.
The problem the paper is actually solving
To understand why KV cache memory management matters, you need to understand what's actually in the KV cache and why its growth is unpredictable.
During autoregressive decoding, the transformer computes key and value vectors for every token at every attention layer. Rather than recomputing these on every forward pass, they're cached. For a 13B parameter model with 40 attention layers, 40 KV heads, and 128 head dimension, the KV cache for a single token occupies:
2 (key + value) × 40 layers × 40 heads × 128 dims × 2 bytes (FP16)
= ~819 KB per token
A single 2,000-token LLaMA-13B sequence occupies roughly 1.7 GB of KV cache. Scale that to batches of dozens of concurrent requests and the memory pressure is immediate.
Here's the problem the paper targets: you don't know how long each request will be before it finishes. The request comes in with a 512-token prompt; the completion might be 10 tokens or 2,000. Standard serving frameworks handle this by pre-allocating a contiguous chunk of memory large enough for the maximum possible output length — because the KV cache must be contiguous in memory for standard attention kernels to work efficiently.
The result is predictable: if you allocate for max_length = 2048 and the average completion is 300 tokens, you've reserved 5× more KV cache memory than the request actually uses. The paper quantifies this: existing systems waste 60–80% of their KV cache memory through fragmentation (gaps between non-uniform-length sequences) and over-reservation (reserved-but-unused space within sequences). That's the ceiling on your batch size, and it's not coming from the model or the hardware — it's coming from the allocator.
The paging mechanism
PagedAttention's solution is drawn directly from OS virtual memory design.
In virtual memory, a process sees a contiguous address space that the OS maps to non-contiguous physical pages. The mapping is maintained in a page table: a data structure translating virtual page numbers to physical frame numbers. Physical memory is divided into fixed-size frames; pages are the same size; memory is allocated one frame at a time as needed.
PagedAttention applies this to the KV cache verbatim. The KV cache is divided into fixed-size blocks (the paper uses 16 tokens per block in most experiments). Each sequence maintains a block table: a mapping from logical block index to physical block index. The physical blocks can occupy any non-contiguous locations in GPU HBM.
For a sequence that has generated 48 tokens, the block table might look like:
Logical Block 0 → Physical Block 7 (tokens 0–15)
Logical Block 1 → Physical Block 23 (tokens 16–31)
Logical Block 2 → Physical Block 11 (tokens 32–47)
Physical block 7, 23, and 11 might be scattered anywhere in HBM. This is fine — the KV cache doesn't need to be contiguous for the attention computation, only for the specific access pattern of the attention kernel. PagedAttention modifies the attention kernel to gather KV values using the block table indirection at runtime.
Memory is now allocated one block at a time, as tokens arrive. A new request reserves nothing upfront. The last block of each sequence may have internal fragmentation (if the sequence ends mid-block), but the waste is bounded: at most one block per sequence, typically 16 tokens × 819 KB/token = ~13 MB, independent of total sequence length. The paper reports average waste drops from 60–80% to under 4%.
Copy-on-write for parallel sampling
The block table indirection enables something else that the paper exploits: sharing physical blocks across sequences.
Consider beam search, or sampling multiple completions from the same prompt. The prompt's KV cache is identical across all beams or samples — it only diverges after generation begins. With contiguous allocation, each beam needs its own copy of the prompt KV cache from the start, even though those copies are identical. For a 512-token system prompt with a 13B model, that's ~420 MB per beam, per request. Running 4-beam search × 8 concurrent requests = 13.4 GB just for prompt KV cache duplication.
PagedAttention implements copy-on-write: physical blocks are reference-counted. Multiple sequences can point to the same physical block. When a sequence needs to write new tokens to a shared block, it gets its own copy first — the write triggers allocation of a new physical block and a redirect in that sequence's block table. Blocks containing only prompt tokens never get written after the prefill phase, so they're never copied.
The paper reports this reduces memory usage by up to 55% for parallel sampling workloads, with up to 2.2× throughput improvement over a system with per-beam contiguous allocation.
What the performance numbers actually show
The throughput comparison in the paper is against HuggingFace Transformers (the naive baseline) and HuggingFace TGI (which had continuous batching but no paged KV cache). On LLaMA-13B:
- vs. HuggingFace Transformers: 2.2–2.5× throughput at the same latency
- vs. the LMSYS internal deployment: up to 30× throughput (reflecting how much worse the prior system was)
The gap widens with longer sequences, larger models, and beam search. For short sequences (<200 tokens), the memory fragmentation problem is less severe — sequences complete quickly, freeing memory before fragmentation accumulates — and the improvement is closer to 1.5×. For 4K-token sequences, the gain is larger because over-reservation waste grows with max_length.
The SOSP 2023 paper uses FasterTransformer as the other comparison point (a highly optimized NVIDIA serving system). vLLM achieves 2–4× higher throughput vs. FasterTransformer depending on workload. This is notable because FasterTransformer was considered the throughput ceiling for production-grade serving at the time.
Production tradeoffs no one mentions in the benchmark post
Block size is a tuning knob with non-obvious tradeoffs. The paper uses 16 tokens per block, but the right value depends on your workload. Smaller blocks (4–8 tokens) reduce internal fragmentation further but increase block table size and add indirection overhead in the attention kernel — each attention head needs to resolve more block table entries per sequence. Larger blocks (32–64 tokens) reduce kernel overhead but waste more memory on short completions. At 4 tokens per block, the block table lookup overhead measurably affects attention kernel throughput at high sequence count. At 32 tokens per block, short completions with 8-token outputs waste 24 tokens' worth of KV cache per sequence. Most production deployments I've seen use 16 tokens as the default and don't revisit it, which is fine until you're running highly homogeneous workloads where a different size is obviously better.
Preemption introduces latency spikes. When memory pressure increases — a new request arrives that needs blocks, but the pool is full — vLLM must preempt one or more in-flight sequences. It has two strategies: swap (write the sequence's KV blocks to CPU DRAM and restore them later) or recompute (discard the blocks and recompute the KV cache from the prompt when the request is resumed). Swap trades GPU memory for CPU memory and PCI-e bandwidth; recompute trades memory for computation. For long sequences with 2K-token contexts, recompute means redoing 2,000 forward passes. The p99 latency impact of preemption is real and often worse than expected in burstier traffic patterns. Monitoring KV cache utilization and setting conservative batch admission thresholds is the operational control — preemption at 99% utilization is a latency disaster.
Prefix caching changes the operational model for chat. Later vLLM versions (not the original paper) add automatic prefix caching: if two requests share the same leading tokens (e.g., identical system prompts), the physical blocks for the shared prefix are reused across requests without recomputation. For chat APIs where every request starts with the same 512-token system prompt, this eliminates 99% of prompt processing cost for non-first turns. The catch: the prefix cache must be invalidated when the model changes. If you have a hot deployment, rolling out a model update silently flushes the prefix cache, and the first few minutes after a deploy see elevated latency as the cache warms back up. Not monitoring TTFT (time to first token) after deploys means you'll miss this.
PagedAttention's kernel is slower per-operation than contiguous attention. The block table indirection in the attention kernel means KV values are gathered from non-contiguous memory locations. GPU memory access is most efficient when accesses are coalesced — adjacent threads reading adjacent addresses. Non-contiguous block layout works against coalescing. The throughput improvement comes entirely from fitting more requests into available memory (higher batch size), not from making individual attention operations faster. Per-request latency at batch size 1 may be slightly higher with PagedAttention than with contiguous allocation and FlashAttention. The win is at scale.
Chunked prefill is not in the original paper but is load-bearing in practice. The original PagedAttention paper focuses on the decode phase (sequential token generation). The prefill phase — processing the input prompt — is not paged in the original design; it's handled as a single contiguous operation. In production, long prompts (4K+ tokens) during prefill can stall all in-flight decode requests because the GPU is fully occupied. Chunked prefill (splitting prefill across multiple steps, interleaved with decode) was added to vLLM later and addresses this latency isolation problem. If you're serving mixed short-decode and long-prompt workloads without chunked prefill, your decode latency will be erratic.
Failure modes in practice
The most common production failure mode: KV cache exhaustion triggers cascading preemption under bursty load.
At steady state, your KV cache utilization sits at 80% and everything is fine. A traffic burst hits — 3× request rate for 30 seconds — and requests arrive faster than they complete. Each new request allocates blocks; completions free blocks, but the arrival rate exceeds the completion rate. Utilization hits 95%, preemption starts, preempted sequences recompute, recomputation uses GPU cycles that would have been used to advance other sequences, latency climbs, completions slow, utilization climbs further. The system is technically not deadlocked but is spending most of its cycles recomputing rather than generating.
The fix is admission control: reject or queue new requests when KV cache utilization exceeds a threshold (typically 90%), rather than accepting them and triggering preemption. This requires instrumenting KV cache utilization as a first-class serving metric and wiring it into your load balancer. Most teams add this after the first production incident.
The second failure mode: block table corruption from concurrent writes during beam search.
If you're implementing PagedAttention manually or modifying a fork, copy-on-write requires careful reference counting. A bug I've seen twice: the reference count for a shared prefix block is decremented when the first beam completes, freeing the physical block while other beams still have it in their block tables. Subsequent attention reads from that block produce garbage outputs — silently, with no error. Output quality degrades in a way that looks like a model regression. This isn't a problem in stock vLLM (which has tested CoW extensively), but any fork that touches the block allocator needs rigorous tests for this exact case.
When not to use PagedAttention
Single-request or very low-QPS serving. PagedAttention's benefit is throughput — fitting more concurrent requests by reducing memory waste. At batch size 1 or 2, there's no memory waste to reclaim, no sharing opportunities, and the block table indirection adds overhead without benefit. If you're running a local development server or a latency-sensitive endpoint with negligible concurrency, standard contiguous KV cache with FlashAttention is faster per request. vLLM's block table lookup is not free.
When you need CUDA graphs for every decode step. CUDA graphs capture a static computation graph and replay it with minimal CPU overhead. They require the input tensor shapes and memory addresses to be fixed at capture time. vLLM's dynamic block allocation means the physical addresses of KV cache blocks change across requests, making CUDA graph capture for the attention step difficult. vLLM works around this with a limited form of CUDA graph support (capturing a graph per batch size), but the coverage is incomplete. If you need maximum decode throughput through CUDA graph utilization — which matters at small batch sizes, latency-critical workloads — the PagedAttention memory indirection is in tension with CUDA graph requirements.
When your sequences are very short and uniform. If your workload is all 50-token completions from 100-token prompts, memory fragmentation is minimal even with contiguous allocation — sequences complete fast enough that freed memory is quickly reused. The operational complexity of paged allocation (block size tuning, preemption handling, CoW logic) isn't justified. PagedAttention's gains scale with sequence length and batch diversity.
When you're on hardware without good non-coalesced memory access. The non-contiguous block access pattern penalizes GPUs where random HBM access is disproportionately expensive. On most data-center GPUs (A100, H100), this is tolerable because the throughput gain from higher batch size more than compensates. On older or less capable hardware, benchmark first. The performance model doesn't hold uniformly across GPU generations.
What the paper actually gives you
PagedAttention is a demonstration that systems research still has significant room in the LLM serving stack. The KV cache memory problem wasn't a mathematical constraint — it was an implementation assumption. Serving frameworks assumed contiguous KV cache because attention kernels require contiguous memory, which was treated as a fixed requirement rather than a kernel implementation detail. The paper broke that assumption by modifying the kernel instead of the serving logic.
The OS virtual memory analogy isn't just pedagogical — it's load-bearing. Page tables, reference counting, copy-on-write, demand allocation, preemption and swap: these are 1970s OS concepts that transfer directly because the problem structure is the same. Physical memory is scarce and shared across processes with unpredictable lifetimes and unpredictable working set sizes. GPU HBM is scarce and shared across inference requests with unpredictable lengths and unpredictable concurrency. The mapping is not a coincidence.
For your specific situation: if you're running LLM serving at any nontrivial batch size and you're not using a PagedAttention-based serving system (vLLM, TGI with block allocation, TensorRT-LLM's equivalent), you're almost certainly leaving throughput on the table due to KV cache fragmentation. The operational cost is real — block size tuning, preemption monitoring, prefix cache invalidation discipline — but the baseline throughput improvement is substantial enough that it's the starting point, not an optimization.
The 40 GB A100 with 12-request ceiling? We switched to vLLM with 16-token blocks, set a 90% KV cache utilization admission threshold, and enabled prefix caching for our system prompt. Concurrent request capacity increased to 47 before hitting the same latency SLA. Same hardware. Same model. The GPU was always capable of it — we were just wasting 70% of its KV cache memory.
Efficient Memory Management for Large Language Model Serving with PagedAttention — Kwon, Li, Zhuang, Sheng, Zheng, Yu, Gonzalez, Zhang, Stoica. SOSP 2023.