← All writing
Paper Breakdown

SGLang and RadixAttention: what the paper actually says

Reading Zheng, Yin, et al. (UC Berkeley / Stanford, 2023) while profiling why our document Q&A pipeline spent 70% of GPU time recomputing prompts we'd already paid for.

The number that got my attention was in a flamegraph: 70% of our total inference compute was prefill for a 4,000-token system prompt we included in every request. The remaining 30% was the actual generation — the part the user was waiting for. We were spending most of our GPU budget telling the model things it had already processed ten thousand times.

The naive fix is to cache the KV cache for the system prompt and reuse it. Most serving frameworks let you do this for a single fixed prefix. The problem is that real production LLM programs aren't that clean. A document Q&A system has a fixed system prompt, a per-document context, and per-question user turns — three levels of prefix that overlap differently for different requests. A code assistant has a system prompt, a per-repository context that changes per PR, and per-file content. A chatbot has a system prompt and a growing conversation history. The reuse pattern is a tree, not a flat prefix.

"Efficiently Programming Large Language Models using SGLang", Lianmin Zheng, Liangsheng Yin, Zhuohan Li, Simon Mo, Joseph E. Gonzalez, Ion Stoica, and others from UC Berkeley and Stanford, December 2023, addresses this with a runtime system that automatically caches and reuses KV cache across arbitrary multi-call LLM programs using a radix tree — what the paper calls RadixAttention.

The paper combines two contributions that are worth separating: SGLang as a domain-specific language for expressing LLM programs, and RadixAttention as the runtime mechanism that makes those programs efficient. The language is interesting. The runtime is what you should care about if you're running inference at any scale.

The problem: KV cache doesn't survive across requests

To understand what RadixAttention solves, you need a precise picture of where KV cache lives and dies in a standard serving stack.

During prefill, the transformer processes every input token in parallel: for each attention layer, it computes key and value vectors for every token and stores them in the KV cache. This cache persists for the duration of the request — each decode step reads it and appends new entries. When the request finishes, the KV cache is freed.

This per-request lifetime is the problem. If your next request has the same 4,000-token system prompt as the last request, you recompute its KV cache from scratch. The LLM doesn't "remember" processing that prompt. Each request arrives at the GPU as a blank slate.

The cost compounds fast. A 4,000-token prefill on a 13B model at FP16 costs roughly 4,000 × 40 layers × 2 (K+V) × 40 heads × 128 dims × 2 bytes = 1.6 GB of memory read/write, plus the attention and FFN compute. If you're running 100 requests per second all sharing that system prompt, you're paying 1.6 GB × 100/s of unnecessary HBM traffic — not counting compute.

Existing systems at the time of the paper (HuggingFace TGI, vLLM v0.1) had no mechanism for cross-request KV cache sharing. Each request allocated, populated, and freed its own KV cache independently. The cost was invisible in typical single-request benchmarks but dominant in production workloads with shared prefixes.

The radix tree data structure

RadixAttention maintains a single radix tree (compressed trie) of KV cache blocks in GPU HBM, shared across all active and recently completed requests.

A radix tree stores sequences of tokens where each node represents a contiguous run of tokens. The tree is indexed by token sequence — the path from root to any node spells out the prefix that node represents. Each node holds a reference to the corresponding KV cache blocks stored in HBM.

When a new request arrives with input [t₁, t₂, ..., tₙ]:

  1. Prefix lookup: walk the radix tree, following edges that match the request's token sequence. The walk continues as long as the request's tokens match stored nodes.
  2. Reuse: at the deepest matching node, the corresponding KV cache blocks are already in HBM and valid. The prefill computation starts from that point, not from the beginning.
  3. Insertion: after generating the KV cache for the unmatched suffix, new nodes are inserted into the radix tree for future reuse.

The memory-speed tradeoff is explicit: each node in the radix tree consumes HBM for its KV blocks. The tree grows as requests arrive and is pruned by LRU eviction when memory pressure builds. Leaf nodes — nodes with no children — are evicted first, because evicting a leaf invalidates no cached state for other nodes. Eviction removes the node from the tree and frees its HBM allocation.

Nodes that are actively referenced by in-flight requests are pinned — they cannot be evicted regardless of LRU order. The implementation uses reference counting: each node has a count of requests currently using its KV cache blocks. Eviction only touches nodes with a zero reference count.

Copy-on-write for concurrent modification

A subtlety the paper addresses: what happens when two requests share a prefix up to position K, then both extend the tree from the same node in different directions?

Node splits happen when a new request's tokens diverge from an existing node mid-sequence. Suppose the tree has a node encoding [A, B, C, D] and a new request has prefix [A, B, C, E, ...]. The system splits the existing node at position 3, creating a parent [A, B, C] and two children [D] and [E].

Splitting a node requires duplicating its KV cache blocks into two separate allocations (one for each child) — otherwise the two children would share the same HBM storage, which breaks when decode writes new tokens. This is structurally identical to the copy-on-write scheme in PagedAttention for beam search: physical blocks are reference-counted, and a copy is triggered on write.

The paper notes that the node split cost is bounded: each request triggers at most one split (the divergence point), and the split cost is proportional to the block size at the split point, not the total sequence length. In practice, splits happen rarely for workloads with clean prefix structure (shared system prompt, per-user suffix) and more often for workloads with fine-grained per-request variation.

How this interacts with continuous batching

The interesting systems problem is how RadixAttention integrates with the continuous batching scheduler.

In a standard continuous batching setup, the scheduler packs as many requests as possible into each forward pass. The constraint is total KV cache memory — once it's full, no new requests can be admitted. The scheduler controls admission by estimating KV cache footprint.

With RadixAttention, a request's actual memory footprint depends on what's already in the radix tree. A request with a 4,000-token system prompt needs zero new KV allocation for that prefix if it's already cached. The scheduler must account for this: overestimating footprint leads to under-utilization, underestimating leads to OOM.

The paper's scheduler uses a hit-aware admission policy: before admitting a request, it walks the radix tree to determine how much new KV cache the request will actually allocate. Only the new (cache-miss) tokens count against the memory budget. This lets the scheduler admit more requests concurrently when the cache hit rate is high.

There's a secondary scheduling optimization the paper implements: cache-locality aware routing. If multiple pending requests have overlapping prefixes, scheduling them consecutively maximizes cache hit rate — the first request populates the tree, subsequent requests hit it. The scheduler re-orders the queue to batch requests with common prefixes together, trading fairness for throughput.

What the performance numbers actually show

The paper benchmarks against HuggingFace TGI and evaluates several representative workloads:

Shared system prompt (chatbot-style): 4 backends, Llama-7B, 1,000-token system prompt, 200-token user turns, 100-token responses. SGLang with RadixAttention achieves 1.7× throughput vs TGI. The modest gain reflects that system prompt reuse is the only shared prefix here — per-user turn KV cache doesn't transfer.

Document Q&A (long shared context): 4,000-token document prefix shared across multiple question-answer pairs per document. 6.4× throughput vs TGI and 5× vs vLLM v0.1. This is the canonical use case for RadixAttention — a long, expensive prefix reused many times across short follow-up queries.

Tree-of-thought reasoning: Multi-branch expansion where parent reasoning steps are shared by child branches. 4.3× vs TGI. Each branch extends from a shared trunk that RadixAttention caches once and shares across all branches.

SQL generation: System prompt with database schema shared across queries. 5× vs TGI.

The pattern is clear: throughput improvement scales with the ratio of shared prefix length to per-request unique length. When the shared portion is large and reused frequently, the gain is multiplicative. When requests are short, unique, and non-overlapping, RadixAttention adds the overhead of a radix tree lookup with zero benefit.

The paper also reports latency numbers for interactive use: at the same throughput level, time-to-first-token for cache-hit requests drops proportionally with the number of tokens skipped in prefill. A request that hits 4,000 tokens of cache gets its first output token 4,000-tokens-of-prefill-compute faster.

What the paper doesn't tell you

Cache warmup is real. A freshly started serving instance has an empty radix tree. The first request for each shared prefix pays full prefill cost. In low-traffic or high-diversity workloads, the cache never fully warms and the hit rate stays low. The paper's benchmarks use stable workloads where the cache is already warm — they don't model the ramp-up period. In practice, you should expect 5–10 minutes of degraded performance after a restart.

Instance affinity breaks multi-node deployments. The radix tree lives in a single instance's GPU HBM. If you're load-balancing across a fleet of serving instances, a request routed to a different instance gets a cold cache. The paper acknowledges this requires routing requests by prefix hash — sticky routing for prefix groups. This conflicts with standard round-robin load balancing and complicates horizontal scaling.

Cache eviction causes silent latency spikes. When memory pressure forces LRU eviction, the next request that needs the evicted prefix pays full prefill cost. This is unpredictable — traffic patterns determine eviction timing, not request structure. In production, you can observe sudden latency spikes that correlate with cache evictions, which are difficult to attribute without instrumentation specific to RadixAttention internals.

The radix tree is per-dtype and per-model. KV cache blocks are stored in the model's compute dtype (FP16 or BF16). If you serve the same model at multiple quantization settings, or switch between quantized and full-precision serving, the KV cache is invalidated. Same for model weight updates — a new model version requires a full cache flush. Rolling updates become more expensive.

vLLM adopted prefix caching. vLLM v0.2+ added its own prefix caching implementation (not identical to RadixAttention but serving the same purpose). If you're evaluating SGLang vs vLLM for production, verify which version of vLLM you're comparing against — the 5× improvement numbers in the SGLang paper are against vLLM v0.1, which had no prefix caching. The gap narrows with modern vLLM.

Production tradeoffs you should measure

Memory budget allocation. RadixAttention stores KV cache indefinitely until eviction. This directly competes with active-request KV cache budget. If you over-allocate to the radix tree cache, you reduce the number of concurrent requests you can serve. The right balance depends on your hit rate and prefix length distribution. The paper recommends treating cache size as a configurable parameter and tuning based on workload measurement — not setting it and forgetting.

Prefix diversity kills hit rate. If each request has even slightly different system prompts (personalized prompts, per-tenant prefixes, A/B test variants), the cache hit rate drops precipitously. Personalized prefixes mean every user gets their own tree node — you're trading storage for zero computational reuse. Profile your prefix diversity before assuming RadixAttention will help.

The gain is compute, not memory. RadixAttention reduces prefill compute. It doesn't reduce total memory usage — cached nodes consume HBM for as long as they're in the tree. The total HBM footprint may increase if you retain KV cache across requests rather than freeing it. The improvement is throughput (more requests per second), not memory efficiency.

When NOT to use RadixAttention

When your prompts are unique per request. If your application generates prompts procedurally, includes per-request metadata in the system prompt, or otherwise has high prefix diversity, the radix tree accumulates dead weight without delivering hits. The overhead of prefix lookup, node management, and LRU tracking is pure cost with no benefit.

When you're sequence-length-constrained on short sequences. For applications where total sequence length is under 500 tokens, prefill is fast and cheap. RadixAttention's benefit is proportional to the number of prefill tokens it skips — at 500 tokens, even a 100% hit rate saves you less than PagedAttention's typical contribution.

When you're horizontally scaling behind a stateless load balancer. RadixAttention requires prefix-aware routing to deliver its hit rate. If your load balancing layer doesn't support sticky routing by prefix hash, you'll deploy the complexity of a radix tree and get cold-cache performance. Fix the routing or don't deploy the feature.

When model update cadence is high. Frequent model weight changes flush the entire cache. If you're doing online learning, RLHF updates, or frequent fine-tuning with hot model swaps, the cache hit rate approaches zero across model version boundaries. The tree management overhead is pure cost in that regime.

When latency SLAs are tight and traffic is low. At low QPS, you're unlikely to serve the same prefix twice in a short window before eviction. RadixAttention optimizes batch throughput, not single-request latency. For latency-sensitive, low-traffic endpoints, the lookup overhead isn't offset by cache savings.

Why this matters beyond the specific implementation

The paper's real insight is that KV cache management is a cross-request resource problem, not a per-request resource problem. Once you accept that framing, a radix tree is an obvious data structure — it's just a trie of past computations. What makes it non-obvious is recognizing that the computation is worth caching at all, which requires understanding that prefill is expensive and amortizable.

The same insight drove vLLM to add prefix caching, and has influenced basically every production serving system since. The radix tree doesn't have to be the specific implementation. The principle — identify shared computation across requests, cache it, share it, evict under pressure — applies to any system that runs the same computation repeatedly with minor variations.

For production inference engineers, the practical takeaways are: measure your prefix hit rate before and after deploying prefix caching; instrument cache evictions because they cause latency spikes; design your routing tier to support prefix affinity; and don't conflate memory savings with throughput savings — the two are orthogonal.

The 6.4× throughput improvement for document Q&A isn't magic. It's the consequence of not paying for work you've already done.


Paper: "Efficiently Programming Large Language Models using SGLang," Lianmin Zheng, Liangsheng Yin, Zhuohan Li, Simon Mo, Joseph E. Gonzalez, Ion Stoica, et al. arXiv:2312.07104, December 2023.

Related reading

  • Titans: What the Test-Time Memorization Paper Actually Says

    Attention is quadratic. SSMs compress everything into a fixed state. Titans takes a third path: a neural memory module whose weights are the memory, updated via gradient descent at inference time. Here's the mechanism, what the benchmark numbers actually say, and why you shouldn't deploy this without thinking carefully about inference cost.

  • Mooncake: What the KV-Cache-Centric Disaggregated Serving Paper Actually Says

    DistServe disaggregates prefill from decode. SGLang caches KV cache per-instance. Mooncake goes further: it treats the KV cache as a shared, distributed resource that the scheduler routes around — turning a fleet of 50 isolated caches into one coherent pool.

  • LLM.int8(): What the 8-bit Matrix Multiplication Paper Actually Says

    INT8 quantization works perfectly for vision models. For LLMs above ~7B parameters it silently destroys quality — unless you understand why outlier features emerge at scale and how mixed-precision decomposition works around them.

← All writing