Lesson 5 of 8 · 14 min
Mooncake: what the KV-cache-centric disaggregated serving paper actually says
Reading Qin et al. (Moonshot AI, 2024) while debugging why our prefix cache hit rate was 4% across a 40-instance fleet despite each instance showing 85% hits in isolation.
The monitoring dashboard was showing two contradictory numbers. Per-instance KV cache hit rate: 83%. End-to-end fleet hit rate — computed by dividing cached token prefills by total prefill tokens across all instances — was under 5%.
The explanation took about ten minutes to find, and another three months to fix properly. Our document QA pipeline prepended a 24,000-token research document to every query. SGLang's RadixAttention was working exactly as designed: the second query on a given instance that referenced the same document was nearly free. The problem was the load balancer. It distributed requests uniformly across 40 instances. Any given document arrived at any given instance maybe twice a day. Each of those 40 instances was maintaining its own radix tree, and the document landed in each one cold — paying the full 24K-token prefill cost — nearly every single time.
We had 40 independent caches where we needed one shared cache. The fleet was doing redundant KV computation on a scale proportional to the number of instances, not the number of users.
"Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving", Qin, Li, He, Zhang, Wu, Zheng, and Xu from Moonshot AI, 2024, is the paper that formalizes this problem and the architecture that solves it. It's deployed in production at Kimi — Moonshot AI's long-context LLM product — which runs contexts up to 128K tokens (and in some configurations longer). At that scale, the cost of redundant prefill isn't an operational nuisance; it's the dominant budget line.
The problem DistServe and SGLang didn't fully solve
To understand what Mooncake adds, you need to understand what the prior work left on the table.
DistServe (Zhong et al., OSDI 2024) disaggregates prefill and decode onto separate machine pools. A request arrives, goes to a prefill instance, gets its KV cache computed, and that KV cache is transferred over RDMA to a decode instance which handles generation. The prefill-decode interference that destroys TTFT p99 goes away. But the KV cache is still per-request: the prefill instance computes it, transfers it, and frees it. If the same document arrives again, the prefill instance recomputes it from scratch.
SGLang / RadixAttention (Zheng et al., 2023) maintains a global radix tree of cached KV blocks per serving instance, enabling prefix sharing across requests that hit the same instance. For a fixed system prompt that every request shares, cache hit rate within a single instance approaches 100%. But the radix tree is local to the instance. Route the same request to a different instance and you're computing from scratch. At 40 instances with round-robin load balancing, this locality is nearly zero.
The gap: both systems treat the serving instance as the unit of caching. Mooncake treats the KV cache segment as the unit of caching, and disaggregates it from the instances that computed it.
The architecture: conductor, prefill pool, and the KV store
Mooncake introduces three components that replace the monolithic serving instance:
Prefill pool: GPU instances that do nothing but compute KV cache for prompt tokens. When a prefill instance finishes processing a prompt, it writes the resulting KV blocks to the distributed KV store. It does not transfer KV directly to a specific decode instance the way DistServe does; it writes to the shared store.
Decode pool: GPU instances that do nothing but autoregressive generation. They read KV cache segments from the distributed KV store and advance generation by one token per step. They never run prefill.
KV cache store: a distributed memory pool that spans CPU DRAM and NVMe SSD across all nodes in the cluster. KV blocks are addressed by content hash (prefix hash), enabling deduplication. A request that shares a 10K-token prefix with a prior request doesn't re-trigger prefill; the decode instance reads the existing KV blocks for those 10K tokens from the store.
Conductor: the cluster-level scheduler that ties this together. When a new request arrives, the Conductor looks at the KV store's current state — which prefix hashes are cached, and on which nodes — and routes the request to the prefill instance best positioned to compute the uncached suffix. If 80% of the KV blocks for this request are already in the store, the Conductor assigns only the remaining 20% to a prefill instance. The decode instance can begin reading cached blocks while the uncached portion is still being computed.
The routing logic is locality-aware in two senses. First, it maximizes KV cache reuse: the Conductor scores candidate prefill instances by their overlap with existing cached content, not just their current queue depth. An instance that already has the relevant model weights loaded hot (though weights are typically all-loaded) and can compute the missing suffix without redundant work gets preference. Second, it considers network topology: if the KV store has cached blocks on nodes close to a particular decode instance (same rack, lower latency), the Conductor factors that into decode instance selection.
Why the KV store is the architectural pivot
The key property this architecture enables: a KV segment computed once is available to every subsequent request that shares that prefix, regardless of which instance handles it.
Consider a 128K-token context that's shared across all requests in a given session. In vLLM or SGLang without sticky routing, the first request computes 128K tokens of KV cache on instance A. The second request lands on instance B (round-robin), computes 128K tokens on B. The third on C. Eventually every instance has computed the same KV cache independently. Memory wasted: 40× duplication. Compute wasted: 39 redundant 128K-token prefills.
In Mooncake, the first request computes the 128K KV cache on any prefill instance and writes it to the KV store. Every subsequent request reads those blocks from the store. Prefill cost: paid once. Memory cost: one copy in the distributed store plus the working copies on decode instances.
For Kimi's production workload, long shared contexts are the norm, not the exception. A user working with a 50K-token document across multiple queries should pay prefill cost once per document, not once per query. This requires the KV cache to be a cluster resource, not an instance resource.
The KV store implementation and transfer path
The store itself is implemented as a hierarchy that mirrors the GPU memory hierarchy one level up:
- GPU HBM: working KV cache for active decode sequences (hot, expensive, ~80 GB per A100)
- CPU DRAM: recently computed KV blocks that aren't in active use (cooler, cheaper, ~512 GB–2 TB per node)
- NVMe SSD: evicted KV blocks for infrequently accessed prefixes (cold, cheap, ~10 TB per node)
KV blocks flow down this hierarchy as they age. A new prefill writes to GPU HBM (working memory on the prefill instance), then migrates to CPU DRAM on the same node as the prefill completes, then potentially to a storage tier if not reused within some eviction window.
Reads go up the hierarchy: decode instances first check GPU HBM (fastest), then request blocks from CPU DRAM on nearby nodes via RDMA, then fetch from NVMe on any node (slowest, but still faster than recomputing a 128K-token prefill).
RDMA is load-bearing here. The bandwidth math for large KV transfers:
KV cache per token (70B model) = 2 × 80 layers × 8 KV heads × 128 head_dim × 2 bytes (FP16)
= ~320 KB per token
A 50K-token context generates ~16 GB of KV cache. At 400 Gb/s InfiniBand HDR (~48 GB/s usable), reading this from CPU DRAM on a remote node takes ~330ms. That sounds slow, but compare to re-running prefill for 50K tokens: at ~500 tokens/second throughput for a 70B model, that's 100 seconds. Reading from the store is ~300× faster than recomputing.
For shorter contexts (under ~2K tokens), re-computation is often cheaper than the overhead of checking the store, managing block headers, and initiating RDMA. The paper acknowledges this and sets a threshold below which requests skip the KV store entirely.
What the production numbers show
The Mooncake paper reports evaluation on Kimi's production traffic, which is unusual — most LLM serving papers use synthetic workloads, often Poisson-distributed with ShareGPT as the prompt distribution. Kimi's traffic has structural properties (long shared documents, multi-turn sessions with growing context) that make prefix sharing particularly valuable.
The paper's headline result: under their production workload, the Conductor-based routing achieves substantially higher KV cache reuse than per-instance caching with load balancing, reducing the fraction of total prefill tokens that require actual computation. The effective TTFT at the p99 improves significantly at high load compared to a disaggregated system without cross-instance KV sharing. (I'll be specific about one figure I'm confident on: for a batch of requests sharing a common 100K-token prefix, Mooncake's first request pays the full prefill cost, and all subsequent requests in the same batch read cached KV — compute cost for the cached portion goes to zero after the first request.)
The decode throughput improvement is more nuanced. Decode instances that read from the distributed KV store rather than receiving fresh KV from a dedicated prefill instance have slightly higher latency per token for the first generation step (RDMA read latency vs. direct GPU-to-GPU transfer). But this is offset by higher decode instance utilization: decode GPUs that aren't stalled waiting for prefill-to-decode KV transfers can batch more sequences.
Production tradeoffs the paper handles carefully but you still need to think about
The store adds a new failure surface. A collocated vLLM instance has one failure domain. Mooncake has prefill instances, decode instances, KV store nodes, and the Conductor — four components that must all be operational for a request to complete. The KV store nodes need to handle writes from prefill, reads from decode, evictions, and block address resolution simultaneously. If a KV store node fails mid-request — after the prefill instance has written blocks but before the decode instance has read them — the request must restart from prefill. This is recoverable, but adds latency. Your availability model needs to account for partial store failures.
Cache pollution from one-off requests. The radix tree in SGLang handles prefix sharing by maintaining a tree of cached prefixes. Mooncake's global KV store has the same eviction problem at larger scale. A single request with a unique 200K-token context fills the store with blocks that will never be reused, evicting blocks for documents that have high reuse frequency. Cache eviction policy — global LRU, frequency-weighted, per-prefix TTL — is critical to hit rate under mixed traffic. The paper describes a frequency-aware eviction policy, but tuning it for your specific workload requires instrumentation that most teams don't have set up for KV-level cache behavior.
Network bandwidth is now shared infrastructure. In DistServe, the KV transfer is per-request: prefill-to-decode, once, done. In Mooncake, KV blocks are being read from and written to the store continuously by many prefill and decode instances simultaneously. The store nodes' network interface becomes a shared resource. Under high load, a burst of new requests that all miss the cache generate simultaneous prefill writes; a subsequent burst of similar requests generates simultaneous cache reads. These spikes can saturate the store nodes' network bandwidth. The paper addresses this with prioritized read/write queuing on store nodes, but dimensioning your network for peak combined read/write load is a capacity planning problem that's harder than DistServe's per-request transfer sizing.
Sticky routing is no longer the simple fix. The usual response to "our cache hit rate is low across instances" is to add sticky routing: hash the prefix to always send same-prefix requests to the same instance. Sticky routing is simple to implement and immediately effective. Mooncake makes this tradeoff: you give up strict sticky routing in exchange for a richer scheduling space where the Conductor balances between cache locality, instance utilization, and request priority. If your traffic is very skewed — a few prefixes account for 90% of requests — sticky routing gets you 80% of Mooncake's benefit at 5% of the implementation cost. Mooncake pays off when traffic is diverse enough that strict per-prefix affinity creates severe load imbalance.
Long-context workloads stress the hierarchy tiers. Kimi's 128K-token contexts generate ~40 GB of KV cache per context. If you're serving 1,000 concurrent 128K-context sessions, that's 40 TB of active KV cache. CPU DRAM and NVMe are much cheaper than GPU HBM, but 40 TB of NVMe with low-latency RDMA-accessible reads is not a trivial infrastructure investment. Before adopting a KV-hierarchical store, model out how much KV storage your workload actually needs at your target concurrency. For many teams, the answer is "we don't have 128K-token sessions, we have 4K-token sessions" — at which point the standard DistServe + SGLang stack is almost certainly sufficient.
Failure modes in practice
Conductor as a single point of coordination. The Conductor routes all requests, maintains the global view of cache state, and assigns prefill/decode instances. Under a request surge, it can become a bottleneck — particularly if evaluating prefix overlap against a large cache index is computationally expensive. The paper describes a two-level scheduling approach (coarse routing at request arrival, fine-grained scheduling at the instance level) to avoid making the Conductor a critical bottleneck, but production implementations need load testing of the Conductor specifically under request bursts.
Cold start on new document types. The KV store is only useful after the first request computes and caches the relevant blocks. For a fresh deployment or a cache that's been cleared, the first N requests for any given document pay full prefill cost regardless of how many identical requests follow. If your traffic has periodic "flush" patterns — new documents pushed daily, old ones becoming irrelevant — your effective cache hit rate may be much lower than peak steady-state numbers suggest. Warming the cache with synthetic prefill requests before shifting live traffic is worth considering for time-sensitive launches.
RDMA reliability under GPU cluster conditions. InfiniBand fabrics in GPU clusters see transient errors — link flaps, routing updates, congestion-induced packet drops — that software RDMA stacks handle through retransmit. A single failed RDMA read to retrieve a KV block can add tens of milliseconds of latency to that decode step. If the KV store is spread across 50+ nodes, the probability of hitting at least one transient error on a given request that reads from multiple nodes is non-negligible. The paper's RDMA implementation uses pipelining and hedged reads (send the same request to two nodes, use whichever responds first) for latency-critical paths, but this is implementation complexity that needs to be validated in your specific InfiniBand environment.
When NOT to use Mooncake's architecture
When your prefix sharing is low. If requests arrive with diverse, non-overlapping contexts — agent workloads where each prompt is unique, creative writing where there's no shared document — the KV store is all overhead with no benefit. Every write to the store is wasted; every cache miss is an RDMA round-trip to confirm the miss. For low-sharing workloads, DistServe without a global KV store outperforms Mooncake because the store adds latency without adding hit rate.
When you can solve it with sticky routing. If your traffic has a small number of highly-shared prefixes (a handful of system prompts, a few shared documents), implementing sticky routing at the load balancer level and using SGLang's per-instance RadixAttention is simpler, cheaper, and gets you most of the benefit. Mooncake's scheduling complexity is justified when your prefix space is large and diverse — enough that any single instance can't hold all relevant prefixes, but with enough repetition that global sharing is valuable.
When your cluster lacks high-bandwidth inter-node networking. CPU DRAM reads via RDMA at 200+ Gb/s are a different proposition from CPU DRAM reads over 25GbE. If your GPU cluster nodes communicate over standard Ethernet without RoCE (RDMA over Converged Ethernet) or InfiniBand, KV reads from remote nodes will be too slow to benefit TTFT. The architecture requires the network to behave like a fast storage layer, not a standard LAN.
When contexts are short and prefill cost is small. For workloads with under 2K-token average prompt lengths, prefill takes tens of milliseconds per request — not the hundreds of milliseconds that make KV reuse compelling. At short contexts, DistServe + chunked prefill is the right architecture, and the KV store's coordination overhead adds cost without corresponding benefit.
When you're running a single-node deployment. Mooncake's value proposition is fleet-level KV sharing. On a single node with 8 GPUs, all sharing NVLink-connected HBM, per-instance RadixAttention already has low-latency KV access to everything the other GPUs have computed (through CUDA IPC or shared memory). The distributed store solves a multi-node coordination problem; don't introduce that complexity when you don't have multiple nodes.
What the paper actually gives you
Mooncake's contribution is reframing the unit of LLM serving from "GPU instance" to "KV cache segment." Once you make that shift, the right architecture falls out fairly naturally: compute (prefill and decode) is stateless and can be placed anywhere, KV cache is stateful and should be managed as a cluster-wide resource, and the scheduler's job is to minimize expensive KV recomputation by routing computation to where the data already lives.
This is not a novel idea in distributed systems. It's exactly how distributed caches work — Memcached, Redis Cluster, CDNs. The content is the key; the server that computed it is incidental. What's novel is applying this to a situation where the "content" is hundreds of gigabytes of float16 tensors that must be read and written at memory bandwidth, not disk bandwidth, and where the computation that produces the content (prefill) is expensive enough to justify sophisticated routing logic to avoid it.
The paper also makes a case for production-first evaluation that's worth highlighting. Most LLM serving papers benchmark on synthetic workloads. Kimi's production traces have structural properties — long shared documents, multi-turn sessions, bursty arrivals correlated with business hours — that synthetic benchmarks miss. The results on real traffic are less clean than controlled experiments, but they're the ones that determine whether the architecture survives contact with actual usage.
For your specific situation: if you're running a multi-instance serving deployment where prefix cache hit rate per-instance is high but fleet-level reuse is low (the gap I described at the start), Mooncake's architecture addresses your exact problem. Before building it, instrument two things: your actual prefix overlap across the fleet (what fraction of total prefill tokens are repetitions of content computed elsewhere in the same time window), and your inter-node network bandwidth. The former tells you how much KV reuse is theoretically available; the latter tells you whether RDMA reads from the store will be fast enough to beat recomputation for your typical prefix lengths. If prefix overlap is above ~40% and your network supports RDMA at 100+ Gb/s, the KV-centric architecture pays off.
The 4% fleet-level cache hit rate that started this post? We ended up solving it with a simpler approach first — consistent hashing by document hash at the load balancer, combined with per-instance RadixAttention — which got us to ~65% effective hit rate without a distributed store. But the document set grew, instances scaled out, and load became skewed: popular documents sent all traffic to one instance, rare documents starved others. We needed a scheduler that understood both cache locality and load. That's the Conductor. When we got there, Mooncake's architecture was the map we were already navigating toward.
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving — Qin, Li, He, Zhang, Wu, Zheng, Xu. Moonshot AI, 2024.