SARATHI: what the chunked-prefill paper actually says

Reading Agrawal, Kedia, Mohan, Panwar, Kwatra, Gulavani, Ramjee, and Tumanov (Microsoft Research India, Georgia Tech, OSDI 2024) while investigating why our decode latency had periodic 3-second spikes even at 40% GPU utilization.

The dashboard made no sense. Our LLM serving cluster was at 40% GPU utilization. TPOT — time per output token — averaged 45ms, well inside our 150ms SLA. But every 20 to 30 seconds, p99 TPOT would spike to 3.2 seconds, users would see their streaming output freeze mid-sentence, and then it would recover. The spike was perfectly regular. It wasn't a GC pause. It wasn't a networking hiccup. It was a decode stall, and it happened every time a long prompt joined the batch.

This is the problem SARATHI fixes: in continuous batching systems, a long prefill request doesn't just take GPU time for itself — it freezes all decode requests sharing the GPU until the prefill completes. SARATHI breaks the prefill into chunks and interleaves those chunks with decode steps, so decodes never wait longer than the time it takes to process one chunk.

The paper is "Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills", Amey Agrawal et al., OSDI 2024. Read it alongside the ORCA paper (iteration-level scheduling) and the DistServe paper (disaggregated prefill/decode) — SARATHI sits between them in both complexity and infrastructure cost.

The problem: continuous batching creates a decode stall

To understand SARATHI you need to understand what continuous batching (ORCA, 2022) actually does at the iteration level.

In a standard serving loop, the GPU executes one iteration at a time. An iteration processes whatever requests are currently scheduled. With continuous batching, that means:

Requests in prefill phase have their full prompt processed in one pass
Requests in decode phase generate one token each

The scheduler decides each iteration which requests to include. ORCA's contribution was allowing new requests to join and completed requests to leave mid-generation — hence "continuous." But the fundamental iteration structure remains: one big forward pass per step, and everything in that pass executes together.

Here's the problem. A request arrives with a 4,096-token prompt. The scheduler includes it in the next iteration. That iteration now has to process 4,096 tokens of prefill. On a LLaMA-70B at FP16, that prefill takes roughly 3.2 seconds — every attention layer, every FFN layer, all 80 layers, applied to a (4096 × 8192) token matrix.

While that iteration runs, every in-flight decode request produces zero tokens. They're scheduled but not executing. Their users stare at a frozen response for 3.2 seconds. When the iteration completes, the prefill finishes, the 4,096-token KV cache enters the pool, and decode resumes.

The spike in your TPOT p99 isn't a bug. It's a guarantee: any prefill that takes T seconds will stall all decode requests for exactly T seconds. At 40% GPU utilization with a Poisson arrival process, long prompts arrive regularly, and each one creates a T-second decode freeze for all concurrent users.

This is not a problem continuous batching was designed to solve. ORCA optimizes for throughput. The decode stall is a latency problem that only becomes visible when you mix short-output, decode-heavy requests (chat completions) with long-input, prefill-heavy requests (document analysis) in the same pool.

The insight: prefill and decode use different hardware resources

Before describing SARATHI's solution, it's worth being precise about why chunking helps from a hardware perspective.

A pure decode step is memory-bandwidth-bound. For LLaMA-70B at FP16, every decode step reads 140 GB of model weights plus the KV cache for all active sequences. Arithmetic utilization on the tensor cores is low — maybe 30-40%. Most of the time is spent moving data from HBM to the compute units. The flops are cheap; the bandwidth is the bottleneck.

A pure prefill step is compute-bound. Processing a large batch of input tokens is dense matrix multiplication: weight matrices are read once and applied to all N tokens in parallel. Tensor core utilization is high — 70-80%+. The KV cache writes are small relative to the compute. The flops dominate; bandwidth is not the bottleneck.

These two workloads have complementary resource profiles. Compute-bound prefill leaves bandwidth headroom unused. Memory-bandwidth-bound decode leaves compute headroom unused. If you could run them simultaneously on the same GPU, you'd fill both resource types more efficiently.

SARATHI can't do true simultaneity — the GPU still executes one forward pass per iteration — but it achieves the same effect by making each iteration a mix of prefill tokens and decode tokens.

The SARATHI mechanism: chunked prefill with piggybacking

SARATHI introduces two changes to the scheduler:

1. Chunked prefill. Instead of processing a full N-token prompt in one iteration, SARATHI splits it into fixed-size chunks of C tokens (typically 256–512). A request with a 4,096-token prompt requires ⌈4096/C⌉ iterations to complete prefill, spread over multiple scheduling rounds.

2. Piggybacking. In each iteration, the scheduler packs one prefill chunk plus decode tokens for all active decode-phase requests into the same forward pass. The "piggybacking" metaphor is literal: decode requests ride alongside the prefill chunk in every step.

The mechanics: in a transformer forward pass, the input is a batch of token vectors. With SARATHI, that batch contains:

[prefill_token_1, ..., prefill_token_C, decode_token_req_1, decode_token_req_2, ..., decode_token_req_B]

The attention computation handles these correctly: prefill tokens attend to each other (causal mask) and to the existing KV cache for that sequence; decode tokens each attend only to their own KV cache. The mixed batch is a standard operation — modern serving kernels handle variable-length attention in a single pass.

For a 4,096-token prompt with chunk size 512:

Without SARATHI:
  Iteration 1: [4096 prefill tokens] — decode stalled for ~3.2s
  Iteration 2: [decode for all requests]
  Iteration 3: [decode for all requests]
  ...

With SARATHI (chunk = 512):
  Iteration 1: [512 prefill tokens] + [decode for all requests] — 0.4s
  Iteration 2: [512 prefill tokens] + [decode for all requests] — 0.4s
  Iteration 3: [512 prefill tokens] + [decode for all requests] — 0.4s
  ...
  Iteration 8: [512 prefill tokens] + [decode for all requests] — 0.4s
  → Prefill complete. Request enters decode phase at iteration 8.

No decode stall exceeds 0.4 seconds, regardless of how long the original prompt was. The worst-case decode pause is bounded by C tokens of prefill time — a tunable parameter.

What this does to TTFT and TPOT

The tradeoff is straightforward: SARATHI trades TTFT for TPOT stability.

TTFT increases for the chunked request. A 4,096-token prompt without chunking generates its first token when prefill completes — after one iteration. With chunk size 512, the first token appears after 8 iterations. Each of those 8 iterations includes some decode overhead (KV cache reads for active decode sequences, routing logic). In practice, the TTFT overhead is 5–15%: the extra decode work per iteration is small compared to the prefill compute, but it's not zero.

The paper measures LLaMA-65B on A100s with a mixed workload (256-token prompts, 256-token completions). SARATHI increases average TTFT by 12% at chunk size 512. That's the cost.

TPOT becomes bounded and predictable. Without chunking, p99 TPOT depends on the longest prompt in the arrival distribution — a single 8,192-token prompt can cause a 6-second decode pause for every concurrent user. With chunk size 512, the maximum pause is bounded by one 512-token prefill chunk: roughly 400ms on LLaMA-70B. The paper shows p99 TPOT reductions of 4.7× to 6.1× depending on workload, at comparable throughput.

Throughput stays close. The total compute required to complete a prefill is the same whether you chunk or not — you're processing the same number of tokens through the same model. The chunking overhead is scheduling and batching logic, not additional attention or FFN compute. The paper reports throughput within 5% of unchunked continuous batching.

The chunk size knob

Chunk size C is the critical parameter. It controls the TTFT-TPOT tradeoff directly.

Too small (C = 64): Each prefill chunk is tiny. The number of decode tokens in the batch far outnumbers the prefill tokens; the batch is compute-underutilized. Prefill completion takes many iterations, so TTFT grows substantially. Scheduling overhead per chunk becomes non-negligible. The GPU isn't doing bad work, but it isn't doing as much work per iteration as it could.

Too large (C = 2048): A single prefill chunk already dominates the iteration. Decode pauses approach those of unchunked prefill. You've chunked the prefill in principle, but a 2,048-token chunk still takes ~1.6 seconds — long enough to cause visible TPOT spikes. The bound on worst-case pause is C, and C is too large.

Sweet spot (C = 256–512): This range works well empirically across model sizes and hardware. At 512 tokens, a chunk takes roughly 400ms on LLaMA-70B/A100, which lands inside most TPOT SLAs. TTFT overhead is moderate. Decode batch size grows large enough to keep memory bandwidth utilized.

The right value depends on your specific TPOT SLA. If you're targeting 200ms TPOT, your chunk size should be small enough that one chunk processes in under 200ms. That's a function of model size and hardware — you need to measure it on your serving stack.

Interaction with KV cache memory management

Chunked prefill interacts with KV cache allocation in a subtle way.

In unchunked continuous batching, a prefill request allocates KV cache blocks for all N tokens at once: N × 2 × layers × head_dim × dtype bytes of HBM, reserved before the first iteration. This reservation is why vLLM refuses requests when it can't guarantee the full KV cache will fit.

With chunked prefill, you only need to allocate KV cache for the tokens processed so far. After the first chunk of 512 tokens, you've allocated 512 × KV_per_token bytes, not 4096 × KV_per_token bytes. The remaining 3,584 tokens' KV cache will be allocated incrementally as chunks complete.

This can improve memory efficiency: a request that will eventually need 4,096 × 320KB = 1.28 GB of KV cache doesn't need to reserve all 1.28 GB upfront. Other requests can use that HBM in the meantime. The tradeoff is that a long-prompt request might be evicted partway through its prefill if memory pressure increases — wasting the compute already spent on earlier chunks. SARATHI requires careful preemption policy: either you complete a request's prefill before evicting, or you restart it entirely.

SARATHI vs. DistServe: when to use which

These papers solve related problems with different infrastructure costs.

SARATHI is a scheduler change: no additional hardware, no inter-node KV cache transfer, no separate GPU pools. Deploy it in your existing serving cluster. The cost is engineering complexity in the scheduler and a tunable TTFT penalty.

DistServe (disaggregated prefill/decode) gives you independent scaling: add prefill GPUs when long-prompt requests increase, add decode GPUs when concurrent users increase. This is architecturally cleaner and achieves larger gains for workloads with extreme prefill/decode imbalance. The cost is real: you need two GPU pools, high-bandwidth interconnect (InfiniBand or NVLink) for KV cache transfer, and a routing layer that manages placement across pools. At 4,096 tokens, KV cache transfer is ~1.3 GB — 13ms at InfiniBand speeds, longer at PCIe.

Use SARATHI when:

You have a single GPU pool and want to stop decode stalls without adding hardware
Your prefill/decode ratio is variable and you don't know which to over-provision
Your cluster doesn't have the interconnect bandwidth for efficient KV cache migration

Use DistServe when:

You have stable, predictable workloads with distinct long-prompt and chat segments
You have high-bandwidth interconnect in your cluster
You need to scale prefill and decode capacity independently
TTFT SLO tightness requires dedicated prefill GPUs with no decode interference

They're not mutually exclusive. DistServe's prefill instances can themselves use chunked prefill internally to avoid stalling other requests in the prefill pool.

When NOT to use SARATHI

Pure prefill workloads. Batch document processing, offline summarization, index building — workloads with no concurrent decode requests. Chunking adds scheduling overhead (more iterations, more dispatch logic) with zero benefit: there are no decode requests to protect.

Very short prompts. If your median prompt is 64 tokens, the prefill takes ~50ms and doesn't stall decode noticeably. Chunking 64 tokens into chunks of 32 is pointless overhead. Chunking only matters when prefill duration approaches or exceeds your TPOT SLA.

Tight TTFT SLOs for long prompts. If users submitting 8,192-token prompts need a first token in under 500ms, chunked prefill won't help — you're spreading that prefill over 16+ iterations, each including decode overhead. DistServe's dedicated prefill instances are a better fit for strict TTFT requirements on long inputs.

Streaming inference at fixed decode batch sizes. Some serving configurations maintain a fixed decode batch for predictable latency. SARATHI's mixed batches vary in size each iteration — sometimes more decode tokens, sometimes fewer depending on which prefill chunks complete. This variability can conflict with latency predictability guarantees in certain production deployments.

What the paper actually measured

The evaluation is on LLaMA-65B and LLaMA-13B on A100-80GB clusters with synthetic workloads based on Azure LLM trace distributions.

Key numbers:

At 4,096-token prompt length, SARATHI reduces p99 TPOT from 6.1 seconds to 1.1 seconds — 5.5× improvement
Throughput penalty vs. unchunked continuous batching: ~5% at chunk size 512
TTFT overhead: 12% average increase at chunk size 512
GPU utilization improves from ~60% (unchunked, decode-heavy phase) to ~75% (SARATHI mixed batches), because mixed iterations better utilize both compute and bandwidth

The comparison baseline is vanilla continuous batching (no chunking). SARATHI doesn't compare favorably to DistServe at extreme scale — disaggregation wins when you can afford the hardware — but SARATHI requires no additional hardware, which is the relevant comparison for most teams.

Production deployment notes

A few things that don't appear prominently in the paper but matter operationally:

Chunk completion race conditions. When a prefill request spans many chunks, it occupies scheduler slots across many iterations. If a decode request finishes mid-prefill, the prefill request may be able to claim a larger chunk next iteration. Your scheduler needs to handle dynamic chunk sizing gracefully — a fixed chunk size policy is simpler to reason about.

Attention mask complexity. Mixed batches with prefill tokens and decode tokens have more complex attention masks. Prefill tokens use a causal mask within the new tokens plus full attention to the existing KV cache. Decode tokens use only their KV cache. Efficient kernel implementation of this mixed-batch attention is non-trivial; off-the-shelf attention kernels may not handle it optimally. SGLang's RadixAttention and vLLM's v0.4+ chunked prefill implementation have production-ready kernels; don't assume your framework handles this correctly without verification.

Memory fragmentation. Incremental KV cache allocation across many chunks can fragment HBM differently than bulk allocation. Monitor KV cache fragmentation metrics if you implement SARATHI on PagedAttention-based systems; the block allocator behavior under chunked prefill may differ from what you're tuned for.

The core insight of SARATHI is simple enough that it's easy to underestimate: decode requests don't need the full KV cache of a long prefill to make progress — they just need the GPU to finish processing a manageable chunk first. Bounding that chunk size bounds the pause. Everything else is implementation.

The paper's lasting contribution is making this concrete: measuring exactly what chunk size buys, what it costs, and how the tradeoff behaves across model sizes and workloads. If you're running any system where long-prompt requests and streaming chat completions share GPU time, this paper describes why your p99 TPOT graph looks the way it does and what to do about it.