Continuous batching: what the Orca paper actually says
Reading Yu et al. (Seoul National University, OSDI 2022) while diagnosing why GPU utilization sat at 40% under load.
We had 20 concurrent users hammering a 13B model. GPU utilization: 38%. Not memory-bound — the model fit easily. Not compute-bound — the GPU was mostly idle. The queue was growing, latency was climbing, but the hardware sat there doing nothing for most of every second.
The problem was the serving strategy we'd inherited from the ML team: static batching. Collect N requests, run them all together until every single one finishes generating, then collect the next batch. It seemed sensible. It was destroying throughput.
The paper that diagnosed this precisely — and fixed it — is "Orca: A Distributed Serving System for Transformer-Based Generative Models," Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun, from Seoul National University, presented at OSDI 2022. The core contribution is called iteration-level scheduling, now commonly referred to as continuous batching. It's the technique that underpins every modern LLM serving system: vLLM, TGI, TensorRT-LLM, SGLang. If you're running LLMs in production, you're using this whether you know it or not.
The problem with static batching
To understand why static batching fails, you need a precise model of what "batching" means for autoregressive inference.
During the prefill phase, the model processes the entire input prompt in a single forward pass — this is embarrassingly parallelizable and GPU-efficient. During the decode phase, the model generates one token per forward pass, autoregressively, until it hits an end-of-sequence token or a maximum length. The decode phase is where serving systems spend most of their time.
Static batching groups requests together at the request level: wait until you have N requests, start them all at the same time, and run the decode loop until all N have completed. Then start the next batch.
The failure mode is immediate: requests generate different numbers of tokens. A request for a one-sentence summary might finish in 15 tokens. A request for a detailed analysis might run to 800. With static batching, every request in the batch must wait for the longest one to finish before any slot opens up. Your 15-token request — done in 200ms — sits idle for 9 more seconds while the 800-token request grinds through. The GPU is executing forward passes, but many sequences in the batch are either waiting for others to finish (waste) or padding is consuming cycles (waste).
The paper quantifies this precisely. With a trace of real LLM serving workloads, the output length distribution is heavy-tailed: most requests are short, but some are very long. Static batching forces you to schedule around the tail. Mean GPU utilization under realistic workloads: roughly 40%.
There's a second failure mode: inter-batch gaps. While you're waiting to accumulate the next batch of N requests, the GPU sits idle. High-QPS systems minimize this, but it compounds the waste.
Iteration-level scheduling
Orca's fix is conceptually simple: instead of scheduling at the request boundary, schedule at the iteration boundary — after every single decoding step.
After each forward pass, the scheduler runs:
- Check which sequences just generated an EOS token or hit
max_new_tokens. Remove them from the batch. - Check how many slots just opened up.
- Pull that many requests from the waiting queue and add them to the active batch.
- Run the next forward pass with the updated batch.
No request waits for another request to finish. The moment a short request completes, a waiting request takes its slot. The GPU is never idle between batches because there's no batch boundary — just a continuous stream of iteration steps with slots being filled and emptied dynamically.
The paper calls the size of the active set the batch size, which is now a live variable rather than a fixed hyperparameter. The scheduler can also enforce a maximum concurrent request count to bound KV cache memory usage.
The throughput improvement in the paper is substantial. Against Faster Transformer (the state-of-the-art static-batching system at the time), Orca achieves up to 36.9x higher throughput on the tested workloads. This isn't a marginal improvement — it's a regime change. The reason is simple: static batching was wasting most of the hardware. Iteration-level scheduling stops wasting it.
The selective batching problem
Implementing this correctly is non-trivial because of a technical constraint the paper addresses carefully: you can't batch all transformer operations across sequences with different KV cache states.
Here's the issue. At iteration step T, the active batch contains, say, 32 sequences. Some have been running for 5 tokens, some for 200 tokens. Their KV caches have different lengths. For a standard batched attention implementation, you'd need to pad all sequences to the same KV cache length — but that would mean sequences at step 5 carrying 195 padding entries, negating the efficiency gains.
Orca introduces selective batching: different operations in the transformer are batched differently.
For the linear projection layers (QKV projections, MLP FFN layers, output projections), the computation is applied token-by-token — each operation takes the embedding of a single token position and produces an output. These don't depend on the KV cache at all. So you can take all N current tokens from all N active sequences and batch them into a single (N × d_model) matrix, run one giant matmul, and split the result. These ops scale with the number of active sequences and are always fully batched.
For attention, you need the full KV cache for each sequence. Each sequence's attention scores are computed over its own history only — there's no cross-sequence interaction. Orca handles these with per-sequence attention computations: each of the N active sequences computes its own attention over its own KV cache. This is the only part that can't be naively batched across sequences, and the paper shows it's not the primary bottleneck for most workloads.
The result is that the expensive compute (the large matrix multiplications that dominate FLOPs) scales with the number of active sequences and runs efficiently, while the per-sequence attention adds linear overhead proportional to KV cache size — which is bounded by max sequence length.
What continuous batching changed about production LLM serving
Before Orca, serious serving deployments used static batching with careful batch size tuning. The folklore was that you needed to match batch size to peak load and accept that utilization would vary. You'd set max_batch_size = 16, watch your GPU run at 40% under mixed-length workloads, and shrug.
After Orca (and its open-source implementations in vLLM, TGI, etc.), the model shifted. The meaningful system parameter is now maximum concurrent requests (bounded by KV cache memory), not batch size. Throughput scales close to linearly with the concurrent request count up to the memory limit, and the GPU stays busy because there's always something in the active set as long as there's demand.
This interacts directly with PagedAttention. Continuous batching increases the number of concurrent requests you want to run. More concurrent requests means more KV cache memory pressure. PagedAttention solves the KV cache fragmentation problem that would otherwise limit how many requests you can hold concurrently. The two papers are complementary: Orca fixes the scheduling inefficiency; PagedAttention fixes the memory inefficiency. vLLM implements both.
The prefill-decode interference problem
One failure mode the paper doesn't fully address: prefill pollution.
When a new request enters the active batch, it needs a prefill step: the model processes the entire input prompt in a single forward pass. For a 4,096-token system prompt, that's a massive computation — far larger than a single decode step. If you inject a new request's prefill into an active batch of decode steps, you've just made that iteration take 10x longer. Every sequence that was mid-decode experiences a latency spike.
This is not theoretical. In systems running continuous batching naively, you see P99 decode latency spikes every time a new long-context request is admitted. The prefill step isn't bounded; a user with a 32K-token prompt causes a 32K-token iteration that stalls all other concurrent sequences.
The solutions are:
- Chunked prefill: break the prefill into fixed-size chunks (e.g., 512 tokens per iteration). Each iteration processes a chunk of the new request's prefill alongside existing decode steps. The new request takes longer to start, but doesn't spike other sequences' latency.
- Disaggregated prefill/decode: run prefill and decode on separate GPU pools entirely. New requests go to prefill workers; once prefill is done, the KV cache is transferred and decode runs on dedicated decode workers. Splitwise (ISCA 2024) formalizes this and shows significant efficiency improvements.
Orca's original design doesn't implement either. Most modern frameworks add chunked prefill as a default option, and disaggregated architectures are becoming common at scale.
Head-of-line blocking from long sequences
Continuous batching eliminates the worst of static batching's inefficiency, but it doesn't eliminate head-of-line blocking entirely. A single sequence generating 10,000 tokens still occupies a slot for its entire run. If your maximum concurrent count is 32, that one pathological request ties up 3% of your capacity for minutes.
This matters less than it sounds for most workloads because:
- Long outputs usually come from high-latency use cases where users expect to wait anyway.
- A 10K-token request occupies one slot — 31 other requests are still running normally.
- Maximum token limits bound the worst case.
It matters more than it sounds for:
- Interactive use cases where you have output length variance but strict latency SLAs. A user expecting a sub-second response shouldn't be in the same pool as a user generating a 5K-word essay.
- Long-context inference where even "short" responses have expensive prefills. A 32K-token context requires 32K tokens of prefill regardless of output length.
The practical fix is request routing: separate queues for short-context and long-context requests, each with its own concurrency limit and serving stack.
When NOT to use continuous batching
Single-user, single-request latency: If you're serving one request at a time and care only about that request's latency, continuous batching adds scheduling overhead without benefit. You have no concurrency to exploit. This is the research or local-development case.
Homogeneous workloads with fixed output lengths: If every request generates exactly K tokens (e.g., a classification task that always produces a one-token label, or a fixed-format extraction task with enforced length limits), static batching is fine. Output lengths are identical; there's nothing for continuous batching's dynamic scheduling to optimize. The padding waste is zero.
Streaming audio or video token generation: Systems generating continuous media streams often need strict ordering and bounded latency per token, not maximum throughput. Iteration-level scheduling introduces variable inter-token latency as the batch composition changes dynamically. For speech synthesis or real-time video, this jitter matters.
Very low QPS with high per-token cost: At very low request rates, the active batch is usually size 1 anyway. Continuous batching degenerates to single-request serving and adds complexity without gain.
What the paper actually proved
The Orca paper's core proof is empirical but decisive: under realistic serving workloads with variable output lengths, iteration-level scheduling consistently outperforms static batching by an order of magnitude. The argument isn't complex — it's a direct consequence of arithmetic. Static batching wastes all the GPU cycles for short requests while they wait for long ones. Continuous batching recovers those cycles.
The implementation details — selective batching, the scheduler design, the KV cache management — are where the engineering lives. But the insight itself is almost embarrassingly simple in retrospect: don't wait for the batch to finish before scheduling the next request.
That simplicity is why it spread so fast. Within a year of the OSDI 2022 paper, every production LLM serving framework had either implemented it directly or rebuilt their scheduler around the same concept. Static batching is now the legacy approach, found mainly in systems that were designed before 2023 and haven't been updated.
If you're debugging throughput on an LLM serving deployment, the first question is: are you using continuous batching? The second question is: are you using it correctly (concurrency limits, chunked prefill, request routing by length class)? If the answer to both is yes and you're still throughput-limited, read the PagedAttention post — you're probably hitting KV cache memory fragmentation, not scheduling inefficiency.