Speculative decoding: what the inference speed paper actually says
Reading Leviathan et al. (Google Research, ICML 2023) while chasing p50 latency in a streaming completions endpoint.
The request came in, the model started generating, and the first token took 180ms. Not because the model was slow — A100, tight batching, FlashAttention-2. Because the model was a 70B parameter transformer, and every token requires a full forward pass through all 70 billion parameters. No matter what you do to the hardware, you're memory-bandwidth limited on that pass. Generating a 200-token response means 200 sequential forward passes. Parallelism doesn't help here; each pass depends on the previous output.
This is the fundamental latency constraint of autoregressive decoding, and it doesn't yield to hardware scaling the way training does. Speculative decoding — "Fast Inference from Transformers via Speculative Decoding," Leviathan, Kalman, and Matias, Google Research, ICML 2023 — attacks it directly, without modifying the model or accepting approximate outputs. The output distribution is mathematically identical to running the large model alone.
The problem the paper is actually solving
Autoregressive LLM decoding has an unusual compute profile. During training, you process the entire sequence in one forward pass — all positions are computed in parallel. During inference, you can't: you don't know token N+1 until you've sampled token N. The KV cache helps by avoiding redundant recomputation of attention keys and values, but it doesn't change the sequential dependency. Every new token costs one full forward pass.
For large models at small batch sizes, this forward pass is memory-bandwidth bound, not compute bound. The arithmetic intensity — FLOPs per byte of memory accessed — is too low. You're reading 70 billion parameters from GPU HBM, doing relatively little compute per parameter, and outputting one token. The GPU's CUDA cores are underutilized. The bottleneck is how fast you can stream model weights through the memory bus, not how fast you can multiply.
The paper's key observation: if you're memory-bandwidth bound, adding a small amount of extra compute in that same pass costs almost nothing. Processing K+1 token positions in a single forward pass takes roughly the same wall-clock time as processing 1 position — because the binding constraint is memory bandwidth, not arithmetic operations. This is the exploit that speculative decoding uses.
The mechanism: draft, then verify
Speculative decoding uses two models: a small draft model (fast, cheap, imperfect) and the large target model (slow, expensive, the one you actually want).
The loop looks like this:
- Run the draft model autoregressively for K steps, producing K candidate tokens x̃_1, …, x̃_K and their probability distributions q(·) at each position.
- Run the target model once on the full context plus all K draft tokens — processing K+1 positions in one forward pass, producing target distributions $p(\cdot)$ at each position.
- For each draft token in order, accept or reject using the following rule:
Accept x̃_i if: U[0,1] < p(x̃_i | context) / q(x̃_i | context)
If accepted, move to the next token. If rejected at position i, discard tokens i through K, sample a corrected token from a residual distribution, and start a new round.
The corrected distribution on rejection is:
p'(x) = max(0, p(x) - q(x)) / normalization
This matters. The rejection correction ensures that the final output distribution is exactly $p$ — the target model's distribution. The small model's approximation errors are mathematically canceled out, not hidden. This is not a heuristic or an approximation that happens to work in practice. It's a proof in the paper.
What the performance numbers actually say
The expected number of tokens generated per speculation round is:
E[accepted] = (1 - α^{K+1}) / (1 - α)
where α is the per-token acceptance rate and K is the speculation length. At α = 0.8 and K = 4, this gives approximately 3.4 tokens per round instead of 1.
The wall-clock speedup depends on two things: how good the acceptance rate is, and how cheap the draft model is relative to the target. The paper defines a speedup factor:
speedup ≈ (1 - α^{K+1}) / ((1 - α) × (K × c + 1))
where c is the per-token cost of the draft model as a fraction of the target. For c close to 0 (draft is very cheap), speedup approaches the expected tokens ratio directly. For c = 0.1 and α = 0.8, K = 4, you get roughly 2.5× — consistent with what the paper reports on T5-XXL (11B parameters) with T5-small (77M parameters) as the draft model.
In practice: the paper reports 2–3× end-to-end latency reduction on T5-XXL. Anecdotally, most production deployments I've seen on 70B models with a 7B draft land between 2× and 2.5×, heavily dependent on acceptance rate.
Production tradeoffs no one mentions in the benchmark post
Both models must fit in memory simultaneously. The standard framing is "draft model is small so it's fine." In practice, if you're running a 70B target on 4× A100s with tight memory budget — KV cache, activation memory, FlashAttention buffers — adding a 7B draft model is another 14 GB in BF16. On a 320 GB aggregate budget that's manageable; on a deployment where you've squeezed to exactly fit the target, it breaks things. The memory cost isn't theoretical.
Acceptance rates vary more than the headline numbers suggest. The 0.8 figure is a mean across a benchmark distribution. For specific query types — structured output (JSON, code), low-temperature sampling, out-of-domain inputs — acceptance rates can be 0.5 or lower. At α = 0.5 and K = 4, the speedup shrinks toward 1.8×, and the overhead of running the draft model may push you below 1.5×. Acceptance rate monitoring is not optional; it's the core operational metric for speculative decoding.
The draft model must share vocabulary and tokenizer with the target. This is stated in the paper and obvious in retrospect, but teams that swap out target models (upgrading Llama 3 to Llama 3.1, switching from one fine-tune to another) sometimes forget that the draft model must be re-validated against the new tokenizer. A mismatched tokenizer means your acceptance rate is 0, you're running both models for every token, and your latency doubles.
Streaming changes the latency profile in non-obvious ways. Standard non-speculative streaming delivers tokens at a near-constant rate. Speculative decoding delivers tokens in bursts: when a round goes well (high acceptance), you emit K tokens quickly; when a round goes poorly (early rejection), you emit 1 token slowly. The average latency improves, but the per-token jitter increases. If you're rendering a streaming UI, users may see occasional pauses that weren't there before, even though mean time-to-completion is lower.
Speculative decoding is additive with KV caching, not with FlashAttention. FA reduces memory per token during the forward pass; speculative decoding reduces the number of forward passes. They stack. But FA doesn't change the sequential token generation constraint, which is what SD attacks. You want both, but they're solving different bottlenecks.
Failure modes in practice
The most common production failure mode: acceptance rate silently degrades after a fine-tune.
You fine-tune the target model on domain-specific data. The fine-tune shifts the target distribution — that's the point. But the draft model is still the base model. Acceptance rate drops because the fine-tuned target now assigns different probabilities to the tokens the base draft model predicts. Latency degrades. The monitoring alert is on p95 latency, which rises slowly over weeks of fine-tune iterations, and the root cause isn't obvious because the model outputs are still correct.
Fix: always measure acceptance rate before and after any target model change. If it drops below a threshold (rough heuristic: below 0.65 for K=4 makes SD marginal), fine-tune the draft model on the same data or accept the regression.
The second failure mode: forced sampling breaks the acceptance guarantee.
Speculative decoding's theoretical guarantee assumes you're sampling from the target distribution at each token. If you're using constrained decoding — forcing specific output formats by zeroing out logits for invalid tokens — the acceptance calculation no longer holds cleanly. You can implement constrained speculative decoding, but it requires applying the same constraints to both models' distributions when computing the acceptance ratio. Many implementations don't do this. The output is still valid under your constraints, but the output distribution is no longer exactly the target model's constrained distribution. In practice this is usually fine; conceptually it's a property violation worth knowing about.
When not to use speculative decoding
High-throughput batch inference. Speculative decoding helps latency by reducing the number of sequential target model passes. But if you're running at batch size 64 or 128 for maximum throughput, you're already efficiently utilizing GPU compute. The memory bandwidth constraint that makes SD effective at low batch sizes isn't the binding constraint at large batch sizes — you're compute-bound. SD adds draft model overhead without proportional benefit. Measure with your actual batch size before deploying.
When you can't afford two models in VRAM. If you're already memory-constrained on the target model — quantized to 4-bit to fit, using aggressive KV cache compression — adding a draft model may not be feasible. Quantizing the draft model (to 4-bit or 8-bit) can help, but the acceptance rate impact of quantizing the draft versus the target is asymmetric: quantization errors in the draft just reduce acceptance rate (bad but not catastrophic), whereas quantization errors in the target change the output distribution (potentially catastrophic for quality).
Low speculation depth with low acceptance rates. If your use case produces acceptance rates below 0.6 — structured output, strict instruction following, highly constrained domains — the math stops working in your favor. Expected accepted tokens at α=0.6, K=3 is about 2.4 tokens per round. If the draft model takes 15% of the target model's time and the round takes 1.45 target-model-equivalent passes, you're getting 2.4 tokens per 1.45 passes. That's 1.65× speedup in the best case. Whether that justifies the operational complexity of running two models depends on your latency SLA.
When you need bitwise reproducibility across token-by-token decoding. The output distribution is identical to target-only decoding, but the specific tokens sampled in any given run are not. Speculative decoding changes which random draws produce which outputs. If you're relying on seed-based reproducibility for testing or debugging — using a fixed seed to reproduce a specific completion — speculative decoding breaks that. The distribution is the same; the seed-specific sample path is different.
What the paper actually gives you
Speculative decoding is an argument that sequential decoding isn't as fundamental a constraint as it looks. The sequential dependency is real — you can't sample token N+1 without token N. But you can get a cheap guess at token N+1 (and N+2, N+3...) from a small model, validate the guess in one large-model pass, and on average generate significantly more than one token per large-model pass. The mathematical proof that this doesn't change the output distribution is the critical property that separates it from all the "faster but approximate" approaches.
The parallel work from DeepMind — "Accelerating Large Language Model Decoding with Speculative Sampling," Chen et al., arXiv 2023 — arrives at the same mechanism independently, which is worth noting: two teams at different organizations developed the same algorithm. That convergence suggests the approach is load-bearing rather than incidental.
For your specific situation: if you're serving a large model at low to medium batch sizes with latency as the binding constraint, and you have a compatible draft model available, speculative decoding is likely worth the integration cost. The 2× latency improvement is real on the right workloads. The maintenance cost is real too — you now have two model versions to track, two quality bars to maintain, and an acceptance rate metric that needs to be in your dashboards before you deploy, not after.
The 180ms first-token latency from the opening? We deployed a 7B draft model with K=4 and measured 0.76 acceptance rate on our query distribution. Mean time to first token dropped to 92ms. P95 degraded slightly — the burst pattern occasionally produces a slow round at the tail. We added acceptance-rate alerts, tied them to the fine-tune pipeline, and called it production.
Fast Inference from Transformers via Speculative Decoding — Leviathan, Kalman, Matias. ICML 2023. Accelerating Large Language Model Decoding with Speculative Sampling — Chen, Borgeaud, Irving, Lespiau, Sifre, Jumper. arXiv 2023.