← Back to writing

Speculative decoding: what the paper actually says

Reading Leviathan, Kalman, and Matias (Google Research, ICML 2023) while profiling why a 70B model had worse latency than expected on an H100.

The H100 wasn't the problem. The H100 was mostly idle.

We had a 70B LLaMA-based model deployed for a low-latency completion use case — one request at a time, target of under 500ms for a 200-token response. The H100 has 989 TFLOPs/s of BF16 compute. Our inference traces showed we were using about 80 TFLOPs/s. The rest sat unused. You can't batch your way out of this: the use case was single-request, latency-sensitive, and batching to fill the GPU would defeat the point.

The reason this happens is fundamental to how autoregressive transformers work. During the decode phase — generating each output token one by one — the forward pass is a batch-1 matrix-vector multiplication. The weight matrices are large (a 70B model has ~140GB of parameters in BF16), but you're multiplying them by a single vector of dimension 8192 or so. GPUs are designed for matrix-matrix multiplication. A matrix-vector product uses a tiny fraction of the available FLOPs; the bottleneck is loading weights from HBM, not doing arithmetic. You're memory-bandwidth bound at roughly 1-2 tokens/second per GB/s of HBM bandwidth, regardless of how much compute you throw at it.

Speculative decoding — "Fast Inference from Transformers via Speculative Decoding," Leviathan, Kalman, and Matias, Google Research, ICML 2023 — doesn't change the memory bandwidth situation. What it does is radically change how many tokens you get per target model forward pass. The insight is that if you can generate plausible candidate tokens cheaply, you can verify many of them in a single parallel forward pass of the large model — and if the candidates are right, you've generated K tokens for roughly the cost of one.

The problem the paper is actually solving

To see why speculative decoding works, you need a precise model of where the time goes during autoregressive decode.

A single forward pass through a 70B transformer processes one token (during decode) and produces probabilities for the next. Loading the model weights from HBM to do this takes roughly:

time = bytes_to_load / HBM_bandwidth
     = 140 GB / 3.35 TB/s (H100 SXM5)
     ≈ 42ms per token

That puts a hard lower bound of ~42ms/token before you even account for compute, KV cache reads, or overhead. For a 200-token response, that's 8.4 seconds minimum — driven almost entirely by weight loading time.

The key observation: loading 140GB of weights whether you process 1 token or 32 tokens simultaneously costs the same. The arithmetic for processing a batch of 32 tokens is 32× more expensive than 1 token, but arithmetic is cheap on an H100. The weight load is the expensive part, and it doesn't scale with batch size within a single forward pass.

So: if you could feed the model 32 tokens in one forward pass during decode and it would produce useful output for all 32 positions, you'd get 32 tokens in approximately the time it currently takes to get 1. The problem is that autoregressive generation requires each token to be conditioned on all previous ones — you can't know token 2 until you've generated token 1. The sequence is inherently sequential.

Speculative decoding breaks this dependency by introducing a cheap draft model to speculate on what the next K tokens probably are, and then verifying (or rejecting) those guesses with a single parallel pass of the large target model.

The algorithm

The setup requires two models: a small draft model Mq (fast, lower quality) and the large target model Mp (slow, high quality). The goal is to generate samples that are distributed identically to what Mp would produce on its own.

Step 1: Draft. Run Mq autoregressively to generate K speculative tokens x̃₁, x̃₂, ..., x̃_K. This is fast because Mq is small — if the draft model is 7B and the target is 70B, each draft token costs roughly 1/10 of a target token.

Step 2: Verify. Run Mp once over the full input plus all K draft tokens in a single forward pass. Because transformers compute all positions in parallel given a sequence, this one pass produces K+1 sets of probabilities:

  • p(x̃₁ | prefix) — target probability for the first draft token
  • p(x̃₂ | prefix, x̃₁) — target probability for the second, given the first
  • ... continuing through K positions
  • p(x_{K+1} | prefix, x̃₁, ..., x̃_K) — target probability for the bonus token after all K drafts

One target forward pass costs roughly as much as one token-at-a-time generation step, but now you have information about K+1 token positions.

Step 3: Accept or reject. For each draft token x̃ᵢ, sample u ~ Uniform[0,1] and:

  • Accept x̃ᵢ if u ≤ p_target(x̃ᵢ) / p_draft(x̃ᵢ)
  • Reject x̃ᵢ otherwise: sample a replacement from the adjusted distribution max(0, p_target - p_draft), normalized. Discard all draft tokens after the first rejection.

If all K draft tokens are accepted, take the bonus token from the target's K+1 position as an additional free token.

The adjusted distribution in the rejection case — max(0, p_target - p_draft) normalized — ensures you're sampling from the portion of the target's distribution that the draft model underweights. It's not an approximation; it's the exact correction needed to guarantee the output distribution is Mp's.

The distribution equivalence proof

This is the part that makes speculative decoding compelling rather than merely a heuristic.

The paper proves that the tokens produced by the accept/reject algorithm are distributed identically to tokens sampled directly from Mp. The proof applies the modified rejection sampling theorem (von Neumann 1951, modified for discrete distributions).

The standard rejection sampling result: if you want to sample from a target distribution p, and you have a proposal distribution q, you can sample from q and accept with probability p(x) / (M·q(x)) where M = max_x p(x)/q(x). The accepted samples are distributed as p.

When p_target(x) ≤ p_draft(x), the draft model assigns at least as much probability as the target — we can always accept, since p_target/p_draft ≤ 1. When p_target(x) > p_draft(x), we accept with probability p_target/p_draft < 1. When we reject, we need to make up the "missing" probability mass — that's exactly what the adjusted distribution max(0, p_target - p_draft) / Z provides, where Z = 1 - sum_x min(p_target(x), p_draft(x)).

The result is not approximate. There's no quality degradation by construction. The outputs of speculative decoding are statistically indistinguishable from outputs of the target model running unmodified.

What the performance numbers actually show

The speedup depends on two parameters: the acceptance rate α (average probability of accepting a draft token) and the ratio c of draft model cost to target model cost.

Expected tokens generated per target forward pass:

E[accepted tokens] = (1 - α^(K+1)) / (1 - α)

Plus the bonus token. With K=4 and α=0.85: E[tokens/pass] ≈ 4.3. You've turned one target forward pass into 4+ tokens.

The wall-clock speedup is roughly:

speedup ≈ E[tokens/pass] / (1 + K·c)

The 1 + K·c term accounts for the cost of running K draft tokens before each target pass. If c = 0.1 (7B draft : 70B target) and K=4: denominator ≈ 1.4. With E[tokens/pass] = 4.3: speedup ≈ 3×.

The paper reports measured speedups of 2–3× on T5-XXL using T5-Small as draft, and 2.5× on Chinchilla-70B using a 7B model. For code generation tasks with highly predictable token sequences, acceptance rates above 90% produce speedups closer to 3.5×. For open-ended creative writing with more varied distributions, acceptance rates of 65–70% produce closer to 2×.

The speedup is real wallclock time, not a theoretical calculation. Because the draft tokens are cheap and the target verification is the same cost as one token-at-a-time decoding, you're trading model quality on easy tokens (draft model handles those) for a massive reduction in expensive target model calls.

Production tradeoffs no one mentions in the benchmark post

You're running two models simultaneously. Draft model weights live in GPU memory the entire time. For a 7B draft + 70B target at BF16: ~14GB + ~140GB = ~154GB. A single H100 SXM5 has 80GB HBM. This requires at minimum a 2-GPU setup just for weights, before KV cache. Teams that benchmark speculative decoding on a well-provisioned research cluster then discover this during deployment planning.

Quantization helps but changes the math. A 4-bit 70B target ≈ 35GB, 4-bit 7B draft ≈ 3.5GB, total ≈ 38.5GB — fits on one H100. But the acceptance rate α depends on the relationship between draft and target distributions. When both are quantized, this relationship holds reasonably well. When only the target is quantized (as sometimes happens when a quantized target is fine-tuned on domain data but the draft is the original-scale base model), distribution mismatch increases.

Batch size kills the speedup. Speculative decoding's benefit comes from converting memory-bandwidth-bound single-token decode into compute-bound parallel verification. At batch size 1 or 2, the target model is underutilized and speculative decoding helps enormously. At batch size 16+, the target model's decode is already well-utilized through batching — the forward passes are closer to matrix-matrix than matrix-vector multiplication. Adding speculative decoding in this regime adds draft model overhead and memory without improving throughput. vLLM's documentation explicitly recommends speculative decoding only for effective batch sizes below 4-8.

Optimal K is not static. The performance-optimal K depends on α, which depends on the input. For a coding assistant with predictable completions, K=8 might be optimal. For a general chat endpoint with varied prompts, K=4 might be better. Static K tuned for the average case leaves performance on the table for high-α inputs (where you could speculate further) and wastes compute on low-α inputs (where you're burning draft budget on tokens that will be rejected). Production implementations ideally tune K dynamically based on recent acceptance rates, but this adds serving infrastructure complexity.

Speculative decoding and continuous batching pull in opposite directions. PagedAttention + continuous batching fills GPU utilization by batching many concurrent requests. Speculative decoding improves per-request latency by getting more tokens per pass. If your GPU is well-utilized through batching, the speculative decoding benefit shrinks. If you apply speculative decoding and it lowers effective batch size (because each request now completes faster), throughput may actually decrease. You need to measure the composite effect on your specific workload; the benchmarks in the paper are single-request.

KV cache rollback is an implementation footgun. When draft tokens are rejected, the target model's KV cache must be rolled back to the position of the first rejection. This is a correctness requirement: if you accept draft tokens 1-3, reject token 4, and sample a new token 4, the KV cache must not contain anything computed from the rejected token 4 or draft tokens 5-K. Framework implementations handle this correctly, but teams that implement speculative decoding on top of a custom inference stack frequently get cache rollback wrong in edge cases — particularly around sequence endings and EOS token handling.

Failure modes in practice

OOD inputs crater acceptance rate silently. The speedup you measured in benchmarks is over your evaluation dataset. In production, you encounter inputs outside that distribution — unusual languages, unusual formats, system prompts that steer the target model's distribution away from what the draft model expects. Acceptance rates can drop from 80% to 30% without any error signal. You're now running both models and achieving minimal benefit. Production systems should monitor acceptance rate as a metric and fall back to direct decoding when it stays below a threshold (40% is a reasonable starting point).

Temperature and sampling parameters must match. The acceptance criterion u ≤ p_target(x̃) / p_draft(x̃) assumes both distributions are computed under identical sampling conditions. If you apply temperature scaling, top-p, or top-k differently between the draft and target models in your serving stack, the output distribution guarantee breaks. This is subtle: you might apply temperature to the draft model for sampling efficiency but forget to apply the same temperature to the target model's verification probabilities, producing a distribution that neither matches the original target nor has any theoretical guarantee.

The draft model must stay domain-matched to the target. If your target model is a fine-tuned domain specialist and you're using the base checkpoint as the draft, the distributions diverge exactly where your fine-tuning matters most — on your domain's vocabulary and patterns. For a medical LLM fine-tuned on clinical notes, a generic 7B base will draft common English well but produce low acceptance rates on medical terminology and clinical reasoning patterns. You need a draft model fine-tuned on the same domain, which means double the fine-tuning infrastructure and data.

Latency can increase on short outputs. For responses of 5-10 tokens, the overhead of running K draft tokens plus one target verification pass is comparable to or greater than just running the target 5-10 times directly. The break-even point depends on draft model size and K, but speculative decoding typically has a minimum-output-length below which it's slower. This is particularly relevant for classification and short-answer tasks disguised as generation.

Variants worth knowing

Medusa (Cai et al., 2024) takes a different approach: add multiple speculative heads directly to the target model instead of using a separate draft model. Each head predicts one future token position independently. No separate model needed, no memory overhead for a second model. The tradeoff is that the heads need to be fine-tuned (usually a few hours on task data), and acceptance rates are slightly lower than a well-matched separate draft model. For teams that don't have a suitable draft model but can afford a short fine-tuning run, Medusa is often the better operational choice.

Prompt Lookup Decoding uses n-gram matching from the input prompt itself to generate draft tokens — zero additional parameters, trivially implemented. For tasks where the output heavily mirrors the input (summarization, translation, extraction, retrieval-heavy Q&A), acceptance rates of 60-80% are achievable with no draft model at all. If you're building a RAG system where responses frequently quote or closely paraphrase retrieved passages, this is worth trying before standing up any additional model infrastructure.

SpecInfer (Miao et al., 2023) extends speculative decoding from a linear draft to a tree of drafts — exploring multiple branching hypotheses simultaneously and verifying all branches in one target pass. Higher hardware utilization, better expected acceptance, but significantly more implementation complexity. Used internally by some production systems but not widely available in off-the-shelf frameworks.

When not to use speculative decoding

Throughput-optimized serving. If your production system maximizes GPU utilization through large batch sizes (>8 concurrent requests in steady state), speculative decoding likely reduces throughput. The right metric is requests/second/GPU-hour, not tokens/second on a single request. Measure the composite effect before deploying.

Memory-constrained deployments. If you're already at 70-80% GPU memory running the target model with a reasonable KV cache, adding a draft model will either require a smaller draft (lower acceptance rate) or additional GPUs (higher cost). The latency improvement needs to justify the memory cost.

When you don't have a well-matched draft model. An acceptance rate below ~55% means speculative decoding is at best neutral, at worst slower than direct decoding. If you're not willing to invest in a domain-matched draft model (or Medusa fine-tuning), and prompt lookup decoding doesn't fit your task structure, the realistic benefit may be minimal.

Streaming applications with partial-output rendering. Speculative decoding delivers tokens in bursts — several tokens appear at once when a batch of drafts is accepted, then silence while the next draft-and-verify round happens. For UIs that render a character-by-character "typing" effect, the irregular token delivery is perceptible. You can smooth it with a token queue and output pacing, but that adds infrastructure and partially defeats the latency improvement.

When you can't instrument acceptance rate. Going into speculative decoding blind means you don't know if it's helping or hurting on your actual traffic. The acceptance rate is the leading indicator for everything. Before deploying, confirm you have observability into per-request acceptance rates. Without it, you can't tune K, can't detect distribution drift, and can't make the deployment decision with real data.

What the paper actually gives you

Speculative decoding solves a real problem: GPU compute waste during autoregressive decode. The proof that output quality is unchanged is genuine — not approximately unchanged, exactly unchanged, by construction. That matters because it means you're not trading quality for speed; you're changing where the time goes.

The practical shape of the technique is: it works well for latency-optimized, low-batch-size deployments where you have a good draft model and predictable input distribution. The 2-3× speedup numbers from the paper are reproducible under those conditions. They don't transfer cleanly to throughput-oriented high-batch deployments, and the acceptance rate is sensitive to domain mismatch in ways that benchmark datasets don't capture.

The latency profiling session that started this post ended with a speculative decoding implementation using a 7B draft model fine-tuned on domain data alongside our 70B target. With K=5 and acceptance rates around 78% on production traffic, we got roughly 2.2× median latency reduction — 180ms average for 200-token responses, down from 395ms. GPU utilization went from ~8% to ~22%. The H100 is still mostly idle, but that's the memory bandwidth limit asserting itself. We're using more of the available compute than before, and we got there with no change to output quality by construction.


Fast Inference from Transformers via Speculative Decoding — Leviathan, Kalman, Matias. ICML 2023.
Accelerating Large Language Model Decoding with Speculative Sampling — Chen, Borgeaud, Irving, Lespiau, Sifre, Jumper. DeepMind 2023.