EAGLE: speculative decoding with feature-level prediction — what the paper actually says

Reading Li, Wei, Zhang, and Zhang (2024) while trying to understand why our speculative decoding deployment was hitting a ceiling at 2.1x despite a well-matched draft model.

The acceptance rate was stuck. We were running LLaMA2-70B with a 7B draft model, decent domain overlap, and careful temperature matching. Acceptance rate: 78%. Wall-clock speedup: 2.1x. By the math in the Leviathan et al. paper, with acceptance rate α ≈ 0.78 and a draft length of γ = 5, the expected tokens per target forward pass is (1 - α^{γ+1}) / (1 - α) ≈ 3.6 — but that assumes each accepted token costs nothing. In practice, the draft model's forward pass isn't free. With a 7B draft, we were spending roughly 10% of the target model's time on drafting, which cut into the theoretical gain.

The deeper problem is that 78% acceptance is close to a ceiling for token-level prediction. Given that the draft model must predict the exact token the target model would sample (in speculative decoding's rejection sampling framework), and given that natural language has genuine ambiguity at each position, getting above 85% acceptance on a domain-general task requires a draft model that's nearly as capable as the target — which negates the latency gain.

"EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty", Li, Wei, Zhang, and Zhang, January 2024, is built on a diagnosis of why this ceiling exists and a design that avoids it. The core argument: token-level prediction is hard because it requires committing to one of thousands of vocabulary entries before knowing the full distribution. Feature-level prediction — predicting the hidden state vector the target model would have produced — is a much smoother regression problem. EAGLE predicts features, not tokens.

Why token-level prediction is fundamentally uncertain

To understand EAGLE's argument, you need a precise picture of what makes token-level draft prediction fail.

In standard speculative decoding, the draft model produces a probability distribution q(x) over the vocabulary, the target model produces p(x), and rejection sampling accepts a draft token with probability min(1, p(x)/q(x)). The acceptance rate is bounded by 1 - TV(p, q) — the total variation distance between the two distributions.

Even a well-trained draft model will have high TV(p, q) at positions where the target model is uncertain. When the target assigns probability 0.2 each to five plausible tokens, any single token prediction from the draft has an 80% chance of being wrong in a way that rejection sampling will catch. This isn't a failure of the draft model; it's a consequence of the discrete, multinomial structure of token prediction.

The EAGLE paper quantifies this differently: they measure the L2 distance between the second-to-top-layer feature vectors produced by the target model for the ground truth and those predicted by different approaches. Token-level prediction (where you decode to a token then re-embed) produces high L2 error. Feature-level regression (predicting the continuous vector directly) produces substantially lower error on the same evaluation data.

The intuition is that feature vectors are dense, continuous, and structured by the model's learned representations. Adjacent positions in the sequence have features that vary smoothly in most dimensions — the geometry of the representation space is regular. Tokens, by contrast, are discrete indices into a 32,000-entry vocabulary where "the" and "a" might be adjacent high-probability tokens but their embeddings are structurally unrelated. Predicting in feature space is a simpler regression problem than predicting in token space.

The architecture: one transformer layer and a frozen LM head

EAGLE's draft head is deliberately minimal: a single transformer decoder layer — full self-attention and FFN — applied to a joint representation of the target model's previous-position features and the current token embedding.

The input to the draft head at position t is:

input_t = concat(embed(token_t), feature_{t-1})

where feature_{t-1} is the second-to-top-layer hidden state produced by the target model at position t-1. This concatenation (or a learned projection thereof) is what makes EAGLE work: the draft head always has access to the target model's representation of the context up to t-1, not just the token history.

The draft head passes its output through the target model's LM head (frozen weight reuse, not a separate vocabulary projection) to produce token logits. This is an important detail: because the target model's LM head is shared, the draft head doesn't need to learn to map features to vocabulary — it only needs to predict features that the shared LM head will correctly translate to token probabilities.

At inference time the draft head runs autoregressively: it generates feature_{t}, uses it to predict token_{t+1} and feature_{t+1}, then uses those to predict further positions. The target model's LM head bridges each feature prediction to a token distribution without any separate learned component.

Parameter count for the draft head: roughly 0.24B for Vicuna-7B and 1.05B for LLaMA2-70B. For the 70B model, the draft head is 1.5% of the target model's size. For the 7B model, it's 3.4%. Compare to standard speculative decoding where the "small" draft model is often 1/5 to 1/10 the target size — EAGLE's draft is another order of magnitude smaller.

Tree-structured draft generation

A linear draft sequence — predict tokens t+1, t+2, ..., t+γ as a chain — is the simplest approach but not the most efficient. EAGLE generates a tree of draft candidates and verifies all paths in a single target model forward pass.

The tree generation works as follows: at each depth d, the draft head generates the top-k most likely next tokens, branching the tree. Depth 1 produces k candidates; depth 2 produces k² paths (one branch per depth-1 candidate); and so on. In practice, the tree is pruned by confidence: branches with low cumulative probability mass are dropped early.

Verification uses tree attention: the target model processes all leaf nodes simultaneously, with an attention mask that allows each leaf to attend only to its ancestors in the draft tree, not to sibling branches. This is a standard trick from SpecInfer (Miao et al., 2023) but EAGLE's smaller draft head makes the tree much cheaper to generate than a full draft model would be.

The tree approach improves expected accepted tokens per target forward pass by exposing more hypotheses. With a linear draft at depth γ=5, a single wrong token at depth 3 invalidates the remaining 2. With a tree at the same computational budget, alternative branches at depth 3 may be accepted even if the primary path fails.

Training procedure

The draft head is trained entirely offline, independently of the target model. You need access to the target model's hidden states on a training corpus — specifically the second-to-top-layer features at each position.

Training steps:

Run the target model on a text corpus and save (feature_t, token_t) pairs for each position t.
Train the draft head with a cross-entropy loss: given concat(embed(token_t), feature_{t-1}) as input, predict feature_t and then use the shared LM head to compare against the actual token at position t.
No reinforcement learning, no rejection sampling, no interaction with the target model during training. Standard supervised sequence learning.

The training corpus doesn't need to match your deployment domain exactly — the draft head is learning to predict the target model's feature dynamics, which are governed by the target model's weights, not the specific content. In practice, a general instruction-following dataset (ShareGPT, open-source dialogue data) is sufficient for a draft head that works well across code, math, and conversational tasks.

Training wall time for the LLaMA2-70B draft head: roughly 2 hours on an 8×A100 node. The 7B draft head trains in under an hour. These numbers are from the paper's experimental setup; your dataset size and hardware will vary.

What the numbers actually say

The paper benchmarks on MT-bench, HumanEval, and GSM8K across Vicuna-7B/13B, LLaMA2-Chat-13B, LLaMA2-Chat-70B, and Mixtral-8x7B-Instruct. Against vanilla autoregressive decoding:

Vicuna-7B: 3.0x speedup (EAGLE) vs ~1.6x (Medusa), ~2.0x (standard speculative with a smaller model) Vicuna-13B: 3.2x LLaMA2-Chat-13B: 3.0x
LLaMA2-Chat-70B: 3.5x on math tasks, 2.7x on dialogue Mixtral-8x7B-Instruct: 2.7-3.0x

The EAGLE-2 variant (which adds dynamic draft tree adjustment based on per-step confidence) reaches 4x on most configurations.

Against Medusa specifically: EAGLE is approximately 1.6x faster than Medusa at the same model size. The gap is larger on tasks with higher output diversity — math and code — where token-level predictors struggle and feature-level predictors maintain accuracy.

Against standard speculative decoding with a larger draft model: EAGLE with its tiny 1.05B draft head matches or exceeds the acceptance rate of a 7B draft model paired with the 70B target, while using far less GPU memory and draft computation time.

The acceptance rate numbers explain why: EAGLE's feature-based acceptance rate is 85–92% depending on task and model, versus 75–82% for well-matched token-level draft models on the same benchmarks.

What the paper doesn't tell you

Feature alignment requires model-specific training. The EAGLE draft head is trained on the features of a specific target model at a specific checkpoint. If you update the target model weights — even a minor fine-tune — the draft head needs retraining. The second-to-top-layer features change with every gradient step. Unlike standard speculative decoding (where any compatible draft model works), EAGLE ties the draft head to the exact target model it was trained on.

Quantization breaks feature alignment. The EAGLE draft head is trained on the target model's full-precision features. When you serve the target model in INT4 or INT8, its second-to-top-layer features differ from the FP16 baseline — quantization error shifts the feature distribution. Using an FP16-trained draft head with an INT4 target model degrades acceptance rate measurably. You either need to retrain the draft head on the quantized model's features (which requires a custom training pipeline on quantized activations) or accept the quality hit.

Tree attention adds implementation complexity. The verification step requires tree attention masking, which isn't in most off-the-shelf attention kernels. vLLM and SGLang both have tree attention support, but if you're running a custom serving stack, you're writing this yourself. The masking logic is straightforward but the batching semantics are non-trivial: each "request" in the batch has a different tree structure, making padding and kernel dispatch harder than linear sequence batching.

Memory overhead is non-trivial for large models. The 1.05B draft head for LLaMA2-70B adds ~2GB to your GPU footprint in BF16. On an 80GB A100 with a 140GB model requiring tensor parallelism, this 2GB is probably fine. On a 70GB H100 in NVLink configuration, you're squeezing. Profile your serving configuration before deploying — the draft head cannot be split across GPUs with standard tensor parallelism since it's a single-layer network.

Draft tree depth interacts with your P99 latency budget. Deeper trees improve average speedup but add variance. A path that's accepted at depth 7 is a big win; a path rejected at depth 1 after generating 7 draft tokens has paid draft overhead for one accepted token. For workloads with tight P99 latency SLAs, constrain tree depth based on measured latency distribution, not theoretical expected speedup.

Production tradeoffs

Getting the draft head. Unless you're running a Llama-family model for which community draft heads exist, you'll need to train your own. The training pipeline requires storing the target model's layer-{N-1} activations on a training corpus, which means running inference on that corpus once to generate the dataset. For a 70B model on a 100K-sample corpus at 512 tokens per sample: roughly 50B tokens of inference, about 10-20 hours on 8×A100. One-time cost, but it's real.

Dynamic draft length. The basic EAGLE implementation uses a fixed tree depth. EAGLE-2's insight is to adjust tree depth based on draft confidence: when the draft head is confident, go deeper; when uncertain, stop early and verify. This makes the speedup more robust across different input types. EAGLE-2 is implemented in the official repository and in recent vLLM versions — prefer it over fixed-depth EAGLE if you're deploying from scratch.

Serving framework integration. As of early 2024, vLLM has EAGLE support in its speculative decoding path. SGLang has partial support. TGI does not. If your serving stack is custom, budget a week for the tree attention implementation and a day for performance validation against baseline.

Batch throughput vs. per-request latency. EAGLE, like all speculative decoding methods, helps only with latency-bound workloads. If you're batch-filling your GPU (throughput-maximizing), the draft head adds compute overhead with no benefit. EAGLE is the right optimization for serving long responses to latency-sensitive single users, not for maximizing tokens-per-second on a batch queue.

When NOT to use EAGLE

When your responses are short. The startup cost — generating the draft tree, running verification — pays off over long generations. For responses under 50 tokens, the overhead of tree attention setup and draft generation is a meaningful fraction of total inference time. Measure on your actual P50 response length; don't assume the speedup numbers generalize.

When you can't maintain the draft head lifecycle. EAGLE's tight coupling between draft head and target model weights means every model update requires a draft head retrain. If your team does weekly fine-tuning on the serving model, you need a pipeline that automatically retrains the draft head on each update. If that operational complexity isn't acceptable, standard speculative decoding with a separately maintained draft model has more separation of concerns.

When serving quantized models. INT4 or INT8 target models need draft heads trained on their quantized features. Training on quantized model activations is not a standard workflow and requires careful calibration. If quantization is necessary (you're fitting a 70B model on a single GPU), you may be better off with Medusa — which doesn't depend on feature alignment — than with EAGLE.

When you need exact output reproducibility across deployments. EAGLE's tree structure and acceptance sampling introduce non-determinism beyond the usual temperature sampling. Two identical requests with identical seeds may produce different results depending on draft tree branch selection. If your application depends on exact token-level reproducibility (regression testing, deterministic generation), speculative decoding in general — and EAGLE specifically — requires careful seeding and tree state management.

When your workload is throughput-dominated at high batch sizes. At batch size 16+ on a GPU that's compute-bound rather than memory-bandwidth-bound, the target model is already utilizing the hardware well. The draft head's overhead exceeds its benefit. Profile your GPU utilization at your operating batch size — if it's above 70%, don't bother with speculative decoding.

Why the feature-level insight matters beyond EAGLE

The paper's real contribution is a diagnosis, not just an architecture. Token-level prediction uncertainty is structural: no amount of draft model scaling makes it easy to predict the exact token from a high-entropy distribution. By shifting prediction to feature space, EAGLE escapes this constraint.

This insight influences several subsequent papers. EAGLE-2 and EAGLE-3 extend it with dynamic tree sizing and improved training objectives. Medusa-2 explored a similar feature-conditioning approach. The broader trend in speculative decoding research is away from "find a smaller model that mimics the big model" and toward "use the big model's own internal representations to guide drafting." EAGLE is the clearest implementation of that principle.

For engineers deciding between speculative decoding strategies: if you can afford the draft head training pipeline and the serving complexity of tree attention, EAGLE is the current best option for latency reduction without quality compromise. The 3-4x speedup over vanilla autoregressive decoding is achievable and consistent across model sizes and task types. The 1.6x improvement over Medusa is meaningful when per-request latency is your binding constraint.

The acceptance rate ceiling that limited our original deployment — the 78% we couldn't break through with a token-level draft model — disappears at the feature level. The problem was never the draft model's size. It was predicting in the wrong space.

Paper: "EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty," Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang. arXiv:2401.15077, January 2024.