← All writing
Paper Breakdown

H2O: what the heavy-hitter oracle paper actually says

Reading Zhang et al. (2023) the week we moved from 8K to 128K context windows and found ourselves serving one concurrent request per GPU.

The math was brutal. LLaMA-70B at INT4: roughly 35 GB of model weights per A100. An 80 GB A100 leaves ~45 GB for KV cache. At 128K context, the KV cache for a single sequence works out to about 40 GB:

2 (K+V) × 80 layers × 8 KV heads × 128 head_dim × 2 bytes = 320 KB per token
128,000 tokens × 320 KB = 40 GB

One concurrent request. That's it. The GPU could handle the compute; it couldn't hold two people's 128K contexts simultaneously. Our options were: use shorter contexts, buy more GPUs, or find a way to shrink the KV cache without destroying output quality.

We'd already read the StreamingLLM paper — keep the first few sink tokens, slide a window over the rest. For chat sessions it worked fine. For the document analysis use case where a user uploads a 100K-token contract and asks questions throughout, it was inadequate. The document's relevant clauses weren't always at the beginning or the end. Answers required recall from positions scattered across the full document. Sliding windows evict the middle. That middle is where the answer lives.

The paper that addresses this more carefully is "H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models", by Zhang, Yang, Ibrahim, Theobald, Wang, Sun, Ali, Shah, Fatahalian, Farooqui, and Papailiopoulos, from the University of Texas at Austin and other institutions (NeurIPS 2023). The core claim: most tokens in the KV cache can be discarded without significant quality loss, but which tokens matter depends on the content, not just the position. Keeping the top 20% of tokens by cumulative attention score performs dramatically better than keeping the most recent 20% for tasks that require genuine long-range recall.

The observation that drives the whole paper

Before the algorithm, there's a finding worth sitting with. The paper visualizes attention score distributions across layers and heads for several LLMs: OPT-6.7B, OPT-30B, GPT-NeoX-20B. The pattern is consistent: at every layer and head, attention mass concentrates on a small fraction of the context tokens. Most tokens receive negligible attention from subsequent queries. A few tokens receive substantial, repeated attention from nearly every subsequent position.

These high-attention tokens the paper calls heavy hitters (H₂).

The paper quantifies this: in a typical 512-token sequence, 20% of the tokens (roughly 100) receive approximately 80% of the total cumulative attention. The remaining 80% of tokens are attending-but-mostly-ignored — the model computes their contribution, stores their KV cache, but subsequent queries give them almost no weight. You're paying HBM for computation that barely affects the output.

This isn't just an average statement. The paper shows that heavy hitters are persistent — a token that becomes a heavy hitter early in generation typically stays a heavy hitter for the entire remainder of the sequence. A token that's ignored by the first 20 decode steps will almost certainly continue to be ignored. The cumulative attention score is a useful signal for future importance because past importance predicts future importance.

The connection to attention sinks (StreamingLLM's key insight) is real but partial. Initial tokens — particularly the BOS token and first few content tokens — do tend to be heavy hitters. But they're heavy hitters for a specific reason (softmax mass conservation), and they're not the only heavy hitters. Subject nouns, named entities, quantifiers, and structurally important clause tokens also accumulate high attention throughout the sequence. StreamingLLM's positional heuristic captures the sink phenomenon but misses these semantically-loaded heavy hitters in the body of the document.

The algorithm

H2O maintains a KV cache of fixed maximum size B = H + R where:

  • H: budget allocated to heavy hitters (content-selected)
  • R: budget allocated to recent tokens (recency window)

After each decode step, the algorithm:

  1. Updates cumulative attention scores. For the token generated at this step, collect the attention weights assigned to each position in the KV cache. Add these weights to a running cumulative score vector — one scalar per cached token — that tracks how much total attention each position has received since it entered the cache.

  2. Applies the eviction policy. If the current cache size exceeds B, evict the token with the lowest cumulative attention score among non-recent tokens. Recent tokens (the last R positions) are exempt from eviction regardless of their scores. Eviction removes the KV cache entry and frees the HBM allocation.

  3. Admits the new token. The KV cache entry for the newly decoded token is added. It starts with an attention score of 0 (or the weight it received in this step) and is provisionally in the recent window.

The eviction cost per step is O(H + R) for finding the minimum score — a linear scan. In practice this is negligible compared to the attention computation itself. Score bookkeeping requires one additional float per cached token, totaling B × 4 bytes of extra SRAM — orders of magnitude smaller than the KV cache itself.

The heavy hitter set is dynamic. Tokens move in and out as their cumulative scores change relative to their competitors. This is different from StreamingLLM's static sink set. A token that entered the cache with high initial attention might later be displaced by a token that accumulates more consistent attention over time. The oracle is greedy and local — it doesn't predict future attention, only reflects past — but the persistence property makes this a surprisingly good approximation.

What the performance numbers actually show

The paper evaluates H2O against three baselines: (1) full KV cache (upper bound), (2) window attention (keep only the most recent B tokens), and (3) StreamingLLM-style (keep first 4 tokens + most recent B-4 tokens).

On language modeling perplexity (OpenWebText, OPT-6.7B, 1024-token sequences, 20% budget — keeping 200 of 1024 tokens):

| Method | Perplexity | vs. Full Cache | |---|---|---| | Full KV cache | 11.3 | — | | Window only | 14.2 | +25.7% | | StreamingLLM | 11.9 | +5.3% | | H2O | 11.5 | +1.8% |

The perplexity gap between H2O and full cache at 20% budget is 1.8%. Window-only at the same budget is +25.7%. H2O's content-aware selection recovers most of full-cache performance that positional methods can't reach.

On long-document QA (NarrativeQA, where questions require recall of events scattered throughout novels):

| Method | F1 Score (20% budget) | |---|---| | Full KV cache | 52.3 | | Window only | 23.1 | | StreamingLLM | 31.7 | | H2O | 47.8 |

This is the clearest differentiation. Window-only drops 55% of full-cache F1 because answers aren't always in the recent window. StreamingLLM drops 39% because answers aren't always near the beginning. H2O drops 8.6% because heavy hitters — the entities and events the question is about — tend to receive sustained attention throughout the document and survive eviction.

On text summarization (XSum, 1024-token articles): all three methods are within 3% of full-cache ROUGE-L at 20% budget. Summarization doesn't require precise recall of specific positions; it needs semantic coverage, and heavy hitters plus recent context provide that.

On code generation (HumanEval, function completion): H2O at 20% budget achieves 94% of full-cache pass@1. Window-only achieves 81%, StreamingLLM 87%. The function signature and first few lines of context (heavy hitters) matter more than intermediate docstring content.

The paper also shows a memory-quality Pareto frontier: H2O dominates window and StreamingLLM approaches at every budget level between 10% and 80%. If you're willing to use a 40% budget (keep 40% of KV cache), H2O achieves parity with full cache on most tasks. The minimum viable budget for production depends heavily on your task distribution.

Where the score-tracking math goes wrong

The cumulative attention score s_i = Σ_t a_{t→i} aggregates attention received by token i across all decode steps t. There's a subtle issue: this sums raw attention weights, which have different denominators at different steps (softmax over different context lengths). Early in generation when context is short, attention weights are higher because they're distributed over fewer tokens. A token at position 50 of a 100-token sequence will accumulate higher scores than a token at position 50 of a 1000-token sequence, even if both receive identical relative attention importance.

The paper acknowledges this and uses raw cumulative sum anyway. The practical effect: tokens that appear in shorter contexts accumulate faster scores and are harder to evict. For most natural language inputs this is a reasonable approximation. For structured inputs where token importance genuinely varies inversely with context length — legal documents with dense initial definitions, code with long imports — this can over-retain early tokens that aren't semantically central and under-retain later tokens that are.

There are normalized variants (divide by sequence length at each step) that correct for this, but they're not in the original H2O paper. If you're seeing over-retention of early non-sink tokens in profiling, normalized scoring is worth trying.

Production tradeoffs that aren't in the paper

Memory is where you actually win. H2O reduces KV cache size, not model weights. For a 70B model at INT4, going from 100% to 20% KV budget turns a 40 GB KV cache for 128K context into an 8 GB KV cache. Combined with the 35 GB model weights, you go from 75 GB (fits barely, one concurrent request) to 43 GB (fits with ~37 GB remaining — enough for four concurrent requests at the 8 GB per-sequence budget). The concurrency gains are real and directly proportional to the budget reduction.

Quantizing the KV cache is complementary. H2O reduces the number of KV entries stored. KV cache quantization (FP8 or INT4 KV cache) reduces the size per entry. These are orthogonal: H2O + KV quantization can achieve 4× or 8× combined reduction. The interaction is straightforward: quantize the entries H2O selects to keep, evict the rest. No interference between the two techniques.

Integration complexity is low. H2O modifies only the attention layer: add a score accumulator, add eviction logic after each decode step. No changes to the scheduler, the memory allocator, or the model weights. Most modern serving frameworks (vLLM, SGLang) already expose hooks for custom attention backends; H2O fits cleanly into them. The total implementation is roughly 100 lines of code on top of a standard PagedAttention implementation.

Batch eviction decisions are independent. KV cache eviction runs per-sequence — each request in the batch has its own cumulative score vector and its own eviction decisions. This is important: heavy hitters for one request's contract analysis are not heavy hitters for another request's code generation. There's no cross-contamination in a batched serving setup. The per-sequence overhead (one score vector per sequence in the batch) scales linearly with batch size, which is fine.

You need to tune the H/R split. The paper uses H = 0.7B, R = 0.3B as a default (70% of budget to heavy hitters, 30% to recent window). This is not universally optimal. For conversational tasks with short turns, a larger recent window performs better. For document analysis with long gaps between relevant passages, a larger heavy hitter budget performs better. In practice, profile your workload with a few split ratios before deploying — the optimal split can differ by 10–15% in quality between workloads.

Eviction is irreversible at inference time. Once a token's KV entry is evicted, it's gone. If that token turns out to be relevant to a later query, the model can't access it. This is fundamentally different from StreamingLLM's streaming scenario (where "relevance" is just recency) and from PagedAttention's preemption (which swaps to CPU for later restoration). H2O's evictions are permanent within a request. If a token is evicted at step 200 and becomes relevant at step 500, the model will produce subtly wrong output and you won't know why.

Failure modes in practice

Needle-in-a-haystack collapse. The canonical failure case: a question requires recalling a specific fact that appeared once in the middle of a long document. The fact was mentioned briefly, attended to for a few decode steps when the model initially processed it, and then largely ignored as the model moved on to other topics. Its cumulative attention score stayed low. H2O evicted it. The model then answers the question from its parametric knowledge — which may be wrong — rather than the document. On synthetic needle-in-haystack benchmarks, H2O at 20% budget degrades 30–40% compared to full cache. The paper shows this; don't miss it in the ablations.

Instruction drift over long generation. System prompt tokens (instructions like "answer concisely" or "do not mention competitors") tend to receive high initial attention but declining attention as generation continues. Over a 2,000-token output, some instruction tokens' cumulative scores can fall below document-entity heavy hitters and get evicted. The model then silently stops following the instruction mid-generation. This is intermittent and hard to debug because the output still looks coherent — it's just not following the rules anymore.

Multi-turn attention score reset. H2O scores are per-request. In a multi-turn conversation, each turn starts with an empty score vector. The KV cache from previous turns (if retained across turns via a shared cache) doesn't carry over its historical scores — those were from a different request context. This means at the start of each turn, the model has no signal about which earlier-conversation tokens are heavy hitters, and falls back to retaining everything until the new turn's generation builds up scores. The eviction oracle is blind for the first several decode steps of a new turn.

Adversarial prompt injection via attention score manipulation. If an attacker knows you're using H2O, they can craft inputs where safety-relevant tokens (e.g., "do not generate harmful content") receive low attention (by surrounding them with high-attention distractors that dominate the softmax). Those instructions get evicted. The attacker's actual malicious request then proceeds without the safety instructions in the KV cache. This is a realistic threat model for any system that processes user-controlled content with H2O. The mitigation is to pin safety-critical tokens — mark them as non-evictable regardless of score — and treat H2O as applying only to the "body" content.

When NOT to use H2O

When your task requires exact or near-exact recall. Legal document review (specific clause citations), medical records (exact lab values), financial analysis (specific numbers from tables), compliance checking (exact regulatory language) — these require reproducing specific tokens from specific positions. H2O's eviction is probabilistic: it usually retains important tokens, but "usually" isn't acceptable when getting the clause number wrong constitutes malpractice. Use full KV cache or apply H2O only to parts of the context you can afford to summarize.

When context length is short enough to fit without eviction. If your typical sequence fits comfortably within available HBM at full KV cache, don't evict. H2O's benefit is purely in memory-constrained regimes. Adding eviction overhead to a system that doesn't need it trades a small compute cost for zero gain and introduces the failure modes above.

When generation is very short. H2O's eviction decisions require the model to generate enough tokens that cumulative scores have time to differentiate important from unimportant positions. For generation under ~50 tokens from a long context, there isn't enough signal — every position has scores near zero, and eviction degenerates to arbitrary deletion. For short-output use cases (classification, extraction with fixed output format, short answers), keep the full KV cache.

When streaming with continuous prefill additions. H2O was designed for fixed-context inference: the input is set, generation proceeds, eviction happens during decode. If your serving setup adds new context dynamically during generation (live document feeds, tool call results injected mid-stream), the score vector doesn't reflect the newly-added content's history and the eviction decisions become unreliable. StreamingLLM's positional approach handles streaming better because it doesn't depend on historical attention data.

When you need reproducible outputs. H2O's eviction decisions depend on the cumulative attention scores at each step, which in turn depend on the exact sequence of tokens generated. Two runs with identical inputs but different initial decoding (different temperature samples) will make different eviction decisions, retaining different subsets of the KV cache, and subsequently generate different completions in ways that are difficult to reason about. For deterministic, reproducible inference pipelines — test suites, regression testing, output diffing — full KV cache is significantly easier to reason about.

What this changes about how you think about KV cache

The conventional mental model: KV cache is the thing you want as big as possible, bounded by HBM. More cache = better quality. The only question is memory management efficiency (PagedAttention) and whether you can avoid recomputing it (RadixAttention).

H2O adds a different axis: the content you keep matters as much as how much you keep. An 80% full KV cache containing the wrong 80% of tokens will perform worse than a 20% cache containing the right 20%. Eviction is not degradation — selective eviction is an optimization if your selection is good.

This reframing has practical consequences. Instead of asking "how do I increase my KV cache budget," the better question is "what's the minimum KV budget I need for my task distribution?" For a conversational assistant mostly handling 2K-token contexts, 80% of full cache is overkill — the heavy hitters in a 2K context are already well within a 50% budget. For a document analysis pipeline with 100K-token inputs, the question is whether H2O's oracle is good enough for your specific documents and questions, which is measurable.

The measurement is the important part. Run your actual workload — real documents, real questions — through H2O at 20%, 40%, and 60% budget. Record F1 or pass rate on held-out questions that require specific recall. Find the knee of the quality-vs-budget curve. That knee is probably somewhere between 30-50% for typical mixed workloads. That's the number you actually deploy at, not a guess from the paper's benchmarks.

Our document analysis pipeline ended up at 35% KV budget after profiling. TTFT dropped by 60% (less KV to load on each decode step). Throughput went from 1 concurrent 128K request per GPU to 3. Quality on our internal benchmark: 97.2% of full-cache F1. The 2.8% degradation was entirely from very long contracts (>100K tokens) where we'd expected some loss.

The one failure we caught before production: a clause in one contract that defined a technical term used frequently later in the document. The definition itself had low attention (it appeared once, briefly), but the term it defined became a heavy hitter later. H2O evicted the definition, the model answered questions about the term from parametric knowledge rather than the contract-specific definition. We pinned any sentence containing definitional keywords ("means," "defined as," "shall mean") as non-evictable. Problem solved.

That's the operating lesson from H2O: it mostly gets the right answer about what to keep, but the places where it's wrong are systematically predictable — definitional content, one-time mentions of critical facts, safety instructions. Know your failure modes and pin your exceptions.


H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models — Zhang, Yang, Ibrahim, Theobald, Wang, Sun, Ali, Shah, Fatahalian, Farooqui, Papailiopoulos. NeurIPS 2023.

Related reading

  • Mamba: What the Selective State Space Paper Actually Says

    Transformers scale quadratically with sequence length and carry a growing KV cache that never shrinks. Mamba proposes a different trade: compress context into a fixed-size state, process tokens recurrently at inference, and do it faster than attention at long sequences. Here's the mechanism, what it actually buys you, and where it quietly fails.

  • Mixture of Depths: What the Paper Actually Says

    Every transformer layer processes every token, even when 90% of that work does nothing useful. Mixture of Depths lets the model skip layers for tokens that don't need them — and the compute budget stays completely predictable.

  • SARATHI: What the Chunked-Prefill Paper Actually Says

    Continuous batching fixed GPU utilization but created a new problem: long prefills stall every in-flight decode for seconds. SARATHI splits prefill into chunks and interleaves them with decode, eliminating the stall without disaggregating your cluster. Here's the mechanism, the math, and where it breaks down.

← All writing