Tag

Inference

10 posts tagged Inference.

July 6, 202615 min read

Ring Attention: What the Near-Infinite Context Paper Actually Says

Extending context beyond what fits on one GPU isn't just a memory problem — it's a communication design problem. Ring Attention sequences the K/V data through a ring of devices and hides the transfers behind computation. Here's what that actually costs in production.

AIDistributed SystemsTrainingProductionLLMsInference

June 27, 202614 min read

H2O: Heavy-Hitter Oracle for KV Cache Eviction — What the Paper Actually Says

StreamingLLM keeps the first tokens and a sliding window — a positional bet. H2O makes a different bet: keep the tokens that actually received attention, not the ones that happened to appear early. The paper shows you can cut KV cache by 5× with a greedy per-step eviction policy that costs almost nothing to implement.

AISystemsProductionLLMsInference

June 19, 202611 min read

Mamba: What the Selective State Space Paper Actually Says

Transformers scale quadratically with sequence length and carry a growing KV cache that never shrinks. Mamba proposes a different trade: compress context into a fixed-size state, process tokens recurrently at inference, and do it faster than attention at long sequences. Here's the mechanism, what it actually buys you, and where it quietly fails.

AISystemsProductionLLMsInference

June 13, 202614 min read

Mixture of Depths: What the Paper Actually Says

Every transformer layer processes every token, even when 90% of that work does nothing useful. Mixture of Depths lets the model skip layers for tokens that don't need them — and the compute budget stays completely predictable.

AISystemsProductionLLMsInference

June 8, 202611 min read

SARATHI: What the Chunked-Prefill Paper Actually Says

Continuous batching fixed GPU utilization but created a new problem: long prefills stall every in-flight decode for seconds. SARATHI splits prefill into chunks and interleaves them with decode, eliminating the stall without disaggregating your cluster. Here's the mechanism, the math, and where it breaks down.

AISystemsProductionLLMsInference

June 3, 202611 min read

SGLang and RadixAttention: What the Paper Actually Says

Multi-call LLM programs recompute the same KV cache thousands of times per day. RadixAttention fixes this by maintaining a global radix tree of cached KV blocks — turning redundant prefill into a lookup. Here's the mechanism, the numbers, and where it falls apart.

AISystemsProductionLLMsInference

May 27, 20269 min read

Titans: What the Test-Time Memorization Paper Actually Says

Attention is quadratic. SSMs compress everything into a fixed state. Titans takes a third path: a neural memory module whose weights are the memory, updated via gradient descent at inference time. Here's the mechanism, what the benchmark numbers actually say, and why you shouldn't deploy this without thinking carefully about inference cost.

AISystemsProductionLLMsInferenceArchitecture

May 18, 202614 min read

Mooncake: What the KV-Cache-Centric Disaggregated Serving Paper Actually Says

DistServe disaggregates prefill from decode. SGLang caches KV cache per-instance. Mooncake goes further: it treats the KV cache as a shared, distributed resource that the scheduler routes around — turning a fleet of 50 isolated caches into one coherent pool.

AISystemsProductionLLMsInference

May 13, 202610 min read

LLM.int8(): What the 8-bit Matrix Multiplication Paper Actually Says

INT8 quantization works perfectly for vision models. For LLMs above ~7B parameters it silently destroys quality — unless you understand why outlier features emerge at scale and how mixed-precision decomposition works around them.

AISystemsProductionLLMsInference

May 7, 202612 min read

EAGLE: Speculative Decoding with Feature-Level Prediction — What the Paper Actually Says

Standard speculative decoding tops out at 2–2.4x speedup because token-level draft prediction is hard. EAGLE sidesteps this by predicting at the feature level instead — the second-to-top-layer hidden states — which are far more predictable than discrete tokens. The result is 3–4x with a draft head that's 0.3% the size of the target model.

AISystemsProductionLLMsInference