Tag
AI
6 posts tagged AI.
Titans: What the Test-Time Memorization Paper Actually Says
Attention is quadratic. SSMs compress everything into a fixed state. Titans takes a third path: a neural memory module whose weights are the memory, updated via gradient descent at inference time. Here's the mechanism, what the benchmark numbers actually say, and why you shouldn't deploy this without thinking carefully about inference cost.
Switch Transformers: What the Sparse MoE Scaling Paper Actually Says
Every modern large model — Mixtral, DeepSeek, Gemini — routes tokens through sparse experts. The design decisions in all of them trace back to one 2021 Google Brain paper. The paper is worth reading because the failure modes live in the routing logic, not the math.
Mooncake: What the KV-Cache-Centric Disaggregated Serving Paper Actually Says
DistServe disaggregates prefill from decode. SGLang caches KV cache per-instance. Mooncake goes further: it treats the KV cache as a shared, distributed resource that the scheduler routes around — turning a fleet of 50 isolated caches into one coherent pool.
LLM.int8(): What the 8-bit Matrix Multiplication Paper Actually Says
INT8 quantization works perfectly for vision models. For LLMs above ~7B parameters it silently destroys quality — unless you understand why outlier features emerge at scale and how mixed-precision decomposition works around them.
The Llama 3 Herd of Models: What the Paper Actually Says
Llama 3's 405B benchmark numbers are fine. The paper is actually about something more useful: what decisions you make when you can train on 15T tokens across 16K H100s, and which of those decisions transfer to your deployment.
EAGLE: Speculative Decoding with Feature-Level Prediction — What the Paper Actually Says
Standard speculative decoding tops out at 2–2.4x speedup because token-level draft prediction is hard. EAGLE sidesteps this by predicting at the feature level instead — the second-to-top-layer hidden states — which are far more predictable than discrete tokens. The result is 3–4x with a draft head that's 0.3% the size of the target model.