← All writing

Tag

Systems

8 posts tagged Systems.

Titans: What the Test-Time Memorization Paper Actually Says

Attention is quadratic. SSMs compress everything into a fixed state. Titans takes a third path: a neural memory module whose weights are the memory, updated via gradient descent at inference time. Here's the mechanism, what the benchmark numbers actually say, and why you shouldn't deploy this without thinking carefully about inference cost.

Switch Transformers: What the Sparse MoE Scaling Paper Actually Says

Every modern large model — Mixtral, DeepSeek, Gemini — routes tokens through sparse experts. The design decisions in all of them trace back to one 2021 Google Brain paper. The paper is worth reading because the failure modes live in the routing logic, not the math.

Pregel: What the Large-Scale Graph Processing Paper Actually Says

PageRank in MapReduce is O(iterations × full dataset reloads). Pregel fixes this by keeping the graph in memory across iterations and replacing disk I/O with message passing. The 'think like a vertex' model is the insight — BSP is the implementation.

Mooncake: What the KV-Cache-Centric Disaggregated Serving Paper Actually Says

DistServe disaggregates prefill from decode. SGLang caches KV cache per-instance. Mooncake goes further: it treats the KV cache as a shared, distributed resource that the scheduler routes around — turning a fleet of 50 isolated caches into one coherent pool.

LLM.int8(): What the 8-bit Matrix Multiplication Paper Actually Says

INT8 quantization works perfectly for vision models. For LLMs above ~7B parameters it silently destroys quality — unless you understand why outlier features emerge at scale and how mixed-precision decomposition works around them.

The Llama 3 Herd of Models: What the Paper Actually Says

Llama 3's 405B benchmark numbers are fine. The paper is actually about something more useful: what decisions you make when you can train on 15T tokens across 16K H100s, and which of those decisions transfer to your deployment.

EAGLE: Speculative Decoding with Feature-Level Prediction — What the Paper Actually Says

Standard speculative decoding tops out at 2–2.4x speedup because token-level draft prediction is hard. EAGLE sidesteps this by predicting at the feature level instead — the second-to-top-layer hidden states — which are far more predictable than discrete tokens. The result is 3–4x with a draft head that's 0.3% the size of the target model.

Cassandra: What the Paper Actually Says

We had a Cassandra cluster where DELETE operations made reads progressively slower until queries timed out. Adding more disk space made it worse. The root cause is described precisely in the 2009 paper — but only if you understand that Cassandra cannot actually delete data.