Tag

AI

16 posts tagged AI.

July 9, 202614 min read

MegaScale: What ByteDance's 12,288-GPU Training Paper Actually Says

At 12,288 GPUs, you see roughly one hardware failure per day. Standard training frameworks have no answer for this. MegaScale is ByteDance's account of what actually breaks when you scale LLM training to 10,000+ GPUs — and the algorithm-system co-design changes that kept MFU above 55%.

AIDistributed SystemsTrainingProductionLLMsInfrastructure

July 6, 202615 min read

Ring Attention: What the Near-Infinite Context Paper Actually Says

Extending context beyond what fits on one GPU isn't just a memory problem — it's a communication design problem. Ring Attention sequences the K/V data through a ring of devices and hides the transfers behind computation. Here's what that actually costs in production.

AIDistributed SystemsTrainingProductionLLMsInference

July 3, 202611 min read

Toolformer: What the Paper Actually Says

A 6.7B model beats GPT-3 175B on math by learning to use a calculator. Toolformer's self-supervised training pipeline is the interesting part — and it's more constrained than the demos imply.

AILLMsToolsFine-tuningProduction

June 29, 202613 min read

DeepSeek-V3: What the Frontier-on-a-Budget Paper Actually Says

DeepSeek-V3 trained a 671B-parameter frontier model for ~$5.5M. The paper is less about model quality and more about whether the training stack itself is the bottleneck — and how to engineer around it.

AILLMsTrainingSystemsProduction

June 27, 202614 min read

H2O: Heavy-Hitter Oracle for KV Cache Eviction — What the Paper Actually Says

StreamingLLM keeps the first tokens and a sliding window — a positional bet. H2O makes a different bet: keep the tokens that actually received attention, not the ones that happened to appear early. The paper shows you can cut KV cache by 5× with a greedy per-step eviction policy that costs almost nothing to implement.

AISystemsProductionLLMsInference

June 19, 202611 min read

Mamba: What the Selective State Space Paper Actually Says

Transformers scale quadratically with sequence length and carry a growing KV cache that never shrinks. Mamba proposes a different trade: compress context into a fixed-size state, process tokens recurrently at inference, and do it faster than attention at long sequences. Here's the mechanism, what it actually buys you, and where it quietly fails.

AISystemsProductionLLMsInference

June 13, 202614 min read

Mixture of Depths: What the Paper Actually Says

Every transformer layer processes every token, even when 90% of that work does nothing useful. Mixture of Depths lets the model skip layers for tokens that don't need them — and the compute budget stays completely predictable.

AISystemsProductionLLMsInference

June 8, 202611 min read

SARATHI: What the Chunked-Prefill Paper Actually Says

Continuous batching fixed GPU utilization but created a new problem: long prefills stall every in-flight decode for seconds. SARATHI splits prefill into chunks and interleaves them with decode, eliminating the stall without disaggregating your cluster. Here's the mechanism, the math, and where it breaks down.

AISystemsProductionLLMsInference

June 3, 202611 min read

SGLang and RadixAttention: What the Paper Actually Says

Multi-call LLM programs recompute the same KV cache thousands of times per day. RadixAttention fixes this by maintaining a global radix tree of cached KV blocks — turning redundant prefill into a lookup. Here's the mechanism, the numbers, and where it falls apart.

AISystemsProductionLLMsInference

June 1, 202613 min read

T5: What the Text-to-Text Paper Actually Says

Every instruction-tuned model today owes something to T5's core idea: every NLP task is just sequence-to-sequence. But the paper's real contribution is a systematic ablation of what actually helps in transfer learning — and several of the answers are counterintuitive.

AILLMsFine-tuningProductionTransfer Learning

May 27, 20269 min read

Titans: What the Test-Time Memorization Paper Actually Says

Attention is quadratic. SSMs compress everything into a fixed state. Titans takes a third path: a neural memory module whose weights are the memory, updated via gradient descent at inference time. Here's the mechanism, what the benchmark numbers actually say, and why you shouldn't deploy this without thinking carefully about inference cost.

AISystemsProductionLLMsInferenceArchitecture

May 22, 202613 min read

Switch Transformers: What the Sparse MoE Scaling Paper Actually Says

Every modern large model — Mixtral, DeepSeek, Gemini — routes tokens through sparse experts. The design decisions in all of them trace back to one 2021 Google Brain paper. The paper is worth reading because the failure modes live in the routing logic, not the math.

AISystemsProductionLLMsTrainingMoE

May 18, 202614 min read

Mooncake: What the KV-Cache-Centric Disaggregated Serving Paper Actually Says

DistServe disaggregates prefill from decode. SGLang caches KV cache per-instance. Mooncake goes further: it treats the KV cache as a shared, distributed resource that the scheduler routes around — turning a fleet of 50 isolated caches into one coherent pool.

AISystemsProductionLLMsInference

May 13, 202610 min read

LLM.int8(): What the 8-bit Matrix Multiplication Paper Actually Says

INT8 quantization works perfectly for vision models. For LLMs above ~7B parameters it silently destroys quality — unless you understand why outlier features emerge at scale and how mixed-precision decomposition works around them.

AISystemsProductionLLMsInference

May 10, 202611 min read

The Llama 3 Herd of Models: What the Paper Actually Says

Llama 3's 405B benchmark numbers are fine. The paper is actually about something more useful: what decisions you make when you can train on 15T tokens across 16K H100s, and which of those decisions transfer to your deployment.

AILLMsTrainingSystemsProduction

May 7, 202612 min read

EAGLE: Speculative Decoding with Feature-Level Prediction — What the Paper Actually Says

Standard speculative decoding tops out at 2–2.4x speedup because token-level draft prediction is hard. EAGLE sidesteps this by predicting at the feature level instead — the second-to-top-layer hidden states — which are far more predictable than discrete tokens. The result is 3–4x with a draft head that's 0.3% the size of the target model.

AISystemsProductionLLMsInference