Tag

Training

4 posts tagged Training.

July 6, 202615 min read

Ring Attention: What the Near-Infinite Context Paper Actually Says

Extending context beyond what fits on one GPU isn't just a memory problem — it's a communication design problem. Ring Attention sequences the K/V data through a ring of devices and hides the transfers behind computation. Here's what that actually costs in production.

AIDistributed SystemsTrainingProductionLLMsInference

June 29, 202613 min read

DeepSeek-V3: What the Frontier-on-a-Budget Paper Actually Says

DeepSeek-V3 trained a 671B-parameter frontier model for ~$5.5M. The paper is less about model quality and more about whether the training stack itself is the bottleneck — and how to engineer around it.

AILLMsTrainingSystemsProduction

May 22, 202613 min read

Switch Transformers: What the Sparse MoE Scaling Paper Actually Says

Every modern large model — Mixtral, DeepSeek, Gemini — routes tokens through sparse experts. The design decisions in all of them trace back to one 2021 Google Brain paper. The paper is worth reading because the failure modes live in the routing logic, not the math.

AISystemsProductionLLMsTrainingMoE

May 10, 202611 min read

The Llama 3 Herd of Models: What the Paper Actually Says

Llama 3's 405B benchmark numbers are fine. The paper is actually about something more useful: what decisions you make when you can train on 15T tokens across 16K H100s, and which of those decisions transfer to your deployment.

AILLMsTrainingSystemsProduction