← Back to home

Writing

Blog

Technical writing on distributed systems and AI engineering — production LLM infrastructure, agent observability, RAG, and system design from Ashwani Jha.

Toolformer: What the Paper Actually Says

A 6.7B model beats GPT-3 175B on math by learning to use a calculator. Toolformer's self-supervised training pipeline is the interesting part — and it's more constrained than the demos imply.

Dapper: What Google's Distributed Tracing Paper Actually Says

Every distributed tracing tool you use — Jaeger, Zipkin, OpenTelemetry — descends from one design decision Google made in 2010: sample at the trace root, not per-span. The paper explains why, and the failure modes it didn't fully solve.

DeepSeek-V3: What the Frontier-on-a-Budget Paper Actually Says

DeepSeek-V3 trained a 671B-parameter frontier model for ~$5.5M. The paper is less about model quality and more about whether the training stack itself is the bottleneck — and how to engineer around it.

H2O: Heavy-Hitter Oracle for KV Cache Eviction — What the Paper Actually Says

StreamingLLM keeps the first tokens and a sliding window — a positional bet. H2O makes a different bet: keep the tokens that actually received attention, not the ones that happened to appear early. The paper shows you can cut KV cache by 5× with a greedy per-step eviction policy that costs almost nothing to implement.

Kafka: What the Original Paper Actually Says

The original Kafka paper from 2011 had no replication. A broker failure made all unconsumed messages permanently unavailable. The paper treats this as a limitation to fix later, not a deal-breaker. Understanding why explains more about Kafka's design philosophy than any architecture diagram.

Mamba: What the Selective State Space Paper Actually Says

Transformers scale quadratically with sequence length and carry a growing KV cache that never shrinks. Mamba proposes a different trade: compress context into a fixed-size state, process tokens recurrently at inference, and do it faster than attention at long sequences. Here's the mechanism, what it actually buys you, and where it quietly fails.

MapReduce: What the Google Paper Actually Says

The 2004 Google paper that gave us Hadoop — and everything that replaced it — is worth reading not for the map/reduce abstraction itself, but for the fault tolerance model and the straggler insight. The failure modes are still the failure modes.

Mixture of Depths: What the Paper Actually Says

Every transformer layer processes every token, even when 90% of that work does nothing useful. Mixture of Depths lets the model skip layers for tokens that don't need them — and the compute budget stays completely predictable.

SARATHI: What the Chunked-Prefill Paper Actually Says

Continuous batching fixed GPU utilization but created a new problem: long prefills stall every in-flight decode for seconds. SARATHI splits prefill into chunks and interleaves them with decode, eliminating the stall without disaggregating your cluster. Here's the mechanism, the math, and where it breaks down.

SGLang and RadixAttention: What the Paper Actually Says

Multi-call LLM programs recompute the same KV cache thousands of times per day. RadixAttention fixes this by maintaining a global radix tree of cached KV blocks — turning redundant prefill into a lookup. Here's the mechanism, the numbers, and where it falls apart.

T5: What the Text-to-Text Paper Actually Says

Every instruction-tuned model today owes something to T5's core idea: every NLP task is just sequence-to-sequence. But the paper's real contribution is a systematic ablation of what actually helps in transfer learning — and several of the answers are counterintuitive.

Titans: What the Test-Time Memorization Paper Actually Says

Attention is quadratic. SSMs compress everything into a fixed state. Titans takes a third path: a neural memory module whose weights are the memory, updated via gradient descent at inference time. Here's the mechanism, what the benchmark numbers actually say, and why you shouldn't deploy this without thinking carefully about inference cost.

Switch Transformers: What the Sparse MoE Scaling Paper Actually Says

Every modern large model — Mixtral, DeepSeek, Gemini — routes tokens through sparse experts. The design decisions in all of them trace back to one 2021 Google Brain paper. The paper is worth reading because the failure modes live in the routing logic, not the math.

Pregel: What the Large-Scale Graph Processing Paper Actually Says

PageRank in MapReduce is O(iterations × full dataset reloads). Pregel fixes this by keeping the graph in memory across iterations and replacing disk I/O with message passing. The 'think like a vertex' model is the insight — BSP is the implementation.

Mooncake: What the KV-Cache-Centric Disaggregated Serving Paper Actually Says

DistServe disaggregates prefill from decode. SGLang caches KV cache per-instance. Mooncake goes further: it treats the KV cache as a shared, distributed resource that the scheduler routes around — turning a fleet of 50 isolated caches into one coherent pool.

LLM.int8(): What the 8-bit Matrix Multiplication Paper Actually Says

INT8 quantization works perfectly for vision models. For LLMs above ~7B parameters it silently destroys quality — unless you understand why outlier features emerge at scale and how mixed-precision decomposition works around them.

The Llama 3 Herd of Models: What the Paper Actually Says

Llama 3's 405B benchmark numbers are fine. The paper is actually about something more useful: what decisions you make when you can train on 15T tokens across 16K H100s, and which of those decisions transfer to your deployment.

EAGLE: Speculative Decoding with Feature-Level Prediction — What the Paper Actually Says

Standard speculative decoding tops out at 2–2.4x speedup because token-level draft prediction is hard. EAGLE sidesteps this by predicting at the feature level instead — the second-to-top-layer hidden states — which are far more predictable than discrete tokens. The result is 3–4x with a draft head that's 0.3% the size of the target model.

Cassandra: What the Paper Actually Says

We had a Cassandra cluster where DELETE operations made reads progressively slower until queries timed out. Adding more disk space made it worse. The root cause is described precisely in the 2009 paper — but only if you understand that Cassandra cannot actually delete data.