Self-paced course

LLM Inference & Serving

How large language models actually run in production — KV-cache management, batching, speculative decoding, quantization, and the systems work that turns a model into a serveable endpoint.

10 lessons~2h totalFree

Start learning →

0 / 10 lessons complete

EAGLE: Speculative Decoding with Feature-Level Prediction — What the Paper Actually Says12 min read
LLM.int8(): What the 8-bit Matrix Multiplication Paper Actually Says10 min read
Mooncake: What the KV-Cache-Centric Disaggregated Serving Paper Actually Says14 min read
Titans: What the Test-Time Memorization Paper Actually Says9 min read
SGLang and RadixAttention: What the Paper Actually Says11 min read
SARATHI: What the Chunked-Prefill Paper Actually Says11 min read
Mixture of Depths: What the Paper Actually Says14 min read
Mamba: What the Selective State Space Paper Actually Says11 min read
H2O: Heavy-Hitter Oracle for KV Cache Eviction — What the Paper Actually Says14 min read
Ring Attention: What the Near-Infinite Context Paper Actually Says15 min read

Curriculum

Prefer it in your inbox?