← All tracks

Self-paced course

LLM Inference & Serving

How large language models actually run in production — KV-cache management, batching, speculative decoding, quantization, and the systems work that turns a model into a serveable endpoint.

4 lessons~1h totalFree
0 / 4 lessons complete

Curriculum

  1. EAGLE: Speculative Decoding with Feature-Level Prediction — What the Paper Actually Says12 min read
  2. LLM.int8(): What the 8-bit Matrix Multiplication Paper Actually Says10 min read
  3. Mooncake: What the KV-Cache-Centric Disaggregated Serving Paper Actually Says14 min read
  4. Titans: What the Test-Time Memorization Paper Actually Says9 min read