Chinchilla: what the compute-optimal training paper actually says

Reading Hoffmann et al. (DeepMind, 2022) while trying to decide whether to scale up a model or train it longer.

The question came up during a pretraining run planning meeting: given a fixed compute budget, do you build a bigger model or feed more data to a smaller one? The Kaplan scaling laws from 2020 had an answer — favor model size, grow parameters faster than training tokens. It was clean, citable, and, as DeepMind would show eighteen months later, substantially wrong.

"Training Compute-Optimal Large Language Models," Hoffmann, Borgeaud, Mensch, Buchatskaya, Cai, Rutherford, de las Casas, Hendricks, Welbl, Clark, Hennigan, Noland, Millican, van den Driessche, Damoc, Guy, Osindero, Simonyan, Elsen, Rae, Vinyals, and Sifre — 22 authors, DeepMind, 2022, informal title Chinchilla — reversed the Kaplan recommendation, demonstrated the reversal with a working model, and reset how serious training operations allocate compute.

What Kaplan got wrong, and why it matters

To understand Chinchilla, you need to understand what Kaplan (OpenAI, 2020) actually said and where the methodology broke.

Kaplan's central finding on compute-optimal allocation:

N_opt ∝ C^0.73
D_opt ∝ C^0.27

For every 10x increase in compute budget, grow parameters by roughly 5.4x and training tokens by only 1.9x. The implication: parameters are more valuable than data. Invest in model size.

This drove GPT-3 (175B parameters, 300B tokens), Gopher (280B parameters, 300B tokens), and Megatron-Turing NLG (530B parameters, 270B tokens). All made the same bet: big models, relatively few training tokens.

Chinchilla's methodology critique is subtle but important. Kaplan ran experiments by fixing model size and varying training duration, or by fixing compute and varying model size across a narrow range. Both approaches sample sparsely from the (N, D, C) space and fit power laws to those samples. The issue: Kaplan's experiments were heavily concentrated in the low-token regime. Most of the data points came from models trained for relatively few tokens. The power law fit was extrapolating into a region with little empirical support.

Chinchilla ran what the authors call IsoFLOP profiles: fix the total training compute C, then systematically vary N and D across a wide range while holding the compute approximately constant (since C ≈ 6ND for transformer training FLOPs). Instead of fitting from the outside, they measured loss directly across the full parameter-token tradeoff space for each compute level. They also ran two independent analysis methods — one based on smooth loss curves, one by finding minima of fitted functional forms — and all three approaches converge on roughly the same answer.

The Chinchilla finding

The compute-optimal allocation, by Chinchilla's measurement:

N_opt ∝ C^0.50
D_opt ∝ C^0.50

Parameters and training tokens should scale equally as compute grows. For every 10x compute increase, grow both model size and token count by roughly 3.16x. The ratio of tokens to parameters at the optimum:

D_opt ≈ 20 × N_opt

Twenty training tokens per parameter, approximately, across the compute budgets the paper studies. For a 10B parameter model, the compute-optimal training run is roughly 200B tokens. For a 70B parameter model, about 1.4T tokens.

This is the root of the headline result: Gopher (280B parameters, 300B tokens) was undertrained by roughly 5-6x in tokens relative to compute-optimal. At Gopher's compute budget, the Chinchilla-optimal model would be around 70B parameters trained on 1.4T tokens.

So that's what they built. Chinchilla: 70B parameters, 1.4T tokens, same compute as Gopher. It outperformed Gopher on most benchmarks — and matched or beat GPT-3, Megatron-Turing, and Jurassic-1 at the same compute cost — with 4x fewer parameters.

Why the methodology difference produced different results

The key insight is that the Kaplan experiments systematically undersampled the data-rich regime. When you train small experiments mostly with few tokens, the fitted power law captures the relationship in that regime, but the curvature of the loss surface is different in the token-rich regime. Specifically, loss has diminishing returns to both N and D, but Kaplan's data mostly captured the regime where returns to N were still high and returns to D were not well-measured.

Chinchilla's IsoFLOP approach forces experiments across the full (N, D) frontier for a given C, so the curvature is measured directly. The shape of the loss surface in token-rich settings turned out to be more favorable to data than Kaplan's extrapolations predicted.

There's an analogy to classic bias in experimental design: Kaplan optimized for smooth power law fits from a well-clustered set of data points; Chinchilla optimized for coverage across the full parameter space. The second approach found a different optimum because it actually measured where Kaplan was extrapolating.

Concrete numbers from the paper

The paper provides fitted constants for the compute-optimal allocation:

N_opt ≈ 0.22 × C^0.50   (N in parameters, C in FLOPs)
D_opt ≈ 4.54 × C^0.50   (D in tokens, C in FLOPs)

For a training budget of 10^23 FLOPs (roughly the cost of Gopher):

N_opt ≈ 0.22 × (10^23)^0.50 ≈ 0.22 × 3.16 × 10^11 ≈ 69B parameters
D_opt ≈ 4.54 × (10^23)^0.50 ≈ 4.54 × 3.16 × 10^11 ≈ 1.43T tokens

The paper cautions that these constants are estimated from models up to ~16B parameters trained on up to ~500B tokens. Extrapolating beyond that range extends the power law into unmeasured territory — a caveat that gets dropped in almost every downstream citation.

What this means for inference cost

The production implication of Chinchilla isn't just "train differently." It's that undertrained large models are strictly dominated on cost by compute-optimal smaller models.

Serving a 280B parameter model costs roughly 4x more per token than serving a 70B parameter model — memory bandwidth scales linearly with parameters for autoregressive decoding. If Chinchilla-70B matches Gopher-280B on your task, you've cut your serving bill by 4x with no quality loss, without any architecture changes, quantization, or system-level tricks.

This is why Chinchilla landed hard in production AI. It wasn't an abstract scaling law result. It was a concrete, reproducible demonstration that the best models of 2021 had left 4x inference efficiency on the table by over-parameterizing and under-training.

The Llama extension: beyond compute-optimal

Chinchilla identifies the compute-optimal allocation — the model size and token count that minimizes loss for a given training compute budget. But compute-optimal training isn't inference-optimal serving.

Meta's Llama papers make this explicit. Llama-3-8B is trained on 15 trillion tokens — roughly 3-4x more than the Chinchilla-optimal token count for an 8B model at that compute budget. This deliberately over-trains the model past the compute-optimal point: you spend more training compute than necessary to achieve a given loss level, but you achieve that loss level in a smaller model.

Why? Because if you're going to serve the model at massive scale, the amortization flips. The extra training compute is a one-time cost. The inference cost savings from using an 8B model instead of a 30B model are paid on every token generated, forever. At Meta's serving volumes, training past Chinchilla-optimal produces enormous cumulative savings.

Chinchilla's exponents imply a practical rule: past a certain serving scale, it's nearly always worth spending extra training compute to reduce the inference model size. The question is just how far past compute-optimal to push. The answer depends on your specific inference volume and cost structure — but almost every major model family released after 2023 has moved in this direction.

Production tradeoffs

Chinchilla tells you where to put training compute, not whether to train. If you're fine-tuning a base model, not pretraining from scratch, Chinchilla doesn't directly apply. You're starting from a non-random initialization, using a much smaller dataset, and optimizing for a different objective. The compute-optimal framework assumes pretraining from scratch. Fine-tuning has its own scaling properties, and the 20:1 token-to-parameter ratio has no direct analog there.

Data quality changes the constants, not the structure. Chinchilla was trained on MassiveText, a heavily filtered, deduplicated web corpus. The 20 tokens-per-parameter heuristic is calibrated to that data quality level. With noisier data, the effective number of training tokens is lower — duplicated or low-quality examples contribute less signal than one "effective" clean token. Teams that apply Chinchilla directly to uncleaned data budgets routinely find their models underperform the prediction, then discover the corpus had 30% near-duplicates. Deduplication is load-bearing here, not a nicety.

The 20:1 ratio is a minimum, not a ceiling. Chinchilla identifies the compute-optimal point: where, for a given training compute, loss is minimized. Loss continues to decrease (slowly) if you train longer past that point. Chinchilla doesn't say "stop at 20 tokens per parameter." It says "compute-optimal is around 20:1." Whether to continue training depends on your serving economics and data availability, not on the paper's formula.

The paper's compute range may not generalize to current scale. Chinchilla's largest experiments are around 10B parameters, with training data up to ~500B tokens. The exponents are fit from data ranging from roughly 70M to 16B parameters. Extending the compute-optimal formula to 70B, 100B, or 400B parameter models is extrapolation. The industry has largely adopted "20 tokens per parameter" as a practical heuristic while knowing the underlying fit came from smaller experiments. The directional claim is robust; the precise constants at scale are less so.

Failure modes in practice

Treating Chinchilla as a hard stop rather than an optimum. Some teams interpret "compute-optimal at 20 tokens per parameter" as "stop training at 20 tokens per parameter." If you have more data available and a serving-dominated cost structure, stopping there is leaving quality on the table. The compute-optimal allocation minimizes loss for a given training budget; it doesn't maximize quality subject to a data budget. Those are different optimizations.

Applying the token count without thinking about data diversity. If your training corpus has 1T tokens but 200B of them are near-duplicates, your effective diversity is much lower than the nominal count. Models trained on highly repeated data memorize rather than generalize. The Chinchilla paper's token counts assume deduplicated data; treating a 1T-token deduplicated corpus equivalently to a 1T-token raw crawl will produce worse models than the formula predicts.

Ignoring within-domain coverage. Chinchilla measures aggregate next-token prediction loss across the full training distribution. If your task is concentrated in a narrow subdomain — clinical records, legal text, a specific programming language — the aggregate loss optimum may not correspond to your within-domain optimum. A model that performs well on the broad pretraining distribution may still underperform in your target domain. Domain-specific fine-tuning or deliberate data mixture adjustments are often necessary even after achieving compute-optimal pretraining.

When not to use Chinchilla

When data is the binding constraint. Chinchilla's optimum assumes you can acquire as many training tokens as the formula requires. For specialized domains — clinical notes, proprietary codebases, low-resource languages — you may have 5B or 10B tokens of domain data with no path to more. In that case, the Chinchilla-optimal model for your data budget may be smaller than the task requires. The relevant constraint is data availability, not compute. Train the largest model that doesn't overfit your actual dataset, then augment with general data or accept domain coverage limitations.

When capabilities require minimum model scale. Some capabilities — complex multi-step reasoning, certain coding tasks, specific forms of instruction following — appear to be threshold functions of parameter count rather than smooth functions of loss. A Chinchilla-optimal 70B model may outperform an undertrained 280B model on most tasks but fall short on specific tasks that require reasoning capacity that only emerges above a certain scale. If your application depends on those capabilities, compute-optimal allocation may not deliver them, and you need empirical evaluation at multiple model sizes rather than formula-driven selection.

When serving scale doesn't justify the training premium. The "train past Chinchilla-optimal to shrink inference cost" argument only holds when cumulative inference savings exceed the extra training cost. For a prototype, an internal tool, or a low-traffic service, this math doesn't work. The right model for a 1M-token-per-day service is the one that meets the quality bar with the lowest deployment cost — which may be a large pre-trained base model accessed via API at a few dollars per million tokens, not a custom-trained inference-optimal model.

When you're evaluating fine-tune suitability rather than base model quality. Chinchilla optimizes pretraining loss. Pretraining loss is a noisy proxy for fine-tune performance. Larger, undertrained base models sometimes retain more internal representational capacity for task-specific adaptation, even if their pretraining loss is higher than a Chinchilla-optimal alternative. The relationship between Chinchilla loss and fine-tuned task performance is real but not monotone across all settings and task types. Run both candidates through your fine-tuning pipeline and evaluate on your actual task before treating base model loss as the decision criterion.

What the paper actually gives you

Chinchilla is a measurement paper. It doesn't derive compute-optimal allocation from first principles — it measures it empirically across a well-designed set of experiments and fits a functional form to the results. What it gives you is a calibration: for a given training compute budget, the loss-minimizing model is considerably smaller and better-trained than the Kaplan recommendation would produce.

The practical upshot: if someone on your team suggests training a model significantly larger than (training_tokens / 20) parameters, they're in Kaplan territory and likely paying for parameters that aren't earning their inference cost. The burden of proof should be on why the task requires that model size, not why you'd consider a smaller one.

The 20:1 heuristic is not a law. It's an empirically fitted optimum at a particular data quality level and compute scale. But it's far better-calibrated than the pre-2022 intuition that model size was the primary lever. The industry's shift from GPT-3-era giant-models-and-sparse-data toward Llama-era medium-models-and-heavy-training is Chinchilla in production. You're already running its recommendations whether you've read the paper or not.

Training Compute-Optimal Large Language Models — Hoffmann, Borgeaud, Mensch, Buchatskaya, Cai, Rutherford, de las Casas, Hendricks, Welbl, Clark, Hennigan, Noland, Millican, van den Driessche, Damoc, Guy, Osindero, Simonyan, Elsen, Rae, Vinyals, Sifre. DeepMind, arXiv:2203.15556, 2022.