← Back to writing

Scaling laws are not just about research budgets

Reading Kaplan et al. (OpenAI 2020) while trying to explain why our fine-tuned 7B model outperformed the larger one in production.

You've picked a model. GPT-4 or Llama-3-70B for quality, Llama-3-8B for cost. You might be fine-tuning. You've probably benchmarked. But somewhere in that decision process, a question didn't get a rigorous answer: why does this model size, for this task, with this much training data, perform the way it does?

Kaplan et al., "Scaling Laws for Neural Language Models," OpenAI 2020, is the paper that tries to answer that systematically. It's cited constantly and read rarely. The blog post version — "bigger is better" — is true but useless. The paper version has production implications that most engineering teams I've worked with have had to rediscover the hard way.

What the paper actually measures

The core finding is that cross-entropy loss on language modeling scales as a smooth power law with three resources: model size (N, parameter count), dataset size (D, training tokens), and compute (C, FLOPs):

L(N) ≈ (N_c / N)^0.076
L(D) ≈ (D_c / D)^0.095
L(C) ≈ (C_c / C)^0.050

These relationships hold over seven orders of magnitude — from 10^3 to 10^10 parameters, from single GPU experiments to the largest models they could train. The log-log plots are straight lines. No kinks, no inflection points, no saturation at the boundaries they studied.

That's the empirical claim. Here's what makes it useful: you can predict where you'll end up before you finish training. Plot your early loss curve, fit the power law, extrapolate. If your model is going to underperform the law suggests, you'll know at 10% of compute spent, not 100%.

The compute-optimal tradeoff the paper gets wrong (and why that's interesting)

Given a fixed compute budget C, how should you allocate between model size and training steps? The paper derives:

N_opt ∝ C^0.73
D_opt ∝ C^0.27

Parameters should scale faster than data as you increase compute. For a 10x larger compute budget, scale parameters by ~5.4x and training tokens by only ~1.9x. The optimal model is larger and undertrained relative to convergence.

This was the accepted wisdom for about two years. Then Hoffmann et al. published Chinchilla (DeepMind 2022) and found the opposite: parameters and tokens should scale roughly equally, so the Kaplan exponents were wrong, and the GPT-3-era models were massively underoptimized on data.

Both papers can be right simultaneously, which took me a while to internalize. They're optimizing for different constraints.

Kaplan optimizes for training compute given a compute budget. If you have a fixed number of FLOPs to spend on a training run, Kaplan tells you the compute-optimal allocation.

Chinchilla adjusts the measurement. The Chinchilla experiments used a wider range of training token counts and found that Kaplan undersampled the data-rich regime. With more data points, the optimal allocation shifts toward more tokens per parameter.

The Llama approach optimizes for inference cost, not training compute. If your constraint is "I need to serve one billion tokens per day at $X," you want the smallest model that achieves acceptable quality. Smaller models have lower per-token inference cost. You deliberately over-train them far past compute-optimal to squeeze more quality out of fewer parameters. Llama-3-8B on 15 trillion tokens is nowhere near the Kaplan compute-optimal point — it's in a regime Kaplan calls overtrained. That's a feature, not a mistake.

Three different optimal points, depending on what you're actually trying to optimize.

Why larger models are more sample-efficient

One finding in the paper gets less attention than it deserves: for a fixed number of training tokens, larger models achieve lower loss. They're more sample-efficient.

The paper quantifies this: a 1.5B parameter model achieves the same loss as a 6.7B model, but only if given roughly 4x more tokens. If you have limited data, a larger model extracts more information from each training example.

This has a direct implication for fine-tuning on small domain-specific datasets. If you have 100k high-quality examples of a specialized task, fine-tuning a larger base model will typically outperform fine-tuning a smaller one — not because large models are magically better, but because they extract more signal per training example. The floor on "enough data" is lower for larger models than smaller ones.

The corollary is also true: if you have abundant data, a smaller model trained longer can close the gap. This is exactly what the Llama paper exploits.

Overfitting is a dataset-size problem, not a model-size problem

Classic machine learning intuition says large models overfit. The paper shows this is the wrong frame for language models at this scale.

Overfitting occurs when you've consumed your dataset too many times — when D is too small for N. The paper finds overfitting becomes significant around:

D_min ~ N^0.74

For a 7B parameter model, that threshold is roughly 200 billion tokens. Below that, you're not in an overfitting regime at all — you're in an undertrained regime. The model hasn't seen enough data to approach its theoretical minimum loss.

The practical consequence: if your fine-tuning dataset has 1 million tokens and you're fine-tuning a 7B model, you hit the overfitting threshold almost immediately. Multiple epochs over the same data expose you to memorization rather than generalization. The scaling laws say the solution is either more data or fewer parameters — not more regularization.

The disconnect between perplexity and task performance

The paper measures cross-entropy loss on language modeling. That's not what you're evaluating your model on.

The relationship between pretraining loss and downstream task performance is real but nonlinear. For many tasks, loss improves continuously while benchmark accuracy stays flat, then jumps discontinuously when loss crosses some threshold. The paper briefly notes this and doesn't resolve it — "emergent" capabilities in the mechanistic interpretability literature are the ongoing attempt to explain these jumps.

For production, this means you can use scaling laws to predict pretraining loss, but you cannot directly use them to predict whether a model will pass your accuracy threshold on a specific task. A model with loss L+ε might fail your benchmark while one with loss L passes it. The only way to close the loop is empirical: build evaluation benchmarks that matter for your task, not just perplexity.

This is especially sharp for reasoning-heavy tasks. GPT-3's pretraining loss follows the scaling laws beautifully. Its ability to do multi-step arithmetic doesn't — it appears almost discontinuously at a certain scale and then degrades in predictable ways with prompt format. Loss is a leading indicator but not a sufficient one.

Production tradeoffs

Early stopping predicts final performance. The power law holds throughout training, not just at convergence. If you're training a custom model, compute the implied final loss from your early trajectory. If the extrapolation is worse than your baseline, you've identified a problem at 10% of training time rather than 100%.

The right model size depends on your serving constraint. If you're paying per-inference, the smallest model that meets quality thresholds is the right model, regardless of what the compute-optimal formula says. Fine-tune aggressively, evaluate empirically. Don't let a training-time formula make inference-time decisions for you.

Data quality matters more than the power law implies. The paper trains on high-quality web text, filtered heavily. Real production data pipelines include noise, duplicates, format artifacts, and domain shift. The scaling laws describe what's achievable with clean data. Dirty data bends the curve downward. Before blaming model size or compute budget, look at data quality — the paper's baseline assumes cleaner data than most teams have.

Don't treat loss as a proxy for your actual metric. Set up task-specific evals before you start training or fine-tuning. Perplexity-optimal is not the same as your-task-optimal. The curves look similar but the optimal stopping points differ.

When not to use scaling laws

When your task has known phase transitions. If your task shows emergent capability at specific scale thresholds — multi-step reasoning, structured output, certain coding tasks — the smooth power law is the wrong model. You're looking for a threshold, not a minimum on a continuous curve. Empirical evaluation at a few scale points beats extrapolation.

When your compute constraint is inference, not training. Scaling laws are a training-time tool. Once you have trained models, the decision is about serving cost versus quality, and the relevant framework is batching efficiency, quantization headroom, and token throughput — not parameter scaling.

When your domain is sufficiently out-of-distribution. The laws were derived on English web text. If you're building on a highly specialized domain — genomics, legal contracts, low-resource languages — the base relationships hold directionally but the constants shift. You'll need domain-specific data to recalibrate, and you should assume the computed scaling constants are off.

When data availability is the actual constraint. If you have 10 million domain-specific tokens and can't get more, the scaling laws tell you the model will overfit beyond a certain size. Below that size, the laws describe diminishing returns on model parameters. The optimal strategy isn't to follow the compute-optimal formula — it's to enumerate the feasible parameter counts given your data and pick empirically.

What the paper actually gives you

Scaling laws don't tell you which model to buy. They tell you the shape of the relationship between resources and quality, which lets you reason about whether you're on the efficient frontier or not.

If a team tells me their 70B model performs worse than their 7B fine-tune on a specific task, scaling laws immediately suggest two hypotheses: either the 7B was trained on dramatically more task-relevant data (sample efficiency advantage), or the 70B was evaluated on something far from its pretraining distribution (loss doesn't predict task performance). Without the framework, that result looks mysterious. With it, it's predictable.

The benchmark numbers — specific loss values at specific compute scales — aged badly as Chinchilla revised them. The methodology didn't. You can still fit power laws to your own training curves, identify whether you're data-limited or compute-limited, and make resource allocation decisions from empirical measurements rather than vibes.

Most teams are flying blind on model selection. Scaling laws give you a flashlight. It won't illuminate everything, but it's better than the alternative.