The Llama 3 Herd of Models: what the paper actually says

Reading Dubey et al. (Meta, 2024) while debugging why our 70B fine-tune was underperforming a 13B baseline on domain-specific retrieval.

The benchmark tables said we should win. Llama 3 70B on MMLU, HumanEval, MATH — competitive with GPT-4 in many categories. What the benchmarks didn't tell us was that our fine-tuning data was too small and too synthetic, and that the degradation pattern we were seeing had a name: reward model hacking. That name came from the paper.

Most Llama 3 coverage reads the abstract and the leaderboard. "The Llama 3 Herd of Models" (Dubey et al., Meta AI Research, July 2024) is 92 pages of engineering decisions: how they built the data pipeline that processed 15 trillion tokens, why they chose 8B/70B/405B specifically, how post-training works at this scale, and where things go wrong. That's what I want to cover.

The problem Meta was actually solving

Llama 2 was competitive-but-not-frontier in 2023. By the time Llama 3 released, GPT-4o and Claude 3 Opus had raised the bar substantially. The paper's stated goal is "to develop a model that is competitive with the best closed-source models available." But the implicit goal — the one that makes this paper useful beyond benchmark chasing — is figuring out how to build a systematic training pipeline that produces reliable improvements from scale.

The architecture is deliberately conservative. Llama 3 uses the same basic transformer as Llama 2: RoPE positional embeddings, SwiGLU activations, RMSNorm instead of LayerNorm, grouped query attention (GQA) throughout all sizes. No major architectural innovation. The paper explicitly says: "Rather than introduce new architectural innovations, we focused on maximizing the effectiveness of standard components."

This is a deliberate bet. The risk of novel architecture at 405B scale is that bugs, instabilities, and unexpected behaviors are expensive to discover. Standard components have known failure modes. Meta chose known failure modes at scale over unknown ones.

What changed between Llama 2 and 3: data volume (2T → 15.6T tokens), data quality (more aggressive filtering), context length (4K → 128K), and the post-training pipeline (substantially more sophisticated). Every significant quality improvement traces back to one of these four.

Data is the architectural decision

The 15.6T token training corpus is what drives most of the capability jump, and the paper spends significant space on how they built it — which is the part most blog posts skip.

The filtering pipeline is layered:

Heuristic filters: Remove exact duplicates, near-duplicates (MinHash), text with high proportions of non-linguistic tokens, pages dominated by boilerplate. These are cheap and catch the easy garbage.

Model-based quality classifiers: A fastText classifier trained on curated reference data (high-quality books, Wikipedia, academic papers) assigns a quality score to each document. The classifier is trained to predict "would a knowledgeable human consider this useful?" rather than "is this grammatical?" The distinction matters: grammatical but low-information text — SEO content, template-generated pages — passes heuristic filters but fails the quality classifier.

Deduplication: MinHash at document level and n-gram matching at line level. The goal is reducing memorization risk and preventing the model from over-indexing on repeated content (which skews its distribution toward whatever appears most often online, not whatever is most useful).

Domain mixing: The final corpus isn't just "everything that passed the filter." It's a weighted mixture: web pages, code, books, scientific papers, multilingual content in 30+ languages. The weights were determined empirically — they trained small proxy models on different mixtures and measured downstream task performance. Code gets a higher weight than its raw web frequency because code-heavy training improves the model's structured reasoning beyond coding tasks.

The practical takeaway for anyone building domain-specific models: the filtering decisions matter as much as the scale. Meta's paper found that removing aggressive quality filtering and training on 2× more data gave worse results than using the filtered corpus. More tokens from worse sources don't help. Your fine-tuning data quality decisions follow the same principle at smaller scale.

The "overtrain small models" decision

Chinchilla scaling laws (Hoffmann et al., 2022) say the compute-optimal training configuration for an 8B parameter model is roughly 160B tokens. Meta trained Llama 3 8B on 15 trillion tokens — roughly 94× more than Chinchilla-optimal.

This sounds like waste. It isn't.

Chinchilla optimizes for training compute efficiency — the point where you get the most performance per FLOP spent during training. But if you're building models that will be served at scale for months, you're not optimizing for training FLOPs. You're optimizing for inference cost per request. A smaller, more-trained model is cheaper to serve than a larger, less-trained one at equivalent capability. The break-even point depends on how many requests you expect over the model's lifetime.

Meta's calculation: at internet scale, you serve billions of requests. Training compute is one-time. The marginal cost of extra training tokens is low compared to the total inference cost savings from a smaller model. "Inference-optimal" rather than "training-optimal" is the right objective when you're building products, not papers.

For the 70B model, 15T tokens also produces a model that performs significantly better than a 70B model at Chinchilla-optimal would, on the benchmarks that matter. The model uses its parameters more efficiently because it's seen more diverse data.

Concretely: if you're deciding between training a 7B model on your dataset for 2 epochs versus a 1B model for 14 epochs, the 1B model with more training often wins at inference latency while matching quality. The paper provides empirical backing for this tradeoff.

Post-training is a pipeline, not a step

The capability gap between a pre-trained Llama 3 base model and the final instruction-tuned release is enormous, and it comes from a carefully sequenced post-training pipeline. The paper describes six rounds of post-training, each building on the last.

The overall structure is: SFT → reward model training → rejection sampling → DPO → online preference learning. Repeat.

Supervised fine-tuning (SFT): Human-annotated demonstrations of correct behavior. The data here is expensive and carefully curated. The paper is explicit that a small amount of high-quality SFT data beats a large amount of mediocre SFT data. They use roughly 10 million instruction-response pairs after filtering.

Reward model (RM): A separate model trained to score responses. The RM is trained on pairwise preference judgments — given two responses to the same prompt, which is better? A well-calibrated RM is the foundation for everything that follows; a miscalibrated RM produces models that "game" the training signal rather than improve.

Rejection sampling: Generate multiple completions for each prompt using the current model, score them with the RM, keep only the top-scored ones as additional SFT data. This is cheap synthetic data that compounds improvements from the current model.

Direct Preference Optimization (DPO): Rather than running full PPO (which requires four models in memory simultaneously — policy, reference, reward, value), DPO directly fine-tunes on preference pairs using a supervised objective. It's more stable and less resource-intensive than PPO while achieving comparable results for most preference alignment goals.

Online preference learning: In later rounds, they generate responses with the current policy during training and score them in real time. This avoids the distribution shift problem where the RM was trained on one version of the model but is scoring a significantly improved version.

The paper reports that the 405B model serves as a labeling engine for the smaller models — it generates candidate responses that human annotators evaluate, which produces higher-quality training signal than asking humans to write responses from scratch. The 405B model's outputs are better starting points for human correction than a human writing cold.

Production tradeoffs no one mentions in the announcement post

The 128K context window has real latency costs. Extending from 4K to 128K context requires a substantially larger KV cache during inference. At 128K tokens, the KV cache for a single request in Llama 3 70B is roughly 36 GB — more than the model parameters themselves. Serving 128K-context requests requires either model parallelism across multiple GPUs (to fit the cache) or KV cache eviction (which degrades quality). In practice, most production deployments support 128K in the API but route long-context requests to a separate serving cluster. If you're building on Llama 3 and expecting dense 128K utilization, plan your infrastructure for this before you encounter it at load.

405B requires at minimum 8×H100s to serve in FP8. In BF16, you need 16. This isn't a Llama-specific problem — it's physics — but the 405B model's use case in production is narrower than the benchmark tables suggest. It's expensive per token and has higher latency than 70B. Where 405B wins in production: tasks where quality is the constraint and throughput isn't, or as an offline batch processing engine (the labeling use case from training). For real-time user-facing applications, 70B is usually the actual choice.

GQA at 70B significantly reduces KV cache memory. Llama 3's use of grouped query attention (8 key-value heads instead of 64) reduces the KV cache by 8× compared to multi-head attention. At the memory budgets discussed above, this matters. A 70B model with full MHA would require roughly 4× the KV cache of Llama 3 70B at the same context length. This is one of the decisions that makes 70B actually serveable at reasonable cost.

Multilingual quality is uneven. The 15.6T corpus has roughly 8% non-English data. For languages with high representation (German, French, Spanish, Portuguese, Italian), quality is reasonable. For lower-resource languages, the model's performance is substantially worse than headline MMLU numbers suggest. If your application serves non-English users outside the top ~10 languages, benchmark on your target language before assuming Llama 3's numbers transfer.

Failure modes in practice

Reward model hacking. This is where our fine-tuning problem traced back. If your RM is trained on a narrow distribution of prompts and responses, the policy during DPO or rejection sampling will find responses that score well on the RM but don't actually improve quality — responses that are verbose but vague, or that pattern-match the RM's training data in surface ways. The fix in the paper is running multiple rounds with RM retraining between them. In practice: if your fine-tuned model sounds confident but wrong, or is producing overly long hedged responses that weren't there before, check whether your RM is calibrated on your actual use case.

Synthetic data collapse. The paper uses 405B to generate SFT and preference data for 8B and 70B. This works when the larger model's output is verified by humans before use. If you build a pipeline that generates synthetic fine-tuning data from the model you're fine-tuning (or a model of similar capability), you get mode collapse: the model reinforces its own existing patterns, including its errors. The improvement ceiling is the model's current quality. Human verification in the loop, even at 10% sample rate, meaningfully prevents this.

Long context degradation at the boundary. Llama 3's training extended context length progressively — long-context data was introduced in later training stages. The model's performance at exactly 128K tokens isn't as strong as at 32K or 64K. The "lost in the middle" phenomenon (Liu et al., 2023) applies: information in the middle of a 100K+ context receives less effective attention weight than information at the start or end. Don't assume that 128K context means the model processes all 128K tokens equally.

Fine-tuning disrupts instruction following. Full fine-tuning on domain-specific data frequently degrades the model's safety behaviors and instruction following quality. The paper's post-training pipeline is designed to preserve these while adding capability. When you add a fine-tuning stage on top of the released Llama 3 weights without going through the full post-training pipeline, you're overwriting some of what the post-training bought. LoRA mitigates this more than full fine-tuning, but it doesn't eliminate it. Test instruction following and edge case behavior on your fine-tuned model, not just task performance.

When not to use Llama 3

When you need a model smaller than 8B. The 8B model is already overbuilt for many classification and extraction tasks. Phi-3 Mini (3.8B), Gemma 2B, and similar smaller models have been explicitly trained for edge deployment and on-device inference. If you're targeting latency under 20ms or running on CPU, Llama 3 8B is not your model regardless of what the benchmarks say.

When you need dense multilingual support. If your application serves users in Arabic, Hindi, Korean, Japanese, or any language that isn't in the top 10 by web presence, you will likely get better results from models trained with multilingual-first data pipelines (mT5, BLOOM, or domain-specific fine-tunes from academic groups focused on those languages).

When context length isn't your constraint. Llama 3's 128K context comes with the KV cache costs described above. If your application uses contexts under 8K and you're paying for serving infrastructure, there are more cost-efficient choices. The 128K capability doesn't cost you nothing even when you don't use it — the model's attention patterns were shaped by training on long contexts, and that shapes the parameter budget.

When you can't verify your fine-tuning pipeline. The released Llama 3 models are the output of a carefully sequenced multi-round post-training pipeline. If you're fine-tuning without the ability to audit reward model calibration, run rejection sampling, or do even lightweight human preference labeling, you may get surface-level capability improvements that mask degradation on safety, edge cases, and instruction following. Fine-tuning a frontier model on narrow task data isn't free — it has costs that only show up when you evaluate comprehensively.

What the paper actually gives you

Llama 3 is important less because it achieved frontier performance and more because it showed that getting there is a systems engineering problem as much as a research problem. The architecture didn't change. What changed was data quality infrastructure, training scale, and a multi-round post-training pipeline that treats capability and alignment as jointly optimized objectives rather than sequential steps.

The "overtrain small models" insight is the one I return to most. The instinct in production ML is to use the largest model you can afford to serve. Llama 3 argues the opposite for inference-intensive workloads: train a smaller model to a much higher token count, and you get competitive capability at a fraction of the serving cost. The 8B model trained to 15T tokens beats a lot of models that were trained at 4–5× the parameter count with 10× fewer tokens.

For the debugging situation that sent me back to this paper: the fine-tuning failure was synthetic data without human verification, combined with an RM trained on out-of-distribution prompts. Both are problems the Llama 3 paper explicitly describes and addresses in its post-training pipeline. The paper won't give you the pipeline — Meta didn't open-source that — but it tells you enough about the failure modes to design your own.

The Llama 3 Herd of Models — Dubey et al., Meta AI Research. arXiv:2407.21783, July 2024.