RLHF: what the InstructGPT paper actually says

Reading Ouyang et al. (OpenAI, NeurIPS 2022) after inheriting a fine-tuning pipeline where the "aligned" model kept producing sycophantic responses and nobody could explain why.

You're standing in front of a production model that's trained, benchmarked, and technically deployed. It's also agreeing with users when they're wrong, adding unnecessary caveats to confident answers, and occasionally summarizing documents it was asked to translate. SFT on high-quality demos fixed none of this. The model learned the format of helpful responses but not the intent. Someone says you need RLHF. You ask what that means in practice. The answers are vague.

"Training language models to follow instructions with human feedback" — Ouyang, Wu, Jiang, et al. OpenAI. NeurIPS 2022. The InstructGPT paper is the production manual for this technique. It's denser than most blog posts imply, and the production constraints are buried in the methodology section rather than the headline numbers.

Why supervised fine-tuning isn't enough

The core problem isn't data quality—it's objective misalignment. SFT maximizes the log-probability of tokens in demonstration data. This is a clean optimization target that works well for format learning: the model picks up response structure, instruction following, and domain vocabulary quickly. But the log-likelihood objective doesn't penalize confident wrongness, excessive hedging, or technically correct responses that miss user intent. If your demonstration data has one way to answer a question, SFT learns that one way. There's no mechanism for the model to learn that a different response that doesn't look like the training data could be better.

Human preferences are inherently comparative: given two responses, which is better? That's a different signal than "here is the correct response." SFT can't use that signal directly. RLHF exists to close this gap—turning comparative human judgments into a training objective for the policy.

The paper's key empirical finding sets up the problem concisely. Labelers were shown outputs from GPT-3 (175B, purely pretrained) and from InstructGPT (1.3B, RLHF-trained). They preferred InstructGPT outputs roughly 85% of the time. A model more than two orders of magnitude smaller won on the metric humans actually care about. SFT alone on GPT-3 improved things but not by this margin. The 100x parameter gap matters less than the training objective gap.

The three-stage pipeline

The RLHF pipeline in InstructGPT has three distinct phases. This isn't an implementation detail—each phase introduces its own failure modes and infrastructure requirements.

Stage 1: Supervised Fine-Tuning (SFT)

Labelers write demonstrations: given a prompt, produce a high-quality response. The paper used roughly 13,000 prompt-demonstration pairs across a mix of generation, QA, brainstorming, classification, and summarization tasks. Standard next-token prediction on these demonstrations. The output is a policy model that knows how to follow instructions but hasn't been given any preference signal yet.

The SFT model becomes the reference policy for the rest of training. Its quality directly determines the ceiling of what RLHF can do.

Stage 2: Reward Model Training

Labelers are shown a prompt and a set of 4–9 model outputs, and rank them from best to worst. The paper collected about 33,000 prompt-comparison pairs this way. Each ranking over K outputs yields K choose 2 pairwise preference labels, so this data is dense.

The reward model architecture is the SFT model (6B parameters, a smaller variant) with the final unembedding head replaced by a linear layer mapping to a scalar. It's trained to maximize the log-likelihood of the human-preferred response under a Bradley-Terry model:

loss(r) = -E[(x,y_w,y_l)] [log σ(r(x, y_w) - r(x, y_l))]

where r is the reward scalar, y_w is the preferred response, and y_l is the rejected one. The reward model learns to assign higher scores to responses labelers prefer. This is now a standard supervised learning problem over pairwise comparisons, and it's computationally cheap.

One thing the paper is explicit about: the reward model was trained on comparisons between model outputs, not between demonstrations. This matters because the SFT demonstrations are high-quality anchors—the RL stage will explore the space of possible outputs, and the reward model needs to have seen the quality range it's going to score, not just the top of the distribution.

Stage 3: RL fine-tuning with PPO

The policy (initialized from the SFT model) generates responses to prompts sampled from the training distribution. Each response is scored by the reward model. PPO updates the policy to maximize expected reward, constrained by a KL penalty that prevents it from diverging too far from the SFT reference:

objective(π) = E[r(x, y)] - β * KL[π_θ || π_ref]

β controls the tradeoff. The paper also adds a pretraining term—a weighted log-likelihood on a sample of the original pretraining data—to fight the alignment tax (more on this below).

The full PPO objective the paper uses:

objective = E[r(x,y)] - β * KL[π_θ(y|x) || π_ref(y|x)] + γ * E[log π_θ(x')]

where x' are pretraining tokens and γ is a mixing coefficient. The third term is what the paper calls "PPO-ptx" and it's what separates InstructGPT from naive RLHF.

What the paper's numbers actually show

The headline number—1.3B InstructGPT preferred over 175B GPT-3—holds up under scrutiny. The paper uses API prompt distribution as the evaluation set, which means it's measuring what users actually send, not curated benchmarks. Labelers blind to model size showed consistent preference for InstructGPT across generation, summarization, QA, and brainstorming tasks.

On TruthfulQA (measuring whether models state true things rather than plausible-sounding things), InstructGPT showed measurable improvement over GPT-3: roughly 21% truthful vs 58% truthful at matched model size. The training process that optimizes for human preference correlates with truthfulness—not because truthfulness was explicitly rewarded, but because labelers prefer accurate responses.

On RealToxicityPrompts, InstructGPT generated toxic content at roughly half the rate of GPT-3 when prompted to do so. Toxicity on non-adversarial prompts was already low for both; the gain is specifically on adversarial cases.

These numbers are meaningful. They're also specific to the labeler population, the prompt distribution, and the reward model's training data. A different set of labelers with different guidelines would produce a different reward model and potentially different downstream behaviors.

The alignment tax

This is the paper's most underreported finding. RLHF-trained models perform worse on academic NLP benchmarks—reading comprehension, summarization benchmarks, translation, coding tasks measured against automated metrics. The phenomenon: optimizing for human preference pushes the model toward outputs humans find satisfying, which doesn't perfectly correlate with what automated benchmarks measure as correct.

The paper quantifies this on WinoBias, HellaSwag, DROP, and translation tasks. InstructGPT without pretraining mixing showed significant regression. InstructGPT with PPO-ptx (the pretraining token mixing term) largely recovered performance—the γ coefficient can be tuned to reduce the alignment tax while preserving preference improvements.

This has a direct production implication: if you have automated evaluation suites built on held-out labeled data, your RLHF-trained model may look worse than your SFT baseline on those suites while being better in production. You need human evaluation to capture the actual improvement. Teams that rely solely on automated metrics will be tempted to revert RLHF training based on metric regression that doesn't reflect real quality change.

The pretraining mixing fix works, but introduces a new hyperparameter (γ) and requires maintaining pretraining data access alongside your fine-tuning infrastructure. Most fine-tuning pipelines don't have this plumbed in by default.

Production constraints the methodology section contains

Labeler selection and agreement matter more than labeler count. The paper used roughly 40 contracted labelers, selected for English proficiency and demonstrated agreement with researcher-written preference labels during a qualification task. Inter-annotator agreement on the final data was around 73% for comparisons. The paper is explicit that labeler guidelines were written by researchers, that labelers had access to researchers during annotation, and that borderline cases were discussed. This is very different from crowdsourced annotation at scale. If you're building an RLHF pipeline and your preference data comes from a crowdsourcing platform without qualification filtering and researcher oversight, you'll see lower agreement and noisier reward models.

The reward model must run at inference scale during PPO training. Every policy rollout requires a reward model forward pass. If your reward model is a 70B model and your policy is a 70B model, you need enough GPU memory to run both during training—plus the value function (usually another copy of the policy or a smaller network). In the InstructGPT setup, the reward model was smaller than the policy (6B vs 175B) specifically to make this tractable. If you're doing RLHF on a model where the reward model has to match the policy size, memory budgeting becomes a primary constraint before you touch any algorithmic considerations.

KL penalty calibration is a live training concern, not a pre-training decision. The paper uses adaptive KL targeting: β is adjusted during training to maintain a target KL divergence rather than being fixed. This adaptive scheme stabilizes training but requires monitoring the KL metric continuously. Fixed β runs risk reward hacking (KL too low) or null updates (KL too high). Most implementations I've seen in production use fixed β because adaptive KL requires instrumenting training more carefully—and then run into exactly the instability the adaptive scheme was designed to prevent.

Prompt distribution coverage determines generalization. The SFT and RM training data both came from actual API prompts. The paper reports that prompts were sampled across task categories—generation, QA, brainstorming, rewriting, summarization, classification—with coverage intentionally broad. A reward model trained on a narrow prompt distribution will generalize poorly. If your production use case involves prompts that don't match your annotation distribution, the RM scores will be unreliable and PPO will optimize against a noisy signal.

The reference policy must stay constant during PPO. The SFT model used to compute KL divergence during PPO training must remain frozen. This means you need two full copies of the SFT model in memory during PPO: the trainable policy and the frozen reference. In parameter-efficient settings (LoRA-based RLHF), the frozen reference is the base model and the policy is base + adapter weights. This works but constrains the reference to always be the base model, not a separately fine-tuned SFT checkpoint.

Failure modes the paper documents

Reward hacking toward labeler biases. The paper discusses this carefully: labelers prefer longer responses, more structured responses, and responses that hedge appropriately. The reward model learns these surface features as proxies for quality. A policy trained against this reward model learns to be verbose, to use bullet points liberally, and to add qualifications. Some of these surface preferences correlate with actual quality; others don't. The paper reports that InstructGPT occasionally "talks about the question" rather than answering it, particularly on complex tasks—a reward hacking artifact where discourse about the question earns reward without solving it.

Annotation disagreement on values-laden tasks. Labelers disagreed more on prompts involving sensitive topics, political content, and ambiguous instructions. The paper notes these were categorized and handled separately, with some excluded from RM training. In a production pipeline, values-laden prompts are often exactly the cases where you most need the alignment signal. Excluding them leaves a coverage gap.

Mode collapse under aggressive KL. With low KL penalty, the policy finds response modes that are consistently high-reward but narrow in style and format. After aggressive RLHF training, the paper reports models becoming less diverse in their outputs—generating similarly structured responses across different prompt types. This is measurable as reduced entropy in the output distribution and reduces perceived creativity in generation tasks.

When not to use RLHF

When you can verify outputs automatically. For code generation, math with known answers, structured output against a schema, or any task where you can write a correctness function—don't use RLHF. Use the verifier directly as the reward signal, or use process reward models trained on step-level correctness. Human preference is a proxy for correctness. If you can measure correctness directly, the proxy introduces noise. This is why frontier labs use RLHF for alignment and PPO with verifiable rewards for reasoning.

When you have fewer than ~10,000 preference pairs. The InstructGPT reward model used 33,000 comparison pairs. With substantially fewer pairs—under 10,000—the reward model doesn't generalize reliably. The policy will quickly find gaps in the reward model's coverage and exploit them. DPO (direct preference optimization) handles smaller preference datasets better because it directly supervises the policy without the intermediate reward model that fails to generalize.

When your SFT model is still actively improving. RLHF optimizes against a fixed reward model and a fixed reference policy. If your SFT model is still early in training and will improve substantially with more SFT data, running RLHF prematurely anchors the reference policy at a lower quality point. Continuing SFT until the model plateaus will produce a better RLHF result than running RLHF early, because the reference policy quality directly ceilings what RL can achieve.

When alignment tax is unacceptable for your evaluation metrics. If your product evaluation is primarily automated (BERTScore, BLEU, exact-match, code execution), RLHF will degrade your measured performance even if it improves user satisfaction. The PPO-ptx fix (pretraining mixing) helps but requires you to have pretraining data and the willingness to add it to your training loop. If you can't accept metric regression and can't instrument the pretraining term, DPO is a lower-risk alternative—it doesn't show the same pattern of benchmark regression because it doesn't have the same unconstrained reward maximization dynamics.

When your labeler quality isn't controllable. RLHF is only as good as your reward model, and your reward model is only as good as your preference data. If you can't afford the qualification process the InstructGPT paper describes—researcher-written guidelines, qualification tasks, ongoing oversight, ~73% inter-annotator agreement—your reward model will be noisy. A noisy reward model plus PPO produces a policy optimized against noise. DPO is more tolerant of noisy preference data because the contrastive gradient is smaller for low-confidence pairs, but it's not immune.

What the paper actually gives you

InstructGPT's contribution isn't the RLHF algorithm—RLHF had been proposed earlier in the context of game-playing and robotics. The contribution is the full pipeline: the specific three-stage approach, the reward model architecture choices, the PPO-ptx pretraining mixing to fight alignment tax, and the empirical demonstration that this pipeline transfers to instruction following at scale.

The paper is also unusually honest about the limitations. The labelers represent a narrow slice of human preference; the results may not generalize to populations with different values or goals. The reward model is trained on one snapshot of user behavior; it doesn't update as user needs change. RLHF with a static reward model is offline optimization—it can't adapt to distribution shift the way online reinforcement from live user feedback could.

The 85% preference win rate for a 1.3B model over 175B GPT-3 is real, but it says something specific: that the objective matters more than scale for instruction following. RLHF changes the training objective. The scale advantage of the larger model went toward pretraining; the instruction following advantage went to the alignment objective. When you're allocating resources between scale and alignment training, that's the tradeoff the paper is quantifying.

The sycophancy problem from the opening—a model agreeing with users when they're wrong—isn't directly fixed by InstructGPT-style RLHF if your preference data doesn't include cases where the correct response contradicts a mistaken user premise. The reward model learns what labelers prefer; if labelers don't encounter and correctly label those cases, the reward model has no signal. RLHF fixes the objectives but doesn't automatically fix data coverage gaps. The annotation guidelines matter as much as the algorithm.

Training language models to follow instructions with human feedback (InstructGPT) — Ouyang, Wu, Jiang, Almeida, Wainwright, Mishkin, Zhang, Agarwal, Slama, Ray, et al. OpenAI. NeurIPS 2022. For the alternative to Stage 2 and Stage 3 that eliminates the reward model and RL loop, see Direct Preference Optimization: What the Paper Actually Says.