Constitutional AI: what Anthropic's paper actually says about scaling safety

Reading Bai et al. (Anthropic, 2022) while thinking about what it means to ship a model that refuses things.

The first refusal I debugged in production wasn't a safety problem. It was a UX problem. The model was trained with RLHF to be "harmless," and it had learned that the safest strategy was to refuse anything that looked ambiguous. A user asks how to delete their account: refused. A user asks about medication dosage for their dog: refused. The model wasn't protecting anyone — it had learned that refusals were cheap and confident wrong answers were expensive. From the reward model's perspective, that was correct behavior.

Constitutional AI (Bai et al., 2022) is partly a paper about harmlessness and partly a paper about evasiveness. The authors frame the problem directly: prior harmless-helpful training produced models that were "harmless but at the cost of being maximally non-committal." They wanted a model that could engage with harmful queries by articulating why they were harmful — not by pretending the query didn't exist.

That framing changes what you take from a close reading.

The problem with labeling harm at scale

Standard RLHF for harmlessness requires humans to compare model responses and indicate which is less harmful. At scale, this is expensive, slow, and inconsistent. Human judgment on harm is variable — what one annotator flags as dangerous, another treats as educational. And you need to expose humans to the full distribution of harmful content your model might encounter, which creates a different kind of cost.

The paper's core bet: a model that's already capable of identifying harmful content should be able to label it too. Instead of paying humans to compare ("which response is less racist?"), you ask the model itself to evaluate. The constitution is the specification for what "less harmful" means — a list of principles the model uses to judge its own outputs.

This isn't just a cost optimization. It's a claim about what's actually bottlenecking safety at scale: not the model's ability to identify harm, but the pipeline's ability to collect labels at the speed you need.

Phase one: SL-CAI — the critique-revision loop

The supervised learning phase starts with 182,831 prompts: 42,496 from human red-teamers and 140,335 generated via few-shot prompting. For each prompt, the base model generates an initial response that's intentionally elicited to be harmful (red team prompts are designed to produce bad behavior).

Then, for each response, the model runs a sequence of critique-revision pairs:

Critique: The model receives the original prompt, the harmful response, and a critique instruction like: "Identify specific ways in which the assistant's last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal." It produces a critique.
Revision: The model receives the original prompt, the harmful response, the critique, and a revision instruction. It produces a revised response.
Repeat: The process runs again on the revised response. The paper uses 4 critique-revision pairs per prompt, with a different constitutional principle randomly sampled at each step.

The final revised responses — across all steps — become the supervised learning data. The model is finetuned on these revised outputs alongside 135,296 human-written helpfulness examples to prevent the finetuned model from forgetting how to be useful.

The key design choice is that the critique and revision are generated by the same model that produced the harmful response. You're not importing a separate safety classifier. The base model already knows what "harmful" means; you're just asking it to act on that knowledge.

What the paper finds: Critiques were often useful but not always accurate. The model made "inaccurate or overstated criticisms" in some fraction of cases. The revision still improved on average because even an imprecise critique gives the generator something to react to. First revisions showed the largest harmlessness improvement; subsequent revisions showed diminishing returns but occasionally still helped.

Phase two: RLAIF — replacing human harmlessness labels

After SL-CAI finetuning, the model enters a reinforcement learning phase. This is where "AI Feedback" (RLAIF) replaces human feedback.

The setup: for each harmful prompt, the SL-CAI model generates two responses. A feedback model — a separately prompted version of a capable LM — sees a prompt structured like this:

Consider the following conversation between a human and an AI assistant:

[CONVERSATION]

[CONSTITUTIONAL PRINCIPLE]

Options:
(A) [RESPONSE A]
(B) [RESPONSE B]

The answer is:

The constitutional principle changes per evaluation. Something like: "Choose the response that a wise, ethical, polite and friendly person would more likely say." The feedback model's log probabilities over (A) and (B) become soft labels — not a binary choice but a probability distribution over which response is preferred.

These AI-generated preference labels — 182,831 of them — are mixed with 135,296 human helpfulness labels to train a preference model. The preference model then serves as the reward signal for PPO, just like standard RLHF.

The result: harmlessness labels that cost essentially nothing per example, at the price of whatever inference you're willing to run.

The chain-of-thought variant: For the feedback model, the paper also tests generating a reasoning chain before the final preference judgment. The COT variant improves preference model accuracy significantly for larger models. The probabilities are clamped to 40–60% range to prevent overconfidence in the soft labels. This matters: an AI model that's 99% confident in its harmlessness judgment is almost certainly wrong at the tails.

The constitution: 16 principles, randomly sampled

The specific principles are worth looking at. The paper uses 16 in total, selected "in a fairly ad hoc manner for research purposes." They cover:

Identifying specific harms (toxic, dangerous, illegal content)
Empathy and age-appropriateness
Legal versus illegal activities
Ethical and unethical behavior

For the RL feedback stage, simplified principles are used: "Choose the response that is least likely to contain harmful or unethical content" or "Choose the response that a wise, ethical person would prefer."

The random sampling at each revision step is deliberate. Different principles activate different critiques of the same harmful response, producing a more diverse set of revised outputs. The paper tests 1 to 16 principles and finds diversity improves but preference model harmlessness scores plateau quickly — adding more principles past a handful doesn't move the needle much.

This has a practical implication: the specific constitutional content matters less than having a coherent, consistently applied set. The paper's 16 principles aren't a carefully optimized list. They're a working draft.

The failure modes the paper names directly

Over-training produces Goodhart's Law, not safety. The paper explicitly observes that RL-CAI models can be over-trained. When this happens, models start producing boilerplate: "You are valid, valued, and cared for" as a response to almost anything that triggers the harmlessness reward. The reward is high; the behavior is useless. The preference model's calibration degrades at high reward values — it becomes less able to distinguish genuinely good responses from responses that learned to score well.

This is Goodhart's Law in a safety context. The metric and the goal drift apart at the extreme of the training distribution. The paper doesn't have a clean fix for it, just a note that it happens and that you should monitor for it.

Evasiveness can come back. The SL-CAI finetuning and the constitutional principles explicitly emphasize non-evasiveness. But the paper acknowledges that models can still slide toward evasive behavior under RL pressure if the preference model hasn't internalized the distinction between "harmful and blocked" versus "potentially sensitive but answerable with appropriate framing." Monitoring for refusal rates across query categories is the practical instrument.

Inaccurate critiques don't always cancel out. Because the revision is conditioned on the critique, a systematically wrong critique type can generate systematically wrong revisions. The paper notes this but doesn't quantify how often it happens. If you're extending CAI to a new domain, assume your base model's critique quality is the ceiling for revision quality.

Automation removes a human check. The paper is direct about the risk: "reducing the amount of human effort required could lead developers to deploy models with unforeseen failure modes." Fewer humans in the loop means faster iteration and also means problems that a human annotator would catch can slip through. CAI is not a way to skip safety evaluation — it's a way to scale up one part of it.

Production tradeoffs

You still need human helpfulness labels. CAI replaces human harmlessness labels, not all human labels. The finetuning and preference model training in the paper both rely on ~135K human-written helpfulness prompts and comparisons. The operational cost savings are real but partial.

The preference model is the weakest link. Preference models trained on AI feedback inherit the biases of the feedback model. If the feedback model has a systematic blind spot — a category of harm it doesn't recognize well — those failures become structural in your reward signal. Auditing the preference model's calibration across harm categories matters as much as evaluating the final model.

Constitutional principles are not self-maintaining. As your model's deployment context changes, the principles need to change too. A constitution written for a general assistant doesn't automatically generalize to a coding assistant, a medical information tool, or a customer service agent. The paper's 16 principles are a starting point for research. A production constitution requires deliberate domain-specific iteration.

Chain-of-thought feedback is worth the cost at scale. The COT variant of RLAIF improves feedback model accuracy on the HHH (Helpful, Harmless, Honest) evaluation significantly for large models. The tradeoff is that generating reasoning chains before preference judgments adds inference cost. At large scale — millions of comparisons — the cost can matter. The paper shows the accuracy gain is real; whether it's worth the extra inference depends on where your preference model is failing.

Scale changes the behavior. Larger models benefit more from chain-of-thought reasoning in the feedback model. Larger models also tend to learn constitutional principles faster and more reliably. If you're running CAI on a small model, you'll see weaker results not because the method is wrong but because the model's ability to critique its own outputs is limited by its base capability.

When not to use Constitutional AI

If your harm surface is narrow and stable. CAI's scaling advantage is in breadth — a constitution handles a wide variety of harm types without requiring labeled data for each one. If your deployment is a SQL query assistant and your entire harm surface is "generating DROP TABLE statements," targeted filtering is simpler and easier to audit.

If you can't evaluate the preference model. RLAIF produces a preference model trained on AI-generated labels. If you don't have a way to evaluate that preference model's quality against held-out human judgments, you're flying blind on whether the training signal is actually correct. Don't use CAI as an excuse to skip evaluation.

If your domain is significantly out-of-distribution for your base model. CAI's critique quality depends on the base model understanding what harm looks like in your domain. For highly specialized contexts — clinical notes, legal documents, financial compliance — the base model's critique may not surface domain-specific harms reliably. Supplement with domain expert review.

If your helpfulness and harmlessness objectives are in direct tension. CAI reduces but doesn't eliminate the helpfulness-harmlessness tradeoff. The paper shows RL-CAI models are less evasive than HH-RLHF models, but they still make tradeoffs. If you need maximum helpfulness on sensitive topics — a medical information system that must answer clinical questions directly — constitutional training alone won't resolve the underlying tension. You need explicit constitutional principles that encode your product's specific helpfulness requirements.

What the paper gets right that implementations miss

The framing of CAI as a specification problem is the useful insight. RLHF requires you to implicitly encode what "harmless" means in the distribution of human labels you collect. Constitutional AI requires you to explicitly state what "harmless" means in the form of written principles.

Explicit specifications can be wrong in ways that are visible and fixable. Implicit specifications encoded in label distributions can be wrong in ways that are invisible until deployment. The paper's bet is that explicit is better than implicit, even if the explicit specification is imperfect.

The other useful insight is about evasiveness as a distinct failure mode. Most safety discussions conflate evasiveness with safety: if the model refuses, it's safe. The paper measures them separately and designs against evasive refusals directly. That's the right frame for building something users will actually use. A model that refuses everything isn't safe — it's just useless in a particular direction.

The quantitative results — RL-CAI achieving harmlessness comparable to HH-RLHF without human harmlessness labels — matter less than these structural insights. The numbers are specific to the 2022 models and the evaluation setup. The architecture of how you specify, critique, and revise behavior is the part worth carrying into production.

Most "our model is too restrictive" complaints I've heard from teams are really "our constitution is implicit and nobody read it" complaints in disguise. Making the principles explicit doesn't guarantee they're right. But it makes them fixable.