Chain-of-thought prompting: what the Wei et al. paper actually says

Reading Wei et al. (Google Brain, NeurIPS 2022) after watching our LLM agent answer a multi-step math question correctly, then wrong, then correctly again — with no code changes.

We had a customer support agent that needed to compute prorated refunds. Simple arithmetic: days remaining, original price, days in billing period. The model got it right 60% of the time with a standard few-shot prompt. We tuned the prompt, added more examples, adjusted temperature — still 60%. The failure mode was strange: the model would output the correct numbers in its reasoning, then state the wrong final answer.

The fix was chain-of-thought prompting. Not magic, just: show the model the intermediate reasoning steps in your few-shot examples. Accuracy jumped to 91%. But then I read the paper that coined the term — "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," Wei, Wang, Schuurmans, Bosma, Ichter, Xia, Chi, Le, and Zhou from Google Brain, NeurIPS 2022 — and realized we had gotten lucky in exactly the ways the paper predicts you should be lucky, and were about to hit the failure modes it predicts you'll hit.

The problem the paper is actually solving

Standard few-shot prompting shows the model input-output pairs: input → answer. For factual retrieval this works fine. For multi-step reasoning it falls apart.

The paper's hypothesis is sharp: language models trained to predict the next token learn to compress reasoning steps into a single implicit computation. When you ask a standard model for a multi-step answer, it's trying to output a token that approximates the correct final answer given the training distribution — not running an explicit reasoning procedure. For simple queries, compression works. For problems that require three or more sequential logical steps, the compressed representation isn't expressive enough.

Chain-of-thought prompting sidesteps this by changing the output distribution you're sampling from. Instead of input → answer, your few-shot examples are input → [reasoning steps] → answer. The model is now generating reasoning tokens that serve as an intermediate working memory. Each token it generates becomes context for the next, which is exactly the mechanism transformers are built for.

This isn't a training technique. It's an inference technique. The model hasn't changed. You've changed the structure of what it's being asked to produce.

What the paper actually shows — and what it doesn't

The headline result is striking: chain-of-thought prompting on PaLM 540B (Google's largest model at the time) achieves 56.9% on GSM8K (grade-school math word problems), compared to 17.9% for standard few-shot prompting. That's a 3x lift with only a prompting change.

But that number hides the most important finding in the paper.

The authors test chain-of-thought across multiple model scales: GPT-2 (117M parameters) through GPT-3 (175B), LaMDA (137B), PaLM (540B), and others. The result is unambiguous: chain-of-thought prompting hurts performance on smaller models.

For GPT-3 at 7B parameters, CoT prompting produces lower accuracy than standard few-shot on arithmetic tasks. The model generates reasoning steps, but they're incoherent — the extra tokens introduce noise rather than useful signal. The accuracy floor from CoT improves only as model scale increases, and the technique only reliably beats standard few-shot above roughly 100 billion parameters.

The paper frames this as an "emergent ability": a capability that appears discontinuously as model scale increases, not gradually. Below the threshold, CoT is actively harmful. Above it, the gains are large and consistent.

This has a direct implication for production that most engineering teams ignore: if you're running a 7B or 13B model, CoT prompting is likely making your reasoning accuracy worse, not better. The GPT-3-class models that made CoT famous are 175B parameters. Your fine-tuned Llama-3-8B is not the same animal.

The mechanics of a chain-of-thought exemplar

The paper uses eight few-shot exemplars. Each consists of a question, a chain of reasoning steps written in natural language, and a final answer. Example from the paper:

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he now have?

A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.

The reasoning chain is free-form natural language, not structured output, not code, not explicit step numbering. This is intentional. The paper shows that the specific format of the chain matters less than the presence of intermediate reasoning. You don't need to write perfect step-by-step algebra — you need to show the model what "reasoning through a problem" looks like in your domain.

The paper tests several variants: no reasoning, reasoning without explicit equations, equations without reasoning, the final answer only. The full natural-language chain-of-thought consistently outperforms all ablations, but the differences are smaller than you'd expect. The critical ingredient is the presence of intermediate steps, not their structure.

Why the reasoning chain is not actually the reasoning

Here's the production-critical thing the paper mentions carefully but most implementations miss: the chain of thought may not faithfully represent the model's internal computation.

The model doesn't actually execute the reasoning steps you see in the output. It predicts tokens. When those tokens form a reasoning chain, they constrain the distribution of subsequent tokens in ways that increase the probability of correct final answers. But the model has no intrinsic mechanism that "runs" the chain and then outputs an answer — it's all next-token prediction, all the way down.

This has a specific failure mode: confabulation chains. The model produces a fluent, coherent chain of reasoning that arrives at a wrong answer. When you read the chain, each step looks plausible. But somewhere in the middle, the model has drifted from accurate reasoning into generating tokens that sound like reasoning while being semantically incorrect.

I've watched this happen on percentage calculations: "15% of 240 is 24" (wrong — it's 36), then the rest of the chain proceeds from 24 and arrives at a wrong but internally consistent conclusion. The chain is grammatically and structurally correct. It's just wrong. And unlike standard prompting, where the error is contained in the final answer, CoT errors can be harder to catch because the reasoning looks convincing.

Failure modes in production

Error propagation. Each step in the chain conditions the next. An arithmetic error in step 2 cascades through steps 3, 4, and 5. Standard prompting fails locally; chain-of-thought failures can be deeply embedded in apparently coherent reasoning. Debugging requires reading the entire chain, not just the final answer.

Latency and cost multiplication. Chain-of-thought increases output token count by 3–10x for typical reasoning tasks. At GPT-4 pricing, a question that costs $0.002 to answer with standard prompting costs $0.015 with a full chain-of-thought. At scale, this isn't a rounding error. And because output tokens are generated sequentially, CoT directly increases time-to-first-token for streaming responses.

Exemplar sensitivity. The paper's 8 exemplars aren't special — but the quality of your exemplars is. Poorly constructed reasoning chains in your few-shot examples reliably produce poor reasoning chains in model outputs. I've seen teams copy chain-of-thought examples from blog posts and wonder why their accuracy is poor. The reasoning chains need to reflect the actual domain and problem structure of your queries.

Few-shot vs zero-shot CoT. The paper is about few-shot CoT (you write the exemplars). "Let's think step by step" zero-shot CoT (Kojima et al., 2022) came out shortly after and is what most engineers actually use. The mechanisms are related but not identical. The Wei et al. results don't fully transfer — zero-shot CoT has different accuracy profiles and different model-scale dependencies. Don't assume the paper's benchmarks apply to your zero-shot prompt.

Self-consistency as error correction. The paper notes that chains are stochastic — the same model with the same prompt can produce different reasoning chains and different answers. Self-consistency (Wang et al., 2022) samples multiple chains and majority-votes the final answer. It works: the paper reports ~5–10% accuracy gains over greedy CoT decoding. The cost is N× inference compute. If you're using CoT for high-stakes reasoning and not using self-consistency, you're leaving accuracy on the table — but only if your budget allows generating 5–20 samples per query.

When NOT to use chain-of-thought

When you're below the scale threshold. If your production model is smaller than roughly 70B parameters (by today's instruction-tuned model standards), test carefully before assuming CoT helps. With sufficiently fine-tuned smaller models, CoT can sometimes work at smaller scales, but the gains are less predictable and often negative on reasoning-heavy tasks.

When latency is your primary constraint. CoT fundamentally increases time-to-complete-answer. For conversational interfaces where users see streaming output, the intermediate reasoning can be a UX feature. For backend API calls where you're timing a pipeline stage, it's a tax. Measure before adding it.

For single-step factual retrieval. Chain-of-thought adds value for problems that decompose into multiple steps. "What is the capital of France?" does not benefit from CoT. Adding reasoning exemplars to factual queries doesn't improve accuracy and adds token cost. Save CoT for problems that actually require sequential reasoning.

When the reasoning chain becomes the input to downstream systems. If you're parsing the model's output programmatically, the free-form reasoning chain before the answer creates a parsing problem. You need either structured output (JSON with reasoning fields), a reliable answer delimiter, or a two-pass approach: generate the chain, extract the answer. All of these work, but they add complexity the standard approach doesn't have.

When you need faithful explanations. Chain-of-thought reasoning looks like an explanation but isn't one in the causal sense. If you're using reasoning chains for auditability — showing users why the model made a decision — understand that the chain may not reflect the actual basis for the output. For regulated domains where explanations must be causally accurate, CoT chains are legally and technically insufficient.

What the paper gets right that most implementations miss

The paper tests CoT on benchmarks where ground truth exists: GSM8K, SVAMP, AQuA (arithmetic), StrategyQA (commonsense), CommonsenseQA, and symbolic manipulation tasks. These have objective correct answers. You can measure accuracy precisely.

Most production CoT deployments don't have this luxury. The "reasoning" you care about — drafting a support response, analyzing a contract, summarizing a technical document — doesn't have a ground truth. You can't compute a 91% accuracy figure. This means the gains the paper demonstrates don't directly apply, and you can't easily verify whether your CoT setup is helping or not.

The right answer is to instrument your own evaluation dataset for your specific task before committing to a CoT strategy. Sample 100–200 production queries, annotate correct answers, and measure accuracy with and without CoT. The paper's benchmarks tell you the mechanism works in principle. They don't tell you it works for your domain, your model, and your token budget.

The production version

In practice, chain-of-thought prompting in production looks like this:

Identify the tasks that actually need it. Multi-step math, logical deduction, multi-constraint filtering. Not factual lookup, not summarization, not classification.
Write domain-specific exemplars. Don't copy from papers or blog posts. Write 6–10 examples that reflect your actual query distribution. The reasoning style matters more than its structure.
Profile the token cost. Measure average output length with and without CoT on a sample of real queries. Budget accordingly before enabling in production.
Add a delimiter. The model needs to know where the reasoning ends and the answer begins. Therefore, the answer is: or \n\nFinal answer: work. Structured output with a reasoning field and an answer field works better for parsing.
Consider self-consistency for high-stakes paths. If a CoT error in this flow has real consequences (wrong calculation, wrong policy decision), generate 3–5 chains and majority-vote. Expensive, but the accuracy gain is real and predictable.

The Wei et al. paper is about six pages of core content and worth reading directly. The main insight — that reasoning chains let the model use intermediate tokens as working memory, and this only reliably works above a scale threshold — is crisp and actionable. Everything else in the ecosystem of CoT techniques builds on that insight, so starting from the source is worth the time.