Teaching LLMs to reach for a calculator

Reading Toolformer (Schick et al., Meta AI, 2023) while building agent pipelines.

The thing that keeps breaking my agents isn't hallucination in the general sense—it's arithmetic. An agent running a cost analysis will confidently tell me that 400 out of 1400 is 27%, not 28.6%. It's a $5 Python call away from the right answer, but the model doesn't reach for it. You add a calculator tool with a system prompt, it works in testing, and then it stops using the calculator at exactly the wrong moment in production.

Toolformer (Schick et al., 2023) takes a fundamentally different angle: instead of telling the model to use tools, train it to decide for itself—using its own judgment about when a tool call actually helps. A 6.7B parameter model trained this way outperforms GPT-3 175B on several math benchmarks. That's the headline. The mechanism is what's interesting.

The core idea: filter by whether the result helps prediction

The self-supervised training pipeline has three steps. None of them require humans labeling when tool use is appropriate.

Step 1: Sample candidates. For each tool, you write a handful of demonstrations (e.g., "Joe Biden was born in [QA('Where was Joe Biden born?')] Scranton"). The LM uses these as a prompt to annotate a large text corpus with potential API calls—positions in text where a tool call might fit. You sample up to k candidate API calls per text, keeping positions where the model assigns p(⟨API⟩ | context) > τ_s.

Step 2: Execute. You actually call the APIs and collect results. This can be any tool that takes text in and returns text out—a calculator, a BM25 search index, another neural network. The constraint is just that the response is a text string.

Step 3: Filter by loss. Here's the key decision criterion. For each candidate API call at position i with result r, you compute:

L_i⁺ = weighted cross-entropy loss over tokens x_i...x_n, prefixed with the API call AND result
L_i⁻ = min(loss with no API call, loss with API call but no result)

Keep the API call only if L_i⁻ - L_i⁺ ≥ τ_f. In plain English: the API result must reduce the model's loss on predicting future tokens by at least τ_f compared to both (a) not calling the tool, and (b) calling the tool but ignoring the result. If the tool call doesn't help prediction, it's filtered out.

This is elegant. You're not asking humans what "useful" means—you're asking the model: does this information make you better at predicting what comes next? The weighting function also decays with distance (w_t = max(0, 1 - 0.2t)), so the filter favors API calls that are relevant to what immediately follows, not just somewhere later in the text.

What the numbers look like

Toolformer is GPT-J (6.7B parameters) finetuned on ~CCNet augmented with API calls. Baselines include the same GPT-J without finetuning, GPT-J finetuned without API calls, OPT (66B), and GPT-3 (175B)—all evaluated zero-shot.

Math word problems (ASDiv / SVAMP / MAWPS):

GPT-J baseline: 7.5 / 5.2 / 9.9
GPT-3 175B: 14.0 / 10.0 / 19.8
Toolformer: 40.4 / 29.4 / 44.0

The model calls the calculator on 97.9% of math examples. The other 2.1% it doesn't call at all—not because it has the wrong tool, but because it already has enough context.

Factual lookup (LAMA T-REx):

GPT-J baseline: 31.9
GPT-3 175B: 39.8
Toolformer: 53.5

Here it routes to the QA tool in 98.1% of cases. The filtering step is selecting the right tool per context, not just learning "always call something."

Question answering (WebQS / NQ / TriviaQA):

Toolformer: 26.3 / 17.7 / 48.8
GPT-3 175B: 29.0 / 22.6 / 65.9

On QA, Toolformer still trails GPT-3. The paper attributes this partly to the simplicity of the BM25 search engine—it returns results that often aren't a good match—and partly to a deeper issue: the model can't reformulate its query if the first result is bad. More on that in limitations.

The inference-time knob you need to know about

During decoding, there's a hyperparameter k: call the API if ⟨API⟩ is among the top-k most likely tokens. At k=1 (pure greedy), the model only calls a tool when it's already the most likely token. At k=10, the model calls much more aggressively.

This matters in production because the model is somewhat calibrated at k=1—it reaches for tools most often on examples where it would do worst without them. But k=1 is too conservative for many tasks; the paper uses k=10 to get full benefit. The tradeoff is a higher false-call rate at higher k. In a latency-sensitive system, k is a real dial you'd tune against your SLA.

The paper also enforces a maximum of one API call per input to avoid loops. That's conservative—useful for safety but it prevents chaining.

Production tradeoffs

The filtering threshold τ_f directly controls precision vs. recall of tool use. At τ_f = 0.5, the Wikipedia search tool produces ~207K training examples. At τ_f = 2.0, it drops to ~14K. Higher threshold means only keeping the clearest cases where the tool helped—more reliable behavior but less coverage. In practice you'd tune this per tool based on how bad false positives are (a calculator that fires spuriously is annoying; a translation call that fires spuriously on English text is expensive and wrong).

Sample efficiency varies dramatically by tool. Processing over a million documents generates only ~994 useful calculator examples at τ_f = 1.0. Compare that to ~60K Wikipedia search examples at the same threshold. This reflects how rarely raw web text contains purely mathematical reasoning contexts. If you're adding a tool with a narrow trigger domain, plan for the training set to be sparse—and the model to underuse it.

Tool use capability doesn't emerge below ~775M parameters. Smaller models trained with the same approach show no improvement from tool calls. The model needs enough capacity to learn when calling is appropriate, not just how to format a call. This is relevant if you're building on a distilled or quantized model for latency reasons.

Language modeling performance is preserved. Finetuning on the augmented dataset with API calls disabled at inference gives the same perplexity as finetuning without any API calls (both ~10.3 on WikiText). The tool use is additive, not a tradeoff against core capabilities.

The failure modes the paper doesn't hide

Prompt sensitivity. The model is "often sensitive to the exact wording of its input when deciding whether or not to call an API." This is consistent with everything we know about LLMs and prompting—but it means tool-call reliability degrades on inputs that look different from training distribution. If users are paraphrasing or your input preprocessing changes, you'll see inconsistent tool activation.

No chaining. Because API calls for each tool are generated independently during training, there are no examples of using one tool's output as another tool's input. The model can't "call calendar, then pass the date to the QA system." For workflows that naturally require this—temporal factual queries, multi-step calculations—you'd need a different architecture, or to explicitly construct chained examples.

Search quality is the ceiling. On open QA tasks, Toolformer's performance is bounded by what the BM25 search engine returns. The model can't browse multiple results or reformulate a bad query. With a better retrieval backend this likely improves substantially—the paper acknowledges this as an open direction.

The calendar tool learns stale patterns. The model learns temporal awareness from CCNet, which has a fixed cutoff. The calendar API helps for date arithmetic (DATESET: 27.3 vs 0.8 for GPT-3), but on TEMPLAMA—temporal facts about named entities—the calendar is barely called (0.2% of examples). The model routes to QA and search instead, correctly recognizing that knowing today's date doesn't help you recall what team a footballer plays for.

False positives still slip through. Table 10 in the paper shows API calls with low or negative L_i⁻ - L_i⁺ scores that passed the filtering threshold. A Wikipedia search for "Fast train success" returns an April Wine discography entry, reducing perplexity on subsequent tokens that happen to mention "success," without being semantically useful. The noise is intentional—the authors note it prevents the model from blindly trusting every result—but it means you're training the model with some garbage examples, and the model learns to be resilient to noisy tool outputs as a result.

When not to use this approach

If you have a well-defined task with known tool requirements, prompt engineering or task-specific finetuning is simpler. Toolformer's value is in general-purpose use: the model should pick the right tool from a menu without being told which one applies.

If your tools have non-trivial costs or side effects, the model currently doesn't account for API cost when deciding to call. Sending an LLM call, writing to a database, or charging a third-party API on every generation step is a different problem than reading from a calculator. Cost-aware tool selection isn't solved here.

If you're working below ~1B parameters, the capability doesn't reliably emerge. The scaling analysis is clear: smaller models produce similar output with and without tools. You'd be adding training complexity for no inference benefit.

If your reliability bar is "never wrong on math", tool use reduces errors dramatically but doesn't eliminate them. The model still makes mistakes on which tool to call and what arguments to pass. For mission-critical calculations, you want to validate the model's tool call arguments before executing, not trust the generated input.

What this means for how I'm building

The pattern I've taken from Toolformer isn't about replicating their training pipeline—most teams aren't finetuning 6.7B models on CCNet. It's about the filtering criterion.

In Extremis, when an agent is deciding whether to retrieve from memory, I've started applying a version of the same check: does adding this retrieved context reduce uncertainty on the next step, or is it noise? The perplexity-based filter is just formalized intuition that should apply anywhere you're deciding whether to add latency for context.

The k hyperparameter also changed how I think about tool call thresholds. There's a real tradeoff between the model that almost always uses a tool when it could help (k=10) and the model that's more conservative but more calibrated (k=1). Neither is universally right. Your call budget, latency tolerance, and false-positive cost decide where on that curve you should be.

Toolformer: Language Models Can Teach Themselves to Use Tools — Schick et al., Meta AI Research, 2023. arXiv:2302.04761