How LLMs learn to use tools

Digging into Toolformer (Schick et al., 2023) while thinking about tool calling in production agents.

The demo looked great. The agent answered questions, ran code, pulled live data. Then someone asked it what 7% compound interest on $43,000 looks like over 8 years, and it confidently computed the wrong number three different ways without calling the calculator once.

This is the persistent embarrassment of LLMs doing math — or really, of any task where precision matters more than fluency. The model is excellent at sounding like it knows, which is exactly the wrong property when the answer needs to be correct.

Tool use is the fix. But there's a question underneath the fix that most frameworks skip: how does the model know when to call a tool, and when to just answer? That boundary is where things break in production.

The routing problem no one talks about

If you've shipped an agent that uses function calling or tool use, you've hit one of two failure modes.

Over-calling: the agent invokes a search or calculator for things it knows perfectly well, burning latency and cost, sometimes introducing lookup failures where there were none. Users notice when a simple question takes three seconds because the agent went out to fetch something it could have answered from weights.

Under-calling: the agent generates plausible-sounding answers when it should be querying a database or running computation. Silent wrong answers are worse than explicit failures — your users don't know to distrust them.

The standard fixes are either prompting ("always use the calculator for numbers") or deterministic routing rules (regex patterns, classifier on intent). Both are brittle. The prompting approach leaks into unrelated behaviors. The rules can't generalize to phrasing you didn't anticipate.

Toolformer (Schick et al., Meta AI, 2023) takes a different angle: instead of telling the model when to call tools, teach it to figure that out from data using only a handful of demonstrations per tool.

The bootstrapping pipeline

The core insight is elegant: a tool call is only worth making if the result helps predict what comes next. If you can measure that — and you can, using cross-entropy loss — you don't need humans to label "this is a good place to call the calculator."

The pipeline has three stages:

Stage 1: Candidate position sampling. The model reads through a large corpus of text and at each token position computes the probability it would start an API call there. Positions exceeding a threshold (τ_s = 0.05) become candidates. The paper caps this at the top 5 positions per document, which keeps the search tractable.

Stage 2: API call generation. For each candidate position, the model generates up to 5 candidate API calls using few-shot in-context examples — just 3-5 demonstrations of the API format is enough. This is the only point where human input enters: writing those initial demonstrations, not labeling every training example.

Stage 3: Loss-based filtering. Here's the mechanism that does the real work. For each generated call, compare two loss values:

L⁺: loss with the API call and its result both present
L⁻: loss without any API call (or with the call but no result — whichever is lower)

Keep the call only if L⁻ - L⁺ ≥ 1.0. In other words: only keep tool invocations where using the tool's output actually improved the model's ability to predict the tokens that followed. If the calculator result didn't help predict the rest of the sentence, the call gets filtered out.

The loss is weighted to emphasize nearby tokens — w̃_t = max(0, 1 - 0.2·t) where t is token distance from the call — because tool results should primarily help with what comes immediately after them.

Fine-tune on the filtered dataset and you have a model that calls tools where they help.

The numbers that surprised me

The filtering is aggressive. Processing over a million documents yields:

60,974 Wikipedia search examples
20,587 calendar examples
18,526 QA examples
994 calculator examples
1,034 machine translation examples

That's not a typo. A million documents and you get fewer than a thousand usable calculator training examples, because precise arithmetic doesn't appear that often in naturally-occurring text. The model learned to use a calculator reliably from under a thousand examples, which tells you something about how efficiently the loss signal concentrates useful signal.

The benchmark results are worth sitting with. On math datasets:

Dataset	GPT-3 (175B)	Toolformer (6.7B)
ASDiv	14.0%	40.4%
SVAMP	10.0%	29.4%
MAWPS	19.8%	44.0%

A 6.7B parameter model outperforming GPT-3 by 2-3x on math, using tools learned from under a thousand training examples. On temporal reasoning — a dataset of questions about dates and durations — Toolformer hits 27.3% where GPT-3 manages 0.8%.

The QA results are more nuanced: GPT-3 (175B) still leads on WebQuestions (29.0% vs 26.3%) and NaturalQuestions (22.6% vs 17.7%), where raw knowledge in weights outweighs retrieved facts for the question distribution in those benchmarks.

Production tradeoffs

What you gain: The loss-based filtering gives you something valuable — tool calls that are actually load-bearing. The model calls a tool when the output matters to what it's about to say, not just when a rule fires. This produces more coherent behavior than deterministic routing because the tool call is causally connected to the response.

What you trade away:

No chained tool use. API calls are sampled independently, so you can't pipe the output of a search into a calculator, or use a fact lookup to inform a translation. Each call is atomic. This is the hardest limitation for complex agents — the patterns that matter in production (retrieve → reason → compute → respond) require chaining that Toolformer doesn't support.

Sample inefficiency at the data generation step. Getting to a useful calculator dataset from scratch requires processing enormous amounts of text to find the positions where arithmetic actually happens. If your domain has sparse tool-relevant content (highly specialized medical calculations, financial derivatives pricing), generating enough signal to fine-tune on may require domain-specific corpora, not a general web crawl.

Prompt sensitivity. The few-shot demonstrations used to generate candidate API calls have significant influence on what kinds of calls get proposed. Small changes in how you write the demonstration examples (different phrasing, different argument formats) propagate to the training data and then to model behavior. This matters if you're extending to new tools — spend time on those demonstrations.

Cost-agnostic selection. The filtering criterion is pure loss: does this call help predict what follows? It has no concept of latency budget, API pricing, or retry risk. A model trained this way will happily invoke an expensive external API in situations where a local lookup would have been sufficient.

What this tells you about modern tool calling

When you use function calling in GPT-4 or tool use in Claude, the underlying problem is the same one Toolformer is solving: the model needs to know when to invoke external capabilities vs. when to generate from weights. The Toolformer approach was trained into a GPT-J-sized model; the frontier models have internalized a version of this through much larger training pipelines with much richer tool use data.

But the failure modes are structurally identical. Over-calling and under-calling are still the dominant bugs in production tool-using agents. The loss-based framing is useful for thinking about why: a tool call should only happen when the result changes what you'd say next. When it doesn't — when the agent is calling search to look up something it already knows — that's the over-calling failure mode, and it's the model not having learned the precise boundary.

For teams building agents today: this is why few-shot examples in your tool definitions matter so much. The demonstrations you provide are effectively doing what the Toolformer pipeline's in-context examples do — setting the prior for when calls are appropriate. Vague descriptions produce vague call patterns. Specific examples of when not to call (the model should answer directly if it already has the information) are as valuable as examples of when to call.

When not to use this

If your agent's tool use is dominated by a small number of well-defined patterns — "always look up account balance from the database before responding to billing questions" — deterministic routing is simpler, faster, and more auditable. Learned routing adds complexity without benefit when the rules are clear.

If you need tool chaining — feeding one tool's output to another — the Toolformer approach needs significant extension. Modern agent frameworks handle this at the orchestration layer rather than in the base model, and that's probably the right architectural choice until the underlying self-supervised methods are extended to multi-step sequences.

If you're in a high-stakes domain where a bad tool call (wrong API invoked, wrong arguments) causes user-visible damage rather than just a suboptimal response, you want deterministic constraints on when tools can be invoked, not a learned soft boundary. The loss criterion is a heuristic, not a guarantee.

Implementation notes

Threshold tuning matters more than it looks. The filtering threshold τ_f = 1.0 is a significant design choice — it controls how conservative the model is about when tool results actually helped. A tighter threshold means fewer, more confident tool calls; a looser one gives you more calls with more noise. Neither the paper nor your production system will have the same optimal value.

The demonstration examples are your lever. You're writing 3-5 examples per tool to bootstrap the candidate generation. These examples set the vocabulary for how the model thinks about calling that tool. Inconsistency in argument format between demonstrations will surface as inconsistency in production calls — the model won't pick a canonical format, it'll interpolate between yours.

Monitor call distribution, not just accuracy. When Toolformer learns from its own output, systematic gaps in the training data (sparse calculator examples, for instance) can create blind spots. Track what fraction of tool-appropriate queries actually trigger a call, not just whether the calls that do fire are correct.

What's next

Toolformer demonstrates that tool-use behavior can be learned from a handful of demonstrations using loss-based selection — no per-example annotation, no reinforcement learning signal, no human labels at decision time. That's a meaningful result.

The open problem is tool chaining: the real work agents do requires composing tools, and the loss-based approach as described doesn't extend naturally to multi-step sequences. ReAct, chain-of-thought with tools, and more recent agent frameworks are all working on this, but it's still an area where the theoretical grounding is weaker than the empirical results.

I'm thinking about this in the context of Multica's tool dispatch layer — specifically whether a lighter version of the Toolformer filtering idea (did this call change the response?) is useful as a runtime signal for identifying unnecessary tool invocations, without the full training pipeline.

References:

Schick, T., et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761