The Tool-Use Tax: When Adding Tools Makes Your LLM Agents Worse

New research shows tool-augmented LLM agents underperform plain chain-of-thought in the presence of distractors, and the cost is paid in latency, tokens, and accuracy. Here is what engineering leaders should measure before scaling agentic systems.

Anystack Engineering

The prevailing assumption in agentic AI is straightforward: give the model more tools, and it will reason more reliably. A new paper, Are Tools All We Need? Unveiling the Tool-Use Tax in LLM Agents, challenges that assumption with a Factorized Intervention Framework that separates three distinct costs: prompt formatting overhead, tool-calling protocol overhead, and the actual gain from executing tools. The headline finding is uncomfortable for anyone who has been pitching agent platforms to their board. In the presence of semantic distractors, tool-augmented reasoning does not necessarily outperform native chain-of-thought, and sometimes underperforms it.

This matters because most enterprise agentic deployments today look exactly like the failure case the paper describes: a single LLM wrapped in a dozen or more tools, prompted with verbose JSON schemas, operating over messy real-world inputs that contain plenty of irrelevant context. If you have a team building copilots, internal agents, or customer-facing assistants, you are almost certainly paying the tool-use tax without measuring it.

What the research actually found

The authors decomposed agent performance into measurable components rather than treating tool-augmented reasoning as a single black box. Three findings stand out for engineering leaders.

First, the prompt-formatting cost alone is non-trivial. Simply describing tools in the system prompt — before any tool is ever called — degrades reasoning on tasks the model could otherwise solve natively. The model spends attention budget parsing schemas it does not need. This is consistent with what teams running production agents have been reporting anecdotally: adding a tool to the registry slightly worsens performance on unrelated tasks.

Second, the tool-calling protocol itself imposes overhead distinct from the actual tool execution. Routing a request through a tool-call/observation loop changes how the model reasons, and not always for the better. Even when the tool returns the correct answer, the surrounding scaffolding can introduce errors at the parse, plan, or synthesis steps.

Third, the net gain from executing tools is highly task-dependent and shrinks dramatically when inputs contain distractors. On clean, well-scoped queries, tools help. On realistic inputs with ambiguity, partial information, or irrelevant context, the gain often goes negative. The model is better off reasoning from its parametric knowledge than navigating a tool surface that amplifies its uncertainty.

A related paper published the same week, AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?, reinforces the point from a different angle. Production agentic systems make many model calls per request, most of them short and structured. The authors show that smaller models handle the lower rungs of the capability ladder competently, and only a fraction of calls genuinely need frontier intelligence. Routing every step to GPT-class models is, in most agent workflows, paying for capability you do not use.

Why this contradicts the dominant vendor narrative

Agent framework vendors have a strong incentive to encourage tool proliferation. More tools means more integrations, more lock-in, more reason to pay for the platform. The research suggests the opposite design principle: minimise the tool surface, route aggressively to smaller models for routine steps, and reserve frontier models for the steps that genuinely require reasoning over ambiguity.

This is not an argument against agents. It is an argument against measuring agent quality solely by capability and ignoring the cost structure. Most enterprise teams cannot tell you, for a given workflow, what fraction of latency is prompt formatting overhead, what fraction is tool protocol overhead, and what fraction is actual useful work. Without that decomposition, optimisation is guesswork.

Three actions for this week

Audit your tool registry per agent. For each production agent, list the tools currently exposed to it and ask: when was each tool last actually called by this agent on real traffic? Tools that are present but rarely invoked are paying the formatting tax on every request without contributing to outcomes. Cut anything below a usage threshold (5% of requests is a reasonable starting point) or move it behind a router that only exposes it conditionally.

Run an A/B between tool-augmented and native CoT on your real traffic. Take a sample of last week's agent requests. Replay each through two configurations: your current tool-augmented agent, and a stripped-down native chain-of-thought variant with no tools. Score outputs on accuracy and measure tokens and latency. The paper predicts a non-trivial fraction of your traffic will perform better, faster, and cheaper without tools. Identify which task categories those are and route them differently.

Decompose agent latency into the three components. Instrument your agent runtime to separately log time spent on (1) prompt construction including tool descriptions, (2) tool-call parsing and protocol, and (3) tool execution itself. Most teams only measure end-to-end latency. Without the breakdown, you cannot tell whether to invest in prompt compression, protocol simplification, or tool optimisation.

The cost-quality frontier most teams are missing

The AgentFloor work suggests a related action that pairs well with the audit: model routing. If 70% of your agent calls are short, structured, and routine, sending them to a frontier model is wasted spend. A tiered architecture — small open-weight model for instruction-following and tool-use steps, frontier model only for planning and synthesis — typically cuts inference cost by 60-80% with minimal quality loss, provided you measure the loss rather than assume it.

This is the same discipline engineering leaders already apply to cloud workloads: not every workload deserves the most expensive instance type. Treat your model fleet the same way. The teams that get this right are the ones that will be able to scale agent deployments to the volumes the business is asking for, without the unit economics breaking.

There is also a quality dimension. Smaller models hallucinate less on narrow, well-scoped tasks because they have less surface area to drift across. A 7B model asked to extract three fields from a structured document is often more reliable than a 400B model asked to do the same, because the larger model brings more context that can interfere. The engineering decision is empirical, not aspirational.

At Anystack, our AI integration practice helps teams instrument agentic systems to expose exactly these cost-quality trade-offs, and our platform reliability work extends the same observability discipline that engineering leaders already apply to backend services into the LLM runtime layer. The teams that will win the next two years of agentic AI are not the ones with the most tools or the largest models. They are the ones who can answer, with data, why each tool exists and which model tier each step belongs on.