Most of Your Agent's LLM Calls Don't Need a Frontier Model

New benchmark research shows that small open-weight models handle the majority of routine agent calls competently. The implication for engineering leaders: a routing strategy can cut inference spend dramatically without degrading user-facing quality.

Anystack Engineering

A production agent rarely makes one model call per user request. It makes dozens. A planner decomposes the task, a router picks a tool, a formatter shapes the arguments, a validator checks the output, a summariser composes the response. Most teams point all of those calls at the same frontier model — typically the most expensive one their procurement team has approved — and absorb the cost as a fixed line item.

The AgentFloor benchmark, published last week (arXiv:2605.00334), provides the first deterministic evidence that this default is wasteful. The authors built a 30-task, six-tier capability ladder spanning instruction following, tool use, multi-step coordination, and long-horizon planning. They then measured how far up the ladder small open-weight models can climb before quality collapses. The headline finding: the lower three tiers — which account for the bulk of calls in real agent workflows — are reliably handled by models orders of magnitude cheaper than the frontier.

This matters because agent inference cost has become a significant slice of cloud spend at any company running production copilots. And it compounds: a poorly designed agent loop can balloon a single user query into 40+ model calls, each billed at frontier rates.

What the research actually shows

Three findings deserve attention from engineering leaders.

First, capability is not uniform across tiers. Small open-weight models perform near-parity with frontier models on instruction following, structured tool calling, and short-horizon coordination. They degrade sharply on long-horizon planning under persistent constraints — the top of the ladder. The gap is real, but it is concentrated in a narrow band of tasks.

Second, most production agent calls live in the bottom three tiers. AgentFloor's task taxonomy aligns with what we see in client codebases: the majority of model calls in a working agent are routing decisions, JSON formatting, tool argument construction, and short summarisation. These are the calls a 7B–13B parameter model handles competently.

Third, the cost differential is not marginal. Endpoint-level benchmarks like TokenArena, released the same week, show workload-blended price differences of 20–50× between frontier and well-served small-model endpoints. Even allowing for routing overhead and occasional escalation, the savings are material at any non-trivial scale.

The corollary is uncomfortable: if your agent uses one model for everything, you are almost certainly overpaying — and you may also be slower than you need to be, since smaller models typically have better time-to-first-token.

What this means in practice

The action is not "swap your frontier model for a small one." That trade fails on the long-horizon tasks where capability genuinely matters. The action is to introduce a routing layer that sends each call to the cheapest model that can handle it reliably, and escalates when it cannot.

Three concrete steps for this week.

Instrument before you optimise. Most teams cannot answer the question "what fraction of our agent calls are tier-1 routing versus tier-5 planning?" Add structured logging to every model call: task type, prompt token count, output token count, latency, and a coarse capability tag (routing, formatting, tool-arg, summarisation, planning). Two weeks of this data will tell you where the spend actually goes. In our experience, the distribution is heavily skewed toward routine calls — which is exactly where small models perform best.

Build a tiered routing policy with a fallback contract. Define explicit tiers in your agent framework. Tier 1 (routing, classification, JSON shaping) goes to a small open-weight model on a fast endpoint. Tier 2 (multi-step tool coordination, retrieval re-ranking) goes to a mid-tier model. Tier 3 (long-horizon planning, ambiguous reasoning) goes to a frontier model. Critically, define a fallback contract: if the small model returns invalid JSON, fails schema validation, or expresses low confidence, the call escalates automatically. Measure escalation rates per tier — they should sit in the low single digits if your tiering is calibrated.

Benchmark on your tasks, not the leaderboard. Public benchmarks are useful directionally but they do not reflect your agent's actual prompt distribution. Take 200 real production prompts per tier, run them through three candidate models each, and grade the outputs against your own quality bar. Quality on your traffic is the only number that should drive routing decisions. AgentFloor's deterministic structure is a good template — fixed inputs, fixed grading rubric, repeatable runs.

Two pitfalls worth flagging

Do not route based on prompt length alone. Length correlates weakly with task difficulty. A 200-token prompt asking for a multi-hop plan over five tools is harder than a 4,000-token prompt asking for a JSON reformat. Route on task type, established up-front by the agent framework.

Do not under-invest in the escalation path. The savings come from confidently sending most calls to small models. The quality comes from confidently catching the cases where they fail and escalating cleanly. A routing layer without a robust escalation contract is a regression waiting to happen — and it will surface in production as subtle quality drift rather than visible errors, which is the worst kind of failure to debug.

There is also a procurement dimension. Many enterprise contracts with frontier model providers include volume commitments. A routing layer that materially reduces frontier traffic may interact with those commitments. Worth a conversation with your finance partner before you ship the change, not after.

A note on momentum

The research direction here is consistent. The Tool-Use Tax paper (arXiv:2605.00136) showed that adding tools to a frontier model can degrade performance under semantic distractors. AgentFloor shows that removing the frontier model for routine calls often does not. Both findings point the same way: the era of treating LLM agents as a single monolithic call to the biggest available model is ending. Production-grade agents look more like distributed systems with heterogeneous compute, careful contracts between components, and instrumented escalation paths. They look, in other words, like every other piece of well-engineered backend software.

Anystack's AI integration practice helps engineering teams design and ship tiered routing layers, capture the right telemetry to drive routing decisions, and build the escalation contracts that keep quality stable as cost falls. Where the savings flow through to broader infrastructure spend, our cloud cost optimisation work ties model-level decisions to overall unit economics. The pattern is repeatable, and the data to justify the change is usually sitting in your logs already.