25 May 2026

·

5 min read

AI EngineeringCost OptimisationObservability

Why Per-Invocation Cost Metrics Lie About Your AI Agents

New research argues that measuring AI agent cost per LLM call hides the true price of goal completion — retries, tool calls, and recovery cycles. Here's how engineering leaders should rethink agent accounting.

Anystack Engineering

Most engineering organisations track AI cost the way they track database queries: per invocation. Tokens in, tokens out, dollars per million. That worked when LLMs were called once per user request. It does not work for agents.

A new paper from researchers proposing A-LEMS (Agentic LLM Energy Measurement System), Energy per Successful Goal: Goal-Level Energy Accounting for Agentic AI Systems, makes the case bluntly: when a single user goal triggers multi-step orchestration, tool calls, retries, and failure-recovery cycles, the invocation count is an implementation artifact rather than a property of the task. Measuring energy — or cost, or latency — at the invocation level systematically misrepresents what it actually takes to complete work.

The paper is framed around energy, but the argument generalises. Any per-call metric — cost, tokens, latency, carbon — becomes misleading the moment your system retries, branches, or recovers. And almost every production agent does all three.


What the research actually shows

Three findings stand out for anyone running agents in production.

First, invocation counts are not stable across runs of the same task. The same user goal, executed twice, can produce dramatically different numbers of LLM calls depending on whether tools succeed on the first try, whether the planner backtracks, or whether the orchestrator hits a transient rate limit and retries. Reporting average cost-per-call across a fleet smooths over the fact that *the unit itself is unstable*.

Second, failure-recovery cycles dominate the cost distribution. A small fraction of goals — the ones that fail repeatedly, get re-planned, or trigger fallback paths — consume a disproportionate share of total energy and spend. If you only see aggregate per-call metrics, these tail goals are invisible. They look like a slightly elevated p99 latency rather than the budget problem they actually are.

Third, goal-level accounting changes which optimisations matter. When you measure energy per successful goal, the highest-leverage interventions are not "use a smaller model" but "reduce the retry rate," "fail faster on unrecoverable goals," and "cache tool outputs across attempts." The paper's measurements suggest these orchestration-level changes can move goal-level cost more than swapping model tiers.

This tracks with what teams running agents at scale have been quietly discovering. The cost line item that surprises the CFO is rarely the base inference rate. It is the long tail of goals where the agent looped, retried, and eventually either succeeded expensively or failed after burning through a budget.


Three actions for engineering leaders this week

Instrument cost-per-successful-goal, not cost-per-call. Most observability stacks emit per-request token counts. Few aggregate those by business outcome. Pick your top three agent workflows — a support triage agent, a code-review agent, a data-extraction pipeline — and define what "goal" means for each. Then plumb a correlation ID through every LLM call, tool invocation, and retry so you can attribute total spend to a single completed (or failed) goal. This is a one-sprint piece of work and it almost always changes the budget conversation.

Separate succeeded-goal cost from failed-goal cost. The paper's framing implicitly assumes you know which goals completed successfully. Many teams do not. They have logs of LLM calls but no ground truth on whether the agent actually achieved what the user asked for. Without that signal, you cannot tell whether your 8x retry rate is producing 8x value or burning tokens on goals that were always going to fail. Build a lightweight success oracle — even a human-graded sample of 200 goals per week is enough to start — and tag every trace with the outcome.

Set retry budgets per goal, not per call. Most agent frameworks expose retry counts at the tool or LLM level. That is the wrong abstraction. A goal that has already cost 40,000 tokens across 15 tool calls should probably not get another five retries, regardless of what any individual component's retry policy says. Implement a goal-level budget that caps total spend per user request and fails loudly when exceeded. This single change typically eliminates the worst of the long tail.


Why this matters beyond cost

Goal-level accounting is not just a finance exercise. It is the foundation for every other decision you make about an agent system.

If you cannot measure cost per goal, you cannot compare two prompt strategies fairly — one might use fewer tokens per call but require more calls. You cannot decide whether a smaller model is actually cheaper — it may need more retries to reach the same success rate. You cannot set SLAs with internal customers, because you have no stable unit of work to price against. And you cannot make a credible carbon claim, because per-invocation energy understates the cost of completion by whatever your retry multiplier happens to be.

The organisations that are scaling agents in production are quietly all converging on this same insight: the meaningful unit is the goal, not the call. They have rebuilt their dashboards, their alerting, and their cost-allocation models around it. The organisations that have not are the ones still surprised by their monthly Anthropic and OpenAI bills.


How Anystack approaches this

When we run an agent cost review for a client, the first deliverable is almost never a model swap. It is a goal-level cost telemetry layer — correlation IDs, success oracles, retry budgets — bolted onto whatever orchestrator the team is already using. Only once that instrumentation is in place do we start tuning models, prompts, or routing logic, because before then any optimisation is measured against a unit that lies.

A 3-person AI-augmented pod typically takes four to six weeks to instrument goal-level accounting across a handful of production agent workflows, identify the failure modes driving long-tail cost, and ship the orchestration changes that flatten the distribution. The work sits at the intersection of AI integration and cloud cost optimisation — which is increasingly where the real money is in production AI.

Free engineering audit

Want a structured assessment of where this applies to your stack? Our 30-minute tech audit is free.

Request audit →

Start a conversation

Facing a version of this in your organisation? We scope engagements in a single call.

Book a 30-min call →

See the evidence

Read how we've delivered these outcomes for clients in fintech, healthcare, and telecom.

Browse case studies →