What 7.5 million agent invocations tell us about running LLMs in production

A 21-day deployment of 3,505 user-funded LLM agents trading real ETH offers the largest public dataset on agent reliability under real consequences. The lessons apply directly to any team putting copilots near production systems.

Anystack Engineering

Most enterprise discussions about LLM agents still rely on toy benchmarks or vendor demos. A new arXiv paper, Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital (https://arxiv.org/abs/2604.26091), provides something rarer: telemetry from 3,505 autonomous agents managing real user funds over 21 days, producing roughly 7.5 million agent invocations in a bounded onchain market.

The study is interesting precisely because the agents had skin in the game. Users configured vaults through a mix of structured controls (caps, allow-lists, risk parameters) and natural-language strategies. The agents made the actual buy/sell decisions. Failures cost real money, which means the dataset reflects the kinds of reliability problems engineering leaders will hit the moment a copilot stops being a suggestion engine and starts taking actions on production systems.

Three findings stand out for anyone planning to deploy agents inside an enterprise stack.

Finding 1: Structured controls outperform prompt-level guardrails

The authors separate the control surface into two layers. The operating layer consists of deterministic, code-enforced controls (vault caps, instrument allow-lists, rate limits, validated tool schemas). The strategy layer is the natural-language mandate the user gives the agent. Across millions of invocations, almost every catastrophic failure mode the paper describes was either prevented by the operating layer or caused by gaps in it. Strategy-layer instructions, by contrast, were routinely re-interpreted, ignored under ambiguity, or quietly drifted across long sessions.

This matches what we have seen in customer support copilots and code-generation agents: telling a model in the system prompt to "never do X" is qualitatively different from making X impossible at the tool boundary.

Action: Treat the system prompt as documentation, not as a control. For any agent touching production, map every tool call to a deterministic policy check before execution: schema validation, scope check, rate limit, and a hard cap on blast radius. If a behaviour matters, encode it in code, not in English.

Finding 2: Reliability is dominated by the long tail, not the median

With 7.5M invocations, the median agent looked competent. The interesting data lives in the tail: malformed tool calls, hallucinated arguments, retry storms, and pathological loops where an agent re-invoked the same failing action hundreds of times. The paper documents that a small fraction of agents accounted for a disproportionate share of both error volume and capital loss, and that these failures were often invisible to standard accuracy metrics because each individual call looked locally plausible.

This is a familiar pattern to SRE teams: p50 latency tells you nothing useful, p99.9 tells you what will wake you up. LLM agents have the same shape, except the tail is wider and the failure modes are semantic rather than numeric.

Action: Instrument agents the way you instrument distributed systems. Capture every invocation with input, output, tool calls, and outcome. Build dashboards keyed on p99 behaviour, repeat-action rates, and tool-call validation failures, not on average task success. Set automated circuit breakers that suspend an agent session when it exceeds thresholds for retries, identical actions, or rejected tool calls within a window.

Finding 3: Natural-language strategies need a validation layer of their own

Users in the study expressed strategies in free text. The paper finds that ambiguous or internally contradictory mandates were a leading source of unintended behaviour. Crucially, the authors describe a translation step: parsing the natural-language strategy into a structured representation that could be checked for consistency, conflicts with the operating-layer controls, and feasibility before any capital was committed.

The lesson generalises. Whenever a human gives an agent a goal in prose, that prose is effectively a contract. Most enterprise agent deployments have no equivalent of a compiler for that contract. They go straight from English to action.

Action: Add a strategy-compilation step between user intent and agent execution. Have the model produce a structured plan (goals, constraints, success criteria, prohibited actions) that is reviewed - by code, by a second model, or by a human, depending on risk tier - before the agent begins acting. Reject or escalate when the structured plan conflicts with operating-layer policy. This is the agent equivalent of a CI pipeline: cheap to add, and it catches the failures that are most expensive to debug after the fact.

Why this matters beyond crypto

It would be easy to dismiss the study because the substrate is onchain trading. That would be a mistake. The architectural pattern - users issuing natural-language mandates, agents invoking validated tools against systems with real consequences - is the same pattern emerging in:

engineering copilots that open pull requests, run migrations, or change infrastructure
customer-operations agents that issue refunds, change entitlements, or update CRM records
internal data agents that execute queries, modify pipelines, or trigger workflows

In every case, the operating-layer / strategy-layer split is the right mental model, and the failure modes documented in the paper - tool-call malformations, retry loops, drift from mandate, blast-radius incidents - are the failure modes you should expect.

The data also reframes a common debate. Engineering leaders often ask whether the model is "good enough" for production. The paper suggests this is the wrong question. Across 7.5M invocations, the model's raw capability mattered far less than the surrounding controls. A weaker model with a strong operating layer was safer than a stronger model with a permissive one.

How Anystack helps

Anystack's AI integration practice works with engineering organisations putting copilots and agents next to production systems. Typical engagements include defining the operating-layer policy boundary for a given agent, building the tool-validation and circuit-breaker infrastructure around it, and instrumenting invocation telemetry so reliability can be measured on the tail rather than the average. We also help teams design the strategy-compilation step described above, so that natural-language intent is checked against policy before it becomes action. The goal in every case is the same: make the agent's blast radius a property of the system, not a property of the prompt.