Monitoring Agents Need Patience, Not Persistence

New research on long-running AI agents shows that the default 'keep acting' loop wastes tokens and misses events. Engineering leaders deploying agents in production need to design for sustained attention, not continuous action.

Anystack Engineering

Most production AI agents are built around a simple loop: think, call a tool, observe, repeat. That works for short tasks. It falls apart the moment an agent is asked to watch a build pipeline, wait for a customer reply, or monitor a deployment for regressions over the next four hours.

A paper published this week, SentinelBench: A Benchmark for Long-Running Monitoring Agents, formalises something engineering teams have been discovering the expensive way. The default agent behaviour — continuous action — is the wrong model for any task that spans more than a few minutes. What's needed instead is sustained attention: the ability to wait quietly, notice when an external event makes progress possible, and respond promptly without burning tokens in the meantime.

This matters because the gap between demo-grade agents and production-grade agents is widening, and monitoring workloads are where the gap shows up first on the invoice.

What the research actually found

SentinelBench evaluates agents on tasks where the right action, most of the time, is no action. Watching a long-running job. Polling for an event that may take hours. Detecting drift in a metric stream. The benchmark separates two failure modes that conventional agent evals collapse together.

The first failure mode is false urgency: agents that keep issuing tool calls, refreshing state, and trying to force progress when the correct behaviour is to wait. This inflates token spend, exhausts rate limits on downstream APIs, and crowds out genuine signals when they do arrive.

The second failure mode is missed events: agents that go quiet but never wake up. Context decays, the model forgets what it was meant to be watching for, and the triggering event passes unnoticed.

The benchmark shows that current frontier models, when given a long-horizon monitoring task, lean heavily toward the first failure mode. They act when they should wait. A 4-hour monitoring task can cost an order of magnitude more than necessary because the agent treats inactivity as a problem to be solved rather than the correct state.

Why this hits enterprise deployments hardest

The agent use cases that enterprise teams are actually building — incident triage assistants, deployment watchers, contract monitoring, fraud review queues — are almost all long-running. They look superficially like the chat-style tasks agents are good at, but the underlying execution profile is fundamentally different.

Three consequences follow.

First, token cost forecasts based on demo workloads are wrong by 5–20x. The Cloudflare team made this point sharply in their recent post on AI Gateway spend limits, where they observed runaway token bills as the primary support issue from customers running agents in production. Spend caps are a backstop, not a fix.

Second, observability tooling built for request/response services misses the failure mode entirely. A monitoring agent that sits silent for four hours and then misses its trigger event looks healthy in every dashboard. It has no errors, no high latency, no anomalous behaviour. It just quietly fails to do its job.

Third, the architectural fix is not 'better prompts'. It's a different control loop — one that includes external event sources, durable state, and explicit wake conditions. This is platform engineering, not prompt engineering.

Three actions for engineering leaders this week

Audit your agent inventory by time horizon, not by use case. For every agent in or near production, classify it as short-horizon (under 60 seconds), medium (minutes), or long-horizon (hours or more). Long-horizon agents need a different runtime — typically event-driven with durable state — and almost certainly need to be rebuilt rather than tuned. If your team can't answer this question quickly, that's the finding.

Instrument cost-per-completed-task, not cost-per-call. Token spend per API call tells you nothing about whether an agent is working efficiently. A monitoring agent that polls 400 times to detect one event has the same per-call cost as one that uses webhooks and wakes once. Build a dashboard that surfaces tokens consumed per business outcome delivered, and you'll find the runaway agents within a day.

Add explicit wait primitives to your agent framework. Most agent frameworks in common use today (LangGraph, AutoGen, custom in-house variants) don't have first-class support for 'sleep until X'. They emulate waiting with polling, which is exactly the antipattern SentinelBench measures. Either adopt a framework with durable execution primitives (Temporal, Inngest, Restate) or build the wait/wake layer yourself before scaling further.

The deeper pattern

This is one specific instance of a broader truth that's hardening as agent deployments mature: the bottleneck is rarely model capability. It's the surrounding system — the control loop, the event sources, the state management, the cost observability. Treating agents as 'just a model call' produces systems that work in demos and bleed money in production.

The teams getting this right tend to share a profile. They have senior engineers who've built distributed systems before and who recognise that an LLM agent is, architecturally, just another long-running stateful service with the unusual property that its state is partly held in a context window. They apply the same rigour to it that they would to any other production service: durable state, idempotent operations, explicit timeouts, structured observability.

The teams getting it wrong tend to treat agent development as a separate discipline staffed by people new to production systems. The result is predictable: heroic prompt engineering papering over architectural gaps that will not survive contact with real traffic.

Anystack's 3-person AI-augmented pod is built for exactly this kind of problem — where the work sits at the seam between AI engineering and production platform work, and where the failure modes only show up in real workloads. A typical engagement starts with an audit of agent runtime behaviour and cost-per-outcome, then moves to refactoring long-horizon agents onto durable execution platforms. Teams who want to dig deeper into the platform side can read more about our approach to platform reliability and to AI integration. The pattern is consistent: small senior teams, working alongside your engineers, fixing the surrounding system so the agents can do their job.