Why Your Monitoring Agents Should Sleep More: Lessons from SentinelBench

New research on long-running AI agents shows that 'sustained attention' beats continuous polling for monitoring work. Here's what engineering leaders should change about how they design agent workloads.

Anystack Engineering

Most production AI agents are built to *act*. They poll, refresh, retry, search for alternatives, or otherwise force progress on every tick of the loop. That design is reasonable for short tasks. It is wasteful — and often wrong — for tasks that span hours or days.

SentinelBench: A Benchmark for Long-Running Monitoring Agents is the first systematic attempt to measure what the authors call sustained attention: an agent's ability to monitor an environment, stay quiet while nothing is happening, then respond promptly when an external event makes progress possible. The benchmark evaluates agents on long-horizon tasks where the right behaviour is mostly *not acting*, and the wrong behaviour is the default behaviour of almost every agent framework shipping today.

The findings matter because monitoring is where most enterprise agent pilots end up: watching a queue, a ticket system, a dashboard, an inbox, a build pipeline, a regulatory feed. If your agents handle those workloads the way they handle a single-turn chat — by trying to make something happen on every iteration — your token bill, latency, and reliability will all suffer in predictable ways.

What the research actually found

Three results from SentinelBench deserve attention from anyone running agents in production.

First, continuous-action agents waste enormous compute on tasks where nothing is happening. The default behaviour of most agent harnesses — keep calling tools, keep refreshing context, keep reasoning — produces high token consumption with near-zero progress when the environment is idle. The benchmark shows this is not a tuning problem; it is a design problem baked into the standard ReAct-style loop.

Second, agents struggle to detect when an external event has occurred. Even when given polling tools, they either poll too aggressively (cost) or miss the event window (reliability). The gap between "check every 10 seconds" and "check when something probably changed" is where most production monitoring agents fail silently.

Third, prompt-level instructions to 'wait' or 'be patient' are largely ineffective. Agents trained on action-oriented trajectories revert to action under uncertainty. Sustained attention has to be engineered into the harness — through event-driven triggers, sleep primitives, and explicit cost-aware planning — not asked for in the system prompt.

None of this is surprising to anyone who has put an agent on a long-running task and watched the bill. What is new is the structured evidence and a benchmark to measure against.

What this means for engineering leaders this quarter

If your organisation has agents in production or pilots that involve any form of waiting, watching, or reacting to external state, three actions are worth taking now.

Audit your agent loops for idle-state cost. Pick your three most expensive agent workloads from the last 30 days. For each, calculate the ratio of tokens spent on iterations that produced no state change versus iterations that did. If the idle-iteration cost is more than 20% of the total, you have a sustained-attention problem, not a model-quality problem. Switching to a larger model will make it worse, not better.

Replace polling with event-driven triggers wherever the underlying system supports it. Most enterprise systems your agents are watching — Jira, GitHub, ServiceNow, Kafka, SQS, S3 — emit webhooks or events. An agent that wakes on a webhook and runs for 30 seconds costs three orders of magnitude less than the same agent polling every minute for a week. This is not novel infrastructure; it is the same pattern your platform team already uses for everything else. The friction is usually that the agent framework makes synchronous polling the path of least resistance.

Add explicit sleep and budget primitives to your agent harness. SentinelBench's higher-performing configurations give the agent a structured way to say "there is nothing to do for N minutes, wake me on event X or at time T." If your current framework does not expose this, build it. The same primitive doubles as a hard cost-control mechanism — an agent that cannot sleep cannot stay within a budget on a long-horizon task. This pairs well with the kind of real-time spend limits that AI gateways are starting to expose; the agent-side discipline and the platform-side ceiling are complementary, not redundant.

Why this shows up now

The shift from short-task agents (summarise this, draft that) to long-running agents (watch this queue, manage this incident, monitor this release) is the dominant production pattern of 2026. The frameworks have not caught up. Most popular agent libraries still default to a tight reason-act loop with no first-class concept of waiting, no cost-aware scheduling, and no native event integration. The result is that teams discover the sustained-attention problem the same way they discover most production AI problems: through a finance review.

There is also a reliability dimension. An agent that polls aggressively will hit rate limits, get throttled, and produce inconsistent behaviour under load. An agent that wakes on events behaves predictably and is far easier to reason about during incidents. The cost story is the loudest, but the operational story is arguably more important for platform and SRE teams.

How Anystack approaches this

When we engage on agent workloads, a 3-person AI-augmented pod typically spends the first two weeks instrumenting the existing agent loops — measuring idle-iteration cost, event detection latency, and budget adherence — before changing any code. The pattern we see most often is that 60–80% of agent spend goes to iterations that produce no useful state change, and the fix is structural rather than model-level. Sustained-attention design, event-driven triggers, and explicit budget primitives are now part of how we ship AI integration work for clients running agents in production. SentinelBench gives the industry a shared way to measure something most teams were already feeling on their invoices.