Your AI Agents Are Aging in Production. Day-One Benchmarks Don't Show It

New research argues that long-lived AI agents drift even when model weights are frozen. Engineering leaders need lifespan testing, not just launch-day benchmarks.

Anystack Engineering

A new paper from arXiv, Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems, makes a claim that should unsettle anyone running LLM agents in production: even when model weights are frozen, an agent's effective behaviour keeps changing after deployment. It compresses interaction history. It retrieves from a memory store that grows by the day. It revises facts after updates land. It undergoes routine maintenance. The agent you benchmarked on day one is not the agent serving your customers in month six.

This matters because most enterprise AI evaluation pipelines look like model evaluation pipelines: run a fixed test set, score the outputs, ship if the numbers clear a threshold. That worked when the system under test was stateless. It does not work for agents with persistent memory, tool registries, and retrieval pipelines that all evolve independently of the model itself.

Three findings worth your attention

The paper, alongside related work on agent memory systems such as Is Agent Memory a Database?, surfaces a set of failure modes that are easy to dismiss in isolation but compound viciously in production.

Memory grows without governance. Agents accumulate context indefinitely. Without explicit policies for revision, deprecation, and forgetting, retrieval quality decays as the store fills with stale or contradictory facts. The paper calls this "unregulated growth" and "capacity-driven forgetting" — the agent eventually evicts useful memories to make space for noise.
Compression is silent corruption. To stay within context budgets, agents summarise prior interactions. Each summarisation step loses information. After dozens of cycles, the agent's working understanding of a user, a customer account, or a long-running task can drift far from ground truth — without any single step looking obviously wrong.
Retrieval is read-only by default. Most agent memory implementations let the system write and read, but never revise. When a fact changes — a customer's address, a policy, a regulation — the old version sits alongside the new one. The agent may surface either, depending on embedding similarity.

None of these failures show up on a launch-day benchmark, because launch-day benchmarks run against a freshly initialised agent. They show up at month three, when the support agent starts confidently citing a discount programme you cancelled, or the coding agent recommends a deprecated internal library because three months ago it was the right answer.

What this means for engineering leaders

If you have agents in production — or you are about to ship them — three actions are worth doing this week.

Action one: define a lifespan SLO, not just a quality SLO. Most teams track agent quality as a point-in-time metric: accuracy, helpfulness, task completion. Add a second axis. Pick a representative cohort of agent sessions and replay them at fixed intervals — day one, week one, month one, month three. Track how task completion and factual accuracy change as the agent's memory store grows. If quality at month three is materially worse than day one, you have a lifespan problem, and no amount of model upgrading will fix it on its own.

Action two: instrument the memory layer like a database, because it is one. Agent memory systems are routinely treated as opaque blobs. They should be treated as production data stores with explicit schemas, retention policies, revision semantics, and read/write audit trails. At minimum, log every memory write with provenance (which session, which tool, which user). Build a job that flags contradictions — where the agent has stored two mutually exclusive facts about the same entity. Define a policy for what "forgetting" means: time-based eviction, supersession by newer facts, or explicit deletion on user request. The Is Agent Memory a Database? paper is a good primer on the failure modes here.

Action three: build a maintenance harness, not just a deployment pipeline. Agents need the equivalent of database maintenance windows. Schedule periodic memory compaction with verifiable diffs (what did we collapse, what did we drop). Schedule fact reconciliation against canonical sources of truth — your CRM, your product catalogue, your policy database. Schedule retrieval index rebuilds. Crucially, run your replay benchmark after each maintenance pass to catch regressions. If your current deployment process ends at "the agent is live," you are missing roughly half the operational surface area.

Where this connects to platform reliability

The deeper point is that agentic systems are operational systems, not model artefacts. They share more DNA with stateful distributed systems than with classifiers. That means the disciplines that keep databases, queues, and caches healthy — observability, capacity planning, schema evolution, backup and restore, chaos testing — all need analogues in your agent stack. Most teams shipping agents today have none of these. The gap between "the agent works on the demo" and "the agent works in month six" is the gap between a prototype and a production system, and it is wider than the benchmark scores suggest.

This is also where the boundary between AI engineering and platform reliability gets blurry, which is the right outcome. Lifespan engineering is an SRE discipline applied to agents. Treat it as such.

How Anystack helps

Most enterprises do not need a year-long programme to fix this. They need a small team that has shipped long-lived agents before, can audit the memory and retrieval layers in weeks rather than quarters, and can install the observability and replay infrastructure that turns lifespan into a measurable property. A 3-person AI-augmented pod from Anystack typically spends its first fortnight building a lifespan benchmark against your production agent traffic, identifies the top two or three drift modes specific to your deployment, and ships the maintenance harness alongside your existing CI. The output is not a slide deck. It is a running system, owned by your team, that tells you when your agents are getting worse before your customers do.