Treating Text-to-SQL Like Test Coverage: What PExA Tells Us About Production LLM Agents

New research reframes complex text-to-SQL generation as a test coverage problem, using parallel atomic queries to break the latency-accuracy trade-off. The implications go well beyond SQL.

Anystack Engineering

Most enterprise teams piloting LLM agents have hit the same wall: every accuracy improvement seems to cost latency, and every latency optimisation costs accuracy. A paper published this week on arXiv, PExA: Parallel Exploration Agent for Complex Text-to-SQL (https://arxiv.org/abs/2604.22934), proposes a reframing that is worth the attention of any engineering leader running copilots over structured data.

The authors argue that text-to-SQL agents have been optimised the wrong way. Rather than asking a single agent to draft, critique, and repair one monolithic query, they treat the original natural-language request the way a QA engineer treats a feature: as something that requires a suite of test cases providing semantic coverage. Each test case is a simpler, atomic SQL statement. The atomic queries are executed in parallel against the warehouse, and the agent iterates on coverage until the original request is satisfied. The framing is borrowed directly from software testing, and the results suggest the borrowing is overdue.

Finding 1: The latency-accuracy trade-off is an artefact of sequential agent design

The dominant pattern in production LLM agents today is sequential: plan, draft, self-critique, repair, retry. Each step adds wall-clock time, and each retry compounds it. PExA's parallel decomposition shows that when you split a hard query into atomic sub-queries that can run concurrently, you get both higher semantic coverage and lower end-to-end latency than a single large reasoning chain. The trade-off curve people thought they were stuck on was a property of the architecture, not the problem.

The practical action here is to audit your agent topology before you tune your prompts. Most teams we see have spent months on prompt engineering and model upgrades while leaving an inherently sequential plan-act-reflect loop in place. Map out the dependency graph of your agent's reasoning steps. Anywhere two steps do not strictly depend on each other's output, they should run in parallel. For text-to-SQL, code search, document analysis, and most retrieval-heavy workloads, the parallelisable surface is much larger than teams assume.

Finding 2: Test coverage is a better mental model for agent correctness than self-critique

The more interesting contribution is conceptual. Self-critique loops, where an LLM judges its own output, are known to be unreliable: the model that produced the error is rarely the best judge of it. PExA replaces self-critique with executable coverage. Each atomic SQL is run against the database; the results are concrete evidence, not opinion. Coverage of the original semantic intent is then measured against those results.

This matters because it gives engineering teams a verification primitive that does not depend on a second LLM call. For an enterprise leader, the action is to insist that every agent in production has a deterministic verifier. If your agent generates SQL, the verifier is the database. If it generates code, the verifier is a test runner or a type checker. If it generates a workflow, the verifier is a dry-run executor. Agents whose only check is another LLM call are not engineered systems; they are stochastic processes wearing a costume. This is the same discipline that test automation has enforced on application code for two decades, and it is the discipline that QA modernisation programmes need to extend to AI features.

Finding 3: Decomposition primitives are reusable across agent domains

PExA is framed as a text-to-SQL paper, but the decomposition strategy is domain-agnostic. The pattern is: take a complex request, generate a set of simpler sub-requests whose combined results provide semantic coverage of the original, execute in parallel, and aggregate. This applies to multi-hop code understanding (see also the SWE-QA benchmark released the same week, https://arxiv.org/abs/2604.24814, which shows current LLMs still struggle with reasoning across dispersed code segments), to multi-document retrieval, to financial analysis pipelines, and to incident triage.

The action for a Head of Engineering is to treat agent decomposition as a platform capability, not a per-feature concern. If three teams are independently building plan-and-execute agents for SQL, code search, and ticket triage, they are almost certainly reinventing the same parallel decomposition logic. Centralise it. A small platform team owning the agent runtime, the parallel executor, the coverage verifier, and the observability layer will produce better outcomes than five product teams each rolling their own. This is the same argument that drove the consolidation of CI/CD platforms a decade ago, and the economics are similar: shared infrastructure, shared reliability, shared cost controls.

A note on the cost dimension

Parallel execution sounds expensive, and naively it is: you are issuing more LLM calls and more database queries per user request. But the PExA results suggest the picture is more nuanced. Sequential agents with self-critique loops often issue 5–10 LLM calls per request when retries are counted; a parallel decomposition that hits coverage on the first pass can come in at fewer total tokens with lower wall-clock latency. The lesson for cloud cost optimisation is to measure tokens-per-successful-request, not tokens-per-call. A cheaper-per-call architecture that retries three times is more expensive than a more elaborate architecture that succeeds once.

We have seen this pattern repeatedly with clients. A naive cost dashboard shows the parallel architecture using more compute. A correctness-weighted dashboard shows it using less. Engineering leaders who only look at the first dashboard end up optimising for the wrong number and quietly degrading their product.

What this means for the next twelve months

Three shifts are worth planning for.

First, agent architecture reviews will become a standard engineering practice, the way code reviews and design reviews are today. The cost of a badly designed agent is now high enough, in both spend and user trust, that it warrants a formal review gate before production.

Second, deterministic verifiers will become a hiring and tooling priority. Teams that can wire an LLM agent into a database, a test runner, a type system, or a simulator will ship more reliable AI features than teams that cannot. The skill set sits at the intersection of QA engineering and AI engineering, and it is in short supply.

Third, the test-coverage metaphor will spread. Expect to see coverage-style metrics applied to agent outputs across domains: not line coverage, but semantic coverage of the user's intent, measured against executable verifiers. The teams that adopt this vocabulary early will have a clearer language for talking about AI quality with their stakeholders.

How Anystack helps

Anystack works with enterprise engineering organisations on exactly these problems. Our AI integration practice runs agent architecture reviews that map sequential bottlenecks and identify parallelisable decomposition surfaces. Our QA modernisation practice builds the deterministic verifier layer, including database sandboxes, test-runner harnesses, and coverage-style evaluation suites that give your teams a non-LLM source of truth. And our platform reliability work centralises the agent runtime so that decomposition, parallel execution, and verification become shared services rather than per-team reinventions. If you are running production LLM agents and the latency-accuracy trade-off feels like a law of physics, it probably is not.