13 May 2026

·

6 min read

AI EngineeringLLMOpsCompliance

Why Generic LLM Serving Stacks Fail Compliance Workloads

A new paper on LLMOps for fraud and AML shows that compliance prompts break the assumptions baked into generic LLM serving stacks. Here's what engineering leaders should change before scaling regulated AI workloads.

Anystack Engineering

Most enterprise LLM serving stacks were designed around a single mental model: a user types a short question, the model streams a long, free-form answer. That assumption is now leaking into regulated workloads where it does not hold — and the cost shows up as latency, spend, and audit risk.

A paper published this month, Rethinking LLMOps for Fraud and AML: Building a Compliance-Grade LLM Serving Stack, makes the case directly. The authors argue that fraud detection and anti-money-laundering prompts have an inverted shape compared to chat workloads: they are prefix-heavy, schema-constrained, and evidence-rich, producing short structured outputs such as JSON labels or risk factors. Treating them like generic chat traffic leaves significant performance, cost, and governance wins on the table — and introduces failure modes that compliance teams cannot accept.

This matters well beyond banks. Any enterprise using LLMs for KYC, sanctions screening, contract review, claims triage, healthcare coding, or insider-risk detection is running compliance-shaped workloads on chat-shaped infrastructure.


The shape of a compliance prompt

A typical fraud or AML prompt is not a question. It is a stack: a reusable policy preamble, a risk taxonomy, a structured representation of a transaction or document, a few-shot block of prior decisions, and a tight output schema. The policy and taxonomy rarely change. The evidence block changes every call. The output is short and machine-parseable, not prose.

The paper identifies three properties that follow from this shape, and each has direct implications for how you serve the model.

Finding 1: Prefix reuse is the dominant cost lever, not model size

In chat workloads, prompts vary widely and KV-cache hit rates are low. In compliance workloads, 70–95% of the prompt tokens are identical across calls — the same policy, the same taxonomy, the same schema. A serving stack that does not aggressively reuse prefix KV cache is paying to recompute the same attention over and over.

The authors show that prefix-aware routing and persistent KV caches change the economics meaningfully: latency drops, throughput rises, and the temptation to "just use a smaller model" partly evaporates because you are no longer paying for the easy 90% of the prompt on every call.

Action for this week: instrument your LLM gateway to log prompt prefix hashes and measure prefix overlap across the last 100,000 production calls per use case. If overlap is above 60% and you are not using a serving runtime with explicit prefix caching (vLLM, SGLang, TensorRT-LLM with prefix reuse enabled, or a managed provider that exposes it), you have a quantifiable, near-term cost reduction sitting on the table. Most teams find this is the single largest unit-economics improvement available before any model swap.

Finding 2: Output validation is part of the model, not a downstream concern

Compliance outputs are structured: a JSON object with a verdict, a list of risk factors, a confidence score, citations to the evidence. Free-form generation followed by a JSON parser is the wrong architecture for this. The paper argues — and production experience confirms — that schema-constrained decoding (grammar-constrained sampling, structured output APIs, or constrained beam search) belongs inside the serving layer, not bolted on after.

The reason is not just developer convenience. Unconstrained models periodically emit outputs that parse as valid JSON but violate the schema in subtle ways: an enum value that does not exist in the taxonomy, a citation that points to evidence not in the prompt, a confidence outside [0,1]. These slip past unit tests and into the audit trail. By the time a regulator asks why your model flagged a particular transaction with risk factor WIRE_STRUCT_47, nobody can find that code in the taxonomy because the model invented it.

Action for this week: audit every production LLM endpoint that returns structured output. For each, ask three questions. Is the schema enforced at decode time, at parse time, or only at human review? What percentage of outputs in the last 30 days failed schema validation on the first attempt? When validation fails, what happens — silent retry, fallback to a default, or escalation? If the answers are "parse time", "we don't know", and "silent retry", you have an auditability gap that will become a regulatory finding the first time it is examined.

Finding 3: Model orchestration beats model selection

The paper's third contribution is a model-routing argument. Compliance workloads have a long tail: 80% of cases are unambiguous and a 7B parameter model handles them correctly when given the right prefix and schema. The remaining 20% need a frontier model. A single-model stack overpays on the easy cases and underperforms on the hard ones.

The authors propose tiered routing with explicit handoff criteria — confidence thresholds, evidence completeness checks, taxonomy coverage — rather than letting one model handle everything. This is the same pattern that has emerged from production telemetry across many enterprise deployments over the last 18 months: heterogeneous model fleets, with routing logic that is itself observable and testable.

Action for this week: for your highest-volume LLM use case, sample 500 production calls and have a domain expert label them by difficulty (clear / ambiguous / requires expert judgement). Run the clear-case subset through a model 2–3 tiers smaller than your current default. If accuracy holds, you have evidence for a routing tier. If accuracy collapses, you have evidence that your prompt is doing more work than you realised — usually because the frontier model is silently compensating for missing context that a smaller model exposes.


What this means for engineering leaders

The broader pattern in the paper is one that applies to every regulated LLM workload, not just fraud and AML. Generic serving infrastructure optimises for the wrong shape. The wins available — often 3–10x on cost, 2–5x on latency, and meaningful reductions in audit risk — come from treating prefix structure, output schema, and model routing as first-class concerns of the serving stack, not as application-layer afterthoughts.

Three organisational implications follow.

First, LLMOps is not MLOps with a different model artifact. The serving runtime, the prompt registry, the schema registry, and the routing policy are all production systems that need versioning, observability, and rollback. If your platform team treats the LLM gateway as a thin proxy to a vendor API, you do not have an LLMOps capability — you have a billing relationship.

Second, compliance and engineering need a shared schema language. The taxonomy your model uses to label risk is the same taxonomy your compliance officers use in their procedures. When these diverge — and they always diverge — the model becomes ungovernable. The fix is dull but effective: a single source of truth for taxonomies and schemas, generated into both the model's structured output grammar and the compliance team's documentation.

Third, evaluation has to be continuous, prefix-aware, and schema-aware. A static test set evaluated quarterly will not catch the drift that matters: a new evidence type that the prefix wasn't designed for, a taxonomy update that hasn't propagated to the grammar, a routing rule that quietly sends 30% more traffic to the expensive tier after a prompt change.


Anystack helps engineering organisations move from chat-shaped LLM infrastructure to workload-shaped infrastructure. Our AI integration and copilot engineering work focuses on the unglamorous middle layer — prefix caching, schema-constrained decoding, tiered routing, evaluation harnesses — that determines whether a regulated LLM workload survives its first audit. Where these workloads sit on top of platform infrastructure that also needs to be production-grade, our platform reliability practice covers the observability and rollback patterns that turn an LLM gateway into a system you can defend on a Friday afternoon.

Free engineering audit

Want a structured assessment of where this applies to your stack? Our 30-minute tech audit is free.

Request audit →

Start a conversation

Facing a version of this in your organisation? We scope engagements in a single call.

Book a 30-min call →

See the evidence

Read how we've delivered these outcomes for clients in fintech, healthcare, and telecom.

Browse case studies →