When Optimal Plans Break on Contact: The Post-Solve Robustness Gap

A new position paper argues that MILP decision engines hand engineering teams nominally optimal plans that quietly fail under tiny real-world perturbations. Here's what enterprise leaders should do about it.

Anystack Engineering

A June 2026 position paper, Post-Solve Robustness in Decision Engines: Feasible Regions and Smoothness Under Perturbations, makes an uncomfortable argument: the Mixed-Integer Linear Programming (MILP) solvers that quietly run logistics, workforce scheduling, energy dispatch, ad allocation and supply chain planning across the enterprise produce plans that are *nominally* optimal — and operationally fragile. Small drifts in costs, demands or capacities can either invalidate feasibility outright or trigger discontinuous jumps to qualitatively different solutions. The authors call this the post-solve robustness gap, and argue it's a missing evaluation dimension in both classical optimisation and the new wave of learning-enabled decision systems.

For CTOs, this matters because MILP-style decision engines have quietly become load-bearing in places leadership rarely audits. The solve runs nightly, the plan goes out, and when reality diverges by 3% on a single input, the on-call team is paged at 04:00 to manually patch a schedule. That's not an optimisation problem. That's a production reliability problem dressed up as mathematics.

What the paper actually says

Three findings deserve attention from engineering leaders, even if you've never personally written a constraint.

First, optimal is not the same as stable. A MILP solver returns a single point in a feasible region. Two solutions with near-identical objective values can have wildly different structures — different vehicles assigned, different shifts allocated, different SKUs sourced from different suppliers. The solver has no incentive to prefer the structurally stable one. So a 0.5% improvement in objective can come at the cost of a solution that flips entirely when tomorrow's demand forecast moves by one standard deviation.

Second, feasibility is brittle at the boundary. Industrial MILPs are typically solved tight against capacity constraints — that's where the value is. But tight constraints mean small perturbations push you outside the feasible region. The plan isn't suboptimal under perturbation; it's *impossible*. Teams then resort to ad-hoc relaxations, manual overrides, or rerunning the solver under time pressure with no guarantee of consistency.

Third, learning-enabled decision systems inherit the same gap, and hide it better. The recent trend of wrapping solvers in ML — predict-then-optimise, end-to-end learned surrogates, LLM-driven decision agents — produces systems that look smooth and modern but rest on the same discontinuous foundation. The paper argues current benchmarks for these systems measure solve-time accuracy, not deployment-time robustness, so the gap is invisible until it hits production.

What to do this week

Three concrete actions, none of which require rewriting your optimiser.

First, instrument the gap between plan and execution. For every decision engine in production, log the input assumptions at solve time and the actual realised values at execution time. Compute the drift distribution. If you can't answer the question "how often does reality diverge from solver inputs by more than X%?" you're flying blind. This is a one-sprint exercise for a competent platform team and it surfaces the problem in concrete numbers a CFO understands.

Second, add a perturbation test suite to your optimisation pipeline. Treat the solver like any other production component: it deserves a test harness. Pick the top 20 historical solves, perturb each input by ±1%, ±5%, ±10%, and measure two things — whether the solution remains feasible and whether the solution structure changes materially (Hamming distance on assignments is a reasonable proxy). Plans that flip structure under 5% perturbation are not production-ready; they're statistical artefacts. This is essentially QA modernisation applied to decision systems, and almost no one is doing it.

Third, add a stability term to the objective, not just accuracy. Most MILPs optimise pure cost or pure throughput. A small penalty for deviation from the previous plan, or for proximity to constraint boundaries, often costs less than 1% of objective value and dramatically reduces operational thrash. The business value of a plan that doesn't change overnight when nothing important happened is enormous and almost never measured.

Why this is a platform problem, not a data science problem

The instinct in most enterprises is to send this to the OR team or the ML team. That's a mistake. Post-solve robustness is fundamentally about the contract between a decision system and the systems that consume its output — schedulers, dispatchers, ERPs, downstream services. It's a platform reliability concern with optimisation flavour. Treating it as a modelling problem produces better models that still fail in production. Treating it as a platform reliability problem produces decision engines that degrade gracefully, expose their uncertainty, and integrate with the rest of your observability stack.

Concretely, that means decision engines should emit not just a plan but a stability score, a feasibility margin, and a structured set of fallbacks. They should be deployable through the same CI/CD pipeline as application code, with the same canary discipline. Output should be versioned and diffable. None of this is exotic — it's table stakes for any production system — but optimisation engines, often built by specialist teams outside the platform group, rarely get the treatment.

The wider pattern

This is the second time in a year a high-profile paper has pointed at the same underlying issue: systems that are accurate on benchmark distributions and fragile under deployment-time drift. We saw it with LLM agents losing reliability when context grew. We saw it with predict-then-optimise pipelines where small forecast errors compounded into large decision errors. The post-solve robustness gap is the same pattern in classical OR clothing.

The through-line for engineering leaders is that evaluation regimes have not caught up with deployment realities. Whatever the underlying technology — MILP, LLM, gradient-boosted forecaster, RL policy — the question "how does this behave when inputs drift?" is rarely asked at procurement time, rarely measured in CI, and rarely surfaced in dashboards. Teams that close this loop will outperform those that don't, regardless of which specific algorithms they choose.

How Anystack helps

A 3-person AI-augmented pod typically takes 60–90 days to audit an existing decision engine, instrument the plan-versus-execution gap, build the perturbation test harness, and ship a stability-aware deployment pipeline. The pod pairs a senior optimisation engineer with a platform/SRE specialist and an AI engineer — because the problem sits at the seam between those disciplines, and that seam is exactly where most enterprise decision systems break. No bench, no juniors, no quarter-long discovery phase.

If you operate a MILP, scheduling, dispatch or allocation engine in production and you can't currently answer the perturbation question on it, that's the signal. The fix is mechanical, the cost is bounded, and the payoff shows up in fewer 04:00 pages within the first month.