17 May 2026
·5 min read
Delivery & CI/CDClickHouseObservabilityPlatform ReliabilityWhen the Query Planner Becomes the Bottleneck: Lessons from Cloudflare's ClickHouse Billing Stall
A partitioning change broke Cloudflare's petabyte-scale billing pipeline — but the smoking gun wasn't IO or CPU. It was lock contention inside ClickHouse's query planner. Three takeaways for engineering leaders running data-intensive platforms.
In April 2026, Cloudflare's engineering team published a forensic write-up of a billing pipeline incident that should be required reading for anyone running analytical databases at scale. A routine partitioning change to their petabyte-scale ClickHouse cluster caused critical billing jobs to stall — and every dashboard they had said the system was healthy. CPU was fine. IO was fine. Query errors were absent. Yet jobs that normally completed in minutes were now timing out.
The culprit, eventually traced and patched upstream, was severe lock contention inside ClickHouse's query planner — a layer most operators never instrument and most observability stacks never surface. The full post-mortem is at blog.cloudflare.com/clickhouse-query-plan-contention, and it's worth reading in full. But the broader pattern it exposes is one we see repeatedly with enterprise clients: the metrics that look green are not the metrics that matter.
What actually happened
The team made what looked like a low-risk schema change: adjusting partitioning on a high-volume table to improve query pruning. Throughput on read-heavy paths improved as expected. But a subset of jobs — specifically those issuing many concurrent queries against the same table — began to degrade. Latency rose. Then it cliff-edged.
Standard ClickHouse metrics (query duration histograms, merge queue depth, replication lag, disk IO) showed nothing alarming. The team had to drop into stack traces and flame graphs on running ClickHouse processes to find that worker threads were spending the majority of wall-clock time waiting on a single mutex deep inside the query planner. The new partitioning scheme had inflated the cost of plan construction, and under concurrent load that cost serialised behind a global lock.
The fix required patches to ClickHouse itself, contributed back upstream. Time from first symptom to root cause: several days of senior engineering effort.
Three findings worth internalising
1. Green dashboards are a lagging indicator of architectural fitness.
Most observability stacks instrument the layers operators chose to instrument — typically request rates, error rates, durations, and resource saturation (the RED and USE methods). These are necessary but not sufficient. When the bottleneck moves into a layer you didn't instrument — a planner, a scheduler, a connection pool, a lock — your dashboards will continue to show health right up to the point of failure. The Cloudflare incident is a textbook example: every metric they had was within bounds, and the system was still unusable.
2. Schema and configuration changes to analytical databases are not low-risk.
There is a cultural assumption — particularly in teams that have shifted analytical workloads to columnar stores like ClickHouse, Snowflake, or BigQuery — that schema-level changes are reversible and cheap. They are not. Partitioning, sort keys, materialised view definitions, and replication topology interact with the query planner in ways that are difficult to predict and even harder to test in pre-production. A change that improves one workload by 30% can regress another by 10x under concurrency.
3. Upstream contribution is sometimes the only path forward.
When your bottleneck is in open-source code your team doesn't own, you have two options: wait for the maintainers, or patch it yourself. Cloudflare chose the latter. This requires engineers comfortable reading C++ in unfamiliar codebases, building and testing patches against production-scale workloads, and navigating upstream review processes. It's a capability most enterprise engineering organisations have quietly lost as they moved up the abstraction stack.
What to do this week
Audit your observability for planner-layer blindness. Pick your three highest-traffic data systems (databases, queues, search indexes). For each, ask: do we have visibility into time spent in query planning, scheduling, or coordination layers? Not just IO and CPU. If the answer is no — and it usually is — instrument continuous profiling (eBPF-based tools like Parca, Polar Signals, or Pyroscope work well) on at least one replica per cluster. The marginal cost is low. The marginal information when something goes wrong is enormous.
Treat schema changes to analytical stores as production deployments. This means: a written change description, a hypothesis about expected performance impact, a shadow-traffic or replay-based validation step, and a rollback plan that does not assume the change is reversible in place. If your current process for a ClickHouse partitioning change is a pull request that gets merged after one review, you are one Friday afternoon away from Cloudflare's bad week.
Inventory your open-source dependencies by risk, not by popularity. A database, message broker, or runtime your team cannot debug or patch under pressure is a latent risk regardless of how mainstream it is. For each critical dependency, name the engineer who could realistically patch it. If you can't name one, that's a gap worth closing — through hiring, training, or a partner relationship.
How a small senior team approaches this
These incidents — where the failure mode is two layers below the dashboards — are exactly what a 3-person AI-augmented pod is designed for. The work is not voluminous; it is dense. It requires engineers who have seen mutex contention in production planners before, who are fluent enough in C++ or Rust to read the offending code path, and who know which flame graph signature to look for. Throwing twenty mid-level engineers at this problem makes it worse, not better, because the bottleneck is judgement, not throughput.
When we engage with clients on platform reliability work, the first 30 days are usually spent extending observability into exactly these blind spots — planner-layer profiling, scheduler instrumentation, lock contention tracing — before any code change is proposed. We also use that window to map dependencies by debuggability, not popularity, so that when the next Cloudflare-style incident hits, the team knows whether it has the muscle to patch upstream or needs to route around the problem. The CI/CD and delivery practices that follow — shadow-traffic validation for schema changes, rollback playbooks for analytical stores — are the cheap insurance that prevents a Tuesday partitioning change from becoming a Friday billing outage.
The Cloudflare team did the hard, unglamorous work: they instrumented deeper than their dashboards, they read code they didn't write, and they shipped a fix back upstream. Most enterprise engineering organisations would not have reached root cause in the same timeframe — not because their engineers are less capable, but because the operating model isn't set up to support that kind of focused, senior-only investigation. That gap is closeable. It starts with deciding which problems deserve density over headcount.
