16 May 2026
·5 min read
Delivery & CI/CDClickHouseObservabilityPlatform ReliabilityWhen the Bottleneck Isn't in Your Code: Cloudflare's ClickHouse Billing Stall
A partitioning change at Cloudflare turned a healthy ClickHouse cluster into a billing-pipeline stall. The root cause wasn't query logic — it was lock contention in the query planner itself. Here's what engineering leaders should take from it.
In April 2026, Cloudflare published a deep post-mortem on a billing pipeline regression that nearly delayed customer invoicing across their global edge. The cluster was healthy. CPU was fine. Disk I/O was fine. Query errors were nil. And yet jobs that previously completed in minutes were stalling for hours. The culprit, eventually, was severe lock contention inside ClickHouse's query planner — triggered by a seemingly innocuous partitioning change on a petabyte-scale table. Cloudflare's engineers ended up patching ClickHouse upstream to fix it.
The full write-up is worth reading: Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse. For engineering leaders, though, the technical fix is less interesting than the failure mode. This is a class of incident that standard observability stacks are structurally bad at catching, and it shows up more often than most CTOs realise as data platforms scale.
What actually happened
The Cloudflare team changed the partitioning key on a large analytics table — a routine optimisation intended to improve pruning and reduce scan volumes. After the change, billing jobs that aggregated usage data began to slow down dramatically. Dashboards showed nothing alarming: queries weren't failing, the cluster wasn't saturated, and per-query latency on smaller workloads looked normal.
The team eventually traced the issue to a global mutex inside ClickHouse's query planning phase. The new partitioning scheme caused the planner to evaluate a much larger set of candidate partitions per query. Under concurrent load — exactly what billing batch jobs produce — threads serialised on this mutex. From the outside, the cluster looked idle. Internally, it was thrashing on a lock the standard metrics didn't surface.
The fix was a targeted upstream patch reducing the scope of the critical section, plus a workload-side change to reduce planner pressure. Total time from regression to root cause: several engineer-weeks.
Three findings engineering leaders should internalise
The specific bug matters less than the pattern. There are three takeaways here that apply to any team running large-scale data infrastructure.
First, dashboard-shaped observability misses contention bugs. Most production observability is built around rates, errors, durations, and saturation — the RED and USE methods. Lock contention inside a query planner produces none of these signals cleanly. The query eventually completes. The CPU is not saturated; threads are blocked, not running. Saturation metrics measure resources, not coordination primitives. If your platform team can't pull flame graphs, off-CPU profiles, or lock-wait traces from production databases in under an hour, you have the same blind spot Cloudflare had.
Second, schema and partitioning changes are platform-level changes, not application-level changes. The Cloudflare regression was triggered by what most teams would treat as a routine DBA task. The blast radius — billing pipeline stalls — was enormous, but the change-management process almost certainly treated it as low-risk. In large engineering organisations, partitioning changes, index rebuilds, and storage-engine tuning routinely bypass the same review rigour applied to application deploys, even though they have higher production impact.
Third, upstream patching is now a baseline platform skill, not an exotic capability. Cloudflare didn't escalate to a vendor. They read the ClickHouse source, identified the contention, and shipped a patch. Ten years ago this was rare. Today, with most data infrastructure being open source and AI-assisted code comprehension being genuinely useful at this kind of task, it is a reasonable expectation for any platform team running infrastructure at scale. Teams that still treat their database as a black box will be slower to recover from every incident of this shape.
What to do this week
Three concrete actions, in priority order.
- Audit your top three data platforms for lock-wait visibility. For each, document how an on-call engineer would, today, identify a contention bug in under 30 minutes. If the answer involves SSHing into a node and running
perfad-hoc, that's your gap. Most managed database services now expose wait-event telemetry — turn it on and put it on a dashboard.
- Reclassify schema, partitioning, and storage-engine changes as Tier 1 production changes. Require the same canary, rollback plan, and reviewer rigour you apply to application deploys. The Cloudflare incident is one of dozens of public post-mortems where a "DBA task" caused a customer-facing outage.
- Identify one piece of open-source infrastructure in your critical path that no one on your team has ever read the source of. Assign someone to spend two days reading it. The goal isn't to become a maintainer — it's to remove the cognitive barrier that prevents your team from debugging into it during an incident.
None of these are large investments. All three would have shortened Cloudflare's incident materially.
The deeper pattern
The ClickHouse incident is part of a broader trend that engineering leaders should be tracking. As organisations consolidate onto a smaller number of very large data platforms — ClickHouse, Snowflake, BigQuery, Iceberg-on-X — the failure modes shift from "the query is wrong" to "the platform itself has emergent behaviour under our specific load". The old playbook of throwing more hardware at the problem rarely works because the bottleneck is in coordination, not resources.
This is also why generic platform expertise is becoming less useful and workload-specific expertise more valuable. Knowing ClickHouse is not the same as knowing how ClickHouse behaves under billing-style batch concurrency on tables with high partition counts. The latter is what actually fixes incidents. It's also what most internal teams don't have time to build, because the people who could build it are firefighting.
How Anystack helps
This is the shape of problem a 3-person AI-augmented pod is built for. Three senior engineers, paired with AI tooling that accelerates source-code comprehension and trace analysis, can move from "unknown regression in a managed platform" to "upstream patch and workload fix" in days rather than the engineer-weeks an internal team typically spends. We've run this play on Postgres, Kafka, and ClickHouse workloads across several enterprise clients in the last year.
When the work is more systemic — observability gaps, change-management weaknesses, or platform reliability debt accumulating across multiple data systems — the same pod model applies through our platform reliability engagement. The output is the same: a smaller team, moving faster, with the depth to read the source when the dashboards lie.
The Cloudflare post-mortem is a good example of what world-class platform engineering looks like in 2026: read the source, patch upstream, write it up honestly. Most enterprise engineering organisations are not yet operating at that level — not because their engineers are less capable, but because the structure around them rewards firefighting over depth. Closing that gap is a leadership decision, not an engineering one.
