Cloudflare outage on November 18, 2025 post mortem
Mood
heated
Sentiment
mixed
Category
tech
Key topics
Cloudflare outage
post-mortem analysis
software reliability
Cloudflare's post-mortem analysis of their November 18, 2025 outage reveals a configuration error caused by a database permissions change, sparking discussion on software reliability, error handling, and change management.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
13m
Peak period
66
Hour 1
Avg / period
12.3
Based on 160 loaded comments
Key moments
- 01Story posted
11/18/2025, 11:31:22 PM
19h ago
Step 01 - 02First comment
11/18/2025, 11:44:32 PM
13m after posting
Step 02 - 03Peak activity
66 comments in Hour 1
Hottest window of the conversation
Step 03 - 04Latest activity
11/19/2025, 3:14:34 PM
4h ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
> The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail
A configuration error can cause internet-scale outages. What an era we live in
Edit: also, after finishing my reading, I have to express some surprise that this type of error wasn't caught in a staging environment. If the entire error is that "during migration of ClickHouse nodes, the migration -> query -> configuration file pipeline caused configuration files to become illegally large", it seems intuitive to me that doing this same migration in staging would have identified this exact error, no?
I'm not big on distributed systems by any means, so maybe I'm overly naive, but frankly posting a faulty Rust code snippet that was unwrapping an error value without checking for the error didn't inspire confidence for me!
I think it's quite rare for any company to have exact similar scale and size of storage in stage as in prod.
We’re like a millionth the size of cloudflare and we have automated tests for all (sort of) queries to see what would happen with 20x more data.
Mostly to catch performance regressions, but it would work to catch these issues too.
I guess that doesn’t say anything about how rare it is, because this is also the first company at which I get the time to go to such lengths.
In this case it seems the database table in question seemed modest in size (the features for ML) so naively thinking they could have kept stage features always in sync with prod at the very least, but could be they didn't consider that 55 rows vs 60 rows or similar could be a breaking point given a certain specific bug.
It is much easier to test with 20x data if you don't have the amount of data cloudflare probably handles.
Either way, you don’t need to do it on every commit, just often enough that you catch these kinds of issues before they go to prod.
I also found the "remediation and follow up" section a bit lacking, not mentioning how, in general, regressions in query results caused by DB changes could be caught in future before they get widely rolled out.
Even if a staging env didn't have a production-like volume of data to trigger the same failure mode of a bot management system crash, there's also an opportunity to detect that something has gone awry if there were tests that the queries were returning functionally equivalent results after the proposed permission change. A dummy dataset containing a single http_requests_features column would suffice to trigger the dupe results behaviour.
In theory there's a few general ways this kind of issue could be detected, e.g. someone or something doing a before/after comparison to test that the DB permission change did not regress query results for common DB queries, for changes that are expected to not cause functional changes in behaviour.
Maybe it could have been detected with an automated test suite of the form "spin up a new DB, populate it with some curated toy dataset, then run a suite of important queries we must support and check the results are still equivalent (after normalising row order etc) to known good golden outputs". This style of regression testing is brittle, burdensome to maintain and error prone when you need to make functional changes and update what then "golden" outputs are - but it can give a pretty high probability of detecting that a DB change has caused unplanned functional regressions in query output, and you can find out about this in a dev environment or CI before a proposed DB change goes anywhere near production.
No other time in history has one single company been responsible for so much commerce and traffic. I wonder what some outage analogs to the pre-internet ages would be.
> The standard procedures the managers tried first failed to bring the network back up to speed and for nine hours, while engineers raced to stabilize the network, almost 50% of the calls placed through AT&T failed to go through.
> Until 11:30pm, when network loads were low enough to allow the system to stabilize, AT&T alone lost more than $60 million in unconnected calls.
> Still unknown is the amount of business lost by airline reservations systems, hotels, rental car agencies and other businesses that relied on the telephone network.
https://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collap...
Lots of things have the sky in common. Maybe comet-induced ice ages...
AWS very likely has Cloudflare beat in commerce responsibility. Amazon is equal to ~2.3% of US GDP by itself.
In the pre digital era, East India Company dwarfs every other company in any metric like commerce controlled, global shipping, communication traffic, private army size, %GDP , % of workforce employed by considerable margins.
The default was large consolidated organization throughout history, like say Bell Labs, or Standard Oil before that and so on, only for a brief periods we have enjoyed benefits of true capitalism.
[1] Although I suspect either AWS or MS/Azure recent down-times in the last couple of years are likely higher
- Their database permissions changed unexpectedly (??)
- This caused a 'feature file' to be changed in an unusual way (?!)
- Their SQL query made assumptions about the database; their permissions change thus resulted in queries getting additional results, permitted by the query
- Changes were propagated to production servers which then crashed those servers (meaning they weren't tested correctly)
- They hit an internal application memory limit and that just... crashed the app
- The crashing did not result in an automatic backout of the change, meaning their deployments aren't blue/green or progressive
- After fixing it, they were vulnerable to a thundering herd problem
- Customers who were not using bot rules were not affected; CloudFlare's bot-scorer generated a constant bot score of 0, meaning all traffic is bots
In terms of preventing this from a software engineering perspective, they made assumptions about how their database queries work (and didn't validate the results), and they ignored their own application limits and didn't program in either a test for whether an input would hit a limit, or some kind of alarm to notify the engineers of the source of the problem.From an operations perspective, it would appear they didn't test this on a non-production system mimicing production; they then didn't have a progressive deployment; and they didn't have a circuit breaker to stop the deployment or roll-back when a newly deployed app started crashing.
This is literally the CrowdStrike bug, in a CDN. This is the most basic, elementary, day 0 test you could possibly invent. Forget the other things they fucked up. Their app just crashes with a config file, and nobody evaluates it?! Not every bug is preventable, but an egregious lack of testing is preventable.
This is what a software building code (like the electrical code's UL listings that prevent your house from burning down from untested electrical components) is intended to prevent. No critical infrastructure should be legal without testing, period.
> The change explained above resulted in all users accessing accurate metadata about tables they have access to. Unfortunately, there were assumptions made in the past, that the list of columns returned by a query like this would only include the “default” database:
SELECT
name,
type
FROM system.columns
WHERE
table = 'http_requests_features'
order by name;
Note how the query does not filter for the database name. With us gradually rolling out the explicit grants to users of a given ClickHouse cluster, after the change at 11:05 the query above started returning “duplicates” of columns because those were for underlying tables stored in the r0 database.And here is the query they used ** (OK, so it's not exactly):
SELECT * from feature JOIN permissions on feature.feature_type_id = permissions.feature_type_id
someone added a new row to permissions and the JOIN started returning two dupe feature rows for each distinct feature.** "here is the query" is used for dramatic effect. I have no knowledge of what kind of database they are even using much less queries (but i do have an idea).
more edits: OK apparently it's described later in the post as a query against clickhouse's table metadata table, and because users were granted access to an additional database that was actually the backing store to the one they normally worked with, some row level security type of thing doubled up the rows. Not sure why querying system.columns is part of a production level query though, seems overly dynamic.
Any other large-ish company, there would be layers of "stakeholders" that will slow this process down. They will almost always never allow code to be published.
Damn corporate karma farming is ruthless, only a couple minute SLA before taking ownership of the karma. I guess I'm not built for this big business SLA.
Thanks for the insight.
As a visitor to random web pages, I definitely appreciated this—much better than their completely false “checking the security of your connection” message.
> The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems' permissions
Also appreciate the honesty here.
> On 18 November 2025 at 11:20 UTC (all times in this blog are UTC), Cloudflare's network began experiencing significant failures to deliver core network traffic. […]
> Core traffic was largely flowing as normal by 14:30. We worked over the next few hours to mitigate increased load on various parts of our network as traffic rushed back online. As of 17:06 all systems at Cloudflare were functioning as normal.
Why did this take so long to resolve? I read through the entire article, and I understand why the outage happened, but when most of the network goes down, why wasn't the first step to revert any recent configuration changes, even ones that seem unrelated to the outage? (Or did I just misread something and this was explained somewhere?)
Of course, the correct solution is always obvious in retrospect, and it's impressive that it only took 7 minutes between the start of the outage and the incident being investigated, but it taking a further 4 hours to resolve the problem and 8 hours total for everything to be back to normal isn't great.
Or you do have something like this but the specific db permission change in this context only failed in production
"This feature file is refreshed every few minutes and published to our entire network and allows us to react to variations in traffic flows across the Internet. It allows us to react to new types of bots and new bot attacks. So it’s critical that it is rolled out frequently and rapidly as bad actors change their tactics quickly."
https://how.complexsystems.fail/#18
It'd be fun to read more about how you all procedurally respond to this (but maybe this is just a fixation of mine lately). Like are you tabletopping this scenario, are teams building out runbooks for how to quickly resolve this, what's the balancing test for "this needs a functional change to how our distributed systems work" vs. "instead of layering additional complexity on, we should just have a process for quickly and maybe even speculatively restoring this part of the system to a known good state in an outage".
For London customers this made the impact more severe temporarily.
- A product depends on frequent configuration updates to defend against attackers.
- A bad data file is pushed into production.
- The system is unable to easily/automatically recover from bad data files.
(The CrowdStrike outages were quite a bit worse though, since it took down the entire computer and remediation required manual intervention on thousands of desktops, whereas parts of Cloudflare were still usable throughout the outage and the issue was 100% resolved in a few hours)
Side thought as we're working on 100% onchain systems (for digital assets security, different goals):
Public chains (e.g. EVMs) can be a tamper‑evident gate that only promotes a new config artifact if (a) a delay or multi‑sig review has elapsed, and (b) a succinct proof shows the artifact satisfies safety invariants like ≤200 features, deduped, schema X, etc.
That could have blocked propagation of the oversized file long before it reached the edge :)
The exact wording (which I can easily find, because a good chunk of the internet gives it to me, because I’m on Indian broadband):
> example.com needs to review the security of your connection before proceeding.
It bothers me how this bald-faced lie of a wording has persisted.
(The “Verify you are human by completing the action below.” / “Verify you are human” checkbox is also pretty false, as ticking the box in no way verifies you are human, but that feels slightly less disingenuous.)
The "...was then propagated to all the machines that make up our network..." followed by "....caused the software to fail." screams for a phased rollout / rollback methodology. I get that "...it’s critical that it is rolled out frequently and rapidly as bad actors change their tactics quickly" but today's outage highlights that rapid deployment isn't all upside.
The remediation section doesn't give me any sense that phased deployment, acceptance testing, and rapid rollback are part of the planned remediation strategy.
Edit: Similar to Crowdstrike, the bot detector should have fallen-back to its last-known-good signature database after panicking, instead of just continuing to panic.
It has somewhat regularly saved us from disaster in the past.
For something so critical, why aren't you using lints to identify and ideally deny panic inducing code. This is one of the biggest strengths of using Rust in the first place for this problem domain.
Wonder why these old grey beards chose to go with that.
Afaik, Go and Java are the only languages that make you pause and explicitly deal with these exceptions.
unwrap() implicitly panic-ed, right?
I'm pretty surprised that Cloudflare let an unwrap into prod that caused their worst outage in 6 years.
1. https://doc.rust-lang.org/std/option/enum.Option.html#recomm...
I don't know enough about Cloudflare's situation to confidently recommend anything (and I certainly don't know enough to dunk on them, unlike the many Rust experts of this thread) but if I was in their shoes, I'd be a lot less interested in eradicating `unwrap` everywhere and more in making sure than an errant `unwrap` wouldn't produce stable failure modes.
But I do feel strongly that the expect pattern is a highly useful control and that naked unwraps almost always indicate a failure to reason about the reliability of a change. An unwrap in their core proxy system indicates a problem in their change management process (review, linting, whatever).
This reads to me more like the error type returned by append with names is not (ErrorFlags, i32) and wasn't trivially convertible into that type so someone left an unwrap in place on an "I'll fix it later" basis, but who knows.
Surely a unwrap_or_default() would have been a much better fit--if fetching features fails, continue processing with an empty set of rules vs stop world.
> Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input
> Enabling more global kill switches for features
> Eliminating the ability for core dumps or other error reports to overwhelm system resources
> Reviewing failure modes for error conditions across all core proxy modules
Absent from this list are canary deployments and incremental or wave-based deployment of configuration files (which are often as dangerous as code changes) across fault isolation boundaries -- assuming CloudFlare has such boundaries at all. How are they going to contain the blast radius in the future?
This is something the industry was supposed to learn from the CrowdStrike incident last year, but it's clear that we still have a long way to go.
Also, enabling global anything (i.e., "enabling global kill switches for features") sounds like an incredibly risky idea. One can imagine a bug in a global switch that transforms disabling a feature into disabling an entire system.
I wonder why clickhouse is used to store the feature flags here, as it has it's own duplication footguns[0] which could have also easily lead to a query blowing up 2/3x in size. oltp/sqlite seems more suited, but i'm sure they have their reasons
[0] https://clickhouse.com/docs/guides/developer/deduplication
Also, the link you provided is for eventual deduplication at the storage layer, not deduplication at query time.
It’s not a terrible idea, in that you can test the exact database engine binary in CI, and it’s (by definition) not a single point of failure.
Also I've been seeing an uptrend in sqlite dogma on HN lately. I love sqlite for some things, but it's not The One True Database Solution.
People think of configuration updates (or state updates, call it what you will) as inherently safer than code updates, but history (and today!) demonstrates that they are not. Yet even experienced engineers will happily yeet such changes into production -- even ones who wouldn't dare let a single line code go live without being subject to through the staged deployment and testing process.
The point here remains: consider every change to involve risk, and architect defensively.
And I hope fly.io has these mechanisms as well :-)
In this case, the older proxy's "fail-closed" categorization of bot activity was obviously better than the "fail-crash", but every global change needs to be carefully validated to have good characteristics here.
Having a mapping of which services are downstream of which other service configs and versions would make detecting global incidents much easier too, by making the causative threads of changes more apparent to the investigators.
Unfortunately they do not share, what caused the status page to went down as well. (Does this happen often? Otherwise a big coincidence it seems)
IDK Atlassian Statuspage clientele, but it's possible Cloudflare is much larger than usual.
At the bare minimum they could've used an expect("this should never happen, if it does database schema is incorrect").
The whole point of errors as values is preventing this kind of thing.... It wouldn't have stopped the outage but it would've made it easy to diagnose.
If anyone at cloudflare is here please let me in that codebase :)
Unwrap gives you a stack trace, while retuned Err doesn't, so simply using a Result for that line of code could have been even harder to diagnose.
`unwrap_or_default()` or other ways of silently eating the error would be less catastrophic immediately, but could still end up breaking the system down the line, and likely make it harder to trace the problem to the root cause.
The problem is deeper than an unwrap(), related to handling rollouts of invalid configurations, but that's not a 1-line change.
The problem is that they didn't surface a failure case, which means they couldn't handle rollouts of invalid configurations correctly.
The use of `.unwrap()` isn't superficial at all -- it hid an invariant that should have been handled above this code. The failure to correctly account for and handle those true invariants is exactly what caused this failure mode.
I know, this is "Monday morning quarterbacking", but that's what you get for an outage this big that had me tied up for half a day.
It's not necessarily invalid to use unwrap in production code if you would just call panic anyway. But just like every unsafe block needs a SAFETY comment, every unwrap in production code needs an INFALLIBILITY comment. clippy::unwrap_used can enforce this.
How about indexing into a slice/map/vec? Should every `foo[i]` have an infallibility comment? Because they're essentially `get(i).unwrap()`.
For the 5% of cases that are too complex for standard iterators? I never bother justifying why my indexes are correct, but I don't see why not.
You very rarely need SAFETY comments in Rust because almost all the code you write is safe in the first place. The language also gives you the tool to avoid manual iteration (not just for safety, but because it lets the compiler eliminate bounds checks), so it would actually be quite viable to write these comments, since you only need them when you're doing something unusual.
If you could share some of your Rust code or snippets I would be very interested. I found it quite hard to avoid indexing for graph-heavy code, so I'm always on the lookout for interesting ways to enforce access safety.
Could you share some more details, maybe one fully concrete scenario? There are lots of techniques, but there's no one-size-fits-all solution.
Exceptions force a panic on all errors, which is why they're supposed to be used in "exceptional" situations. To avoid exceptions when an error is expected, (eof, broken socket, file not found,) you either have to use an unnatural return type or accept the performance penalty of the panic that happens when you "throw."
In Rust, the stack trace happens at panic (unwrap), which is when the error isn't handled. IE, it's not when the file isn't found, it's when the error isn't handled.
A few ideas:
- It should not compile in production Rust code
- It should only be usable within unsafe blocks
- It should require explicit "safe" annotation from the engineer. Though this is subject to drift and become erroneous.
- It should be possible to ban the use of unsafe in dependencies and transitive dependencies within Cargo.
Not unlike people having a blind spot for Rust in general, no?
We are now discussing what can be done to improve code correctness beyond memory and thread safety. I am excited for what is to come.
It's not about whether you should ban unwrap() in production. You shouldn't. Some errors are logic bugs beyond which a program can't reasonably continue. The problem is that the language makes it too easy for junior developers (and AI!) to ignore non-logic-bug problems with unwrap().
Programmers early in their careers will do practically anything to avoid having to think about errors and they get angry when you tell them about it.
An unwrap should never make it to production IMHO. It's fine while prototyping, but once the project gets closer to production it's necessary to just grep `uncheck` in your code and replace those that can happen with a proper error management and replace those that cannot happen with `expect`, with a clear justification of why they cannot happen unless there's a bug somewhere else.
Best post mortem I've read in a while, this thing will be studied for years.
A bit ironic that their internal FL2 tool is supposed to make Cloudflare "faster and more secure" but brought a lot of things down. And yeah, as other have already pointed out, that's a very unsafe use of Rust, should've never made it to production.
667 more comments available on Hacker News
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.