Questions for Cloudflare
Mood
skeptical
Sentiment
mixed
Category
tech
Key topics
Cloudflare
post-mortem analysis
incident response
CDN
The author questions Cloudflare's handling of a recent incident, sparking a discussion on the thoroughness of their post-mortem analysis and the challenges of incident response.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
14m
Peak period
33
Hour 1
Avg / period
18.5
Based on 37 loaded comments
Key moments
- 01Story posted
11/19/2025, 4:49:08 PM
2h ago
Step 01 - 02First comment
11/19/2025, 5:03:30 PM
14m after posting
Step 02 - 03Peak activity
33 comments in Hour 1
Hottest window of the conversation
Step 03 - 04Latest activity
11/19/2025, 5:59:05 PM
1h ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Cloudflare is probably one of the best "voices" in the industry when it comes to post-mortems and root cause analysis.
Reading cloudflare's description of the problem, this is something that I could easily see my own company missing. It's the case that a file got too big which tanked performance enough to bring everything down. That's a VERY hard thing to test for. Especially since this appears to have been a configuration file and a regular update.
The reason it's so hard to test for is because all tests would show that there's no problem. This isn't a code update, it was a config update. Without really extensive performance tests (which, when done well, take a long time!) there really wasn't a way to know that a change that appeared safe wasn't.
I personally give Cloudflare a huge pass for this. I don't think this happened due to any sloppiness on their part.
Now, if you want to see a sloppy outage you look at the Crowdstrike outage from a few years back that bricked basically everything. That is what sheer incompetence looks like.
Anubis is a bot firewall not a CDN.
Yet it is an available alternative to Cloudflare that is not on Wall Street (a public company).
If you want to do this 100% yourself there is Apache Traffic Control.
https://github.com/apache/trafficcontrol
> Anubis is a bot firewall not a CDN.
For now. If we support alternatives they can grow into an open source CDN.
I think the ultimate judgement must come from whether we will stay with Cloudflare now that we have seen how bad it can get. One could also say that this level of outage hasn't happened in many years, and they are now freshly frightened by it happening again so expect things to get tightened up (probably using different questions than this blog post proposes).
As for what this blog post could have been: maybe a page out of how these ideas were actively used by the author at e.g. Tradera or Loop54.
This would be preferable, of course. Unfortunately both organisations were rather secretive about their technical and social deficiencies and I don't want to be the one to air them out like that.
We mustn't assume that CloudFlare isn't undertaking this process because we're not an audience to it.
It's extremely easy, and correspondingly valueless, to ask all kinds of "hard questions" about a system 24h after it had a huge incident. The hard part is doing this appropriately for every part of the system before something happens, while maintaining the other equally rightful goals of the organizations (such as cost-efficiency, product experience, performance, etc.). There's little evidence that suggests Cloudflare isn't doing that, and their track record is definitely good for their scale.
Some never get out of this phase though.
Ultimately end-users don’t have a relationship with any of those companies. They have relationships with businesses that chose to rely on them. Cloudflare, etc publish SLAs and compensation schedules in case those SLAs are missed. Businesses chose to accept those SLAs and take on that risk.
If Cloudflare/etc signed a contract promising a certain SLA (with penalties) and then chose to not pay out those penalties then there would be reasons to ask questions, but nothing suggests they’re not holding up their side of the deal - you will absolutely get compensated (in the form of a refund on your bill) in case of an outage.
The issue is that businesses accept this deal and then scream when it goes wrong, yet are unwilling to pay for a solution that does not fail in this way. Those solutions exist - you absolutely can build systems that are reliable and/or fail in a predictable and testable manner; it’s simply more expensive and requires more skill than just slapping a few SaaSes and CNCF projects together. But it is possible - look at the uptime of card networks, stock exchanges, or airplane avionics. It’s just more expensive and the truth is that businesses don’t want to pay for it (and neither are their end-customers - they will bitch about outages, but will immediately run the other way if you ask them to pony up for a more reliable system - and the ones that don’t, already run such a system and were unaffected by the recent outages).
> Ultimately end-users don’t have a relationship with any of those companies. They have relationships with businesses that chose to rely on them
Could you not say this about any supplier relationship? No, in this case, we all know the root of the outage is CloudFlare, so it absolutely makes sense to blame CloudFlare, and not their customers.
Same with cloudflare. If you run your site on cloudflare you are responsible for any downtime caused to your site by cloudflare
What we can blame cloudflare for is having so many customers that a cloudflare outage has outsized impact compared to the more uncorrelated outages we would have if sites were distributed among many smaller providers. But that's not quite the same as blaming any individual site being down on cloudflare
In fact, I'd say... airplane avionics are not what you should be looking at. Boeing's 787? Reboot every 51 days or risk the pilots getting wrong airspeed indicators! No, I'm not joking [1], and it's not the first time either [2], and it's not just Boeing [3].
[1] https://www.theregister.com/2020/04/02/boeing_787_power_cycl...
[2] https://www.theregister.com/2015/05/01/787_software_bug_can_...
[3] https://www.theregister.com/2019/07/25/a350_power_cycle_soft...
If this is documented then fair enough - airlines don’t have to buy airplanes that need rebooting every 51 days, they can vote with their wallets and Boeing is welcome to fix it. If not documented, I hope regulators enforced penalties high enough to force Boeing to get their stuff together.
Either way, the uptime of avionics (and redundancies - including the unreliable airspeed checklists) are much higher than anything conventional software “engineering” has been putting out the past decade.
So during ingress there’s not an async call to the bot management service which intercepts the request before it’s outbound to origin - it’s literally a Lua script (or rust module in fl2) that runs on ingress inline as part of handling the request. Thus there’s no timeout or other concerns with the management service failing to assign a bot score.
There are better questions but to me the ones posed don’t seem particularly interesting.
With that said, I would also like to know how it took them ~2 hours to see the error. That's a long, long time.
I don't love piling on, but it still shocks me that people write without first reading.
8 more comments available on Hacker News
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.