Back to Home11/19/2025, 4:49:08 PM

Questions for Cloudflare

todsacerdoti

54 points

45 comments

Mood

skeptical

Sentiment

mixed

Discussion Activity

Very active discussion

First comment

14m

Peak period

Hour 1

Avg / period

18.5

Comment distribution37 data points

Based on 37 loaded comments

Key moments

01Story posted
11/19/2025, 4:49:08 PM
2h ago
Step 01
02First comment
11/19/2025, 5:03:30 PM
14m after posting
Step 02
03Peak activity
33 comments in Hour 1
Hottest window of the conversation
Step 03
04Latest activity
11/19/2025, 5:59:05 PM
1h ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (45 comments)

Showing 37 comments of 45

mnholt

2h ago

4 replies

This website could benefit from a CDN…

Sesse__

1h ago

I loaded it and got an LCP of ~350 ms, which is better than the ~550 ms I got from this very comment page.

2h ago

https://web.archive.org/web/20251119165814/https://entropict...

majke

2h ago

Questions for "questions for cloudflare" owner

internetter

2h ago

8.5s... yikes... although notably they aren't adopting an anti-CDN or even really anti-cloudlare perspective, just grievances with software architecture. So the slowness of their site isn't really detrimental to their argument

tptacek

2h ago

2 replies

It's a detailed postmortem published within a couple hours of the incident and this blog post is disappointed that it didn't provide a comprehensive assessment of all the procedural changes inside the engineering organization that came as a consequence. At the point in time when this blog post was written, it would not have been possible for them to answer these questions.

kqr

1h ago

Part of my argument in the article is that it does't take long to come to that realisation if using the right methods. It would absolutely have been possible to identify the problem of missing feedback by that time.

otterley

2h ago

"But I need attention now!"

RationPhantoms

2h ago

2 replies

> I wish technical organisations would be more thorough in investigating accidents.

Cloudflare is probably one of the best "voices" in the industry when it comes to post-mortems and root cause analysis.

ItsHarper

2h ago

If you read their previous article about AWS (linked in this one), they specifically call out root cause analysis as a flawed approach.

tptacek

2h ago

I wish blog posts like these would be more thorough in simply looking at the timestamps on the posts they're critiquing.

timenotwasted

2h ago

2 replies

"I don’t know. I wish technical organisations would be more thorough in investigating accidents." - This is just armchair quarterbacking at this point given that they were forthcoming during the incident and had a detailed post-mortem shortly after. The issue is that by not being a fly on the wall in the war room the OP is making massive assumptions about the level of discussions that take place about these types of incidents long after it has left the collective conscience of the mainstream.

kqr

1h ago

The article makes no claim about the effort that has gone into the analysis. If the analysis has not uncovered the feedback problems (even with large effort, or without it), maybe a better method is needed?

cogman10

1h ago

People outside of tech (and some inside) can be really bad at understanding how something like this could slip through the cracks.

Reading cloudflare's description of the problem, this is something that I could easily see my own company missing. It's the case that a file got too big which tanked performance enough to bring everything down. That's a VERY hard thing to test for. Especially since this appears to have been a configuration file and a regular update.

The reason it's so hard to test for is because all tests would show that there's no problem. This isn't a code update, it was a config update. Without really extensive performance tests (which, when done well, take a long time!) there really wasn't a way to know that a change that appeared safe wasn't.

I personally give Cloudflare a huge pass for this. I don't think this happened due to any sloppiness on their part.

Now, if you want to see a sloppy outage you look at the Crowdstrike outage from a few years back that bricked basically everything. That is what sheer incompetence looks like.

colesantiago

2h ago

2 replies

Maybe instead of asking "questions" to a corporation which their only interest is profit, is now beholden Wall Street and wouldn't care what we think, we should look for answers and alternatives like BunnyCDN [0], Anubis [1], etc.

[0] https://bunny.net/

[1] https://github.com/TecharoHQ/anubis

arbll

2h ago

1 reply

Ah yes because both of those alternatives are non-profits right ?

colesantiago

1h ago

You can sponsor Anubis right now and start supporting alternatives.

vlovich123

2h ago

2 replies

Bunny has raised money from VC which indicates it’s going the “Wall Street” path.

Anubis is a bot firewall not a CDN.

colesantiago

1h ago

> Bunny has raised money from VC which indicates it’s going the “Wall Street” path.

Yet it is an available alternative to Cloudflare that is not on Wall Street (a public company).

If you want to do this 100% yourself there is Apache Traffic Control.

https://github.com/apache/trafficcontrol

> Anubis is a bot firewall not a CDN.

For now. If we support alternatives they can grow into an open source CDN.

koakuma-chan

1h ago

I wouldn't trust a provider that has "Excellent (underlined) star star star star star STAR TrustPilot 4.8 on G2" on their landing page. I bet they are also award winning, and 150 best place to work at. Really shows they have no taste.

blixt

2h ago

1 reply

It's a bit odd to come from the outside to judge the internal process of an organization with many very complex moving parts, only a fraction of which we have been given context for, especially so soon after the incident and the post-mortem explaining it.

I think the ultimate judgement must come from whether we will stay with Cloudflare now that we have seen how bad it can get. One could also say that this level of outage hasn't happened in many years, and they are now freshly frightened by it happening again so expect things to get tightened up (probably using different questions than this blog post proposes).

As for what this blog post could have been: maybe a page out of how these ideas were actively used by the author at e.g. Tradera or Loop54.

kqr

1h ago

> how these ideas were actively used by the author at e.g. Tradera or Loop54.

This would be preferable, of course. Unfortunately both organisations were rather secretive about their technical and social deficiencies and I don't want to be the one to air them out like that.

otterley

2h ago

1 reply

The post is describing a full port-mortem process including a Five Whys (https://en.wikipedia.org/wiki/Five_whys) inquiry. In a mature organization that follows best SRE practices, this will be performed by the relevant service teams, recorded in the port-mortem document, and used for creating follow-up actions. It's almost always an internal process and isn't shared with the public--and often not even with customers under NDA.

We mustn't assume that CloudFlare isn't undertaking this process because we're not an audience to it.

tptacek

1h ago

1 reply

It also couldn't have happened by the time the postmortem was produced. The author of this blog post appears not to have noticed that the postmortem was up within a couple hours of resolving the incident.

otterley

1h ago

Exactly. These deeper investigations can sometimes take weeks to complete.

dkyc

2h ago

1 reply

These engineering insights were not worth the 16 seconds load time this website took.

It's extremely easy, and correspondingly valueless, to ask all kinds of "hard questions" about a system 24h after it had a huge incident. The hard part is doing this appropriately for every part of the system before something happens, while maintaining the other equally rightful goals of the organizations (such as cost-efficiency, product experience, performance, etc.). There's little evidence that suggests Cloudflare isn't doing that, and their track record is definitely good for their scale.

raincole

1h ago

Every engineer has this phase when you're capable enough to do something at small scale, so you look at the incumbents, who are doing the similar thing but at 1000x scale, and wonder how they are so bad at it.

Some never get out of this phase though.

Nextgrid

2h ago

2 replies

It is unfair to blame Cloudflare (or AWS, or Azure, or GitHub) for what’s happening, and I say that as one of the biggest “yellers at the cloud” on here.

Ultimately end-users don’t have a relationship with any of those companies. They have relationships with businesses that chose to rely on them. Cloudflare, etc publish SLAs and compensation schedules in case those SLAs are missed. Businesses chose to accept those SLAs and take on that risk.

If Cloudflare/etc signed a contract promising a certain SLA (with penalties) and then chose to not pay out those penalties then there would be reasons to ask questions, but nothing suggests they’re not holding up their side of the deal - you will absolutely get compensated (in the form of a refund on your bill) in case of an outage.

The issue is that businesses accept this deal and then scream when it goes wrong, yet are unwilling to pay for a solution that does not fail in this way. Those solutions exist - you absolutely can build systems that are reliable and/or fail in a predictable and testable manner; it’s simply more expensive and requires more skill than just slapping a few SaaSes and CNCF projects together. But it is possible - look at the uptime of card networks, stock exchanges, or airplane avionics. It’s just more expensive and the truth is that businesses don’t want to pay for it (and neither are their end-customers - they will bitch about outages, but will immediately run the other way if you ask them to pony up for a more reliable system - and the ones that don’t, already run such a system and were unaffected by the recent outages).

psim1

1h ago

3 replies

> It is unfair to blame Cloudflare (or AWS, or Azure, or GitHub) for what’s happening

> Ultimately end-users don’t have a relationship with any of those companies. They have relationships with businesses that chose to rely on them

Could you not say this about any supplier relationship? No, in this case, we all know the root of the outage is CloudFlare, so it absolutely makes sense to blame CloudFlare, and not their customers.

Nextgrid

1h ago

Devil’s advocate: I operate the equivalent of an online lemonade stand, some shitty service at a cheap price offered with little guarantees (“if I fuck up I’ll refund you the price of your ‘lemonade’”) for hobbyists to use to host their blog and Visa decides to use it in their critical path. Then this “lemonade stand” goes down. Do you think it’s fair to blame me? I never chose to be part of Visa’s authorization loop, and after all is done I did indeed refund them the price of their “lemonade”. It’s Visa’s fault they introduced a single point of failure with inadequate compensation schedules in their critical path.

stronglikedan

1h ago

If I'm paying a company that chose Cloudflare, and my SLA with that company entitles me to some sort of compensation for outages, then I expect that company to compensate me regardless of whose fault it is, and regardless of whether they were compensated by Cloudflare. I can know that the cause of the outage is Cloudflare, but also know that the company that I'm paying should have had a backup plan and not be solely reliable on one vendor. In other words, I care about who I pay, not who they decide to use.

wongarsu

1h ago

Don't we say that about all supplier relationships? If my Samsung washing machine stops working I blame Samsung. Even when it turns out that it was a broken drive belt I don't blame the manufacturer of the drive belt, or whoever produced the rubber that went into the drive belt, or whoever made the machine involved in the production of this batch of rubber. Samsung choose to put the drive belt in my washing machine, that's where the buck stops. They are free to litigate the matter internally, but I only care about Samsung selling me a washing machine that's now broken

Same with cloudflare. If you run your site on cloudflare you are responsible for any downtime caused to your site by cloudflare

What we can blame cloudflare for is having so many customers that a cloudflare outage has outsized impact compared to the more uncorrelated outages we would have if sites were distributed among many smaller providers. But that's not quite the same as blaming any individual site being down on cloudflare

mschuster91

1h ago

1 reply

> look at the uptime of card networks, stock exchanges, or airplane avionics.

In fact, I'd say... airplane avionics are not what you should be looking at. Boeing's 787? Reboot every 51 days or risk the pilots getting wrong airspeed indicators! No, I'm not joking [1], and it's not the first time either [2], and it's not just Boeing [3].

[1] https://www.theregister.com/2020/04/02/boeing_787_power_cycl...

[2] https://www.theregister.com/2015/05/01/787_software_bug_can_...

[3] https://www.theregister.com/2019/07/25/a350_power_cycle_soft...

Nextgrid

1h ago

> Reboot every 51 days or risk the pilots getting wrong airspeed indicators

If this is documented then fair enough - airlines don’t have to buy airplanes that need rebooting every 51 days, they can vote with their wallets and Boeing is welcome to fix it. If not documented, I hope regulators enforced penalties high enough to force Boeing to get their stuff together.

Either way, the uptime of avionics (and redundancies - including the unreliable airspeed checklists) are much higher than anything conventional software “engineering” has been putting out the past decade.

vlovich123

1h ago

A lot of these questions bely a misunderstanding of how it works - bot management is evaluated inline within the proxy as a feature on the site (similar to other features like image optimization).

So during ingress there’s not an async call to the bot management service which intercepts the request before it’s outbound to origin - it’s literally a Lua script (or rust module in fl2) that runs on ingress inline as part of handling the request. Thus there’s no timeout or other concerns with the management service failing to assign a bot score.

There are better questions but to me the ones posed don’t seem particularly interesting.

jcmfernandes

1h ago

The tone is off. Cloudflare shared a post-mortem on the same day as the incident. It's unreasonable to throw a "I wish technical organisations would be more thorough in investigating accidents".

With that said, I would also like to know how it took them ~2 hours to see the error. That's a long, long time.

waiwai933

2h ago

> Maybe some of these questions are obviously answered in a Cloudflare control panel or help document. I’m not in the market right now so I won’t do that research.

I don't love piling on, but it still shocks me that people write without first reading.

8 more comments available on Hacker News

View full discussion on Hacker News

ID: 45981760Type: storyLast synced: 11/19/2025, 6:02:58 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Read Article View on HN