Cloudflare Outage on December 5, 2025

Postedabout 1 month agoActive29 days ago

meetpateltech

29 points

4 comments

blog.cloudflare.comNewsstory

informativeneutral

Debate

20/100

Website ManagementOutageNetworking

Key topics

Website Management

Outage

Networking

Discussion Activity

Very active discussion

First comment

Peak period

0-2h

Avg / period

12.3

Key moments

01Story posted
Dec 5, 2025 at 10:35 AM EST
about 1 month ago
Step 01
02First comment
Dec 5, 2025 at 10:39 AM EST
4m after posting
Step 02
03Peak activity
85 comments in 0-2h
Hottest window of the conversation
Step 03
04Latest activity
Dec 6, 2025 at 3:09 PM EST
29 days ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (4 comments)

Showing 160 comments

jpeter

about 1 month ago

3 replies

Unwrap() strikes again

throwawaymaths

about 1 month ago

4 replies

this time in lua. cloudflare can't catch a break

RoyTyrell

about 1 month ago

1 reply

Or they're not thoroughly testing changes before pushing them out. As I've seen some others say, CloudFlare at this point should be considered critical infrastructure. Maybe not like power but dang close.

esseph

about 1 month ago

My power goes out every Wednesday around noon and normally if the weather is bad. In a major US metro.

I hope cloudflare is far more resilient than local power.

gcau

about 1 month ago

2 replies

The 'rewrite it in lua' crowd are oddly silent now.

infrcg

about 1 month ago

1 reply

In their defense, if that crowd could read, they’d be very upset.

Lua is a garbage language, used by garbage products, and promoted by engineers that could be replaced by garbage LLMs

jcmfernandes

about 1 month ago

Did you really go through the trouble of creating an account just to spit trash? Damn!

barbazoo

about 1 month ago

How do you know?

lexoj

about 1 month ago

1 reply

Anyone knows why lua? Or is it perhaps as a redis script in lua?

lexoj

29 days ago

Figured it, its prob a nginx lua module

rvz

about 1 month ago

Time to use boring languages such as Java and Go.

dap

about 1 month ago

1 reply

I guess you’re being facetious but for those who didn’t click through:

> This type of code error is prevented by languages with strong type systems. In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur.

skywhopper

about 1 month ago

1 reply

That bit may be true, but the underlying error of a null reference that caused a panic was exactly the same in both incidents.

roguecoder

about 1 month ago

2 replies

Yep: it is wild for them to claim that a strongly-typed language would have saved them when it didn't.

Relying on language features instead of writing code well will always eventually backfire.

dap

about 1 month ago

You're right that you have to "write code well" to prevent this sort of thing. It's also true that Rust's language features, if you use them, can make this sort of mistake a compile-time error rather than something that only blows up at runtime under the wrong conditions. The problem with their last outage was that somebody explicitly opted out of the tool provided by the language. As you say, that's "not writing code well". But I think you're dismissing the value of the language feature in helping you write code well.

rossjudson

about 1 month ago

Relying on code ninja ego backfires way sooner, and way more often.

https://security.googleblog.com/2025/11/rust-in-android-move...

debugnik

about 1 month ago

No, it's a nil property in Lua in a poorly tested kill-switch code path.

barbazoo

about 1 month ago

1 reply

> Customers that did not have the configuration above applied were not impacted. Customer traffic served by our China network was also not impacted.

Interesting.

flaminHotSpeedo

about 1 month ago

They kinda buried the lede there, 28% failure rate for 100% of customers isn't the same as 100% failure rate for 28% of customers

Scaevolus

about 1 month ago

5 replies

> Disabling this was done using our global configuration system. This system does not use gradual rollouts but rather propagates changes within seconds to the entire network and is under review following the outage we recently experienced on November 18.

> As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:

They really need to figure out a way to correlate global configuration changes to the errors they trigger as fast as possible.

> as part of this rollout, we identified an increase in errors in one of our internal tools which we use to test and improve new WAF rules

Warning signs like this are how you know that something might be wrong!

philipwhiuk

about 1 month ago

2 replies

> Warning signs like this are how you know that something might be wrong!

Yes, as they explain it's the rollback that was triggered due to seeing these errors that broke stuff.

Scaevolus

about 1 month ago

1 reply

They saw errors and decided to do a second rollout to disable the component generating errors, causing a major outage.

JesseJames210

about 1 month ago

That was their first mistake, if your deployment does not behave the way you expect to (or even give you bad smell) roll back, that how it used to be... when I was a kid...lol Or I don't know, maybe load test before you deploy.....?

8cvor6j844qw_d6

about 1 month ago

Would be nice if the outage dashboards are directly linked to this instead of whatever they have now.

testplzignore

about 1 month ago

2 replies

> They really need to figure out a way to correlate global configuration changes to the errors they trigger as fast as possible.

This is what jumped out at me as the biggest problem. A wild west deployment process is a valid (but questionable) business decision, but if you do that then you need smart people in place to troubleshoot and make quick rollback decisions.

Their timeline:

> 08:47: Configuration change deployed and propagated to the network

> 08:48: Change fully propagated

> 08:50: Automated alerts

> 09:11: Configuration change reverted and propagation start

> 09:12: Revert fully propagated, all traffic restored

2 minutes for their automated alerts to fire is terrible. For a system that is expected to have no downtime, they should have been alerted to the spike in 500 errors within seconds before the changes even fully propagated. Ideally the rollback would have been automated, but even if it is manual, the dude pressing the deploy button should have had realtime metrics on a second display with his finger hovering over the rollback button.

Ok, so they want to take the approach of roll forward instead of immediate rollback. Again, that's a valid approach, but you need to be prepared. At 08:48, they would have had tens of millions of "init.lua:314: attempt to index field 'execute'" messages being logged per second. Exact line of code. Not a complex issue. They should have had engineers reading that code and piecing this together by 08:49. The change you just deployed was to disable an "execute" rule. Put two and two together. Initiate rollback by 08:50.

How disconnected are the teams that do deployments vs the teams that understand the code? How many minutes were they scratching their butts wondering "what is init.lua"? Are they deploying while their best engineers are sleeping?

morpheos137

about 1 month ago

5 replies

I see lots of people complaining about this down time but in actuality is it really that big a deal to have 30 minutes of down time or whatever. It's not like anything behind cloudflare is "mission critical" in the sense that lives are at stake or even a huge amount of money is at stake. In many developed countries the electric power service has local down times on occasion. That's more important than not being able to load a website. I agree if CF is offering a certain standard of reliability and not meeting it then they should offer prorated refunds for the unexpected down time but otherwise I am not seeing what the big deal is here.

odie5533

about 1 month ago

1 reply

> It's not like anything behind cloudflare is "mission critical" in the sense that lives are at stake or even a huge amount of money is at stake.

Yes, there are lots of mission critical systems that use cloudflare and lives and huge amounts of money are at stake.

morpheos137

about 1 month ago

1 reply

Can you provide an example of lives being at stake because of a cloud flare outage?

kortilla

29 days ago

https://creators.spotify.com/pod/profile/epicompliance/episo...

therein

about 1 month ago

> about this down time but in actuality is it really that big a deal to have 30 minutes of down time or whatever. It's not like anything behind cloudflare is "mission critical" in the sense that lives are at stake or even a huge amount of money is at stake.

This reads like sarcasm. But I guess it is not. Yes, you are a CDN, a major one at that. 30 minutes of downtime or "whatever" is not acceptable. I worked at traffic teams of social networks that looked at themselves as that mission critical. CF is absolutely that critical and it is definitely lives at stake.

ljm

about 1 month ago

> It's not like anything behind cloudflare is "mission critical" in the sense that lives are at stake or even a huge amount of money is at stake.

This is far too dismissive of how disruptive the downtime can be and it sets the bar way too low for a company so deeply entangled in global internet infrastructure.

I don’t think you can make such an assertion with any degree of credibility.

moritonal

about 1 month ago

I am confident there is at least a few hospitals, gp offices or ticketing systems that interact directly or indirectly with Cloud flare. They've sold themselves as a major defence in security.

bombcar

about 1 month ago

30 minutes of downtime is fine for most things, including Amazon.

30 minutes of unplanned downtime for infrastructure is unacceptable; but we’re tending to accept it. AWS or Cloudflare have positioned themselves as The Internet so they need to be held to a higher standard.

bostik

about 1 month ago

3 replies

> 2 minutes for their automated alerts to fire is terrible

I take exception to that, to be honest. It's not desirable or ideal, but calling it "terrible" is a bit ... well, sorry to use the word ... entitled. For context, I have experience running a betting exchange. A system where it's common for a notable fraction of transactions in a medium-volume event to take place within a window of less than 30 seconds.

Vast majority of current monitoring systems are built on Prometheus. (Well okay, these days it's more likely something Prom-compatible but more reliable.) That implies collection via recurring scrapes. A supposedly "high" frequency online service monitoring system does a scrape every 30 seconds. Well known reliability engineering practices state that you need a minimum of two consecutive telemetry points to detect any given event - because we're talking about a distributed system and network is not a reliable transport. That in turn means that with near-perfect reliability the maximum time window before you can detect something failing is the time it takes to perform three scrapes: thing A might have failed a second after the last scrape, so two consecutive failures will show up only after a delay of just-a-hair-shy-of-three scraping cycle windows.

At Cloudflare's scale, I would not be surprised if they require three consecutive events to trigger an alert.

As for my history? The betting exchange monitoring was tuned to run scrapes at 10-second intervals. That still meant that the first an alert fired for something failing could have been effectively 30 seconds after the failures manifested.

Two minutes for something that does not run primarily financial transactions is a pretty decent alerting window.

parchley

about 1 month ago

1 reply

> At Cloudflare's scale, I would not be surprised if they require three consecutive events to trigger an alert.

Sorry but that’s a method you use if you serve 100 requests per second, not when you are at Cloudflare scale. Cloudflare easily have big enough volume that this problem would trigger an instant change in a monitorable failure rate.

rossjudson

about 1 month ago

Let's say you have 10 million servers. No matter what you are deploying, some of those servers are going to be in a bad state. Some are going to be in hw ops/repair. Some are going to be mid-deployment for something else. A regional issue may be happening. You may have inconsistencies in network performance. Bad workloads somewhere/anywhere can be causing a constant level of error traffic.

At scale there's no such thing as "instant". There is distribution of progress over time.

The failure is an event. Collection of events takes times (at scale, going through store and forward layers). Your "monitorable failure rate" is over an interval. You must measure for that interval. And then you are going to emit another event.

Global config systems are a tradeoff. They're not inherently bad; they have both strengths and weaknesses. Really bad: non-zero possibility for system collapse. Bad: Can progress very quickly to towards global outages. Good: Faults are detected quickly, response decision making is easy, and mitigation is fast.

Hyperscale is not just "a very large number of small simple systems".

Denoising alerts is a fact of life for SRE...and is a survival skill.

dotancohen

about 1 month ago

1 reply

Prometheus compatible but more reliable? Sell it to me!

bostik

30 days ago

1 reply

VictoriaMetrics. The answer to the question "could I get Prometheus, but with ClickHouse architecture?"

dotancohen

30 days ago

I'll be looking into that, thank you.

While we're here, any other Prometheus or Grafana advice is welcome.

yearolinuxdsktp

about 1 month ago

1 reply

Critical high-level stats such as errors should be scraped more frequently than 30 seconds. It’s important to have multiple time granularity scraping intervals, a small set of most critical stats should be scraped closer to 10s or 15s.

Prometheus has as an unaddressed flaw [0], where rate functions must be at least 2x the scrape interval. This means that if you scrape at 30s intervals, your rate charts won’t reflect the change until a minute after.

[0] - https://github.com/prometheus/prometheus/issues/3746

rossjudson

about 1 month ago

"Scrape" intervals (and the plumbing through to analysis intervals) are chosen precisely because of the denoising function aggregation provides.

Most scaled analysis systems provide precise control over the type of aggregation used within the analyzed time slices. There are many possibilities, and different purposes for each.

High frequency events are often collected into distributions and the individual timestamps are thrown away.

bombcar

about 1 month ago

“ Uh...it's probably not a problem...probably...but I'm showing a small discrepancy in...well, no, it's well within acceptable bounds again. Sustaining sequence. Nothing you need to worry about, Gordon. Go ahead.“

shadowgovt

about 1 month ago

"Hey, this change is making the 'check engine' light turn on all the time. No problem; I just grabbed some pliers and crushed the bulb."

8note

about 1 month ago

they arent a panacea though, internal tools like that can be super noisy on errors, and be broken more often than theyre working

xnorswap

about 1 month ago

3 replies

My understanding, paraphrased: "In order to gradually roll out one change, we had to globally push a different configuration change, which broke everything at once".

But a more important takeaway:

> This type of code error is prevented by languages with strong type systems

jsnell

about 1 month ago

3 replies

That's a bizarre takeaway for them to suggest, when they had exactly the same kind of bug with Rust like three weeks ago. (In both cases they had code implicitly expecting results to be available. When the results weren't available, they terminated processing of the request with an exception-like mechanism. And then they had the upstream services fail closed, despite the failing requests being to optional sidecars rather than on the critical query path.)

littlestymaar

about 1 month ago

1 reply

In fairness, the previous bug (with the Rust unwrap) should never have happened: someone explicitly called the panicking function, the review didn't catch it and the CI didn't catch it.

It required a significant organizational failure to happen. These happen but they ought to be rarer than your average bug (unless your organization is fundamentally malfunctioning, that is)

greatgib

about 1 month ago

1 reply

The issue would also not have happened, if someone did the right code, tests, and the review or CI caught it...

marcosdumay

about 1 month ago

It's different to expect somebody to write the correct program every time than to expect somebody not to call the "break_my_system" procedure that was warnings all over it telling people it's there for quick learning-to-use examples or other things you'll never run.

pdimitar

about 1 month ago

1 reply

To be precise, the previous problem with Rust was because somebody copped out and used a temporary escape hatch function that absolutely has no place in production code.

It was mostly an amateur mistake. Not Rust's fault. Rust could never gain adoption if it didn't have a few escape hatches.

"Damned if they do, damned if they don't" kind of situation.

There are even lints for the usage of the `unwrap` and `expect` functions.

As the other sibling comment points out, the previous Cloudflare problem was an acute and extensive organizational failure.

zozbot234

about 1 month ago

1 reply

You can make an argument that.unwrap() should have no place in production code, but .expect("invariant violated: etc. etc.") very much has its place. When the system is in an unpredicted and not-designed-for state it is supposed to shut down promptly, because this makes it easier to troubleshoot the root cause failure whereas not doing so may have even worse consequences.

pdimitar

about 1 month ago

I don't disagree but you might as well also manually send an error to f.ex. Sentry and just halt processing of the request.

Though that really depends. In companies where k8s is used the app will be brought back up immediately anyway.

Hamuko

about 1 month ago

Yeah, my first thought was that had they used Rust, maybe we would've seen them point out a rule_result.unwrap() as the issue.

skywhopper

about 1 month ago

1 reply

This is the exact same type of error that happened in their Rust code last time. Strong type systems don’t protect you from lazy programming.

inejge

about 1 month ago

It's not remotely the same type of error -- error non-handling is very visible in the Rust code, while the Lua code shows the happy path, with no indication that it could explode at runtime.

Perhaps it's the similar way of not testing the possible error path, which is an organizational problem.

debugnik

about 1 month ago

[delayed]

flaminHotSpeedo

about 1 month ago

12 replies

What's the culture like at Cloudflare re: ops/deployment safety?

They saw errors related to a deployment, and because it was related to a security issue instead of rolling it back they decided to make another deployment with global blast radius instead?

Not only did they fail to apply the deployment safety 101 lesson of "when in doubt, roll back" but they also failed to assess the risk related to the same deployment system that caused their 11/18 outage.

Pure speculation, but to me that sounds like there's more to the story, this sounds like the sort of cowboy decision a team makes when they've either already broken all the rules or weren't following them in the first place

deadbabe

about 1 month ago

1 reply

As usual, Cloudflare is the man in the arena.

samrus

about 1 month ago

1 reply

There are other men in the arena who arent tripping on their own feet

usrnm

about 1 month ago

3 replies

Like who? Which large tech company doesn't have outages?

k8sToGo

about 1 month ago

1 reply

It's not about outages. It's about the why. Hardware can fail. Bugs can happen. But to continue a roll out despite warning sings and without understanding the cause and impact is on another level. Especially if it is related to the same problem as last time.

udev4096

about 1 month ago

1 reply

And yet, it's always clownflare breaking everything. Failures are inevitable, which is widely known, therefore we build resilience systems to overcome the inevitable

deadbabe

about 1 month ago

1 reply

It is healthy for tech companies to have outages, as they will build experience in resolving them. Success breeds complacency.

wizzwizz4

about 1 month ago

You don't need outages to build experience in resolving them, if you identify conditions that increase the risk of outages. Airlines can develop a lot of experience resolving issues that would lead to plane crashes, without actually crashing any planes.

nish__

about 1 month ago

1 reply

Google does pretty good.

hansonkd

about 1 month ago

Google docs was just down a couple weeks ago almost the whole day.

k__

about 1 month ago

"tripping on their own feet" == "not rolling back"

this_user

about 1 month ago

1 reply

The question is perhaps what the shape and status of their tech stack is. Obviously, they are running at massive scale, and they have grown extremely aggressively over the years. What's more, especially over the last few years, they have been adding new product after new product. How much tech debt have they accumulated with that "move fast" approach that is now starting to rear its head?

sandeepkd

about 1 month ago

I think this is probably a bigger root cause and is going to show up in different ways in future. The mere act of adding new products to an existing architecture/system is bound to create knowledge silos around operations and tech debt. There is a good reason why big companies keep smart people on their payroll to just change couple of lines after a week of debate.

lukeasrodgers

about 1 month ago

2 replies

Roll back is not always the right answer. I can’t speak to its appropriateness in this particular situation of course, but sometimes “roll forward” is the better solution.

echelon

about 1 month ago

1 reply

You want to build a world where roll back is 95% the right thing to do. So that it almost always works and you don't even have to think about it.

During an incident, the incident lead should be able to say to your team's on call: "can you roll back? If so, roll back" and the oncall engineer should know if it's okay. By default it should be if you're writing code mindfully.

Certain well-understood migrations are the only cases where roll back might not be acceptable.

Always keep your services in "roll back able", "graceful fail", "fail open" state.

This requires tremendous engineering consciousness across the entire org. Every team must be a diligent custodian of this. And even then, it will sometimes break down.

Never make code changes you can't roll back from without reason and without informing the team. Service calls, data write formats, etc.

I've been in the line of billion dollar transaction value services for most of my career. And unfortunately I've been in billion dollar outages.

drysart

about 1 month ago

1 reply

"Fail open" state would have been improper here, as the system being impacted was a security-critical system: firewall rules.

It is absolutely the wrong approach to "fail open" when you can't run security-critical operations.

echelon

about 1 month ago

Cloudflare is supposed to protect me from occasional ddos, not take my business offline entirely.

This can be architected in such a way that if a rules engine crashes, other systems are not impacted and cached rules and heuristics continue to function.

If some rules from one microservice fail, other systems will presumably have other rules still available and functioning, so it doesn't even seem like it would be a totally open firewall.

flaminHotSpeedo

about 1 month ago

1 reply

Like the other poster said, roll back should be the right answer the vast majority of the time. But it's also important to recognize that roll forward should be a replacement for the deployment you decided not to roll back, not a parallel deployment through another system.

I won't say never, but a situation where the right answer to avoid a rollback (that it sounds like was technically fine to do, just undesirable from a security/business perspective) is a parallel deployment through a radioactive, global blast radius, near instantaneous deployment system that is under intense scrutiny after another recent outage should be about as probable as a bowl of petunias in orbit

crote

about 1 month ago

4 replies

Is a roll back even possible at Cloudflare's size?

With small deployments it usually isn't too difficult to re-deploy a previous commit. But once you get big enough you've got enough developers that half a dozen PRs will have been merged since the start of the incident and now. How viable is it to stop the world, undo everything, and start from scratch any time a deployment causes the tiniest issues?

Realistically the best you're going to get is merging a revert of the problematic changeset - but with the intervening merges that's still going to bring the system in a novel state. You're rolling forwards, not backwards.

jamesog

about 1 month ago

1 reply

Disclosure: Former Cloudflare SRE.

The short answer is "yes" due to the way the configuration management works. Other infrastructure changes or service upgrades might get undone, but it's possible. Or otherwise revert the commit that introduced the package bump with the new code and force that to rollout everywhere rather than waiting for progressive rollout.

There shouldn't be much chance of bringing the system to a novel state because configuration management will largely put things into the correct state. (Where that doesn't work is if CM previously created files, it won't delete them unless explicitly told to do so.)

mewpmewp2

about 1 month ago

> service upgrades might get undone, but it's possible.

But who knows what issues might reverting other team's stuff bring?

newsoftheday

about 1 month ago

If companies like Cloudflare haven't figured out how to do reliable rollbacks, there seems little hope for any of us.

yuliyp

about 1 month ago

I'd presume they have the ability to deploy a previous artifact vs only tip-of-master.

gabrielhidasy

about 1 month ago

That will depend on how you structure your deployments, on some large tech companies, while thousands of changes little are made every hour, and deployments are mande in n-day cycles. A cut-off point in time is made where the first 'green' commit after that is picked for the current deployment, and if that fails in an unexpected way you just deploy the last binary back, fix (and test) whatever broke and either try again or just abandon the release if the next cut is already close-by.

dkyc

about 1 month ago

4 replies

One thing to keep in mind when judging what's 'appropriate' is that Cloudflare was effectively responding to an ongoing security incident outside of their control (the React Server RCE vulnerability). Part of Cloudlfare's value proposition is being quick to react to such threats. That changes the equation a bit: any hour you wait longer to deploy, your customers are actively getting hacked through a known high-severity vulnerability.

In this case it's not just a matter of 'hold back for another day to make sure it's done right', like when adding a new feature to a normal SaaS application. In Cloudflare's case moving slower also comes with a real cost.

That isn't to say it didn't work out badly this time, just that the calculation is a bit different.

Already__Taken

about 1 month ago

2 replies

the cve isn't a zero day though how come cloudflare werent at the table for early disclosure?

flaminHotSpeedo

about 1 month ago

2 replies

Do you have a public source about an embargo period for this one? I wasn't able to find one

Pharaoh2

about 1 month ago

1 reply

https://react.dev/blog/2025/12/03/critical-security-vulnerab...

Privately Disclosed: Nov 29 Fix pushed: Dec 1 Publicly disclosed: Dec 3

drysart

about 1 month ago

Then even in the worst case scenario, they were addressing this issue two days after it was publicly disclosed. So this wasn't a "rush to fix the zero day ASAP" scenario, which makes it harder to justify ignoring errors that started occuring in a small scale rollout.

charcircuit

about 1 month ago

Considering there were patched libraries at the time of disclosure, those libraries' authors must have been informed ahead of time.

ascorbic

29 days ago

Cloudflare did have early access, and had mitigation in place from the start. The changes that were being rolled out were in response to ongoing attempts to bypass those.

Disclosure: I work at Cloudflare, but not on the WAF

flaminHotSpeedo

about 1 month ago

2 replies

To clarify, I'm not trying to imply that I definitely wouldn't have made the same decision, or that cowboy decisions aren't ever the right call.

However, this preliminary report doesn't really justify the decision to use the same deployment system responsible for the 11/18 outage. Deployment safety should have been the focus of this report, not the technical details. My question that I want answered isn't "are there bugs in Cloudflare's systems" it's "has Cloudflare learned from it's recent mistakes to respond appropriately to events"

vlovich123

about 1 month ago

3 replies

> doesn't really justify the decision to use the same deployment system responsible for the 11/18 outage

There’s no other deployment system available. There’s a single system for config deployment and it’s all that was available as they haven’t yet done the progressive roll out implementation yet.

edoceo

about 1 month ago

1 reply

Ok. Sure But shouldn't they have some beta/staging/test area they could deploy to, run tests for an hour then do the global blast?

vlovich123

about 1 month ago

Config changes are distinctly more difficult to have that set up for and as the blog says they’re working on it. They just don’t have it ready yet and are pausing any more config changes until it’s set up. They just did this one in response to try to mitigate an ongoing security vulnerability and missed the mark.

I’m happy to see they’re changing their systems to fail open which is one of the things I mentioned in the conversation about their last outage.

flaminHotSpeedo

30 days ago

There was another deployment system available. The progressive one used to roll out the initial change, which presumably rolls back sanely too.

locknitpicker

about 1 month ago

> There’s no other deployment system available.

Hindsight is always 20/20, but I don't know how that sort of oversight could happen in an organization whose business model rides on reliability. Small shops understand the importance of safeguards such as progressive deployments or one-box-style deployments with a baking period, so why not the likes of Cloudflare? Don't they have anyone on their payroll who warns about the risks of global deployments without safeguards?

dkyc

about 1 month ago

The 11/18 outage was 2.5 weeks ago. Any learning & changes they made as a result for that probably didn't make its way yet to production.

Particularly if we're asking them to be careful & deliberate about deployments, hard to ask them fast-track this.

udev4096

about 1 month ago

1 reply

Clownflare did what it does best, mess up and break everything. It will keep happening again and again

toomuchtodo

about 1 month ago

1 reply

Indeed, but it is what it is. Cloudflare comes out of my budget, and even with downtime, its better than not paying them. Do I want to deal with what Cloudflare offers? I do not, I have higher value work to focus on. I want to pay someone else to deal with this, and just like when cloud providers are down, it'll be back up eventually. Grab a coffee or beer and hang; we aren't savings lives, we're just building websites. This is not laziness or nihilism, but simply being rational and pragmatic.

locknitpicker

about 1 month ago

1 reply

> Do I want to deal with what Cloudflare offers? I do not, I have higher value work to focus on. I want to pay someone else to deal with this, and just like when cloud providers are down, it'll be back up eventually.

This is specious reasoning. How come I had to endure a total outage due to the rollout of a mitigation of a Nextjs vulnerability when my organization doesn't even own any React app, let alone a Nextjs one?

Also specious reasoning #2, not wanting to maintain a service does not justify blindly rolling out config changes globally without any safeguards.

toomuchtodo

about 1 month ago

1 reply

If you are a customer of Cloudflare, and not happy, I encourage you to evaluate other providers more to your liking. Perhaps you'll find someone more fitting to your use case and operational preferences, but perhaps not. My day job org pays Cloudflare hundreds of thousands of dollars a year, and am satisfied with how they operate. Everyone has choice, exercise it if you choose. I'm sure your account exec would be happy to take the feedback.

As a recovering devops/infra person from a lifetime ago, perhaps that is where my grace in this regard comes from. Things break, systems are imperfect, and urgency can lead to unexpected failure. Sometimes its Cloudflare, other times it's Azure, GCP, Github, etc. You can always use something else, but most of us continue to pick the happy path of "it works most of the time, and sometimes it does not." Hopefully the post mortem has action items to improve the safeguards you mention. If there are no process and technical improvements from the outage, certainly, that is where the failure lies (imho).

(i have seen a proof of concept tested and known good against React2Shell and understood the severity of the vuln and level of risk of not broadly attempting to mitigate this with WAF rules across their customer base)

locknitpicker

30 days ago

1 reply

> you are a customer of Cloudflare, and not happy, I encourage you to evaluate other providers more to your liking.

I think your take is terribly simplistic. In a professional setting, virtually all engineers have no say on whether the company switches platforms or providers. Their responsibility is to maintain and develop services that support business. The call to switch a provider is ultimately a business and strategic call, and is a subject that has extremely high inertia. You hired people specialized in technologies, and now you're just dumping all that investment? Not to mention contracts. Think about the problem this creates.

Some of you sound like amateurs toying with pet projects, where today it's framework A on cloud provider X whereas tomorrow it's framework B on cloud provider Y. Come the next day, rinse and repeat. This is unthinkable in any remotely professional setting.

toomuchtodo

30 days ago

> Some of you sound like amateurs toying with pet projects, where today it's framework A on cloud provider X whereas tomorrow it's framework B on cloud provider Y. Come the next day, rinse and repeat. This is unthinkable in any remotely professional setting.

Vendor contracts have 1-3 year terms. We (a financial services firm) re-evaluate vendors every year for potential replacement. Maybe you do things differently? You are a customer; your choices are to remain a customer or to leave and find another vendor. These are not feelings, these are facts. If you are unhappy but choose not to leave a vendor, that is a choice, but it is your choice to make, and unless you are a large enough customer that you have leverage over the vendor, these are your only options.

cowsandmilk

about 1 month ago

Cloudflare had already decided this was a rule that could be rolled out using their gradual deployment system. They did not view it as being so urgent that it required immediate global roll out.

liampulles

about 1 month ago

2 replies

Rollback is a reliable strategy when the rollback process is well understood. If a rollback process is not well known and well experienced, then it is a risk in itself.

I'm not sure of the nature of the rollback process in this case, but leaning on ill-founded assumptions is a bad practice. I do agree that a global rollout is a problem.

newsoftheday

about 1 month ago

1 reply

Rollback carries with it the contextual understanding of complete atomicity; otherwise it's slightly better than a yeet. It's similar to backups that are untested.

marcosdumay

about 1 month ago

1 reply

Complete atomicity carries with it the idea that the world is frozen, and any data only needs to change when you allow it to.

That's to say, it's an incredibly good idea when you can physically implement it. It's not something that everybody can do.

newsoftheday

about 1 month ago

No, complete atomicity doesn't require a frozen state, it requires common sense and fail-proof, fool-proof guarantees derived from assurances gained from testing.

There is another name for rolling forward, it's called tripping up.

programd

about 1 month ago

1 reply

Global rollout of security code on a timeframe of seconds is part of Cloudflare's value proposition.

In this case they got unlucky with an incident before they finished work on planned changes from the last incident.

flaminHotSpeedo

30 days ago

That's entirely incorrect. For starters, they didn't get unlucky. They made a choice to use the same system they knew was sketchy (which they almost certainly knew was sketchy even before 11/18)

And on top of that, Cloudflare's value proposition is "we're smart enough to know that instantaneous global deployments are a bad idea, so trust us to manage services for you so you don't have to rely on in house folks who might not know better"

NicoJuicy

about 1 month ago

Where I work, all teams were notified about the React CVE.

Cloudflare made it less of an expedite.

crote

about 1 month ago

> They saw errors related to a deployment, and because it was related to a security issue instead of rolling it back they decided to make another deployment with global blast radius instead?

Note that the two deployments were of different components.

Basically, imagine the following scenario: A patch for a critical vulnerability gets released, during rollout you get a few reports of it causing the screensaver to show a corrupt video buffer instead, you roll out a GPO to use a blank screensaver instead of the intended corporate branding, a crash in a script parsing the GPOs on this new value prevents users from logging in.

There's no direct technical link between the two issues. A mitigation of the first one merely exposed a latent bug in the second one. In hindsight it is easy to say that the right approach is obviously to roll back, but in practice a roll forward is often the better choice - both from an ops perspective and from a safety perspective.

Given the above scenario, how many people are genuinely willing to do a full rollback, file a ticket with Microsoft, and hope they'll get around to fixing it some time soon? I think in practice the vast majority of us will just look for a suitable temporary workaround instead.

otterley

about 1 month ago

From the post:

“We have spoken directly with hundreds of customers following that incident and shared our plans to make changes to prevent single updates from causing widespread impact like this. We believe these changes would have helped prevent the impact of today’s incident but, unfortunately, we have not finished deploying them yet.

“We know it is disappointing that this work has not been completed yet. It remains our first priority across the organization.”

ignoramous

about 1 month ago

> this sounds like the sort of cowboy decision

Ouch. Harsh given that Cloudflare's being over-honest (to disabling the internal tool) and the outage's relatively limited impact (time wise & no. of customers wise). It was just an unfortunate latent bug: Nov 18 was Rust's Unwrap, Dec 5 its Lua's turn with its dynamic typing.

Now, the real cowboy decision I want to see is Cloudflare [0] running a company-wide Rust/Lua code-review with Codex / Claude...

cf TFA:

  if rule_result.action == "execute" then
    rule_result.execute.results = ruleset_results[tonumber(rule_result.execute.results_index)]
  end

  This code expects that, if the ruleset has action="execute", the "rule_result.execute" object will exist ... error in the [Lua] code, which had existed undetected for many years ... prevented by languages with strong type systems. In our replacement [FL2 proxy] ... code written in Rust ... the error did not occur.

[0] https://news.ycombinator.com/item?id=44159166

rvz

about 1 month ago

> Not only did they fail to apply the deployment safety 101 lesson of "when in doubt, roll back" but they also failed to assess the risk related to the same deployment system that caused their 11/18 outage.

Also there seems to be insufficient testing before deployment with very junior level mistakes.

> As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:

Where was the testing for this one? If ANY exception happened during the rules checking, the deployment should fail and rollback. Instead, they didn't assess that as a likely risk and pressed on with the deployment "fix".

I guess those at Cloudflare are not learning anything from the previous disaster.

NoSalt

about 1 month ago

Ooh ... I want to be on a cowboy decision making team!!!

nine_k

about 1 month ago

> more to the story

From a more tinfoil-wearing angle, it may not even be a regular deployment, given the idea of Cloudflare being "the largest MitM attack in history". ("Maybe not even by Cloudflare but by NSA", would say some conspiracy theorists, which is, of course, completely bonkers: NSA is supposed to employ engineers who never let such blunders blow their cover.)

fidotron

about 1 month ago

1 reply

> This change was being rolled out using our gradual deployment system, and, as part of this rollout, we identified an increase in errors in one of our internal tools which we use to test and improve new WAF rules. As this was an internal tool, and the fix being rolled out was a security improvement, we decided to disable the tool for the time being as it was not required to serve or protect customer traffic.

Come on.

This PM raises more questions than it answers, such as why exactly China would have been immune.

skywhopper

about 1 month ago

1 reply

China is probably a completely separate partition of their network.

fidotron

about 1 month ago

2 replies

One that doesn't get proactive security rollouts, it would seem.

skywhopper

about 1 month ago

I assume it was next on the checklist, or assigned to a different ops team.

roguecoder

about 1 month ago

The deploys are very unlikely to be managed from the same system.

miyuru

about 1 month ago

3 replies

Whats going on with cloudflare's software team?

I have seen similar bugs in cloudflare API recently as well.

There is an endpoint for a feature that is available only to enterprise users, but the check for whether the user is on an enterprise plan is done at the last step.

LelouBil

about 1 month ago

2 replies

Can you elaborate? I'm not sure what you mean by "at the last step"

miyuru

about 1 month ago

The API endpoint I am talking about needs a external verification. they allow to do the external verification before checking if the user is on the enterprise plan or not.

The feature is only available to enterprise plans, it should not even allow external verification.

Etheryte

about 1 month ago

I'm not sure which endpoint gp meant, but as I understood it, as an example, imagine a three-way handshake that's only available to enterprise users. Instead of failing a regular user on the first step, they allow steps one and two, but then do the check on step three and fail there.

about 1 month ago

3 replies

My guess? Code written by AI

system2

about 1 month ago

1 reply

100%. Upper managements try to cut costs and hire remote bullshitters.

venturecruelty

about 1 month ago

1 reply

Agreed in re cost cutting, but there's no need to disparage those of us who don't want to be traffic for two hours every day.

system2

30 days ago

I work remotely 100% too. I don't go to any office. That doesn't change the fact that most remote people are just using AI and bullshitting. Yes they are bullshitters. Don't need to be super soft about it, it is not like an LGBTQ+ subject. Many remote workers are shitty. There, I said it again. Most remote workers are shitty.

markus_zhang

about 1 month ago

TBF they are still hiring a lot of eng people from US/UK/EU:

https://www.cloudflare.com/careers/jobs/?department=Engineer...

rurban

about 1 month ago

No, the original author left long time ago. And nobody understands some uncovered parts anymore.

archon810

about 1 month ago

I recently ran into an issue with the Cloudflare API feature that if you want to roll back requires contacting the support team because there's no way to roll it back with the API or GUI. Even when the exact issue was pointed out, it took multiple days to change the setting and to my knowledge there's still no API fix available.

https://www.answeroverflow.com/m/1234405297787764816

antiloper

about 1 month ago

2 replies

Make faster websites:

> we started rolling out an increase to our buffer size to 1MB, the default limit allowed by Next.js applications.

Why is the Next.js limit 1 MB? It's not enough for uploading user generated content (photographs, scanned invoices), but a 1 MB request body for even multiple JSON API calls is ridiculous. There frameworks need to at least provide some pushback to unoptimized development, even if it's just a lower default request body limit. Otherwise all web applications will become as slow as the MS office suite or reddit.

AmazingTurtle

about 1 month ago

a) They serialize tons of data into requests b) Headers. Mostly cookies. They are a thing. They are being abused all over the world by newbies.

ramon156

about 1 month ago

The update was to update it to 3MB (paid 10MB)

websiteapi

about 1 month ago

2 replies

i wonder why they cannot partially rollout. like the other outage they have to do a global rollout.

denysvitali

about 1 month ago

1 reply

The article mentions that this Lua-based proxy is the old generation one, which is going to be replaced by the Rust based one (FL2) and that didn't fail on this scenario.

So, if anything, their efforts towards a typed language were justified. They just didn't manage to migrate everything in time before this incident - which is ironically a good thing since this incident was cause mostly by a rushed change in response to an actively exploited vulnerability.

websiteapi

about 1 month ago

1 reply

yes, but as the article states why are they doing global fast rollouts?

denysvitali

about 1 month ago

I think (would love to be corrected) that this is the nature of their service. They probably push multiple config changes per minute to mitigate DDOS attacks. For sure the proxies have a local list of IPs that, for a period of time, are blacklisted.

For DDOS protection you can't really rely on multiple-hours rollouts.

usrnm

about 1 month ago

I really don't see how it would've helped. In go or Rust you'd just get a panic, which is in no way different.

snafeau

about 1 month ago

3 replies

A lot of these kind of bugs feel like they could be caught be a simple review bot like Greptile... I wonder if Cloudlare uses an equivalent tool internally?

nish__

about 1 month ago

Any bot that runs an AI model should not be called "simple".

nkmnz

about 1 month ago

What makes greptile a better choice compared to claude code or codex, in your opinion?

roguecoder

about 1 month ago

That has not been my experience with those tools.

Super-procedural code in particular is too complex for humans to follow, much less AI.

denysvitali

about 1 month ago

2 replies

Ironically, this time around the issue was in the proxy they're going to phase out (and replace with the Rust one).

I truly believe they're really going to make resilience their #1 priority now, and acknowledging the release process errors that they didn't acknowledge for a while (according to other HN comments) is the first step towards this.

HugOps. Although bad for reputation, I think these incidents will help them shape (and prioritize!) resilience efforts more than ever.

At the same time, I can't think of a company more transparent than CloudFlare when it comes to these kind of things. I also understand the urgency behind this change: CloudFlare acted (too) fast to mitigate the React vulnerability and this is the result.

Say what you want, but I'd prefer to trust CloudFlare who admits and act upon their fuckups, rather than trying to cover them up or downplaying them like some other major cloud providers.

@eastdakota: ignore the negative comments here, transparency is a very good strategy and this article shows a good plan to avoid further problems

trashburger

about 1 month ago

1 reply

I would very much like for him not to ignore the negativity, given that, you know, they are breaking the entire fucking Internet every time something like this happens.

denysvitali

about 1 month ago

2 replies

This is the kind of comment I wish he would ignore.

You can be angry - but that doesn't help anyone. They fucked up, yes, they admitted it and they provided plans on how to address that.

I don't think they do these things on purpose. Of course given their good market penetration they end up disrupting a lot of customers - and they should focus on slow rollouts - but I also believe that in a DDOS protection system (or WAF) you don't want or have the luxury to wait for days until your rule is applied.

beanjuiceII

about 1 month ago

2 replies

I hope he doesn't ignore it, the internet has been forgiving enough toward cloudflares string of failures..its getting pretty old, and creates a ton of choas. I work with life saving devices, being impacted in any way in data monitoring has a huge impact in many ways. "sorry ma'am we can't give your child t1d readings on your follow app because our provider decided to break everything in the pursuit of some react bug." has a great ring to it

esseph

about 1 month ago

1 reply

Half your medical devices are probably opening up data leakage to China.

https://www.csoonline.com/article/3814810/backdoor-in-chines...

Most hospital and healthcare IT teams are extremely under funded, undertrained, overworked, and the software, configurations and platforms are normally not the most resilient things.

I have a friend at one in the North East right now going through a hell of a security breach for multiple months now and I'm flabbergasted no one is dead yet.

When it comes to tech, I get the impression most organizations are not very "healthy" in the durability of systems.

rurban

about 1 month ago

Most of DICOM data sharing is over unencrypted ports. Every sniffer can easily extract super-sensible data. Not only China, everywhere.

Anon1096

about 1 month ago

Cloudflare and other cloud infra providers are only providing primitives to use, in this case WAF. They have target uptimes and it's never 100%. It's up to the people actually making end user services (like your medical devices) to judge whether that is enough and if not to design your service around it.

(and also, rolling your own version of WAF is probably not the right answer if you need better uptime. It's exceedingly unlikely a medical devices company will beat CF at this game.)

nish__

about 1 month ago

Maybe not on purpose but there's such a thing as negligence.

fidotron

about 1 month ago

1 reply

> HugOps

This childish nonsense needs to end.

Ops are heavily rewarded because they're supposed to be responsible. If they're not then the associated rewards for it need to stop as well.

denysvitali

about 1 month ago

1 reply

I have never seen an Ops team being rewarded for avoiding incidents (focusing in tech debt reduction), but instead they get the opposite - blamed when things go wrong.

I think it's human nature (it's hard to realize something is going well until it breaks), but still has a very negative psychological effect. I can barely imagine the stress the team is going through right now.

fidotron

about 1 month ago

2 replies

> I have never seen an Ops team being rewarded for avoiding incidents

That's why their salaries are so high.

denysvitali

about 1 month ago

1 reply

Depending on the tech debt, the ops team might just be in "survival mode" and not have the time to fix every single issue.

In this particular case, they seem to be doing two things: - Phasing out the old proxy (Lua based) which is replaced by FL2 (Rust based, the one that caused the previous incident) - Reacting to an actively exploited vulnerability in React by deploying WAF rules - and they're doing them in a relatively careful way (test rules) to avoid fuckups, which caused this unknown state, which triggered the issue

fidotron

about 1 month ago

They deliberately ignored an internal tool that started erroring out at the given deployment and rolled it out anyway without further investigation.

That's not deserving of sympathy.

esseph

about 1 month ago

1 reply

Ops salaries are high??? Where?!?!

hnthrowaway0328

about 1 month ago

Definitely commands better salaries than us pitty DEs.

kachapopopow

about 1 month ago

why does this oddly familiar (fail-closed logic)

dematz

about 1 month ago

>This code expects that, if the ruleset has action=”execute”, the “rule_result.execute” object will exist. However, because the rule had been skipped, the rule_result.execute object did not exist, and Lua returned an error due to attempting to look up a value in a nil value.

what if we broke the internet, just a little bit, to even the score with unwrap (kidding!!)

lapcat

about 1 month ago

> This is a straightforward error in the code, which had existed undetected for many years. This type of code error is prevented by languages with strong type systems. In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur.

Cloudflare deployed code that was literally never tested, not even once, neither manually nor by unit test, otherwise the straightforward error would have been detected immediately, and their implied solution seems to be not testing their code when written, or even adding 100% code coverage after the fact, but rather relying on a programming language to bail them out and cover up their failure to test.

View full discussion on Hacker News

ID: 46162656Type: storyLast synced: 12/5/2025, 3:45:10 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN