Cloudflare Outage on December 5, 2025
Posted29 days agoActive27 days ago
blog.cloudflare.comNewsstory
informativeneutral
Debate
20/100
Website ManagementOutageNetworking
Key topics
Website Management
Outage
Networking
Discussion Activity
Very active discussionFirst comment
4m
Peak period
85
0-2h
Avg / period
12.3
Key moments
- 01Story posted
Dec 5, 2025 at 10:35 AM EST
29 days ago
Step 01 - 02First comment
Dec 5, 2025 at 10:39 AM EST
4m after posting
Step 02 - 03Peak activity
85 comments in 0-2h
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 6, 2025 at 3:09 PM EST
27 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 46162656Type: storyLast synced: 12/5/2025, 3:45:10 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
I hope cloudflare is far more resilient than local power.
Lua is a garbage language, used by garbage products, and promoted by engineers that could be replaced by garbage LLMs
> This type of code error is prevented by languages with strong type systems. In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur.
Relying on language features instead of writing code well will always eventually backfire.
https://security.googleblog.com/2025/11/rust-in-android-move...
Interesting.
> As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:
They really need to figure out a way to correlate global configuration changes to the errors they trigger as fast as possible.
> as part of this rollout, we identified an increase in errors in one of our internal tools which we use to test and improve new WAF rules
Warning signs like this are how you know that something might be wrong!
Yes, as they explain it's the rollback that was triggered due to seeing these errors that broke stuff.
This is what jumped out at me as the biggest problem. A wild west deployment process is a valid (but questionable) business decision, but if you do that then you need smart people in place to troubleshoot and make quick rollback decisions.
Their timeline:
> 08:47: Configuration change deployed and propagated to the network
> 08:48: Change fully propagated
> 08:50: Automated alerts
> 09:11: Configuration change reverted and propagation start
> 09:12: Revert fully propagated, all traffic restored
2 minutes for their automated alerts to fire is terrible. For a system that is expected to have no downtime, they should have been alerted to the spike in 500 errors within seconds before the changes even fully propagated. Ideally the rollback would have been automated, but even if it is manual, the dude pressing the deploy button should have had realtime metrics on a second display with his finger hovering over the rollback button.
Ok, so they want to take the approach of roll forward instead of immediate rollback. Again, that's a valid approach, but you need to be prepared. At 08:48, they would have had tens of millions of "init.lua:314: attempt to index field 'execute'" messages being logged per second. Exact line of code. Not a complex issue. They should have had engineers reading that code and piecing this together by 08:49. The change you just deployed was to disable an "execute" rule. Put two and two together. Initiate rollback by 08:50.
How disconnected are the teams that do deployments vs the teams that understand the code? How many minutes were they scratching their butts wondering "what is init.lua"? Are they deploying while their best engineers are sleeping?
Yes, there are lots of mission critical systems that use cloudflare and lives and huge amounts of money are at stake.
This reads like sarcasm. But I guess it is not. Yes, you are a CDN, a major one at that. 30 minutes of downtime or "whatever" is not acceptable. I worked at traffic teams of social networks that looked at themselves as that mission critical. CF is absolutely that critical and it is definitely lives at stake.
This is far too dismissive of how disruptive the downtime can be and it sets the bar way too low for a company so deeply entangled in global internet infrastructure.
I don’t think you can make such an assertion with any degree of credibility.
30 minutes of unplanned downtime for infrastructure is unacceptable; but we’re tending to accept it. AWS or Cloudflare have positioned themselves as The Internet so they need to be held to a higher standard.
I take exception to that, to be honest. It's not desirable or ideal, but calling it "terrible" is a bit ... well, sorry to use the word ... entitled. For context, I have experience running a betting exchange. A system where it's common for a notable fraction of transactions in a medium-volume event to take place within a window of less than 30 seconds.
Vast majority of current monitoring systems are built on Prometheus. (Well okay, these days it's more likely something Prom-compatible but more reliable.) That implies collection via recurring scrapes. A supposedly "high" frequency online service monitoring system does a scrape every 30 seconds. Well known reliability engineering practices state that you need a minimum of two consecutive telemetry points to detect any given event - because we're talking about a distributed system and network is not a reliable transport. That in turn means that with near-perfect reliability the maximum time window before you can detect something failing is the time it takes to perform three scrapes: thing A might have failed a second after the last scrape, so two consecutive failures will show up only after a delay of just-a-hair-shy-of-three scraping cycle windows.
At Cloudflare's scale, I would not be surprised if they require three consecutive events to trigger an alert.
As for my history? The betting exchange monitoring was tuned to run scrapes at 10-second intervals. That still meant that the first an alert fired for something failing could have been effectively 30 seconds after the failures manifested.
Two minutes for something that does not run primarily financial transactions is a pretty decent alerting window.
Sorry but that’s a method you use if you serve 100 requests per second, not when you are at Cloudflare scale. Cloudflare easily have big enough volume that this problem would trigger an instant change in a monitorable failure rate.
At scale there's no such thing as "instant". There is distribution of progress over time.
The failure is an event. Collection of events takes times (at scale, going through store and forward layers). Your "monitorable failure rate" is over an interval. You must measure for that interval. And then you are going to emit another event.
Global config systems are a tradeoff. They're not inherently bad; they have both strengths and weaknesses. Really bad: non-zero possibility for system collapse. Bad: Can progress very quickly to towards global outages. Good: Faults are detected quickly, response decision making is easy, and mitigation is fast.
Hyperscale is not just "a very large number of small simple systems".
Denoising alerts is a fact of life for SRE...and is a survival skill.
While we're here, any other Prometheus or Grafana advice is welcome.
Prometheus has as an unaddressed flaw [0], where rate functions must be at least 2x the scrape interval. This means that if you scrape at 30s intervals, your rate charts won’t reflect the change until a minute after.
[0] - https://github.com/prometheus/prometheus/issues/3746
Most scaled analysis systems provide precise control over the type of aggregation used within the analyzed time slices. There are many possibilities, and different purposes for each.
High frequency events are often collected into distributions and the individual timestamps are thrown away.
But a more important takeaway:
> This type of code error is prevented by languages with strong type systems
It required a significant organizational failure to happen. These happen but they ought to be rarer than your average bug (unless your organization is fundamentally malfunctioning, that is)
It was mostly an amateur mistake. Not Rust's fault. Rust could never gain adoption if it didn't have a few escape hatches.
"Damned if they do, damned if they don't" kind of situation.
There are even lints for the usage of the `unwrap` and `expect` functions.
As the other sibling comment points out, the previous Cloudflare problem was an acute and extensive organizational failure.
Though that really depends. In companies where k8s is used the app will be brought back up immediately anyway.
Perhaps it's the similar way of not testing the possible error path, which is an organizational problem.
They saw errors related to a deployment, and because it was related to a security issue instead of rolling it back they decided to make another deployment with global blast radius instead?
Not only did they fail to apply the deployment safety 101 lesson of "when in doubt, roll back" but they also failed to assess the risk related to the same deployment system that caused their 11/18 outage.
Pure speculation, but to me that sounds like there's more to the story, this sounds like the sort of cowboy decision a team makes when they've either already broken all the rules or weren't following them in the first place
During an incident, the incident lead should be able to say to your team's on call: "can you roll back? If so, roll back" and the oncall engineer should know if it's okay. By default it should be if you're writing code mindfully.
Certain well-understood migrations are the only cases where roll back might not be acceptable.
Always keep your services in "roll back able", "graceful fail", "fail open" state.
This requires tremendous engineering consciousness across the entire org. Every team must be a diligent custodian of this. And even then, it will sometimes break down.
Never make code changes you can't roll back from without reason and without informing the team. Service calls, data write formats, etc.
I've been in the line of billion dollar transaction value services for most of my career. And unfortunately I've been in billion dollar outages.
It is absolutely the wrong approach to "fail open" when you can't run security-critical operations.
This can be architected in such a way that if a rules engine crashes, other systems are not impacted and cached rules and heuristics continue to function.
If some rules from one microservice fail, other systems will presumably have other rules still available and functioning, so it doesn't even seem like it would be a totally open firewall.
I won't say never, but a situation where the right answer to avoid a rollback (that it sounds like was technically fine to do, just undesirable from a security/business perspective) is a parallel deployment through a radioactive, global blast radius, near instantaneous deployment system that is under intense scrutiny after another recent outage should be about as probable as a bowl of petunias in orbit
With small deployments it usually isn't too difficult to re-deploy a previous commit. But once you get big enough you've got enough developers that half a dozen PRs will have been merged since the start of the incident and now. How viable is it to stop the world, undo everything, and start from scratch any time a deployment causes the tiniest issues?
Realistically the best you're going to get is merging a revert of the problematic changeset - but with the intervening merges that's still going to bring the system in a novel state. You're rolling forwards, not backwards.
The short answer is "yes" due to the way the configuration management works. Other infrastructure changes or service upgrades might get undone, but it's possible. Or otherwise revert the commit that introduced the package bump with the new code and force that to rollout everywhere rather than waiting for progressive rollout.
There shouldn't be much chance of bringing the system to a novel state because configuration management will largely put things into the correct state. (Where that doesn't work is if CM previously created files, it won't delete them unless explicitly told to do so.)
But who knows what issues might reverting other team's stuff bring?
In this case it's not just a matter of 'hold back for another day to make sure it's done right', like when adding a new feature to a normal SaaS application. In Cloudflare's case moving slower also comes with a real cost.
That isn't to say it didn't work out badly this time, just that the calculation is a bit different.
Privately Disclosed: Nov 29 Fix pushed: Dec 1 Publicly disclosed: Dec 3
Disclosure: I work at Cloudflare, but not on the WAF
However, this preliminary report doesn't really justify the decision to use the same deployment system responsible for the 11/18 outage. Deployment safety should have been the focus of this report, not the technical details. My question that I want answered isn't "are there bugs in Cloudflare's systems" it's "has Cloudflare learned from it's recent mistakes to respond appropriately to events"
There’s no other deployment system available. There’s a single system for config deployment and it’s all that was available as they haven’t yet done the progressive roll out implementation yet.
I’m happy to see they’re changing their systems to fail open which is one of the things I mentioned in the conversation about their last outage.
Hindsight is always 20/20, but I don't know how that sort of oversight could happen in an organization whose business model rides on reliability. Small shops understand the importance of safeguards such as progressive deployments or one-box-style deployments with a baking period, so why not the likes of Cloudflare? Don't they have anyone on their payroll who warns about the risks of global deployments without safeguards?
Particularly if we're asking them to be careful & deliberate about deployments, hard to ask them fast-track this.
This is specious reasoning. How come I had to endure a total outage due to the rollout of a mitigation of a Nextjs vulnerability when my organization doesn't even own any React app, let alone a Nextjs one?
Also specious reasoning #2, not wanting to maintain a service does not justify blindly rolling out config changes globally without any safeguards.
As a recovering devops/infra person from a lifetime ago, perhaps that is where my grace in this regard comes from. Things break, systems are imperfect, and urgency can lead to unexpected failure. Sometimes its Cloudflare, other times it's Azure, GCP, Github, etc. You can always use something else, but most of us continue to pick the happy path of "it works most of the time, and sometimes it does not." Hopefully the post mortem has action items to improve the safeguards you mention. If there are no process and technical improvements from the outage, certainly, that is where the failure lies (imho).
(i have seen a proof of concept tested and known good against React2Shell and understood the severity of the vuln and level of risk of not broadly attempting to mitigate this with WAF rules across their customer base)
I think your take is terribly simplistic. In a professional setting, virtually all engineers have no say on whether the company switches platforms or providers. Their responsibility is to maintain and develop services that support business. The call to switch a provider is ultimately a business and strategic call, and is a subject that has extremely high inertia. You hired people specialized in technologies, and now you're just dumping all that investment? Not to mention contracts. Think about the problem this creates.
Some of you sound like amateurs toying with pet projects, where today it's framework A on cloud provider X whereas tomorrow it's framework B on cloud provider Y. Come the next day, rinse and repeat. This is unthinkable in any remotely professional setting.
Vendor contracts have 1-3 year terms. We (a financial services firm) re-evaluate vendors every year for potential replacement. Maybe you do things differently? You are a customer; your choices are to remain a customer or to leave and find another vendor. These are not feelings, these are facts. If you are unhappy but choose not to leave a vendor, that is a choice, but it is your choice to make, and unless you are a large enough customer that you have leverage over the vendor, these are your only options.
I'm not sure of the nature of the rollback process in this case, but leaning on ill-founded assumptions is a bad practice. I do agree that a global rollout is a problem.
That's to say, it's an incredibly good idea when you can physically implement it. It's not something that everybody can do.
There is another name for rolling forward, it's called tripping up.
In this case they got unlucky with an incident before they finished work on planned changes from the last incident.
And on top of that, Cloudflare's value proposition is "we're smart enough to know that instantaneous global deployments are a bad idea, so trust us to manage services for you so you don't have to rely on in house folks who might not know better"
Cloudflare made it less of an expedite.
Note that the two deployments were of different components.
Basically, imagine the following scenario: A patch for a critical vulnerability gets released, during rollout you get a few reports of it causing the screensaver to show a corrupt video buffer instead, you roll out a GPO to use a blank screensaver instead of the intended corporate branding, a crash in a script parsing the GPOs on this new value prevents users from logging in.
There's no direct technical link between the two issues. A mitigation of the first one merely exposed a latent bug in the second one. In hindsight it is easy to say that the right approach is obviously to roll back, but in practice a roll forward is often the better choice - both from an ops perspective and from a safety perspective.
Given the above scenario, how many people are genuinely willing to do a full rollback, file a ticket with Microsoft, and hope they'll get around to fixing it some time soon? I think in practice the vast majority of us will just look for a suitable temporary workaround instead.
“We have spoken directly with hundreds of customers following that incident and shared our plans to make changes to prevent single updates from causing widespread impact like this. We believe these changes would have helped prevent the impact of today’s incident but, unfortunately, we have not finished deploying them yet.
“We know it is disappointing that this work has not been completed yet. It remains our first priority across the organization.”
Ouch. Harsh given that Cloudflare's being over-honest (to disabling the internal tool) and the outage's relatively limited impact (time wise & no. of customers wise). It was just an unfortunate latent bug: Nov 18 was Rust's Unwrap, Dec 5 its Lua's turn with its dynamic typing.
Now, the real cowboy decision I want to see is Cloudflare [0] running a company-wide Rust/Lua code-review with Codex / Claude...
cf TFA:
[0] https://news.ycombinator.com/item?id=44159166Also there seems to be insufficient testing before deployment with very junior level mistakes.
> As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:
Where was the testing for this one? If ANY exception happened during the rules checking, the deployment should fail and rollback. Instead, they didn't assess that as a likely risk and pressed on with the deployment "fix".
I guess those at Cloudflare are not learning anything from the previous disaster.
From a more tinfoil-wearing angle, it may not even be a regular deployment, given the idea of Cloudflare being "the largest MitM attack in history". ("Maybe not even by Cloudflare but by NSA", would say some conspiracy theorists, which is, of course, completely bonkers: NSA is supposed to employ engineers who never let such blunders blow their cover.)
Come on.
This PM raises more questions than it answers, such as why exactly China would have been immune.
I have seen similar bugs in cloudflare API recently as well.
There is an endpoint for a feature that is available only to enterprise users, but the check for whether the user is on an enterprise plan is done at the last step.
The feature is only available to enterprise plans, it should not even allow external verification.
https://www.cloudflare.com/careers/jobs/?department=Engineer...
https://www.answeroverflow.com/m/1234405297787764816
> we started rolling out an increase to our buffer size to 1MB, the default limit allowed by Next.js applications.
Why is the Next.js limit 1 MB? It's not enough for uploading user generated content (photographs, scanned invoices), but a 1 MB request body for even multiple JSON API calls is ridiculous. There frameworks need to at least provide some pushback to unoptimized development, even if it's just a lower default request body limit. Otherwise all web applications will become as slow as the MS office suite or reddit.
So, if anything, their efforts towards a typed language were justified. They just didn't manage to migrate everything in time before this incident - which is ironically a good thing since this incident was cause mostly by a rushed change in response to an actively exploited vulnerability.
For DDOS protection you can't really rely on multiple-hours rollouts.
Super-procedural code in particular is too complex for humans to follow, much less AI.
I truly believe they're really going to make resilience their #1 priority now, and acknowledging the release process errors that they didn't acknowledge for a while (according to other HN comments) is the first step towards this.
HugOps. Although bad for reputation, I think these incidents will help them shape (and prioritize!) resilience efforts more than ever.
At the same time, I can't think of a company more transparent than CloudFlare when it comes to these kind of things. I also understand the urgency behind this change: CloudFlare acted (too) fast to mitigate the React vulnerability and this is the result.
Say what you want, but I'd prefer to trust CloudFlare who admits and act upon their fuckups, rather than trying to cover them up or downplaying them like some other major cloud providers.
@eastdakota: ignore the negative comments here, transparency is a very good strategy and this article shows a good plan to avoid further problems
You can be angry - but that doesn't help anyone. They fucked up, yes, they admitted it and they provided plans on how to address that.
I don't think they do these things on purpose. Of course given their good market penetration they end up disrupting a lot of customers - and they should focus on slow rollouts - but I also believe that in a DDOS protection system (or WAF) you don't want or have the luxury to wait for days until your rule is applied.
https://www.csoonline.com/article/3814810/backdoor-in-chines...
Most hospital and healthcare IT teams are extremely under funded, undertrained, overworked, and the software, configurations and platforms are normally not the most resilient things.
I have a friend at one in the North East right now going through a hell of a security breach for multiple months now and I'm flabbergasted no one is dead yet.
When it comes to tech, I get the impression most organizations are not very "healthy" in the durability of systems.
(and also, rolling your own version of WAF is probably not the right answer if you need better uptime. It's exceedingly unlikely a medical devices company will beat CF at this game.)
This childish nonsense needs to end.
Ops are heavily rewarded because they're supposed to be responsible. If they're not then the associated rewards for it need to stop as well.
I think it's human nature (it's hard to realize something is going well until it breaks), but still has a very negative psychological effect. I can barely imagine the stress the team is going through right now.
That's why their salaries are so high.
In this particular case, they seem to be doing two things: - Phasing out the old proxy (Lua based) which is replaced by FL2 (Rust based, the one that caused the previous incident) - Reacting to an actively exploited vulnerability in React by deploying WAF rules - and they're doing them in a relatively careful way (test rules) to avoid fuckups, which caused this unknown state, which triggered the issue
That's not deserving of sympathy.
what if we broke the internet, just a little bit, to even the score with unwrap (kidding!!)
Cloudflare deployed code that was literally never tested, not even once, neither manually nor by unit test, otherwise the straightforward error would have been detected immediately, and their implied solution seems to be not testing their code when written, or even adding 100% code coverage after the fact, but rather relying on a programming language to bail them out and cover up their failure to test.