Back to Home11/18/2025, 11:31:22 PM

Cloudflare outage on November 18, 2025 post mortem

1391 points
827 comments

Mood

heated

Sentiment

mixed

Category

tech

Key topics

Cloudflare outage

post-mortem analysis

software reliability

Debate intensity80/100
Related: Cloudflare Global Network experiencing issues - https://news.ycombinator.com/item?id=45963780 - Nov 2025 (1580 comments)

Cloudflare's post-mortem analysis of their November 18, 2025 outage reveals a configuration error caused by a database permissions change, sparking discussion on software reliability, error handling, and change management.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

13m

Peak period

66

Hour 1

Avg / period

12.3

Comment distribution160 data points

Based on 160 loaded comments

Key moments

  1. 01Story posted

    11/18/2025, 11:31:22 PM

    19h ago

    Step 01
  2. 02First comment

    11/18/2025, 11:44:32 PM

    13m after posting

    Step 02
  3. 03Peak activity

    66 comments in Hour 1

    Hottest window of the conversation

    Step 03
  4. 04Latest activity

    11/19/2025, 3:14:34 PM

    4h ago

    Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (827 comments)
Showing 160 comments of 827
nawgz
19h ago
4 replies
> a change to one of our database systems' permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system ... to keep [that] system up to date with ever changing threats

> The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail

A configuration error can cause internet-scale outages. What an era we live in

Edit: also, after finishing my reading, I have to express some surprise that this type of error wasn't caught in a staging environment. If the entire error is that "during migration of ClickHouse nodes, the migration -> query -> configuration file pipeline caused configuration files to become illegally large", it seems intuitive to me that doing this same migration in staging would have identified this exact error, no?

I'm not big on distributed systems by any means, so maybe I'm overly naive, but frankly posting a faulty Rust code snippet that was unwrapping an error value without checking for the error didn't inspire confidence for me!

jmclnx
19h ago
1 reply
I have to wonder if AI was involved with the change.
norskeld
19h ago
I don't think this is the case with CloudFlare, but for every recent GitHub outage or performance issue... oh boy, I blame the clankers!
mewpmewp2
19h ago
1 reply
It would have been caught only in stage if there was similar amount of data in the database. If stage has 2x less data it would have never occurred there. Not super clear how easy it would have been to keep stage database exactly as production database in terms of quantity and similarity of data etc.

I think it's quite rare for any company to have exact similar scale and size of storage in stage as in prod.

Aeolun
19h ago
1 reply
> I think it's quite rare for any company to have exact similar scale and size of storage in stage as in prod.

We’re like a millionth the size of cloudflare and we have automated tests for all (sort of) queries to see what would happen with 20x more data.

Mostly to catch performance regressions, but it would work to catch these issues too.

I guess that doesn’t say anything about how rare it is, because this is also the first company at which I get the time to go to such lengths.

mewpmewp2
19h ago
1 reply
But now consider how much extra data Cloudflare at its size would have to have just for staging, doubling or more their costs to have stage exactly as production. They would have to simulate similar amount of requests on top of themselves, etc.

In this case it seems the database table in question seemed modest in size (the features for ML) so naively thinking they could have kept stage features always in sync with prod at the very least, but could be they didn't consider that 55 rows vs 60 rows or similar could be a breaking point given a certain specific bug.

It is much easier to test with 20x data if you don't have the amount of data cloudflare probably handles.

Aeolun
18h ago
That just means it takes longer to test. It may not be possible to do it in a reasonable timeframe with the volumes involved, but if you already have 100k servers running to serve 25M requests per second, maybe briefly booting up another 100k isn’t going to be the end of the world?

Either way, you don’t need to do it on every commit, just often enough that you catch these kinds of issues before they go to prod.

norskeld
19h ago
1 reply
This wild `unwrap()` kinda took me aback as well. Someone really believed in themselves writing this. :)
Jach
19h ago
1 reply
They only recently rewrote their core in Rust (https://blog.cloudflare.com/20-percent-internet-upgrade/) -- given the newness of the system and things like "Over 100 engineers have worked on FL2, and we have over 130 modules" I won't be surprised for further similar incidents.
gishh
16h ago
The irony of a rust rewrite taking down the internet is not lost on me.
shoo
18h ago
The speed and transparency of Cloudflare publishing this port mortem is excellent.

I also found the "remediation and follow up" section a bit lacking, not mentioning how, in general, regressions in query results caused by DB changes could be caught in future before they get widely rolled out.

Even if a staging env didn't have a production-like volume of data to trigger the same failure mode of a bot management system crash, there's also an opportunity to detect that something has gone awry if there were tests that the queries were returning functionally equivalent results after the proposed permission change. A dummy dataset containing a single http_requests_features column would suffice to trigger the dupe results behaviour.

In theory there's a few general ways this kind of issue could be detected, e.g. someone or something doing a before/after comparison to test that the DB permission change did not regress query results for common DB queries, for changes that are expected to not cause functional changes in behaviour.

Maybe it could have been detected with an automated test suite of the form "spin up a new DB, populate it with some curated toy dataset, then run a suite of important queries we must support and check the results are still equivalent (after normalising row order etc) to known good golden outputs". This style of regression testing is brittle, burdensome to maintain and error prone when you need to make functional changes and update what then "golden" outputs are - but it can give a pretty high probability of detecting that a DB change has caused unplanned functional regressions in query output, and you can find out about this in a dev environment or CI before a proposed DB change goes anywhere near production.

binarymax
19h ago
5 replies
28M 500 errors/sec for several hours from a single provider. Must be a new record.

No other time in history has one single company been responsible for so much commerce and traffic. I wonder what some outage analogs to the pre-internet ages would be.

captainkrtek
19h ago
Something like a major telco going out, for example the AT&T 1990 outage of long distance calling:

> The standard procedures the managers tried first failed to bring the network back up to speed and for nine hours, while engineers raced to stabilize the network, almost 50% of the calls placed through AT&T failed to go through.

> Until 11:30pm, when network loads were low enough to allow the system to stabilize, AT&T alone lost more than $60 million in unconnected calls.

> Still unknown is the amount of business lost by airline reservations systems, hotels, rental car agencies and other businesses that relied on the telephone network.

https://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collap...

alhirzel
17h ago
> I wonder what some outage analogs to the pre-internet ages would be.

Lots of things have the sky in common. Maybe comet-induced ice ages...

adventured
19h ago
> No other time in history has one single company been responsible for so much commerce and traffic.

AWS very likely has Cloudflare beat in commerce responsibility. Amazon is equal to ~2.3% of US GDP by itself.

manquer
19h ago
Absolute volume maybe[1], as relative % of global digital communication traffic, the era of early telegraph probably has it beat.

In the pre digital era, East India Company dwarfs every other company in any metric like commerce controlled, global shipping, communication traffic, private army size, %GDP , % of workforce employed by considerable margins.

The default was large consolidated organization throughout history, like say Bell Labs, or Standard Oil before that and so on, only for a brief periods we have enjoyed benefits of true capitalism.

[1] Although I suspect either AWS or MS/Azure recent down-times in the last couple of years are likely higher

nullbyte808
19h ago
Yes, all(most) eggs should not be in one basket. Perfect opportunity to setup a service that checks cloudflare then switches a site's DNS to akami as a backup.
0xbadcafebee
19h ago
2 replies
So, to recap:

  - Their database permissions changed unexpectedly (??)
  - This caused a 'feature file' to be changed in an unusual way (?!)
     - Their SQL query made assumptions about the database; their permissions change thus resulted in queries getting additional results, permitted by the query
  - Changes were propagated to production servers which then crashed those servers (meaning they weren't tested correctly)
     - They hit an internal application memory limit and that just... crashed the app
  - The crashing did not result in an automatic backout of the change, meaning their deployments aren't blue/green or progressive
  - After fixing it, they were vulnerable to a thundering herd problem
  - Customers who were not using bot rules were not affected; CloudFlare's bot-scorer generated a constant bot score of 0, meaning all traffic is bots
In terms of preventing this from a software engineering perspective, they made assumptions about how their database queries work (and didn't validate the results), and they ignored their own application limits and didn't program in either a test for whether an input would hit a limit, or some kind of alarm to notify the engineers of the source of the problem.

From an operations perspective, it would appear they didn't test this on a non-production system mimicing production; they then didn't have a progressive deployment; and they didn't have a circuit breaker to stop the deployment or roll-back when a newly deployed app started crashing.

tptacek
19h ago
1 reply
People jump to say things like "where's the rollback" and, like, probably yeah, but keep in mind that speculative rollback features (that is: rollbacks built before you've experienced the real error modes of the system) are themselves sources of sometimes-metastable distributed system failures. None of this is easy.
0xbadcafebee
14h ago
How about where's the most basic test to check if your config file will actually run at all in your application? It was a hard-coded memory limit; a git-hook test suite run a MacBook would have caught this. But nooo, let's not run the app for 0.01 seconds with this config before sending it out to determine the fate of the internet?

This is literally the CrowdStrike bug, in a CDN. This is the most basic, elementary, day 0 test you could possibly invent. Forget the other things they fucked up. Their app just crashes with a config file, and nobody evaluates it?! Not every bug is preventable, but an egregious lack of testing is preventable.

This is what a software building code (like the electrical code's UL listings that prevent your house from burning down from untested electrical components) is intended to prevent. No critical infrastructure should be legal without testing, period.

paulddraper
18h ago
1 reply
Looks like you have the perfect window to disrupt them with a superior product.
mercnz
18h ago
just before this outage i was exploring bunnycdn as the idea of cloudflare taking over dns still irks me slightly. there are competitors. but there's a certain amount of scale that cloudflare offers which i think can help performance in general. that said in the past i found cloudflare performance terrible when i was doings lots of testing. they are predominantly a pull based system not a push, so if content isn't current the cache miss performance can be kind of blah. i think their general backhaul paths have improved, but at least from new zealand they used to seem to do worse than hitting a los angeles proxy that then hits origin. (although google was in a similar position before, where both 8.8.8.8 and www.google.co.nz/.com were both faster via los angeles than via normal paths - i think google were doing asia parent, like if testing 8.8.8.8 misses it was super far away). i think now that we have http/3 etc though that performance is a bit simpler to achieve, and that ddos, bot protection is kind of the differentiator, and i think that cloudflare's bot protection may work reasonably well in general?
rawgabbit
19h ago
1 reply

     > The change explained above resulted in all users accessing accurate metadata about tables they have access to. Unfortunately, there were assumptions made in the past, that the list of columns returned by a query like this would only include the “default” database:

  SELECT
  name,
  type
  FROM system.columns
  WHERE
  table =        'http_requests_features'
  order by name;

    Note how the query does not filter for the database name. With us gradually rolling out the explicit grants to users of a given ClickHouse cluster, after the change at 11:05 the query above started returning “duplicates” of columns because those were for underlying tables stored in the r0 database.
rawgabbit
17h ago
Here is a bit more context in addition to the quote above. A ClickHouse permissions change made a metadata query start returning duplicate column metadata from an extra schema, which more than doubled the size and feature count of a Bot Management configuration file. When this oversized feature file was deployed to edge proxies, it exceeded a 200-feature limit in the bot module, causing that module to panic and the core proxy to return 5xx errors globally
zzzeek
19h ago
1 reply
> Instead, it was triggered by a change to one of our database systems' permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system.

And here is the query they used ** (OK, so it's not exactly):

     SELECT * from feature JOIN permissions on feature.feature_type_id = permissions.feature_type_id
someone added a new row to permissions and the JOIN started returning two dupe feature rows for each distinct feature.

** "here is the query" is used for dramatic effect. I have no knowledge of what kind of database they are even using much less queries (but i do have an idea).

more edits: OK apparently it's described later in the post as a query against clickhouse's table metadata table, and because users were granted access to an additional database that was actually the backing store to the one they normally worked with, some row level security type of thing doubled up the rows. Not sure why querying system.columns is part of a production level query though, seems overly dynamic.

captainkrtek
19h ago
I believe they mentioned ClickHouse
SerCe
19h ago
5 replies
As always, kudos for releasing a post mortem in less than 24 hours after the outage, very few tech organisations are capable of doing this.
bayesnet
19h ago
1 reply
And a well-written one at that. Compared to the AWS port-mortem this could be literature.
philipwhiuk
16h ago
Except it fails to document anything about the actions they made to Warp in London during the resolution.
yen223
19h ago
5 replies
I'm curious about how their internal policies work such that they are allowed to publish a post mortem this quickly, and with this much transparency.

Any other large-ish company, there would be layers of "stakeholders" that will slow this process down. They will almost always never allow code to be published.

eastdakota
18h ago
4 replies
Well… we have a culture of transparency we take seriously. I spent 3 years in law school that many times over my career have seemed like wastes but days like today prove useful. I was in the triage video bridge call nearly the whole time. Spent some time after we got things under control talking to customers. Then went home. I’m currently in Lisbon at our EUHQ. I texted John Graham-Cumming, our former CTO and current Board member whose clarity of writing I’ve always admired. He came over. Brought his son (“to show that work isn’t always fun”). Our Chief Legal Officer (Doug) happened to be in town. He came over too. The team had put together a technical doc with all the details. A tick-tock of what had happened and when. I locked myself on a balcony and started writing the intro and conclusion in my trusty BBEdit text editor. John started working on the technical middle. Doug provided edits here and there on places we weren’t clear. At some point John ordered sushi but from a place with limited delivery selection options, and I’m allergic to shellfish, so I ordered a burrito. The team continued to flesh out what happened. As we’d write we’d discover questions: how could a database permission change impact query results? Why were we making a permission change in the first place? We asked in the Google Doc. Answers came back. A few hours ago we declared it done. I read it top-to-bottom out loud for Doug, John, and John’s son. None of us were happy — we were embarrassed by what had happened — but we declared it true and accurate. I sent a draft to Michelle, who’s in SF. The technical teams gave it a once over. Our social media team staged it to our blog. I texted John to see if he wanted to post it to HN. He didn’t reply after a few minutes so I did. That was the process.
philipgross
18h ago
1 reply
You call this transparency, but fail to answer the most important questions: what was in the burrito? Was it good? Would you recommend?
eastdakota
17h ago
Chicken burrito from Coyo Taco in Lisbon. I am not proud of this. It’s worse than ordering from Chipotle. But there are no Chipotle’s in Lisbon… yet.
jofzar
17h ago
> I texted John to see if he wanted to post it to HN. He didn’t reply after a few minutes so I did

Damn corporate karma farming is ruthless, only a couple minute SLA before taking ownership of the karma. I guess I'm not built for this big business SLA.

anurag
18h ago
Appreciate the extra transparency on the process.
ynx0
16h ago
How do you guys handle redaction? I'm sure even when trusted individuals are in charge of authoring, there's still a potential of accidental leakage which would probably be best mitigated by a team specifically looking for any slip ups.

Thanks for the insight.

thesh4d0w
19h ago
The person who posted both this blog article and the hacker news post, is Matthew Prince, one of highly technical billionaire founders of cloudflare. I'm sure if he wants something to happen, it happens.
madeofpalk
18h ago
From what I've observed, it depends on whether you're an "engineering company" or not.
tom1337
19h ago
I mean the CEO posted the post-mortem so there aren't that many layers of stakeholders above. For other post-mortems by engineers, Matthew once said that the engineering team is running the blog and that he wouldn't event know how to veto even if he wanted [0]

[0] https://news.ycombinator.com/item?id=45588305

BrtByte
7h ago
Cloudflare seems to have baked this level of transparency into their culture and incident response process
andrewinardeer
15h ago
Plenty are capable. Most don't bother.
eastdakota
7h ago
* published less than 12 hours from when the incident began. Proud of the team for pulling together everything so quickly and clearly.
BrtByte
7h ago
It's not just a PR-friendly summary either... they included real technical detail, timestamps, even code snippets
gucci-on-fleek
19h ago
2 replies
> This showed up to Internet users trying to access our customers' sites as an error page indicating a failure within Cloudflare's network.

As a visitor to random web pages, I definitely appreciated this—much better than their completely false “checking the security of your connection” message.

> The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems' permissions

Also appreciate the honesty here.

> On 18 November 2025 at 11:20 UTC (all times in this blog are UTC), Cloudflare's network began experiencing significant failures to deliver core network traffic. […]

> Core traffic was largely flowing as normal by 14:30. We worked over the next few hours to mitigate increased load on various parts of our network as traffic rushed back online. As of 17:06 all systems at Cloudflare were functioning as normal.

Why did this take so long to resolve? I read through the entire article, and I understand why the outage happened, but when most of the network goes down, why wasn't the first step to revert any recent configuration changes, even ones that seem unrelated to the outage? (Or did I just misread something and this was explained somewhere?)

Of course, the correct solution is always obvious in retrospect, and it's impressive that it only took 7 minutes between the start of the outage and the incident being investigated, but it taking a further 4 hours to resolve the problem and 8 hours total for everything to be back to normal isn't great.

eastdakota
19h ago
7 replies
Because we initially thought it was an attack. And then when we figured it out we didn’t have a way to insert a good file into the queue. And then we needed to reboot processes on (a lot) of machines worldwide to get them to flush their bad files.
dbetteridge
18h ago
1 reply
Question from a casual bystander, why not have a virtual/staging mini node that receives these feature file changes first and catches errors to veto full production push?

Or you do have something like this but the specific db permission change in this context only failed in production

forsalebypwner
18h ago
I think the reasoning behind this is because of the nature of the file being pushed - from the post mortem:

"This feature file is refreshed every few minutes and published to our entire network and allows us to react to variations in traffic flows across the Internet. It allows us to react to new types of bots and new bot attacks. So it’s critical that it is rolled out frequently and rapidly as bad actors change their tactics quickly."

tptacek
19h ago
Richard Cook #18 (and #10) strikes again!

https://how.complexsystems.fail/#18

It'd be fun to read more about how you all procedurally respond to this (but maybe this is just a fixation of mine lately). Like are you tabletopping this scenario, are teams building out runbooks for how to quickly resolve this, what's the balancing test for "this needs a functional change to how our distributed systems work" vs. "instead of layering additional complexity on, we should just have a process for quickly and maybe even speculatively restoring this part of the system to a known good state in an outage".

tetec1
19h ago
Yeah, I can imagine that this insertion was some high-pressure job.
philipwhiuk
16h ago
Why was Warp in London disabled temporarily. No mention of that change was discussed in the RCA despite it being called out in an update.

For London customers this made the impact more severe temporarily.

prawn
16h ago
Just asking out of curiosity, but roughly how many staff would've been involved in some way in sorting out the issue? Either outside regular hours or redirected from their planned work?
gucci-on-fleek
19h ago
Thanks for the explanation! This definitely reminds me of CrowdStrike outages last year:

- A product depends on frequent configuration updates to defend against attackers.

- A bad data file is pushed into production.

- The system is unable to easily/automatically recover from bad data files.

(The CrowdStrike outages were quite a bit worse though, since it took down the entire computer and remediation required manual intervention on thousands of desktops, whereas parts of Cloudflare were still usable throughout the outage and the issue was 100% resolved in a few hours)

hbbio
16h ago
Thx for the explanation!

Side thought as we're working on 100% onchain systems (for digital assets security, different goals):

Public chains (e.g. EVMs) can be a tamper‑evident gate that only promotes a new config artifact if (a) a delay or multi‑sig review has elapsed, and (b) a succinct proof shows the artifact satisfies safety invariants like ≤200 features, deduped, schema X, etc.

That could have blocked propagation of the oversized file long before it reached the edge :)

chrismorgan
10h ago
1 reply
> much better than their completely false “checking the security of your connection” message

The exact wording (which I can easily find, because a good chunk of the internet gives it to me, because I’m on Indian broadband):

> example.com needs to review the security of your connection before proceeding.

It bothers me how this bald-faced lie of a wording has persisted.

(The “Verify you are human by completing the action below.” / “Verify you are human” checkbox is also pretty false, as ticking the box in no way verifies you are human, but that feels slightly less disingenuous.)

eastdakota
7h ago
Next time open your dev console in your window and look at how much is going on in the background.
EvanAnderson
19h ago
4 replies
It reads a lot like the Crowdstrike SNAFU. Machine-generated configuration file b0rks-up the software that consumes it.

The "...was then propagated to all the machines that make up our network..." followed by "....caused the software to fail." screams for a phased rollout / rollback methodology. I get that "...it’s critical that it is rolled out frequently and rapidly as bad actors change their tactics quickly" but today's outage highlights that rapid deployment isn't all upside.

The remediation section doesn't give me any sense that phased deployment, acceptance testing, and rapid rollback are part of the planned remediation strategy.

tptacek
19h ago
3 replies
I don't think this system is best thought of as "deployment" in the sense of CI/CD; it's a control channel for a distributed bot detection system that (apparently) happens to be actuated by published config files (it has a consul-template vibe to it, though I don't know if that's what it is).
eastdakota
19h ago
1 reply
That’s correct.
tptacek
19h ago
2 replies
Is it actually consul-template? (I have post-consul-template stress disorder).
threatofrain
16h ago
1 reply
I'd love to hear any commentary on Consul if anyone else has it.
tptacek
16h ago
I think Consul is great, for what it's worth; we were just abusing it.

https://fly.io/blog/a-foolish-consistency/

https://fly.io/blog/corrosion/

mh-
16h ago
Did you know: PCTSD affects more than 2 in 5 engineers.
EvanAnderson
19h ago
That's why I likened it Crowdstrike. It's a signature database that blew up the consumer of said database. (You probably caught my post mid-edit, too. You may be replying to the snarky paragraph I felt better of and removed.)

Edit: Similar to Crowdstrike, the bot detector should have fallen-back to its last-known-good signature database after panicking, instead of just continuing to panic.

JB_Dev
12h ago
Code and Config should be treated similarly. If you would use a ring based rollout, canaries, etc for safely changing your code, then any config that can have the same impact must also use safe rollout techniques.
Aeolun
19h ago
I’m fairly certain it will be after they read this thread. It doesn’t feel like they don’t want, or are incapable of improving?
perlgeek
10h ago
At my employer, we have a small script that automatically checks such generated config files. It does a diff between the old and the new version, and if the diff size exceeds a threshold (either total or relative to the size of the old file), it refuses to do the update, and opens a ticket for a human to look over it.

It has somewhat regularly saved us from disaster in the past.

navigate8310
19h ago
I'm amazed that they are not using any simulator of some sort and pushing changes directly to production.
tristan-morris
19h ago
3 replies
Why call .unwrap() in a function which returns Result<_,_>?

For something so critical, why aren't you using lints to identify and ideally deny panic inducing code. This is one of the biggest strengths of using Rust in the first place for this problem domain.

sayrer
19h ago
2 replies
Yes, can't have .unwrap() in production code (it's ok in tests)
orphea
19h ago
3 replies
Like goto, unwrap is just a tool that has its use cases. No need to make a boogeyman out of it.
gishh
19h ago
To be fair, if you’re not “this tall” you really shouldn’t consider using goto in a c program. Most people aren’t that tall.
metaltyphoon
18h ago
Yes it's meant to be used in test code. If you're sure it can't fail do then use .expect() that way it shows you made a choice and it wasn't just a dev oversight.
fwjafwasd
18h ago
panicans should be using .expect() in production
keyle
19h ago
unwrap itself isn't the problem...
tptacek
19h ago
5 replies
Probably because this case was something more akin to an assert than an error check.
stefan_
19h ago
2 replies
You are saying this would not have happened in a C release build where asserts define to nothing?

Wonder why these old grey beards chose to go with that.

ashishb
19h ago
1 reply
> You are saying this would not have happened in a C release build where asserts define to nothing?

Afaik, Go and Java are the only languages that make you pause and explicitly deal with these exceptions.

tristan-morris
19h ago
1 reply
And rust, but they chose to panic on the error condition. Wild.
ashishb
17h ago
> And rust, but they chose to panic on the error condition. Wild.

unwrap() implicitly panic-ed, right?

tptacek
19h ago
I am one of those old grey beards (or at least, I got started shipping C code in the 1990s), and I'd leave asserts in prod serverside code given the choice; better that than a totally unpredictable error path.
thundergolfer
19h ago
1 reply
Fly writes a lot of Rust, do you allow `unwrap()` in your production environment? At Modal we only allow `expect("...")` and the message should follow the recommended message style[1].

I'm pretty surprised that Cloudflare let an unwrap into prod that caused their worst outage in 6 years.

1. https://doc.rust-lang.org/std/option/enum.Option.html#recomm...

tptacek
18h ago
1 reply
After The Great If-Let Outage Of 2024, we audited all our code for that if-let/rwlock problem, changed a bunch of code, and immediately added a watchdog for deadlocks. The audit had ~no payoff; the watchdog very definitely did.

I don't know enough about Cloudflare's situation to confidently recommend anything (and I certainly don't know enough to dunk on them, unlike the many Rust experts of this thread) but if I was in their shoes, I'd be a lot less interested in eradicating `unwrap` everywhere and more in making sure than an errant `unwrap` wouldn't produce stable failure modes.

thundergolfer
18h ago
Fair. I agree that saying "it's the unwrap" and calling it a day is wrong. Recently actually we've done an exercise on our Worker which is "assume the worst kind of panic happens. make the Worker be ok with it".

But I do feel strongly that the expect pattern is a highly useful control and that naked unwraps almost always indicate a failure to reason about the reliability of a change. An unwrap in their core proxy system indicates a problem in their change management process (review, linting, whatever).

marcusb
19h ago
Rust has debug asserts for that. Using expect with a comment about why the condition should not/can't ever happen is idiomatic for cases where you never expect an Err.

This reads to me more like the error type returned by append with names is not (ErrorFlags, i32) and wasn't trivially convertible into that type so someone left an unwrap in place on an "I'll fix it later" basis, but who knows.

tristan-morris
19h ago
Oh absolutely, that's how it would have been treated.

Surely a unwrap_or_default() would have been a much better fit--if fetching features fails, continue processing with an empty set of rules vs stop world.

piperswe
17h ago
In that case, it should probably be expect rather than unwrap, to document why the assertion should never fail.
koakuma-chan
19h ago
2 replies
Why is there a 200 limit on appending names?
nickmonad
18h ago
Limits in systems like these are generally good. They mention the reasoning around it explicitly. It just seems like the handling of that limit is what failed and was missed in review.
zmj
18h ago
Everything has a limit. You can define it, or be surprised when you find out what it is.
otterley
19h ago
5 replies
> work has already begun on how we will harden them against failures like this in the future. In particular we are:

> Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input

> Enabling more global kill switches for features

> Eliminating the ability for core dumps or other error reports to overwhelm system resources

> Reviewing failure modes for error conditions across all core proxy modules

Absent from this list are canary deployments and incremental or wave-based deployment of configuration files (which are often as dangerous as code changes) across fault isolation boundaries -- assuming CloudFlare has such boundaries at all. How are they going to contain the blast radius in the future?

This is something the industry was supposed to learn from the CrowdStrike incident last year, but it's clear that we still have a long way to go.

Also, enabling global anything (i.e., "enabling global kill switches for features") sounds like an incredibly risky idea. One can imagine a bug in a global switch that transforms disabling a feature into disabling an entire system.

nikcub
19h ago
1 reply
They require the bot management config to update and propagate quickly in order to respond to attacks - but this seems like a case where updating a since instance first would have seen the panic and stopped the deploy.

I wonder why clickhouse is used to store the feature flags here, as it has it's own duplication footguns[0] which could have also easily lead to a query blowing up 2/3x in size. oltp/sqlite seems more suited, but i'm sure they have their reasons

[0] https://clickhouse.com/docs/guides/developer/deduplication

HumanOstrich
18h ago
1 reply
I don't think sqlite would come close to their requirements for permissions or resilience, to name a couple. It's not the solution for every database issue.

Also, the link you provided is for eventual deduplication at the storage layer, not deduplication at query time.

hedora
16h ago
1 reply
I think the idea is to ship the sqlite database around.

It’s not a terrible idea, in that you can test the exact database engine binary in CI, and it’s (by definition) not a single point of failure.

HumanOstrich
16h ago
I think you're oversimplifying the problem they had, and I would encourage you to dive in to the details in the article. There wasn't a problem with the database, it was with the query used to generate the configs. So if an analogous issue arose with a query against a sqlite database, it wouldn't fix anything.

Also I've been seeing an uptrend in sqlite dogma on HN lately. I love sqlite for some things, but it's not The One True Database Solution.

mewpmewp2
19h ago
1 reply
It seems they had this continous rollout for the config service, but the services consuming this were affected even by small percentage of these config providers being faulty, since they were auto updating every few minutes their configs. And it seems there is a reason for these updating so fast, presumably having to react to threat actors quickly.
otterley
19h ago
1 reply
It's in everyone's interest to mitigate threats as quickly as possible. But it's of even greater interest that a core global network infrastructure service provider not DOS a significant proportion of the Internet by propagating a bad configuration too quickly. The key here is to balance responsiveness against safety, and I'm not sure they struck the right balance here. I'm just glad that the impact wasn't as long and as severe as it could have been.
tptacek
18h ago
1 reply
This isn't really "configuration" so much as it is "durable state" within the context of this system.
otterley
18h ago
2 replies
In my 30 years of reliability engineering, I've come to learn that this is a distinction without a difference.

People think of configuration updates (or state updates, call it what you will) as inherently safer than code updates, but history (and today!) demonstrates that they are not. Yet even experienced engineers will happily yeet such changes into production -- even ones who wouldn't dare let a single line code go live without being subject to through the staged deployment and testing process.

HumanOstrich
18h ago
1 reply
They narrowed down the actual problem to some Rust code in the Bot Management system that enforced a hard limit on the number of configuration items by returning an error, but the caller was just blindly unwrapping it.
otterley
18h ago
1 reply
A dormant bug in the code is usually a condition precedent to incidents like these. Later, when a bad input is given, the bug then surfaces. The bug could have laid dormant for years or decades, if it ever surfaced at all.

The point here remains: consider every change to involve risk, and architect defensively.

tptacek
18h ago
1 reply
They made the classic distributed systems mistake and actually did something. Never leap to thing-doing!
otterley
18h ago
1 reply
If they're going to yeet configs into production, they ought to at least have plenty of mitigation mechanisms, including canary deployments and fault isolation boundaries. This was my primary point at the root of this thread.

And I hope fly.io has these mechanisms as well :-)

tptacek
18h ago
We've written at long, tedious length about how hard this problem is.
tptacek
18h ago
1 reply
Reframe this problem: instead of bot rules being propagated, it's the enrollment of a new customer or a service at an existing customer --- something that must happen at Cloudflare several times a second. Does it still make sense to you to think about that in terms of "pushing new configuration to prod"?
otterley
18h ago
1 reply
Those aren't the facts before us. Also, CRUD operations relating to a specific customer or user tend not to cause the sort of widespread incidents we saw today.
tptacek
18h ago
They're not, they're a response to your claim that "state" and "configuration" are indifferentiable.
Buttons840
10h ago
When a failsafe system fails, it fails by failing to fail safely.
ants_everywhere
17h ago
it's always a config push. people rollout code slowly but don't have the same mechanisms for configs. But configs are code, and this is a blind spot that causes an outsized percentage of these big outages.
Scaevolus
18h ago
Global configuration is useful for low response times to attacks, but you need to have very good ways to know when a global config push is bad and to be able to rollback quickly.

In this case, the older proxy's "fail-closed" categorization of bot activity was obviously better than the "fail-crash", but every global change needs to be carefully validated to have good characteristics here.

Having a mapping of which services are downstream of which other service configs and versions would make detecting global incidents much easier too, by making the causative threads of changes more apparent to the investigators.

lukan
19h ago
4 replies
"Throwing us off and making us believe this might have been an attack was another apparent symptom we observed: Cloudflare’s status page went down. The status page is hosted completely off Cloudflare’s infrastructure with no dependencies on Cloudflare. While it turned out to be a coincidence, it led some of the team diagnosing the issue to believe that an attacker may be targeting both our systems as well as our status page."

Unfortunately they do not share, what caused the status page to went down as well. (Does this happen often? Otherwise a big coincidence it seems)

Aeolun
19h ago
1 reply
I mean, that would require a postmortem from statuspage.io right? Is that a service operated by cloudflare?
edoceo
17h ago
Atlasaian
eastdakota
19h ago
4 replies
We don’t know. Suspect it may just have been a big uptick in load and a failure of its underlying infrastructure to scale up.
francisofascii
6h ago
This situation reminds me of risk assessment, where you sometimes assume two rare events are independent, but later learn they are actually highly correlated.
reassess_blind
18h ago
The status page is hosted on AWS Cloudfront, right? It sure looks like Cloudfront was overwhelmed by the traffic spike, which is a bit concerning. Hope we'll see a post from their side.
jbmsf
17h ago
It looks a lot like a CloudFront error we randomly saw today from one of our engineers in South America. I suspect there was a small outage in AWS but can't prove it.
dnw
19h ago
Yes, probably a bunch of automated bots decided to check the status page when they saw failures in production.
notatoad
19h ago
1 reply
it seems like a good chance that despite thinking their status page was completely independent of cloudfront, enough of the internet is dependent on cloudfront now that they're simply wrong about the status page's independence.
verletzen
18h ago
1 reply
i think you've got cloudflare and cloudfront mixed up.
notatoad
17h ago
ahah oops. yeah, it's a problem. i've got two projects ongoing that each rely on one of them, and i can never keep it straight.
paulddraper
18h ago
Quite possibly it was due to high traffic.

IDK Atlassian Statuspage clientele, but it's possible Cloudflare is much larger than usual.

vsgherzi
19h ago
3 replies
Why does cloudflare allow unwraps in their code? I would've assumed they'd have clippy lints stopping that sort of thing. Why not just match with { ok(value) => {}, Err(error) => {} } the function already has a Result type.

At the bare minimum they could've used an expect("this should never happen, if it does database schema is incorrect").

The whole point of errors as values is preventing this kind of thing.... It wouldn't have stopped the outage but it would've made it easy to diagnose.

If anyone at cloudflare is here please let me in that codebase :)

waterTanuki
19h ago
2 replies
Not a cloudflare employee but I do write a lot of Rust. The amount of things that can go wrong with any code that needs to make a network call is staggeringly high. unwrap() is normal during development phase but there are a number of times I leave an expect() for production because sometimes there's no way to move forward.
SchemaLoad
18h ago
Yeah it seems likely that even if there wasn't an unwrap, there would have been some error handling that wouldn't have panicked the process, but would have still left it inoperable if every request was instead going through an error path.
vsgherzi
18h ago
I'm in a similar boat, at the very leas an expect can give hits to what happened. However this can also be problematic if your a library developer. Sometimes rust is expected to never panic especially in situations like WASM. This is a major problem for companies like Amazon Prime Video since they run in a WASM context for their TV APP. Any panic crashes everything. Personally I usually just either create a custom error type (preferred) or erase it away with Dyn Box Error (no other option). Random unwraps and expects haunt my dreams.
ozgrakkurt
12h ago
1 reply
And the error magically disappears when the function returns it?
csomar
11h ago
It doesn’t disappear, it forces you to handle it.
pornel
6h ago
1 reply
unwrap() is only the most superficial part of the problem. Merely replacing `unwrap()` with `return Err(code)` wouldn't have changed the behavior. Instead of "error 500 due to panic" the proxy would fail with "error 500 due to $code".

Unwrap gives you a stack trace, while retuned Err doesn't, so simply using a Result for that line of code could have been even harder to diagnose.

`unwrap_or_default()` or other ways of silently eating the error would be less catastrophic immediately, but could still end up breaking the system down the line, and likely make it harder to trace the problem to the root cause.

The problem is deeper than an unwrap(), related to handling rollouts of invalid configurations, but that's not a 1-line change.

frumplestlatz
4h ago
We don't know what the surrounding code looks like, but I'd expect it handles the error case that's expressed in the type signature (unless they `.unwrap()` there too).

The problem is that they didn't surface a failure case, which means they couldn't handle rollouts of invalid configurations correctly.

The use of `.unwrap()` isn't superficial at all -- it hid an invariant that should have been handled above this code. The failure to correctly account for and handle those true invariants is exactly what caused this failure mode.

ed_mercer
19h ago
1 reply
Wow. 26M/s 5xx error HTTP status codes over a span of roughly two hours. That's roughly 187 billion HTTP errors that interrupted people (and systems)!
watchful_moose
9h ago
Some of these would be retries that wouldn't have happened if not for earlier errors.
sigmar
19h ago
1 reply
Wow. What a post mortem. Rather than Monday morning quarterbacking how many ways this could have been prevented, I'd love to hear people sound-off on things that unexpectedly broke. I, for one, did not realize logging in to porkbun to edit DNS settings would become impossible with a cloudflare meltdown
brandon272
16h ago
That's unfortunate. I'll need to investigate whether Porkbun plans on decoupling its auth from being reliant on CloudFlare, otherwise I will need to migrate a few domains off of that registrar.
ojosilva
19h ago
2 replies
This is the multi-million dollar .unwrap() story. In a critical path of infrastructure serving a significant chunk of the internet, calling .unwrap() on a Result means you're saying "this can never fail, and if it does, crash the thread immediately."The Rust compiler forced them to acknowledge this could fail (that's what Result is for), but they explicitly chose to panic instead of handle it gracefully. This is textbook "parse, don't validate" anti-pattern.

I know, this is "Monday morning quarterbacking", but that's what you get for an outage this big that had me tied up for half a day.

wrs
18h ago
8 replies
It seems people have a blind spot for unwrap, perhaps because it's so often used in example code. In production code an unwrap or expect should be reviewed exactly like a panic.

It's not necessarily invalid to use unwrap in production code if you would just call panic anyway. But just like every unsafe block needs a SAFETY comment, every unwrap in production code needs an INFALLIBILITY comment. clippy::unwrap_used can enforce this.

dist1ll
18h ago
3 replies
> every unwrap in production code needs an INFALLIBILITY comment. clippy::unwrap_used can enforce this.

How about indexing into a slice/map/vec? Should every `foo[i]` have an infallibility comment? Because they're essentially `get(i).unwrap()`.

tux3
18h ago
2 replies
Usually you'd want to write almost all your slice or other container iterations with iterators, in a functional style.

For the 5% of cases that are too complex for standard iterators? I never bother justifying why my indexes are correct, but I don't see why not.

You very rarely need SAFETY comments in Rust because almost all the code you write is safe in the first place. The language also gives you the tool to avoid manual iteration (not just for safety, but because it lets the compiler eliminate bounds checks), so it would actually be quite viable to write these comments, since you only need them when you're doing something unusual.

wrs
17h ago
1 reply
[delayed]
kibwen
16h ago
Indexing is comparatively rare given the existence of iterators, IMO. If your goal is to avoid any potential for panicking, I think you'd have a harder time with arithmetic overflow.
dist1ll
17h ago
For iteration, yes. But there's other cases, like any time you have to deal with lots of linked data structures. If you need high performance, chances are that you'll have to use an index+arena strategy. They're also common in mathematical codebases.
10000truths
17h ago
1 reply
Yes? Funnily enough, I don't often use indexed access in Rust. Either I'm looping over elements of a data structure (in which case I use iterators), or I'm using an untrusted index value (in which case I explicitly handle the error case). In the rare case where I'm using an index value that I can guarantee is never invalid (e.g. graph traversal where the indices are never exposed outside the scope of the traversal), then I create a safe wrapper around the unsafe access and document the invariant.
dist1ll
17h ago
1 reply
If that's the case then hats off. What you're describing is definitely not what I've seen in practice. In fact, I don't think I've ever seen a crate or production codebase that documents infallibility of every single slice access. Even security-critical cryptography crates that passed audits don't do that.

If you could share some of your Rust code or snippets I would be very interested. I found it quite hard to avoid indexing for graph-heavy code, so I'm always on the lookout for interesting ways to enforce access safety.

hansvm
16h ago
1 reply
> graph-heavy code

Could you share some more details, maybe one fully concrete scenario? There are lots of techniques, but there's no one-size-fits-all solution.

dist1ll
16h ago
Sure, these days I'm mostly working on a few compilers. Let's say I want to make a fixed-size SSA IR. Each instruction has an opcode and two operands (which are essentially pointers to other instructions). The IR is populated in one phase, and then lowered in the next. During lowering I run a few peephole and code motion optimizations on the IR, and then do regalloc + asm codegen. During that pass the IR is mutated and indices are invalidated/updated. The important thing is that this phase is extremely performance-critical.
danielheath
18h ago
I mean... yeah, in general. That's what iterators are for.
dehrmann
17h ago
2 replies
It's the same blind spot people have to Java's checked exceptions. People commonly resort to Pokemon exception handling and either blindly ignoring or rethrowing as a runtime exception. When Rust got popular, I was a bit confused by people talking about how great Result it's essentially a checked exception without a stack trace.
Terr_
16h ago
1 reply
[delayed]
bigstrat2003
16h ago
I'm with you! Checked exceptions are actually good and the hate for them is super short sighted. The exact same criticisms levied at checked exceptions apply to static typing in general, but people acknowledge the great value static types have for preventing errors at compile time. Checked exceptions have that same value, but are dunked on for some reason.
gwbas1c
16h ago
It's a lot lighter: a stack trace takes a lot of overhead to generate; a result has no overhead for a failure. The overhead (panic) only comes once the failure can't be handled. (Most books on Java/C# don't explain that throwing exceptions has high performance overhead.)

Exceptions force a panic on all errors, which is why they're supposed to be used in "exceptional" situations. To avoid exceptions when an error is expected, (eof, broken socket, file not found,) you either have to use an unnatural return type or accept the performance penalty of the panic that happens when you "throw."

In Rust, the stack trace happens at panic (unwrap), which is when the error isn't handled. IE, it's not when the file isn't found, it's when the error isn't handled.

speed_spread
16h ago
1 reply
Pet peeve: unwrap() should be deprecated and renamed or_panic(). More consistent with the rest of stdlib methods and appropriately scarier.
echelon
16h ago
A lot of stuff should be done about the awful unwrap family of methods.

A few ideas:

- It should not compile in production Rust code

- It should only be usable within unsafe blocks

- It should require explicit "safe" annotation from the engineer. Though this is subject to drift and become erroneous.

- It should be possible to ban the use of unsafe in dependencies and transitive dependencies within Cargo.

brabel
11h ago
Yes, I always thought it was wrong to use unwrap in examples. I know, people want to keep examples simple, but it trains developers to use unwrap() as they see that everywhere. Yes, there are places where it's ok as that blog post explains so well: https://burntsushi.net/unwrap/ But most devs IMHO don't have the time to make the call correctly most of the time... so it's just better to do something better, like handle the error and try to recover, or if impossible, at least do `expect("damn it, how did this happen")`.
anonnon
16h ago
> It seems people have a blind spot for unwrap

Not unlike people having a blind spot for Rust in general, no?

bombela
11h ago
This thread warms my heart. Rust has set a new baseline that many and myself now take for granted.

We are now discussing what can be done to improve code correctness beyond memory and thread safety. I am excited for what is to come.

quotemstr
6h ago
> people have a blind spot for unwrap

It's not about whether you should ban unwrap() in production. You shouldn't. Some errors are logic bugs beyond which a program can't reasonably continue. The problem is that the language makes it too easy for junior developers (and AI!) to ignore non-logic-bug problems with unwrap().

Programmers early in their careers will do practically anything to avoid having to think about errors and they get angry when you tell them about it.

littlestymaar
14h ago
> In production code an unwrap or expect should be reviewed exactly like a panic.

An unwrap should never make it to production IMHO. It's fine while prototyping, but once the project gets closer to production it's necessary to just grep `uncheck` in your code and replace those that can happen with a proper error management and replace those that cannot happen with `expect`, with a clear justification of why they cannot happen unless there's a bug somewhere else.

arccy
18h ago
1 reply
if you make it easy to be lazy and panic vs properly handling the error, you've designed a poor language
otterley
18h ago
1 reply
nine_k
18h ago
Works when you have the Erlang system that does graceful handing for you: reporting, restarting.
moralestapia
19h ago
No publicity is bad publicity.

Best post mortem I've read in a while, this thing will be studied for years.

A bit ironic that their internal FL2 tool is supposed to make Cloudflare "faster and more secure" but brought a lot of things down. And yeah, as other have already pointed out, that's a very unsafe use of Rust, should've never made it to production.

667 more comments available on Hacker News

ID: 45973709Type: storyLast synced: 11/19/2025, 7:26:53 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.