Cloudflare Outage on November 18, 2025 Post Mortem

Postedabout 2 months agoActiveabout 2 months ago

eastdakota

1,465 points

916 comments

blog.cloudflare.comTechstoryHigh profile

heatedmixed

Debate

80/100

Cloudflare OutagePost-Mortem AnalysisSoftware Reliability

Key topics

Cloudflare Outage

Post-Mortem Analysis

Software Reliability

Related: Cloudflare Global Network experiencing issues - https://news.ycombinator.com/item?id=45963780 - Nov 2025 (1580 comments)

Cloudflare's post-mortem analysis of their November 18, 2025 outage reveals a configuration error caused by a database permissions change, sparking discussion on software reliability, error handling, and change management.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

13m

Peak period

149

Day 1

Avg / period

26.7

Comment distribution160 data points

Loading chart...

Based on 160 loaded comments

Key moments

01Story posted
Nov 18, 2025 at 6:31 PM EST
about 2 months ago
Step 01
02First comment
Nov 18, 2025 at 6:44 PM EST
13m after posting
Step 02
03Peak activity
149 comments in Day 1
Hottest window of the conversation
Step 03
04Latest activity
Nov 27, 2025 at 10:09 AM EST
about 2 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (916 comments)

Showing 160 comments of 916

nawgz

about 2 months ago

4 replies

> a change to one of our database systems' permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system ... to keep [that] system up to date with ever changing threats

> The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail

A configuration error can cause internet-scale outages. What an era we live in

Edit: also, after finishing my reading, I have to express some surprise that this type of error wasn't caught in a staging environment. If the entire error is that "during migration of ClickHouse nodes, the migration -> query -> configuration file pipeline caused configuration files to become illegally large", it seems intuitive to me that doing this same migration in staging would have identified this exact error, no?

I'm not big on distributed systems by any means, so maybe I'm overly naive, but frankly posting a faulty Rust code snippet that was unwrapping an error value without checking for the error didn't inspire confidence for me!

jmclnx

about 2 months ago

1 reply

I have to wonder if AI was involved with the change.

norskeld

about 2 months ago

I don't think this is the case with CloudFlare, but for every recent GitHub outage or performance issue... oh boy, I blame the clankers!

mewpmewp2

about 2 months ago

1 reply

It would have been caught only in stage if there was similar amount of data in the database. If stage has 2x less data it would have never occurred there. Not super clear how easy it would have been to keep stage database exactly as production database in terms of quantity and similarity of data etc.

I think it's quite rare for any company to have exact similar scale and size of storage in stage as in prod.

Aeolun

about 2 months ago

1 reply

> I think it's quite rare for any company to have exact similar scale and size of storage in stage as in prod.

We’re like a millionth the size of cloudflare and we have automated tests for all (sort of) queries to see what would happen with 20x more data.

Mostly to catch performance regressions, but it would work to catch these issues too.

I guess that doesn’t say anything about how rare it is, because this is also the first company at which I get the time to go to such lengths.

mewpmewp2

about 2 months ago

1 reply

But now consider how much extra data Cloudflare at its size would have to have just for staging, doubling or more their costs to have stage exactly as production. They would have to simulate similar amount of requests on top of themselves constantly since presumably they have 100s or 1000s of deployments per day.

In this case it seems the database table in question seemed modest in size (the features for ML) so naively thinking they could have kept stage features always in sync with prod at the very least, but could be they didn't consider that 55 rows vs 60 rows or similar could be a breaking point given a certain specific bug.

It is much easier to test with 20x data if you don't have the amount of data cloudflare probably handles.

Aeolun

about 2 months ago

2 replies

That just means it takes longer to test. It may not be possible to do it in a reasonable timeframe with the volumes involved, but if you already have 100k servers running to serve 25M requests per second, maybe briefly booting up another 100k isn’t going to be the end of the world?

Either way, you don’t need to do it on every commit, just often enough that you catch these kinds of issues before they go to prod.

tatersolid

about 2 months ago

1 reply

> maybe briefly booting up another 100k isn’t going to be the end of the world

Cloudflare doesn’t run in AWS. They are a cloud provider themselves and mostly run on bare metal. Where would these extra 100k physical servers come from?

Aeolun

about 2 months ago

1 reply

From their desire to representatively test before they deploy to production?

Doing stuff at scale doesn’t suddenly mean you skip testing.

And just because they host stuff themselves doesn’t mean they couldn’t run on the cloud if they needed to.

mewpmewp2

about 2 months ago

Cloudflare infra costs are probably 300 mil+ usd. Their gaap profit is negative, their non gaap income is less than their infra expenses. Can you imagine how much they would have to charge more or spend more if they had to duplicate or simulate their production environment in staging and for each of the 100s deployments they probably do a day?

Their main cost of revenue is these infra costs.

mewpmewp2

about 2 months ago

But they are probably doing hundreds of deployments a day, so that would make their pipelines extremely long? Not to mention costs.

norskeld

about 2 months ago

1 reply

This wild `unwrap()` kinda took me aback as well. Someone really believed in themselves writing this. :)

Jach

about 2 months ago

1 reply

They only recently rewrote their core in Rust (https://blog.cloudflare.com/20-percent-internet-upgrade/) -- given the newness of the system and things like "Over 100 engineers have worked on FL2, and we have over 130 modules" I won't be surprised for further similar incidents.

gishh

about 2 months ago

1 reply

The irony of a rust rewrite taking down the internet is not lost on me.

NetMageSCW

about 2 months ago

20% seems to grow every time someone write about this.

shoo

about 2 months ago

The speed and transparency of Cloudflare publishing this port mortem is excellent.

I also found the "remediation and follow up" section a bit lacking, not mentioning how, in general, regressions in query results caused by DB changes could be caught in future before they get widely rolled out.

Even if a staging env didn't have a production-like volume of data to trigger the same failure mode of a bot management system crash, there's also an opportunity to detect that something has gone awry if there were tests that the queries were returning functionally equivalent results after the proposed permission change. A dummy dataset containing a single http_requests_features column would suffice to trigger the dupe results behaviour.

In theory there's a few general ways this kind of issue could be detected, e.g. someone or something doing a before/after comparison to test that the DB permission change did not regress query results for common DB queries, for changes that are expected to not cause functional changes in behaviour.

Maybe it could have been detected with an automated test suite of the form "spin up a new DB, populate it with some curated toy dataset, then run a suite of important queries we must support and check the results are still equivalent (after normalising row order etc) to known good golden outputs". This style of regression testing is brittle, burdensome to maintain and error prone when you need to make functional changes and update what then "golden" outputs are - but it can give a pretty high probability of detecting that a DB change has caused unplanned functional regressions in query output, and you can find out about this in a dev environment or CI before a proposed DB change goes anywhere near production.

binarymax

about 2 months ago

5 replies

28M 500 errors/sec for several hours from a single provider. Must be a new record.

No other time in history has one single company been responsible for so much commerce and traffic. I wonder what some outage analogs to the pre-internet ages would be.

captainkrtek

about 2 months ago

Something like a major telco going out, for example the AT&T 1990 outage of long distance calling:

> The standard procedures the managers tried first failed to bring the network back up to speed and for nine hours, while engineers raced to stabilize the network, almost 50% of the calls placed through AT&T failed to go through.

> Until 11:30pm, when network loads were low enough to allow the system to stabilize, AT&T alone lost more than $60 million in unconnected calls.

> Still unknown is the amount of business lost by airline reservations systems, hotels, rental car agencies and other businesses that relied on the telephone network.

https://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collap...

alhirzel

about 2 months ago

> I wonder what some outage analogs to the pre-internet ages would be.

Lots of things have the sky in common. Maybe comet-induced ice ages...

adventured

about 2 months ago

> No other time in history has one single company been responsible for so much commerce and traffic.

AWS very likely has Cloudflare beat in commerce responsibility. Amazon is equal to ~2.3% of US GDP by itself.

manquer

about 2 months ago

Absolute volume maybe[1], as relative % of global digital communication traffic, the era of early telegraph probably has it beat.

In the pre digital era, East India Company dwarfs every other company in any metric like commerce controlled, global shipping, communication traffic, private army size, %GDP , % of workforce employed by considerable margins.

The default was large consolidated organization throughout history, like say Bell Labs, or Standard Oil before that and so on, only for a brief periods we have enjoyed benefits of true capitalism.

[1] Although I suspect either AWS or MS/Azure recent down-times in the last couple of years are likely higher

nullbyte808

about 2 months ago

Yes, all(most) eggs should not be in one basket. Perfect opportunity to setup a service that checks cloudflare then switches a site's DNS to akami as a backup.

0xbadcafebee

about 2 months ago

2 replies

So, to recap:

  - Their database permissions changed unexpectedly (??)
  - This caused a 'feature file' to be changed in an unusual way (?!)
     - Their SQL query made assumptions about the database; their permissions change thus resulted in queries getting additional results, permitted by the query
  - Changes were propagated to production servers which then crashed those servers (meaning they weren't tested correctly)
     - They hit an internal application memory limit and that just... crashed the app
  - The crashing did not result in an automatic backout of the change, meaning their deployments aren't blue/green or progressive
  - After fixing it, they were vulnerable to a thundering herd problem
  - Customers who were not using bot rules were not affected; CloudFlare's bot-scorer generated a constant bot score of 0, meaning all traffic is bots

In terms of preventing this from a software engineering perspective, they made assumptions about how their database queries work (and didn't validate the results), and they ignored their own application limits and didn't program in either a test for whether an input would hit a limit, or some kind of alarm to notify the engineers of the source of the problem.

From an operations perspective, it would appear they didn't test this on a non-production system mimicing production; they then didn't have a progressive deployment; and they didn't have a circuit breaker to stop the deployment or roll-back when a newly deployed app started crashing.

tptacek

about 2 months ago

1 reply

People jump to say things like "where's the rollback" and, like, probably yeah, but keep in mind that speculative rollback features (that is: rollbacks built before you've experienced the real error modes of the system) are themselves sources of sometimes-metastable distributed system failures. None of this is easy.

0xbadcafebee

about 2 months ago

How about where's the most basic test to check if your config file will actually run at all in your application? It was a hard-coded memory limit; a git-hook test suite run a MacBook would have caught this. But nooo, let's not run the app for 0.01 seconds with this config before sending it out to determine the fate of the internet?

This is literally the CrowdStrike bug, in a CDN. This is the most basic, elementary, day 0 test you could possibly invent. Forget the other things they fucked up. Their app just crashes with a config file, and nobody evaluates it?! Not every bug is preventable, but an egregious lack of testing is preventable.

This is what a software building code (like the electrical code's UL listings that prevent your house from burning down from untested electrical components) is intended to prevent. No critical infrastructure should be legal without testing, period.

paulddraper

about 2 months ago

1 reply

Looks like you have the perfect window to disrupt them with a superior product.

mercnz

about 2 months ago

just before this outage i was exploring bunnycdn as the idea of cloudflare taking over dns still irks me slightly. there are competitors. but there's a certain amount of scale that cloudflare offers which i think can help performance in general. that said in the past i found cloudflare performance terrible when i was doings lots of testing. they are predominantly a pull based system not a push, so if content isn't current the cache miss performance can be kind of blah. i think their general backhaul paths have improved, but at least from new zealand they used to seem to do worse than hitting a los angeles proxy that then hits origin. (although google was in a similar position before, where both 8.8.8.8 and www.google.co.nz/.com were both faster via los angeles than via normal paths - i think google were doing asia parent, like if testing 8.8.8.8 misses it was super far away). i think now that we have http/3 etc though that performance is a bit simpler to achieve, and that ddos, bot protection is kind of the differentiator, and i think that cloudflare's bot protection may work reasonably well in general?

rawgabbit

about 2 months ago

1 reply

     > The change explained above resulted in all users accessing accurate metadata about tables they have access to. Unfortunately, there were assumptions made in the past, that the list of columns returned by a query like this would only include the “default” database:

  SELECT
  name,
  type
  FROM system.columns
  WHERE
  table =        'http_requests_features'
  order by name;

    Note how the query does not filter for the database name. With us gradually rolling out the explicit grants to users of a given ClickHouse cluster, after the change at 11:05 the query above started returning “duplicates” of columns because those were for underlying tables stored in the r0 database.

rawgabbit

about 2 months ago

Here is a bit more context in addition to the quote above. A ClickHouse permissions change made a metadata query start returning duplicate column metadata from an extra schema, which more than doubled the size and feature count of a Bot Management configuration file. When this oversized feature file was deployed to edge proxies, it exceeded a 200-feature limit in the bot module, causing that module to panic and the core proxy to return 5xx errors globally

zzzeek

about 2 months ago

1 reply

> Instead, it was triggered by a change to one of our database systems' permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system.

And here is the query they used ** (OK, so it's not exactly):

     SELECT * from feature JOIN permissions on feature.feature_type_id = permissions.feature_type_id

someone added a new row to permissions and the JOIN started returning two dupe feature rows for each distinct feature.

** "here is the query" is used for dramatic effect. I have no knowledge of what kind of database they are even using much less queries (but i do have an idea).

more edits: OK apparently it's described later in the post as a query against clickhouse's table metadata table, and because users were granted access to an additional database that was actually the backing store to the one they normally worked with, some row level security type of thing doubled up the rows. Not sure why querying system.columns is part of a production level query though, seems overly dynamic.

captainkrtek

about 2 months ago

I believe they mentioned ClickHouse

SerCe

about 2 months ago

5 replies

As always, kudos for releasing a post mortem in less than 24 hours after the outage, very few tech organisations are capable of doing this.

bayesnet

about 2 months ago

1 reply

And a well-written one at that. Compared to the AWS port-mortem this could be literature.

philipwhiuk

about 2 months ago

1 reply

Except it fails to document anything about the actions they made to Warp in London during the resolution.

eastdakotaAuthor

about 2 months ago

1 reply

There’s lots of things we did while we were trying to track down and debug the root cause that didn’t make it into the post. Sorry the WARP takedown impacted you. As I said in a comment above, it was the result of us (wrongly) believing that this was an attack targeting WARP endpoints in our UK data centers. That turned out to be wrong but based on where errors initially spiked it was a reasonable hypothesis we wanted to rule out.

philipwhiuk

about 2 months ago

Thanks!

yen223

about 2 months ago

5 replies

I'm curious about how their internal policies work such that they are allowed to publish a post mortem this quickly, and with this much transparency.

Any other large-ish company, there would be layers of "stakeholders" that will slow this process down. They will almost always never allow code to be published.

thesh4d0w

about 2 months ago

1 reply

The person who posted both this blog article and the hacker news post, is Matthew Prince, one of highly technical billionaire founders of cloudflare. I'm sure if he wants something to happen, it happens.

NewJazz

about 2 months ago

I'm sure he wanted traffic to flow for those few hours but it didn't :p

eastdakotaAuthor

about 2 months ago

10 replies

Well… we have a culture of transparency we take seriously. I spent 3 years in law school that many times over my career have seemed like wastes but days like today prove useful. I was in the triage video bridge call nearly the whole time. Spent some time after we got things under control talking to customers. Then went home. I’m currently in Lisbon at our EUHQ. I texted John Graham-Cumming, our former CTO and current Board member whose clarity of writing I’ve always admired. He came over. Brought his son (“to show that work isn’t always fun”). Our Chief Legal Officer (Doug) happened to be in town. He came over too. The team had put together a technical doc with all the details. A tick-tock of what had happened and when. I locked myself on a balcony and started writing the intro and conclusion in my trusty BBEdit text editor. John started working on the technical middle. Doug provided edits here and there on places we weren’t clear. At some point John ordered sushi but from a place with limited delivery selection options, and I’m allergic to shellfish, so I ordered a burrito. The team continued to flesh out what happened. As we’d write we’d discover questions: how could a database permission change impact query results? Why were we making a permission change in the first place? We asked in the Google Doc. Answers came back. A few hours ago we declared it done. I read it top-to-bottom out loud for Doug, John, and John’s son. None of us were happy — we were embarrassed by what had happened — but we declared it true and accurate. I sent a draft to Michelle, who’s in SF. The technical teams gave it a once over. Our social media team staged it to our blog. I texted John to see if he wanted to post it to HN. He didn’t reply after a few minutes so I did. That was the process.

philipgross

about 2 months ago

2 replies

You call this transparency, but fail to answer the most important questions: what was in the burrito? Was it good? Would you recommend?

eastdakotaAuthor

about 2 months ago

1 reply

Chicken burrito from Coyo Taco in Lisbon. I am not proud of this. It’s worse than ordering from Chipotle. But there are no Chipotle’s in Lisbon… yet.

luckys

about 2 months ago

There's a lot of good food places in Lisbon that you might not be familiar with yet. Enjoy your stay

dentemple

about 2 months ago

I DON'T see this as transparency. There is ZERO mention of the burrito in the post-mortem document itself.

0/10, get it right the first time, folks. (/s)

jofzar

about 2 months ago

1 reply

> I texted John to see if he wanted to post it to HN. He didn’t reply after a few minutes so I did

Damn corporate karma farming is ruthless, only a couple minute SLA before taking ownership of the karma. I guess I'm not built for this big business SLA.

dentemple

about 2 months ago

We're in a Live Fast Die Young karma world. If you can't get a TikTok ready with 2 minutes of the post modem drop, you might as well quit and become a barista instead.

ynx0

about 2 months ago

1 reply

How do you guys handle redaction? I'm sure even when trusted individuals are in charge of authoring, there's still a potential of accidental leakage which would probably be best mitigated by a team specifically looking for any slip ups.

Thanks for the insight.

eastdakotaAuthor

about 2 months ago

1 reply

Team has a good sense, typically. In this case, the names of the columns in the Bot Management feature table seemed sensitive. The person who included that in the master document we were working from added a comment: “Should redact column names.” John and I usually catch anything the rest of the team may have missed. For me, pays to have gone to law school, but also pays to have studied Computer Science in college and be technical enough to still understand both the SQL and Rust code here.

noname120

about 2 months ago

1 reply

Could you elaborate a bit on how going to law school helped? Was it because it made it easier for you to communicate and align with your CLO?

tinfoil3843

about 2 months ago

Probably because he could check legalities of a release himself without council. It is probably equivalent to educating yourself on your rights and laws so if you get pulled over by a cop who may try to do things that you can legally refuse, you can say no.

richardlblair

about 2 months ago

1 reply

A very human and authentic response. Love to see it.

Fantastic for recruiting, too.

anitil

about 2 months ago

> He didn’t reply after a few minutes so I did

I'd consider applying based on this alone

jeffrallen

about 2 months ago

A postmortem postmortem, I love it. Transparency to the power of 2.

anurag

about 2 months ago

Appreciate the extra transparency on the process.

tuetuopay

about 2 months ago

> I read it top-to-bottom out loud for Doug, John, and John’s son. None of us were happy — we were embarrassed by what had happened — but we declared it true and accurate.

I'm so jealous. I've written postmortems for major incidents at a previous job: a few hours to write, a week of bikeshedding by marketing and communication and tech writers and ... over any single detail in my writing. Sanitizing (hide a part), simplifying (our customers are too dumb to understand), etc, so that the final writing was "true" in the sense that it "was not false", but definitely not what I would call "true and accurate" as an engineer.

auto

about 2 months ago

I'm not sure I've ever read something from someone so high up in a company that gave me such a strong feeling for "I'd like to work for these people". If job posts could be so informal and open ended, this post could serve as one in the form of a personality fit litmus test.

tinfoil3843

about 2 months ago

I really appreciate this level of transparency. Thank you for being a good person in such a powerful position in the world.

realaaa

about 2 months ago

that's very cool, thanks

madeofpalk

about 2 months ago

From what I've observed, it depends on whether you're an "engineering company" or not.

tom1337

about 2 months ago

I mean the CEO posted the post-mortem so there aren't that many layers of stakeholders above. For other post-mortems by engineers, Matthew once said that the engineering team is running the blog and that he wouldn't event know how to veto even if he wanted [0]

[0] https://news.ycombinator.com/item?id=45588305

BrtByte

about 2 months ago

Cloudflare seems to have baked this level of transparency into their culture and incident response process

eastdakotaAuthor

about 2 months ago

1 reply

* published less than 12 hours from when the incident began. Proud of the team for pulling together everything so quickly and clearly.

ignoramous

about 2 months ago

That's all well & good, but I'm curious...

> Spent some time after we got things under control talking to customers. Then went home.

What did sama / Fidji say? ;) Turnstile couldn't have been worth that.

andrewinardeer

about 2 months ago

Plenty are capable. Most don't bother.

BrtByte

about 2 months ago

It's not just a PR-friendly summary either... they included real technical detail, timestamps, even code snippets

gucci-on-fleek

about 2 months ago

2 replies

> This showed up to Internet users trying to access our customers' sites as an error page indicating a failure within Cloudflare's network.

As a visitor to random web pages, I definitely appreciated this—much better than their completely false “checking the security of your connection” message.

> The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems' permissions

Also appreciate the honesty here.

> On 18 November 2025 at 11:20 UTC (all times in this blog are UTC), Cloudflare's network began experiencing significant failures to deliver core network traffic. […]

> Core traffic was largely flowing as normal by 14:30. We worked over the next few hours to mitigate increased load on various parts of our network as traffic rushed back online. As of 17:06 all systems at Cloudflare were functioning as normal.

Why did this take so long to resolve? I read through the entire article, and I understand why the outage happened, but when most of the network goes down, why wasn't the first step to revert any recent configuration changes, even ones that seem unrelated to the outage? (Or did I just misread something and this was explained somewhere?)

Of course, the correct solution is always obvious in retrospect, and it's impressive that it only took 7 minutes between the start of the outage and the incident being investigated, but it taking a further 4 hours to resolve the problem and 8 hours total for everything to be back to normal isn't great.

eastdakotaAuthor

about 2 months ago

8 replies

Because we initially thought it was an attack. And then when we figured it out we didn’t have a way to insert a good file into the queue. And then we needed to reboot processes on (a lot) of machines worldwide to get them to flush their bad files.

tptacek

about 2 months ago

1 reply

Richard Cook #18 (and #10) strikes again!

https://how.complexsystems.fail/#18

It'd be fun to read more about how you all procedurally respond to this (but maybe this is just a fixation of mine lately). Like are you tabletopping this scenario, are teams building out runbooks for how to quickly resolve this, what's the balancing test for "this needs a functional change to how our distributed systems work" vs. "instead of layering additional complexity on, we should just have a process for quickly and maybe even speculatively restoring this part of the system to a known good state in an outage".

asenchi

about 2 months ago

This document by Dr. Cook remains _the standard_ for systems failure. Thank you for bringing it into the discussion.

tetec1

about 2 months ago

1 reply

Yeah, I can imagine that this insertion was some high-pressure job.

globalise83

about 2 months ago

1 reply

The computer science equivalent of choosing between the red, green and blue wires when disarming a nuke with 15 seconds left on the clock

dylan604

about 2 months ago

Is it though? Or is it, oh, this is such a simple change that we really don't need to test it attitude? I'm not saying this applies to TFA, but some people are so confident that no pressure is felt.

However, you forgot that the lighting conditions are where only red lights from the klaxons are showing so you really can't differentiate the colors of the wires

gucci-on-fleek

about 2 months ago

1 reply

Thanks for the explanation! This definitely reminds me of CrowdStrike outages last year:

- A product depends on frequent configuration updates to defend against attackers.

- A bad data file is pushed into production.

- The system is unable to easily/automatically recover from bad data files.

(The CrowdStrike outages were quite a bit worse though, since it took down the entire computer and remediation required manual intervention on thousands of desktops, whereas parts of Cloudflare were still usable throughout the outage and the issue was 100% resolved in a few hours)

cocoa19

about 2 months ago

1 reply

It might remind you of Crowdstrike because of the scale.

Outages are in a large majority of cases caused by change, either deployments of new versions or configuration changes.

harivyom

about 2 months ago

zone your deployments first -blue/green. Have a small blue zone, and test it out. If it works, then expand to green deployments.

A configuration file should not grow! design failure here, I want to understand

dbetteridge

about 2 months ago

1 reply

Question from a casual bystander, why not have a virtual/staging mini node that receives these feature file changes first and catches errors to veto full production push?

Or you do have something like this but the specific db permission change in this context only failed in production

forsalebypwner

about 2 months ago

1 reply

I think the reasoning behind this is because of the nature of the file being pushed - from the post mortem:

"This feature file is refreshed every few minutes and published to our entire network and allows us to react to variations in traffic flows across the Internet. It allows us to react to new types of bots and new bot attacks. So it’s critical that it is rolled out frequently and rapidly as bad actors change their tactics quickly."

gizmo686

about 2 months ago

In this case, the file fails quickly. A pretest that consists of just attempting to load the file would have caught it. Minutes is more than enough time to perform such a check.

philipwhiuk

about 2 months ago

2 replies

Why was Warp in London disabled temporarily. No mention of that change was discussed in the RCA despite it being called out in an update.

For London customers this made the impact more severe temporarily.

aaronmdjones

about 2 months ago

1 reply

Probably because it was the London team that was actively investigating the incident and initially came to the conclusion that it may be a DDoS while being unable to authenticate to their own systems.

noir_lord

about 2 months ago

Given the time of the outage that makes sense, they’d mostly be within their work day time (if such a thing apples to us anymore).

eastdakotaAuthor

about 2 months ago

We incorrectly thought at the time it was attack traffic coming in via WARP into LHR. In reality it was just that the failures started showing up there first because of how the bad file propagated and where it was working hours in the world.

mindentropy

about 2 months ago

Is there some way to check the sanity of the configuration change, monitor it and then revert back to an earlier working configuration if things don't work out?

prawn

about 2 months ago

Just asking out of curiosity, but roughly how many staff would've been involved in some way in sorting out the issue? Either outside regular hours or redirected from their planned work?

hbbio

about 2 months ago

Thx for the explanation!

Side thought as we're working on 100% onchain systems (for digital assets security, different goals):

Public chains (e.g. EVMs) can be a tamper‑evident gate that only promotes a new config artifact if (a) a delay or multi‑sig review has elapsed, and (b) a succinct proof shows the artifact satisfies safety invariants like ≤200 features, deduped, schema X, etc.

That could have blocked propagation of the oversized file long before it reached the edge :)

chrismorgan

about 2 months ago

2 replies

> much better than their completely false “checking the security of your connection” message

The exact wording (which I can easily find, because a good chunk of the internet gives it to me, because I’m on Indian broadband):

> example.com needs to review the security of your connection before proceeding.

It bothers me how this bald-faced lie of a wording has persisted.

(The “Verify you are human by completing the action below.” / “Verify you are human” checkbox is also pretty false, as ticking the box in no way verifies you are human, but that feels slightly less disingenuous.)

eastdakotaAuthor

about 2 months ago

Next time open your dev console in your window and look at how much is going on in the background.

jpadkins

about 2 months ago

checking the checkbox does verify you are human, for the most part.

EvanAnderson

about 2 months ago

4 replies

It reads a lot like the Crowdstrike SNAFU. Machine-generated configuration file b0rks-up the software that consumes it.

The "...was then propagated to all the machines that make up our network..." followed by "....caused the software to fail." screams for a phased rollout / rollback methodology. I get that "...it’s critical that it is rolled out frequently and rapidly as bad actors change their tactics quickly" but today's outage highlights that rapid deployment isn't all upside.

The remediation section doesn't give me any sense that phased deployment, acceptance testing, and rapid rollback are part of the planned remediation strategy.

tptacek

about 2 months ago

3 replies

I don't think this system is best thought of as "deployment" in the sense of CI/CD; it's a control channel for a distributed bot detection system that (apparently) happens to be actuated by published config files (it has a consul-template vibe to it, though I don't know if that's what it is).

eastdakotaAuthor

about 2 months ago

1 reply

That’s correct.

tptacek

about 2 months ago

2 replies

Is it actually consul-template? (I have post-consul-template stress disorder).

threatofrain

about 2 months ago

1 reply

I'd love to hear any commentary on Consul if anyone else has it.

tptacek

about 2 months ago

I think Consul is great, for what it's worth; we were just abusing it.

https://fly.io/blog/a-foolish-consistency/

https://fly.io/blog/corrosion/

mh-

about 2 months ago

Did you know: PCTSD affects more than 2 in 5 engineers.

JB_Dev

about 2 months ago

1 reply

Code and Config should be treated similarly. If you would use a ring based rollout, canaries, etc for safely changing your code, then any config that can have the same impact must also use safe rollout techniques.

tptacek

about 2 months ago

1 reply

You're the nth person on this thread to say that and it doesn't make sense. Events that happen multiple times per second change data that you would call "configuration" in systems like these. This isn't `sendmail.cf`.

If you want to say that systems that light up hundreds of customers, or propagate new reactive bot rules, or notify a routing system that a service has gone down are intrinsically too complicated, that's one thing. By all means: "don't build modern systems! computers are garbage!". I have that sticker on my laptop already.

But like: handling these problems is basically the premise of large-scale cloud services. You can't just define it away.

EvanAnderson

about 2 months ago

1 reply

I'm sorry to belabor this but I'm genuinely not understanding what you're saying in this reply. I haven't operated large scale systems. I'm just an IT generalist and casual coder. I acknowledge I'm too inexperienced to even know what I don't know re: running large systems.

I read the parent poster as broadly suggesting configuration updates should have fitness tests applied and be deployed to minimize the blast radius when an update causes a malfunction. That makes intuitive sense to me. It seems like software should be subject to health checks after configuration updates, even if it's just to stop a deployment before it's widely distributed (let alone rolling-back to last-working configurations, etc).

Am I being thick-headed in thinking defensive strategies like those are a good idea? I'm reading your reply as arguing against those types of strategies. I'm also not understanding what you're suggesting as an alternative.

Again, I'm sorry to belabor this. I've replied once, deleted it, tried writing this a couple more times and given up, and now I'm finally pulling the trigger. It's really eating at me. I feel as though I must be deep down the Dunning-Kruger rabbit hole and really thinking "outside my lane".

tptacek

about 2 months ago

1 reply

The things you do to safeguard the rollout of a configuration file change are not the same as the things you do to reliably propagate changes that might happen many times per second.

What's irritating to me are the claims that there's nothing distinguishing real time control plane state changes and config files. Most of us have an intuition for how they'd do a careful rollout of a config file change. That intuition doesn't hold for control plane state; it's like saying, for instance, that OSPF should have canaries and staged rollouts every time a link state changes.

I'm not saying there aren't things you to do make real-time control plane state propagation safer, or that Cloudflare did all those things (I have no idea, I'm not familiar with their system at all, which is another thing irritating me about this thread --- the confident diagnostics and recommendations). I'm saying that people trying to do the "this is just like CrowdStrike" thing are telling on themselves.

EvanAnderson

about 2 months ago

1 reply

Thanks for the reply.

I took the "this sounds like Crowdstrike" tack for two reasons. The write-up characterized this update as an every five minutes process. The update, being a file of rules, felt analogous in format to the Crowdstrike signature database.

I appreciate the OSPF analogy. I recognize there are portions of these large systems that operate more like a routing protocol (with updates being unpredictable in velocity or time of occurrence). The write-up didn't make this seem like one of those. This seemed a lot more like a traditional daemon process receiving regular configuration updates and crashing on a bad configuration file.

tptacek

about 2 months ago

1 reply

It is possible that any number of things people on this thread have called out are, in fact, the right move for the system Cloudflare built (it's hard to know without knowing more about the system, and my intuition for their system is also faulty because I irrationally hate periodic batch systems like these).

Most of what I'm saying is:

(1) Looking at individual point failures and saying "if you'd just fixed that you wouldn't have had an incident" is counterproductive; like Mr. Oogie-Boogie, every big distributed system is made of bugs. In fact, that's true of literally every complex system, which is part of the subtext behind Cook[1].

(2) I think people are much too quick to key in on the word "config" and just assume that it's morally indifferentiable from source code, which is rarely true in large systems like this (might it have been here? I don't know.) So my eyes twitch like Louise Belcher's when people say "config? you should have had a staged rollout process!" Depends on what you're calling "config"!

[1] https://howcomplexsystems.fail/

Yokohiii

about 2 months ago

I just want to point out a few things you may overlooked. First, the bot config gets updated every 5 minutes, not in seconds. Second, they have config checks in other places already ("Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input"). They could probably even align everything in CI/CD if they'd run the config verifier where the configs are generated. This is of course all hindsight blind guessing, but you make it sound a bit arcane and impossible to do anything.

EvanAnderson

about 2 months ago

That's why I likened it Crowdstrike. It's a signature database that blew up the consumer of said database. (You probably caught my post mid-edit, too. You may be replying to the snarky paragraph I felt better of and removed.)

Edit: Similar to Crowdstrike, the bot detector should have fallen-back to its last-known-good signature database after panicking, instead of just continuing to panic.

Aeolun

about 2 months ago

I’m fairly certain it will be after they read this thread. It doesn’t feel like they don’t want, or are incapable of improving?

perlgeek

about 2 months ago

At my employer, we have a small script that automatically checks such generated config files. It does a diff between the old and the new version, and if the diff size exceeds a threshold (either total or relative to the size of the old file), it refuses to do the update, and opens a ticket for a human to look over it.

It has somewhat regularly saved us from disaster in the past.

navigate8310

about 2 months ago

I'm amazed that they are not using any simulator of some sort and pushing changes directly to production.

tristan-morris

about 2 months ago

3 replies

Why call .unwrap() in a function which returns Result<_,_>?

For something so critical, why aren't you using lints to identify and ideally deny panic inducing code. This is one of the biggest strengths of using Rust in the first place for this problem domain.

sayrer

about 2 months ago

2 replies

Yes, can't have .unwrap() in production code (it's ok in tests)

orphea

about 2 months ago

3 replies

Like goto, unwrap is just a tool that has its use cases. No need to make a boogeyman out of it.

gishh

about 2 months ago

1 reply

To be fair, if you’re not “this tall” you really shouldn’t consider using goto in a c program. Most people aren’t that tall.

tucnak

about 2 months ago

1 reply

Nonsense. Linux kernel for one example, uses goto everywhere for error handling.

gishh

about 2 months ago

1 reply

How many codebases besides the Linux kernel had you seen an intentional amount of use of goto?

orphea

about 2 months ago

The answer is one search away on GitHub.

I will give you one example though: various .NET repositories (runtime, aspnetcore, orleans).

metaltyphoon

about 2 months ago

Yes it's meant to be used in test code. If you're sure it can't fail do then use .expect() that way it shows you made a choice and it wasn't just a dev oversight.

fwjafwasd

about 2 months ago

panicans should be using .expect() in production

keyle

about 2 months ago

unwrap itself isn't the problem...

tptacek

about 2 months ago

5 replies

Probably because this case was something more akin to an assert than an error check.

stefan_

about 2 months ago

2 replies

You are saying this would not have happened in a C release build where asserts define to nothing?

Wonder why these old grey beards chose to go with that.

ashishb

about 2 months ago

1 reply

> You are saying this would not have happened in a C release build where asserts define to nothing?

Afaik, Go and Java are the only languages that make you pause and explicitly deal with these exceptions.

tristan-morris

about 2 months ago

1 reply

And rust, but they chose to panic on the error condition. Wild.

ashishb

about 2 months ago

1 reply

> And rust, but they chose to panic on the error condition. Wild.

unwrap() implicitly panic-ed, right?

aw1621107

about 2 months ago

I don't think "implicitly panicked" is an accurate description since unwrap()'s entire reason for existing is to panic if you unwrap an error condition. If you use unwrap(), you're explicitly opting into the panicking behavior.

I suppose another way to think about it is that Result<T, E> is somewhat analogous to Java's checked exceptions - you can't get the T out unless you say what to do in the case of the E/checked exception. unwrap() in this context is equivalent to wrapping the checked exception in a RuntimeException and throwing that.

tptacek

about 2 months ago

I am one of those old grey beards (or at least, I got started shipping C code in the 1990s), and I'd leave asserts in prod serverside code given the choice; better that than a totally unpredictable error path.

thundergolfer

about 2 months ago

1 reply

Fly writes a lot of Rust, do you allow `unwrap()` in your production environment? At Modal we only allow `expect("...")` and the message should follow the recommended message style[1].

I'm pretty surprised that Cloudflare let an unwrap into prod that caused their worst outage in 6 years.

1. https://doc.rust-lang.org/std/option/enum.Option.html#recomm...

tptacek

about 2 months ago

2 replies

After The Great If-Let Outage Of 2024, we audited all our code for that if-let/rwlock problem, changed a bunch of code, and immediately added a watchdog for deadlocks. The audit had ~no payoff; the watchdog very definitely did.

I don't know enough about Cloudflare's situation to confidently recommend anything (and I certainly don't know enough to dunk on them, unlike the many Rust experts of this thread) but if I was in their shoes, I'd be a lot less interested in eradicating `unwrap` everywhere and more in making sure than an errant `unwrap` wouldn't produce stable failure modes.

But like, the `unwrap` thing is all programmers here have to latch on to, and there's a psychological self-soothing instinct we all have to seize onto some root cause with a clear fix (or, better yet for dopaminergia, an opportunity to dunk).

A thing I really feel in threads like this is that I'd instinctively have avoided including the detail about an `unwrap` call --- I'd have worded that part more ambiguously --- knowing (because I have a pathological affinity for this community) that this is exactly how HN would react. Maybe ironically, Prince's writing is a little better for not having dodged that bullet.

perching_aix

about 2 months ago

Sounds like if nothing else, additional attention around (their?) use of unwrap() is still warranted from where you're sitting then though, no? I don't think there's anything wrong with flagging that.

It's one thing to not want to be the one to armchair it, but that doesn't mean that one has to suppress their normal and obvious reactions. You're allowed to think things even if they're kitsch, you too are human, and what's kitsch depends and changes. Applies to everyone else here by extension too.

thundergolfer

about 2 months ago

Fair. I agree that saying "it's the unwrap" and calling it a day is wrong. Recently actually we've done an exercise on our Worker which is "assume the worst kind of panic happens. make the Worker be ok with it".

But I do feel strongly that the expect pattern is a highly useful control and that naked unwraps almost always indicate a failure to reason about the reliability of a change. An unwrap in their core proxy system indicates a problem in their change management process (review, linting, whatever).

marcusb

about 2 months ago

Rust has debug asserts for that. Using expect with a comment about why the condition should not/can't ever happen is idiomatic for cases where you never expect an Err.

This reads to me more like the error type returned by append with names is not (ErrorFlags, i32) and wasn't trivially convertible into that type so someone left an unwrap in place on an "I'll fix it later" basis, but who knows.

tristan-morris

about 2 months ago

Oh absolutely, that's how it would have been treated.

Surely a unwrap_or_default() would have been a much better fit--if fetching features fails, continue processing with an empty set of rules vs stop world.

piperswe

about 2 months ago

In that case, it should probably be expect rather than unwrap, to document why the assertion should never fail.

koakuma-chan

about 2 months ago

2 replies

Why is there a 200 limit on appending names?

nickmonad

about 2 months ago

Limits in systems like these are generally good. They mention the reasoning around it explicitly. It just seems like the handling of that limit is what failed and was missed in review.

zmj

about 2 months ago

Everything has a limit. You can define it, or be surprised when you find out what it is.

otterley

about 2 months ago

6 replies

> work has already begun on how we will harden them against failures like this in the future. In particular we are:

> Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input

> Enabling more global kill switches for features

> Eliminating the ability for core dumps or other error reports to overwhelm system resources

> Reviewing failure modes for error conditions across all core proxy modules

Absent from this list are canary deployments and incremental or wave-based deployment of configuration files (which are often as dangerous as code changes) across fault isolation boundaries -- assuming CloudFlare has such boundaries at all. How are they going to contain the blast radius in the future?

This is something the industry was supposed to learn from the CrowdStrike incident last year, but it's clear that we still have a long way to go.

Also, enabling global anything (i.e., "enabling global kill switches for features") sounds like an incredibly risky idea. One can imagine a bug in a global switch that transforms disabling a feature into disabling an entire system.

nikcub

about 2 months ago

1 reply

They require the bot management config to update and propagate quickly in order to respond to attacks - but this seems like a case where updating a since instance first would have seen the panic and stopped the deploy.

I wonder why clickhouse is used to store the feature flags here, as it has it's own duplication footguns[0] which could have also easily lead to a query blowing up 2/3x in size. oltp/sqlite seems more suited, but i'm sure they have their reasons

[0] https://clickhouse.com/docs/guides/developer/deduplication

HumanOstrich

about 2 months ago

1 reply

I don't think sqlite would come close to their requirements for permissions or resilience, to name a couple. It's not the solution for every database issue.

Also, the link you provided is for eventual deduplication at the storage layer, not deduplication at query time.

hedora

about 2 months ago

1 reply

I think the idea is to ship the sqlite database around.

It’s not a terrible idea, in that you can test the exact database engine binary in CI, and it’s (by definition) not a single point of failure.

HumanOstrich

about 2 months ago

I think you're oversimplifying the problem they had, and I would encourage you to dive in to the details in the article. There wasn't a problem with the database, it was with the query used to generate the configs. So if an analogous issue arose with a query against one of many ad-hoc replicated sqlite databases, you'd still have the failure.

I love sqlite for some things, but it's not The One True Database Solution.

mewpmewp2

about 2 months ago

1 reply

It seems they had this continous rollout for the config service, but the services consuming this were affected even by small percentage of these config providers being faulty, since they were auto updating every few minutes their configs. And it seems there is a reason for these updating so fast, presumably having to react to threat actors quickly.

otterley

about 2 months ago

1 reply

It's in everyone's interest to mitigate threats as quickly as possible. But it's of even greater interest that a core global network infrastructure service provider not DOS a significant proportion of the Internet by propagating a bad configuration too quickly. The key here is to balance responsiveness against safety, and I'm not sure they struck the right balance here. I'm just glad that the impact wasn't as long and as severe as it could have been.

tptacek

about 2 months ago

1 reply

This isn't really "configuration" so much as it is "durable state" within the context of this system.

otterley

about 2 months ago

2 replies

In my 30 years of reliability engineering, I've come to learn that this is a distinction without a difference.

People think of configuration updates (or state updates, call them what you will) as inherently safer than code updates, but history (and today!) demonstrates that they are not. Yet even experienced engineers will allow changes like these into production unattended -- even ones who wouldn't dare let a single line of code go live without being subject to the full CI/CD process.

HumanOstrich

about 2 months ago

2 replies

They narrowed down the actual problem to some Rust code in the Bot Management system that enforced a hard limit on the number of configuration items by returning an error, but the caller was just blindly unwrapping it.

otterley

about 2 months ago

1 reply

A dormant bug in the code is usually a condition precedent to incidents like these. Later, when a bad input is given, the bug then surfaces. The bug could have laid dormant for years or decades, if it ever surfaced at all.

The point here remains: consider every change to involve risk, and architect defensively.

tptacek

about 2 months ago

1 reply

They made the classic distributed systems mistake and actually did something. Never leap to thing-doing!

otterley

about 2 months ago

1 reply

If they're going to yeet configs into production, they ought to at least have plenty of mitigation mechanisms, including canary deployments and fault isolation boundaries. This was my primary point at the root of this thread.

And I hope fly.io has these mechanisms as well :-)

tptacek

about 2 months ago

We've written at long, tedious length about how hard this problem is.

Ekaros

about 2 months ago

Sounds like lack of good testing. Too many items in any input should be a boundary case you will get to eventually.

tptacek

about 2 months ago

1 reply

Reframe this problem: instead of bot rules being propagated, it's the enrollment of a new customer or a service at an existing customer --- something that must happen at Cloudflare several times a second. Does it still make sense to you to think about that in terms of "pushing new configuration to prod"?

otterley

about 2 months ago

1 reply

Those aren't the facts before us. Also, CRUD operations relating to a specific customer or user tend not to cause the sort of widespread incidents we saw today.

tptacek

about 2 months ago

They're not, they're a response to your claim that "state" and "configuration" are indifferentiable.

Buttons840

about 2 months ago

When a failsafe system fails, it fails by failing to fail safely.

Yokohiii

about 2 months ago

I think global kill switches are just an last resort machanism, to bypass identified faulty subsystems. Even if there is a risk with it, in this instance the risk was zero, because CF was dead already. This wont change the blast radius, but it's duration and proliferation.

In reference to fault isolation boundaries: I am not familiar with their CI/CD, in theory the error could have been caught/prevented there, but that comes with a lot of depends or it's tricky. But it looks like they didn't go the extra mile to care about safety sensitive areas. So euphemistic speaking, they are now recalibrating balance of safety measures.

ants_everywhere

about 2 months ago

it's always a config push. people rollout code slowly but don't have the same mechanisms for configs. But configs are code, and this is a blind spot that causes an outsized percentage of these big outages.

Scaevolus

about 2 months ago

Global configuration is useful for low response times to attacks, but you need to have very good ways to know when a global config push is bad and to be able to rollback quickly.

In this case, the older proxy's "fail-closed" categorization of bot activity was obviously better than the "fail-crash", but every global change needs to be carefully validated to have good characteristics here.

Having a mapping of which services are downstream of which other service configs and versions would make detecting global incidents much easier too, by making the causative threads of changes more apparent to the investigators.

lukan

about 2 months ago

2 replies

"Throwing us off and making us believe this might have been an attack was another apparent symptom we observed: Cloudflare’s status page went down. The status page is hosted completely off Cloudflare’s infrastructure with no dependencies on Cloudflare. While it turned out to be a coincidence, it led some of the team diagnosing the issue to believe that an attacker may be targeting both our systems as well as our status page."

Unfortunately they do not share, what caused the status page to went down as well. (Does this happen often? Otherwise a big coincidence it seems)

Aeolun

about 2 months ago

1 reply

I mean, that would require a postmortem from statuspage.io right? Is that a service operated by cloudflare?

edoceo

about 2 months ago

Atlasaian

eastdakotaAuthor

about 2 months ago

1 reply

We don’t know. Suspect it may just have been a big uptick in load and a failure of its underlying infrastructure to scale up.

dnw

about 2 months ago

Yes, probably a bunch of automated bots decided to check the status page when they saw failures in production.

756 more comments available on Hacker News

View full discussion on Hacker News

ID: 45973709Type: storyLast synced: 11/27/2025, 3:36:12 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN