AWS Outage Shows Internet Users 'at Mercy' of Too Few Providers, Experts Say

Posted2 months agoActive2 months ago

evolve2k

258 points

215 comments

theguardian.comTechstoryHigh profile

heatednegative

Debate

80/100

Cloud ComputingAWS OutageVendor Lock-in

Key topics

Cloud Computing

AWS Outage

Vendor Lock-in

The recent AWS outage highlights the risks of relying on a few large cloud providers, sparking debate about vendor lock-in and the need for redundancy and diversification.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

22m

Peak period

140

0-12h

Avg / period

22.9

Comment distribution160 data points

Loading chart...

Based on 160 loaded comments

Key moments

01Story posted
Oct 20, 2025 at 1:32 PM EDT
2 months ago
Step 01
02First comment
Oct 20, 2025 at 1:54 PM EDT
22m after posting
Step 02
03Peak activity
140 comments in 0-12h
Hottest window of the conversation
Step 03
04Latest activity
Oct 27, 2025 at 4:06 AM EDT
2 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (215 comments)

Showing 160 comments of 215

labrador

2 months ago

2 replies

This new post is interesting: https://news.ycombinator.com/item?id=45646777

"October 17, 2025, was my last day at Amazon Web Services... CloudFront is a CDN, a content delivery network, or, simply put, a large distributed cache for your cat photos. And a very successful one. Something like 30% of all internet traffic goes through CloudFront in one way or another. Pretty cool, huh? In practice, this means that with any change, you have a chance of crashing 30% of the internet."

ta1243

2 months ago

> Something like 30% of all internet traffic goes through CloudFront in one way or another. Pretty cool, huh?

No. No its not. But tech enthusiasts on HN and Reddit love it.

(Another 30% runs through cloudflare)

Yokolos

2 months ago

Ngl, that sounds like my dream job.

dynamite-ready

2 months ago

2 replies

The whole industry walked straight into the cloud service lock-in trap. How would we begin to wind back? I also think Docker is as much to blame as the bigger cloud vendors.

Jcowell

2 months ago

1 reply

Why is docker to blame?

dynamite-ready

2 months ago

4 replies

It's subjective I guess, but I feel as though containerisation has greatly supported the large Cloud vendor's desire to subvert the more common model of computing... Like, before, your server was a computer, much like your desktop machine, and you programmed it much like your desktop machine.

But now, people are quite happy to put their app in a Docker container and outsource all design and architecture decisions pertaining to data storage and performance.

And with that, the likes of ECS, Dynamo, RedShift, etc, are a somewhat reasonable answer to that. It's much easier to offer a distinct proposition around that state of affairs, than say a market that was solely based on EC2-esque VMs.

What I did not like, but absolutely expected, was this lurch towards near enough standardising one specific vendor's model. We're in quite a strange place atm, where AWS specific knowledge might actually have a slightly higher value than traditional DevOps skills for many organisations.

Felt like this all happened both at the speed of light, and in slow motion, at the same time.

throwaway894345

2 months ago

1 reply

Containers have nothing to do with storage. They are completely orthogonal to storage (you can use Dynamo or RedShift from EC2), and many people run Docker directly on VMs. Plenty of us still spend lots of time thinking about storage and state even with containers.

Containers allow me to outsource host management. I gladly spend far less time troubleshooting cloud-init, SSH, process managers, and logging/metrics agents.

dynamite-ready

2 months ago

1 reply

> Containers have nothing to do with storage. They are completely orthogonal to storage

Exactly.

And sure, you can use S3/Dynamo/Aurora from an EC2 box, but what would be the point of that? Just get the app running in a container, and we can look into infrastructure later.

It's a very common refrain. That's why I believe Docker is strongly to linked the development of these proprietary, cloud based models of computing, that place containerisation at the heart of an ecosystem that bastardises the classic idea of a 'server'.

The existence of S3 is one good result of this. IAM, on the other hand, can die in dumpster fire. Though it won't...

throwaway894345

2 months ago

> And sure, you can use S3/Dynamo/Aurora from an EC2 box, but what would be the point of that?

An easy API? Easy replication / failover / backups? I would absolutely use S3 even with EC2.

> IAM, on the other hand, can die in dumpster fire.

I’m no great fan of AWS’s approach to IAM, but much of the pain is just the nature of fine-grained / least-privilege permissioning. On EC2 it’s more common to just grant broader permissions; IAM makes you think about least privilege, but you absolutely can grant admin for everything. And as far as a permissioning API goes, IAM is much cleaner/saner than Linux permissions.

godelski

2 months ago

Containers let me essentially build those machines but at the actual requirements I need for a particular system. So instead of 10 machines I can build 1. I then don't need to upgrade that machine if my service changes.

Its also more resilient because I can trash a container and load up a new one with low overhead. I can't really do that with a full machine. It also gives some more security by sandboxing.

This does lead to laziness by programmers accelerated by myopic management. "It works" except when it doesn't. Easy to say you just need to restart the container then to figure out the actual issue.

But I'm not sure what that has to do with cloud. You'd do the same thing self hosting. Probably save money too. Though I'm frequently confused why people don't do both. Self host and host in the cloud. That's how you create resilience. Though you also need to fix problems rather than restart to be resilient too.

I feel like our industry wants to move fast but without direction. It's like we know velocity matters but since it's easier to read the speedometer we pretend they're the same thing. So fast and slow makes sense. Fast by magnitude of the vector. Slow if you're measuring how fast we make progress in the intended direction.

pythonaut_16

2 months ago

I don't see how Docker makes that worse.

Before Docker you had things like Heroku and Amazon Elastic Beanstalk with a much greater degree of lock in than Docker.

ECS and its analogues on the other cloud providers have very little lock in. You should be able to deploy your container to any provider or your own VM. I don't see what Dynamo and data storage have to do with that. If we were all on EC2s with no other services you'd still have to figure out how to move your data somewhere else?

Like I truly don't understand your argument here.

SJC_Hacker

2 months ago

Containerization was basically a way to get rid of the problem of "it works on my machine", mainly the OS version and installed libraries. Plenty of instances where program X will work on system A, but not system B, but program Y works on system B but not A. Or X is supported on Redhat/Ubuntu/etc. but you can't or don't want to build from source.

Even if that is not a problem, you avoid having to install the kitchen sink on your host and make sure everything is configured properly. Just get it working on a container, build and image and spin it up when you need it. Leaves the host machine fairly clean.

You can run a bunch of services as containers within a single host. No cloud or k8s needed. docker-compose is sufficient for testing or smallish projects.

Also, there is a security benefit because if the container is compromised, problem is limited that container not the entire host.

spjt

2 months ago

I don't think it wants to. Ask any on-call engineer or support tech how they felt when, after having their phone blow up at 1am because everything is falling apart, they found out that this was an AWS-wide outage.

bix6

2 months ago

5 replies

Can someone educate me on the solution to this?

I assume most organizations, both small and large, just host on whatever provider they know or that costs them the least. If you have budget maybe you deploy to multiple providers for redundancy? But that increases cost and complexity.

Who’s going to bother with colo given the cost / complexity? Who’s going to run a server from their office given ISP restrictions and downtime fears?

What is the realistic antidote here?

98codes

2 months ago

1 reply

Companies can architect their backends to be able to fail back to another region in case of outage, and either don't test it or don't bother to have it in place because they can just blame Amazon, and don't otherwise have an SLA for their service.

To fix it, test your failback procedures. For everything else, there's nothing to fix, it's working by design.

maccard

2 months ago

1 reply

> Companies can architect their backends to be able to fail back to another region in case of outage, and either don't test it or don't bother to have it in place because they can just blame Amazon, and don't otherwise have an SLA for their service.

My CI was down for 2 hours this morning, despite not even being on AWS. We have a set of credentials on that host that we call assumeRole with and push to an S3 bucket, which has a lambda that duplicates to buckets in other regions. All our IAM calls were failing due to this outage, and we have 0 items deployed in us-east-1 (we're european)

cyberax

2 months ago

1 reply

You likely used a us-east-1 IAM endpoint instead of a regionalized one ( https://aws.amazon.com/blogs/security/how-to-use-regional-aw... ). We've been using it, and we're not experiencing any issues whatsoever in us-east-2.

One thing that AWS should do is provide an easier way to detect these hidden dependencies. You can do that with CloudTrail if you know how to do it (filter operations by region and check that none are in us-east-1), but a more explicit service would be nice.

maccard

2 months ago

1 reply

We did indeed.

The problem was we couldn’t log into cloud trail, or the console at all, to identify that, because IAM identity center is single region. This was a decision recommended by AWS, and blessed by our (army of) SRE teams.

cyberax

2 months ago

1 reply

But you can run TWO identity centers in different regions for the price of one(1)! IAM IDC is just a regular application hosted on the AWS infrastructure, it really has nothing special.

The hindsight is 20/20, of course, it's a good practice to audit CloudTrail periodically for unexpected regional dependencies.

(1) offer void for services that run on AWS.

maccard

2 months ago

1 reply

Indeed. I also noticed this morning that you're not the person I replied to, and I took your response (which was actually helpful) in the context of the original post which was "people are happy to just blame AWS when they're down".

Either way, we would have only made it one step farther in our CI, as the next step is to build a conatiner with a base image from docker hub, and that was down too. The idea of running a multi region nexus repository to avoid Docker hub outages for my 14 person engineering team seems slightly overkill!

cyberax

2 months ago

The easiest way to provide some resilience to the build process is to add a pull-through cache using AWS ECR. It might backfire due to egress costs, though, if you're building outside the AWS infrastructure.

It's actually an interesting exercise to enumerate _all_ the external dependencies. But yeah, avoiding them all seems to be less than helpful for the vast majority of users.

ruuda

2 months ago

2 replies

Rent servers from a local provider. It's cheaper, you get more control over the hardware, but most of all, it avoids correlated failures.

cheeze

2 months ago

On the flipside, then you have to maintain instances of everything.

For most of what I run these days, I'd rather just have someone else run and administer my database. Same with my load balancers. And my Kubernetes cluster. I don't really care if there is an outage every 2 years.

jimbokun

2 months ago

That only helps if their uptime is better than AWS.

0xbadcafebee

2 months ago

> I assume most organizations, both small and large, just host on whatever provider they know or that costs them the least

I don't think there is a "most" organizations. Either they're looking for big cloud or they're not, and least-cost is usually the last consideration when looking at any cloud, because you're trying to pay a premium to get particular advantages.

The realistic antidote? Move to a less-shitty region. Or architect your systems to be failure-resilient.

(Most people seem to think the entire region was offline? That's wrong. It was just particular services which wouldn't process control plane requests, and then a failure cascade caused more problems. But things that were already started running before, stayed running. A region is multiple datacenters. Even AZs are often multiple datacenters. It's virtually impossible for a whole region to stop working.)

gytisgreitai

2 months ago

What cost? Complexity - yes, to some extent.

tayo42

2 months ago

If the cost is worth the complexity then you just do it. Otherwise you don't. How much did a company lose today compared to how much it costs to set it up

And colo and datacenters aren't immune to going down

neom

2 months ago

9 replies

Been a while since I worked in cloud but at least when I got out of it, the primitives where all shoring up to be generally very similar.

Did multi cloud redundancy end up being too expensive? Tech didn't line up enough? No good business case?

The elastic cloud story that never was? https://www.slideshare.net/slideshow/pets-vs-cattle-the-elas...

What happened?

tadfisher

2 months ago

1 reply

Not a justifiable expense when no one else is resilient against their AWS region going down either. Also cross-cloud orchestration is quite dead because every provider is still 100% proprietary bullshit and the control plane is... kubernetes. We settled for kubernetes.

morshu9001

2 months ago

1 reply

Also if you can't even do cross region, cross cloud won't happen

dylan604

2 months ago

3 replies

Cross region isn't simple when you have terabytes of storage in buckets in a region. Building services in other regions without that data doesn't really do any good. Maintaining instances in various regions is easy, but it's that data that complicates everything. If you need to use the instances in a different region because your main region is down, you still can't do anything because those cross region instances can't access the necessary data.

TZubiri

2 months ago

1 reply

Bottomline is that AWS gives you the tools to survive this outage within their own ecosystem.

If there's an issue with relying only on AWS it has not been expressed in this outage.

dylan604

2 months ago

1 reply

exactly what tools helps make your large volume of data stored in a down region available to other regions without duplicating the monthly storage fees?

morshu9001

2 months ago

1 reply

You duplicate the fees. But it's the same or worse trying to do multi cloud.

dylan604

2 months ago

2 replies

Which is precisely why it's not done

neom

2 months ago

2 replies

I seems to recall it was fairly common to have a read only versions of sites when there was a major outage - we did that a lot with deviantart in the early 2000s, did that fall out of favour or too complex with modern stacks or?

dylan604

2 months ago

If only everything was a simple website. You're totally ignoring other types of workflows that would be impossible to use a read-only fall back. Not just impossible, but pointless.

morshu9001

2 months ago

HN does it too, but it's a simple site

morshu9001

2 months ago

2 replies

I don't think storage cost is the reason, more that it's hard to design for regional failures. DB by itself as one example, cross region read replica usually introduces eventual consistency to a system that'd otherwise be immediately consistent.

neom

2 months ago

Thanks for the helpful reply! Do you think that would be still true if one accepted a constraint of the "down" version of the property served had data that was stale, say 24 hours behind what the user would have seen had they been logged in?

TZubiri

2 months ago

Well yeah, but that's why we get paid the big bucks right?

Veserv

2 months ago

1 reply

Entire terabytes?! My god, I can only barely fit that onto a single SD card the size of my pinky nail.

It is quite bizarre that such paltry amounts of data and problems with such tiny scale seem to pose challenging problems when done in the cloud.

dylan604

2 months ago

2 replies

Such a sophomoric response. It does not matter how large your storage use is exactly. The point is that nobody is going to pay to replicate that data in multiple clouds or within multiple regions of the same cloud provider.

Btw, I'd love to have a link to where I could buy an SD card the size of a pinky nail that holds terabytes of data.

Veserv

2 months ago

1 reply

It absolutely matters how large your storage use is. Terabytes of storage is easily manageable on even basic consumer hardware. Terabytes of storage costs just hundreds of dollars if you are not paying the cloud tax.

If you got resiliency and uptime for a extra hundred dollars a year, that would be a no-brainer for any commercial operation. The byzantine kafkaesque horror of the cloud results in trivial problems and costs ballooning into nearly insurmountable and cost-ineffective obstacles.

These are not hard or costly problems or difficult scales. They have been made hard and costly and difficult.

dylan604

2 months ago

1 reply

Your pedantry is just boring. Yes, I used the word terabyte instead I guess something more palatable to you for being large. Fine s/exabyte/terabyte/.

I work with buckets where single files are >1 terabyte. There's more than one of these files, hence terabytes. I'm not going to do a human-readable summary listing of an entire bucket to get the full size. The point of the actual size is irrelevant. When people are spending 5-6 digits on cloud storage per month, they are not going to do it in multiple places. period. Maybe the new storage unit should just be monthly cloud spend, but then your pedantry will say nonsense like which cloud sever, which storage solution type, blah blah blah.

Veserv

2 months ago

Ah yes, let us just gloss over 6 orders of magnitude when we are discussing cost-effectiveness and feasibility. What is the difference between 100$ and 100,000,000$ of spend really? Basically the same thing.

BenjiWiebe

2 months ago

1 reply

Yes they exaggerated, it takes several pinky nail sized cards to store several TB. Only 1TB per microSD.

Veserv

2 months ago

They have them at 2 TB [1] now for just 300$. And SanDisk announced 4 TB last year, but I do not see them for sale just yet.

[1] https://shop.sandisk.com/products/memory-cards/microsd-cards...

jimbokun

2 months ago

Data has a lot of gravity.

Analemma_

2 months ago

2 replies

All the cloud providers have cheap compute but ludicrously expensive network egress. Trying to multicloud will stick you with a massive traffic bill, which is probably not a coincidence.

jamesblonde

2 months ago

2 replies

It's a market regulation failure. Which results in a failed market, with the cloud infra provider also providing data services. 20 years ago, there were 20+ widely used operational databases. Now, it's like DynamoDB with like half the market.

conductr

2 months ago

How should this have played out in a regulated market? DynamoDB gets released, then what? Has limits on the market share it's allowed to steal?

Should we similarly cap say Front End frameworks on market penetration / growth? Is react too big to fail? Do we need to force some of it's users to use something else?

jimbokun

2 months ago

What would these regulations say, exactly?

starman55

2 months ago

It really depends on how you will built it. You can architect it for multi cloud from top down where the client/browser talk to one region, With DNS with health check, and replication happens at the DB layer. Your services don't talk cross region at service level, so avoiding a lot of cross region/cloud communication. Most use cases can be addressed this way.

dylan604

2 months ago

1 reply

If you're a company providing services to people that already have data stored in VendorA's cloud, being on a different cloud would be expensive and prevent you from winning much work. If it turns out that VendorA happens to be the vendor for your clients, you build your services to run on VendorA's cloud too.

This is the situation for my company that started with the intent of being platform agnostic, but it quickly became much less complex as all of the potential client pool was using the same cloud. People with buckets with large amounts of data are not going to be able to convince the bean counters that it would be worth it to have that storage bill from multiple vendors.

conductr

2 months ago

> are not going to be able to convince the bean counters that it would be worth it to have that storage bill from multiple vendors

Because it rarely is. Occasional downtime is just a cost of doing business. It is, or should be, rare enough that you just take it as it comes instead of trying to have a redundancy. We don't build tunnels everywhere as a backup for surface roads on snowy days. We just cancel school and work for the day and make up for it later. Do some important things get impacted? Sure, but most things are as mission critical as we make them out to be. The press coverage of an AWS outage makes it so easy to shrug it off and point fingers.

LaurensBER

2 months ago

The (cognitive) overhead of managing and deploying to multiple clouds usually isn't worth it for most teams. Hiring experts and maintaining knowledge about the ins and outs of two (or more) clouds is less feasible for small, fast moving teams.

Simplicity is linked to uptime and having a single cloud solution is a simpeler solution.

For large companies, its mostly cost savings. Easier to negotiate a good discount at N million versus N/2 million.

Besides that no-one ever got fired for picking AWS ;)

rubiquity

2 months ago

Networking leaving the cloud provider (or even just to another zone on the same cloud) is $0.02 GB. That adds up real fast.

sumtechguy

2 months ago

Many companies idea of a disaster plan is to make it after the disaster.

You have to build it in. That takes time money and training. Do you do failovers? Do they work? What is your backup situation? What is your list of work items to do during the failover? How long does it take? Do you even HAVE a failover plan? Can your services handle being in 'split brain'? Do you have specialty services that can only run in one place?

The unfortunate reality is this planning happens many times too late.

justapassenger

2 months ago

There’s a huge difference between “similar” and “works and is ROI positive for my business across the whole lifecycle”.

Multi cloud redundancy is like Java being a solution to platform independency.

TZubiri

2 months ago

It feels like a hat on a hat, cloud systems are already designed for redundancy, adding a redundant layer on top of that is like a double condom, or invesisting in multiple investment funds.

toast0

2 months ago

It seems that clouds balance their budget on egress charges... which leads to cross cloud communication being too expensive to setup multi cloud redundancy. Cross region redundancy is often too expensive too. Even cross availability zones is too expensive for some clouds and applications. (Cross region redundancy in a single cloud doesn't always work out, if the cloud has an outage on a global subsystem, or the broken subsystem gets pushed to multiple regions before exhibiting symptoms)

Additionally, moving your load to a different cloud can be challenging while one is down. It ends up being a lot of work that pays off for a few hours a year. For a lot of applications, it's better to just suffer the downtime and spend money on other things.

morshu9001

2 months ago

1 reply

The expert opinions are more about geopolitics, like maybe don't have all your country's systems realtime depend on a foreign company.

If you are just one company whose goal is to maximize uptime without bringing in the complexity of multi-cloud, relying on AWS is reasonable. You probably won't get better uptime using something else, you'll only be down at different times than most others, which in most cases is actually worse.

kristianc

2 months ago

For the kind of person being quoted, the stock in trade is not actually doing anything to fix it, it's in being the person quoted when something goes wrong.

racl101

2 months ago

3 replies

I find it weird many people are just realizing this. I've had this conversation with regards to talking about what should happen if a couple of bad earth quakes, not even "the big one", were to occur.

But on the other hand, maybe I hang around too many tech people to not empathically understand the other point of view.

morshu9001

2 months ago

We've seen big outages already but nothing that lasts too long. If an outage became prolonged enough, people would find solutions. We don't know what this massive outage would even look like, so whatever preparation you do, it might still break.

Also there are some outages that affect real life like airlines, but tech news overstates some like Facebook. It turns out that FB and IG can be totally broken for a whole day, the world will keep spinning, and they won't even lose users.

jraph

2 months ago

I think many (most?) non tech people don't even know that Amazon is first and foremost a cloud provider (and one of the biggest at that, if not the biggest) and that its market thing is almost a side activity at this point.

bee_rider

2 months ago

US east is pretty geologically stable I think.

0xbadcafebee

2 months ago

2 replies

This is what I call "fool's availability": reducing single points of failure (one cloud provider) without adding any actual redundancy.

If you removed AWS/GCP/Azure/etc and just had 100 small providers scattered all over, the result would be hundreds of outages throughout the year, as opposed to one big outage every other year [in one region]. AWS is already way more reliable than any other provider.

The real problem here is that companies that use AWS are morons who don't know how to architect/build infrastructure properly.

If it's important, it should be built right, regardless of who the provider is. A software building code would mandate how companies could use infrastructure (AWS or any provider) so that important services would not go down when one service or region goes down.

This is the basic concept behind things like the electrical code. It doesn't matter how great a public utility is; if your business is wired up so badly that a stiff breeze sets it on fire, just switching utilities isn't gonna help. And some utilities do occasionally have problems that persist down their lines to the customers, so customers need to set up equipment to protect against those failures. Whole-house surge protectors, lightning arresters, EMP shields, etc are necessary so that a rare event doesn't fry expensive customer equipment.

morshu9001

2 months ago

1 reply

Yes but most of those companies aren't morons, they're just taking an acceptable risk. Multi-region or multi-cloud setup is nontrivial.

0xbadcafebee

2 months ago

1 reply

Most companies I've worked for (and have heard about from others) have either lacked the knowledge, or the will, to evaluate risk. They build things until they "just work", and their thought process ends there. They don't examine the design to identify its reliability and security risks. They don't calculate the losses. They still have issues, but they just happen to be acceptable most of the time.

Example1: A company's infra goes down, but it doesn't come back up correctly. People run around trying to get it working again. It takes much longer than they hoped/expected, and they lose a lot more money than they expected. This is because they never really understood the risk they were exposed to. If they understood it, they would have done more ahead of time to mitigate that much risk.

(today's outage is this case. A lot of companies are going to lose money after today, because their customers are not happy with these "acceptable risks". Presumably, losing this much money due to one outage will not be an acceptable risk in hindsight. So the company either didn't understand its risk, or it did but was too stupid to prevent it)

Example 2: A company gets hacked, and its data is either exposed or wiped. This is a much worse result; they can lose tons of money, chase off customers, damage their brand, open them up to lawsuits and fines, even tank the whole company. It's clear that this risk is pretty unacceptable. But it keeps happening. And the reason usually isn't "some genius hacker"; it was a lack of understanding the risk of not investing in security.

(there's tons of examples of these in the news. presumably, not investing in security was not an acceptable risk in hindsight when it ended their business! almost always, the people involved in making these products don't know enough about security to understand the risks. but they also don't invest in security training, mandatory security controls, checklists, processes, quality gates, etc)

You don't need multi-region or multi-cloud to mitigate reliability risks. Just like you don't need to hire a big security team or invest tons of cash to mitigate security risks. You can use your existing infra and tools, and mitigate both issues. You just have to use them wisely. It takes some effort and time, but you do it once and it pays dividends indefinitely.

Building something without identifying its security/reliability risks, and then not calculating those risks' impact, is not acceptable risk; it's ignored risk. Is tanking your company and shedding customers an acceptable risk? Well, there's one way to find out.

morshu9001

2 months ago

1 reply

This outage would've required cross region failover to be immune to. We'll see if customers switch to whatever company was resilient, but this has happened before and the answer was no.

0xbadcafebee

2 months ago

1 reply

The company I work for has a ton of stuff in us-east-1, many large products and sites, and we didn't go down. Our products/services aren't multi-region or multi-cloud. We don't pay exorbitant bills or have super complicated architectures.

morshu9001

2 months ago

1 reply

If you were using AWS services that went down in us-east-1, how did you avoid an outage without failing over to anything outside that region?

0xbadcafebee

2 months ago

1 reply

That's the thing - most AWS services didn't "go down", as in stop working entirely. There were specific operations of specific services that were failing. Increased API error rates, inability to start new EC2 instances, billing metrics unavailable, AWS console unavailable, etc.

The outage wasn't like "all our servers stopped running". It was dynamic, new, specific operations that failed. If you just had a Fargate container that was started a week ago, and you have no need to restart the container today, it just kept chugging along.

Our architecture is stuff that just keeps chugging along. Fargate, S3, RDS, CloudFront, CloudFlare, etc. From our perspective, there was no outage in us-east-1. Literally the only alert we got the entire time was "billing limit exceeded" - and that was a false alarm, because it was set to alarm if there is zero billing data.

morshu9001

2 months ago

But is this strategy or luck? I'm not seeing how those many companies did something dumb or wrong here while you did it right. Like are they only affected because they overcomplicated their deployments? Either way, your service isn't resilient against a generalized regional outage it sounds like.

dudeinjapan

2 months ago

Its probably worse—a given stack using multiple of these small providers will probably have more “single points of failure” (providers used in series rather than parallel.)

(If most companies liked using cloud providers in parallel, they’d already be doing it today between AWS, Azure, and GCP.)

Jzush

2 months ago

1 reply

If only there was a system of computers on the Internet that was distributed across the world where we could host things instead of all in one location. We could call it the "cloud".

ta1243

2 months ago

1 reply

We could connect distributed computers on distributed networks together using some form of internetworking protocol.

Jzush

2 months ago

Like some kind of interconnected network. We could call it connetwork or connet for short. We’ll be rich!

impure

2 months ago

3 replies

We already have diversification. You can rent a VPS from hundreds of possible companies. And people are very happy with them, it seems every month or two there’s a post here about how some company slashed their cloud bill by switching to a VPS. What we have here is a lock-in and marketing problem.

jasode

2 months ago

1 reply

>You can rent a VPS from hundreds of possible companies. And people are very happy with them, it seems every month or two there’s a post here about how some company slashed their cloud bill by switching to a VPS.

Companies are using higher-level "PaaS" suite of services from AWS such as DynamoDB, RedShift, etc and not just the lower-level "IaaS" such as basic EC2 instances or pure containers. Same "lock-in" situation with using the higher-level services from MS Azure and Google Cloud.

For those dependent on high-level services, migrating to a VPS like Hetzner or self-hosting is not possible unless they re-invent the AWS stack by installing/babysitting a bunch of open-source software. It's going to be a lot more involved than just installing a PostgreSQL db instance on a VPS.

SoftTalker

2 months ago

1 reply

> It's going to be a lot more involved

Yes, and you can't escape that by outsourcing it. The complexity is still there, and it will still bite you when your outsourcer fails to manage it.

candiddevmike

2 months ago

1 reply

Same thing applies to AWS...

throwaway894345

2 months ago

2 replies

I’m not really making a point here as much as an observation, but if my stack that I manage atop VMs in a data center goes down, my customers are pissed at me. If AWS goes down along with half the Internet, my customers are completely sympathetic.

candiddevmike

2 months ago

1 reply

Maybe just for you and after they realize it's part of the ongoing AWS outages, but for most folks, an outage is still their problem, and their SLA, regardless of if it's upstream from them.

throwaway894345

2 months ago

I disagree. I think most customers are much more sympathetic to an AWS outage than they are to a self-managed outage. Whether that ought to be the case or not is a different question.

SoftTalker

2 months ago

1 reply

But if your services are up when everyone on AWS is down you look like a wizard.

throwaway894345

2 months ago

Unfortunately, people rarely notice when something is working, and the few who do will probably just assume you weren’t on AWS in the first place and move on with their day.

TZubiri

2 months ago

2 replies

Amazon offers VPS as well, EC2 instances, were those affected? I think they weren't.

TYPE_FASTER

2 months ago

> While most operations are recovered, requests to launch new EC2 instances (or services that launch EC2 instances such as ECS) in the US-EAST-1 Region are still experiencing increased error rates.

> We continue to investigate the root cause for the network connectivity issues that are impacting AWS services such as DynamoDB, SQS, and Amazon Connect in the US-EAST-1 Region. We have identified that the issue originated from within the EC2 internal network.

So, kinda? Some global services depend on us-east-1...

> Global services or features that rely on US-EAST-1 endpoints such as IAM updates and DynamoDB Global tables may also be experiencing issues.

Basically, you know it's going to be a bumpy day when us-east-1 has an issue because your ability to run across regions depends on what the issue is what the impact is.

swiftcoder

2 months ago

Our actual running instances were pretty much fine throughout, as was the RDS cluster, but we had no way to launch new instances (or auto-scale), and no way to invoke any of the other AWS services (IAM, SQS, Lambda, etc). Also no cloud watch logs/metrics for the duration, so limited visibility.

Overall not that bad for us, but if you had more high-level service dependencies, there would have been impact.

TiredOfLife

2 months ago

And every comment in those threads is how AWS is webscale and wont go down, while the vps will have uptime of 1 day a month

binary132

2 months ago

2 replies

Wow, thanks experts! I never could have figured this out without you :)))

esafak

2 months ago

1 reply

There has to be an Onion article for this.

01HNNWZ0MV43FF

2 months ago

"No way to prevent this, says only region where this regularly happens"

patrickmcnamara

2 months ago

2 replies

This article isn't written for you. It's written for my mom, etc.

binary132

2 months ago

1 reply

Does your mother frequent hackernews?

patrickmcnamara

2 months ago

1 reply

This article was written for The Guardian, not Hacker News.

binary132

2 months ago

1 reply

yet here it is, posted on hackernews

patrickmcnamara

2 months ago

So why complain about the experts?

hippo77

2 months ago

Surprised to see an article like that even getting shared here. The Guardian seems to be wrong on almost every tech issue.

Aeolun

2 months ago

2 replies

It’s only a single region. If anything it shows how many people just double down on the default without any redundancy.

arbll

2 months ago

1 reply

A single region that is a SPOF for global AWS services*

starman55

2 months ago

Is us-east-2 services impacted today? which ones?

stronglikedan

2 months ago

> It’s only a single region

Which was effectively the only region

ryandvm

2 months ago

2 replies

Man, I did not have "AWS us-east-1 will only have TWO 9s this year" on my bingo card.

aurumque

2 months ago

5 replies

For those of us who have been using AWS for almost 20 years now, I can't imagine why anyone would willingly choose us-east-1 for anything. It is the oldest, highest traffic, most critical path region and is subject to turbulence.

interroboink

2 months ago

1 reply

By some logic, that would mean it is the most battle-tested and highest-stakes (and therefore most carefully-managed) choice. I.e. reasons in favor.

Not that I disagree with you, but maybe not for the reasons you say (:

swiftcoder

2 months ago

> By some logic, that would mean it is the most battle-tested and highest-stakes (and therefore most carefully-managed) choice

As someone who used to work on the inside, us-east-1 has the biggest pile of legacy workarounds for internal AWS issues, it has a variety of legacy API behaviours that don't exist in other regions, and because everyone picks it as the default, it has significantly more pressure on contested resources (i.e. things like spot instance pools).

Plus since it's the default in all the tooling, if you ever decide to go multi-region, you'll find tons of things break right away.

tlogan

2 months ago

1 reply

I think it is a little complicated. For example, your service might be using full failover but you use API from other service which are down.

Or you might use BART to come to work and you got stuck: https://www.kqed.org/news/12060687/bart-resumes-service-but-...

dingnuts

2 months ago

ha! I saw another comment on here talking about how ec2 doesn't need to be held to the same standard as the power company because it's not as important as real infrastructure.

wish I'd already had this link in my back pocket. our industry needs to take its job, as a whole, much more seriously.

captainkrtek

2 months ago

“Global” and “edge” services such as IAM, Route53, CloudFront and so on have dependencies on us-east-1, so even if you don’t think you do, you probably do.

bongodongobob

2 months ago

Well, we didn't, but some of our third party softwares did. Hard to avoid.

morshu9001

2 months ago

It can make sense to depend on the thing that will attract massive worldwide attention if/when it goes down. Or, more likely, it's just a default people don't change.

TZubiri

2 months ago

2 replies

Wait, was the whole region affected? Like even if you had an EC2 instance?

mads_quist

2 months ago

1 reply

No, we run on US East 1 but only EC2. Everything was running smoothly!

mads_quist

2 months ago

Our strategy has always been to use as little higher abstractions from cloud providers as possible. Glad we went this way, saved us quite a bunch of SLA breaches today! I am confident to say that it's "best of both worlds". We get great availability zone redundancy by AWS without having to rely on and pay for all those PaaS stuff the cloud giants offer. Also, we can "fairly easy" migrate to any other cloud provider because we only need Debian instances running.

bigstrat2003

2 months ago

Yes, it was. We have EC2 instances that we turn on as-needed, and at times were unable to start said instances.

labrador

2 months ago

1 reply

Kieran Healy @kjhealy@mastodon.social

Always worth taking sentences that use “the Cloud” or “the Internet” and try replacing those phrases with “A shed in Virginia” to see how they hold up. “Our service is fully based in a shed in Virginia”; “All my files are in a shed in Virginia”; “A shed in Virginia was designed to survive a nuclear war”, etc.

https://mastodon.social/@kjhealy/115407725852594322

SpicyLemonZest

2 months ago

5 replies

Sounds like a pretty good shed! Like a lot of pithy commentary on the cloud, this ignores the fact the practical alternative to a shed in Virginia for most businesses is a shelf in the supply closet. "Oops, Jim Bob tripped over the power cord, guess we won't get any emails until the IT guy shows up" - this used to be a routine experience.

gspencley

2 months ago

3 replies

> "Oops, Jim Bob tripped over the power cord, guess we won't get any emails until the IT guy shows up" - this used to be a routine experience.

You're not entirely wrong, but you're being hyperbolic too. I'm actually curious how old you are / how long you've worked in tech, because I started out pre-cloud and things weren't nearly as bad or as limited as you suggest.

First, on-prem servers are not the only alternative to "cloud." Many businesses, including the ones I worked for, did co-location. The companies owned their own bare metal servers, but would rent a rack in a data centre, and certain things - like the network admin - was entirely outsourced to the data centre / hosting company.

You could also rent managed bare metal servers (you still can). This means that you can pretty much outsource your entire IT department, but you're still not doing cloud services. Meaning you've got bare metal servers, someone you're paying at the hosting company is handling security updates and troubleshooting. You don't get things like auto-scaling or serverless or other cloud features, but you also don't have to worry about Jim tripping over the power cable either.

There's also still virtual servers. Which is basically a VM running on a server that hosts multiple clients.

All of this is to say that the alternative is not "cloud" or "box in a closet." The alternative is "cloud" and a ton of different server options: owned, rented, co-located, on-prem, dedicated, virtual, managed v un-managed (outsource IT vs admin your own) and the list goes on and on.

SpicyLemonZest

2 months ago

1 reply

I've never managed IT professionally myself (pre-cloud or otherwise), so a lot of my information comes from family members who do, but my impression is that bare metal rental and colo centers weren't realistic options for any but the most technically sophisticated organizations. I know schools, stores, even research centers who went straight from on-prem to managed cloud with no real consideration for anything in between.

gspencley

2 months ago

My first paid position as a software developer was for a small, dot-com startup in Windsor, ON Canada. We co-located in Detroit - which meant border crossings (though this was pre-911 so crossing as a Canadian citizen was easier) - and had just a couple of servers on a rack. We were software engineers and had people who knew what they were doing. So yeah we were technically proficient. But I'm not sure I'd call that company among "the most technically sophisticated of organizations". We were tiny. In fact, when I first started working there, we were working out of a house with a workforce of like 25 people max.

When that company went under during the dot-com crash, I started my first business shortly after. It was 2003, I was 21 years-old and this business allowed me to work from home and feed my family until I re-entered the job market in 2018. For 15 years, I was a one-person organization, and because my business operated "free" adult-entertainment websites, bandwidth was my most significant expense. For that reason, even when Cloud became a thing (which it wasn't in 2003), I never migrated because of the bandwidth costs alone. Cloudflare was a major game changer but even it didn't exist when I first started out. There were CDNs like Akamai but they were crazy expensive and out of my league. So at its peak, I had about 12 bare metal servers around the world (all rented from the same hosting company - original called Server Matrix it then became Softlayer and then was bought by IBM and went to shit and is now IBM Cloud). I admin'd those on top of writing and maintaining all of the code and running the business independently with occasional help from my wife.

I am obviously very technically competent. I'm a Principal Software Engineer today. But technically sophisticated? There wasn't much sophistication about it. I did bare metal servers because it was the only cost-effective way to run my business. It was attainable and it worked. And it worked in a way that Cloud couldn't when Cloud came on the scene - so I never went Cloud with that operation just due to cost alone.

Spivak

2 months ago

But is the distinction meaningful? The alternative to a shed in Virginia is a different shed in Montana? I mean sure there are a lot of different sheds out there but they're all still sheds. They're all shared responsibility models where the line is drawn in different areas, some outages will be because of your fuckup, some will be theirs.

Not saying as an industry we shouldn't diversify a little but it doesn't fundamentally change the relationship each company has to their hosting provider.

maccard

2 months ago

We run a subset of our CI workload on on-prem workstations because the cost/performance ratio of consumer hardware is so much higher than servers. 1TB NVMe drive, with a 7950x/i9, 64GB RAM and gigabit networking is < $1000. It actually completes our CI job faster than AWS restarts a gpu instance.

100% of our failure rates with this machine have been "carpet cleaners unplugged the machine" in 2 years. Last year we had nobody in the office (due to carpet cleaning). This year we sent someone in straight after the cleaning to fix it.

franz_vlkshp

2 months ago

1 reply

on the other hand, that's a small price to pay to having total control and physical access to your own infrastructure. if the sysadmin did his job properly, an incident like that shouldn't require anything else but to plug the server back in and hit the power switch. but then if he did his job properly, no one but IT should be tripping on power cables to begin with.

hdgvhicv

2 months ago

My home infrastructure is immune from the “unplug” problem by being hosted on two different 10 year old £30 raspberry pis in different rooms.

But apparently that’s too hard for the average.

jacobsenscott

2 months ago

Largely mitigated by twist lock sockets.

darkwater

2 months ago

With a gazillion of shelves, closets, Jims and cables. So if Fortnite's Jim trips on a wire, Canva's Jim is quitely sipping coffee at his desk.

noir_lord

2 months ago

Once had a site wide outage (biggish manufacturing company) of the internet and backup servers because one of the women wanted to plug her hair straighteners in for the xmas party.

In a surprise to literally no one that happening on the last friday before xmas break got my "We need to secure the main comms cabinet" (which had the backup server and main ingress for WAN and was in a separate building on other side of site) item that I'd been asking about for months to the top of the list.

Still one of my favourite "outages" because I got to my desk, turned PC on, no network, walked across the landing into the main office, opened comms cabinet, plugged it back in and was "resolved" before the MD got to my desk.

greenavocado

2 months ago

2 replies

The only reason we can't leave AWS is because we have 500 terabytes of data in S3

jewel

2 months ago

1 reply

Talk to the other vendors. I know of a place that had about that same amount and decided to have a redundant copy of all of their data in another vendor's S3-compatible product. That vendor paid for all of their egress fees as long as they signed a 12-month contract and used their tool for the migration.

coredog64

2 months ago

AWS will credit your egress fees if you incur them via leaving.

https://aws.amazon.com/blogs/aws/free-data-transfer-out-to-i...

ovaistariq

2 months ago

1 reply

What other AWS services do you depend on?

greenavocado

2 months ago

Mostly EC2 for data mining terabytes of historical data stored in S3. Production usage is fairly lightweight compared to the EC2 and S3 stuff. We did cut our bill a lot by moving to single AZ redundancy.

sunrunner

2 months ago

5 replies

The 'experts' also made similar criticisms with the Fastly outage in 2021 and did anything obvious change as a result? In a week's time no national newspapers will be talking about this.

Meanwhile, everyone that spends actual time in these areas:

- Knows that running an operation at AWS scale is difficult and any armchair critism from 'experts' is exactly that. Actions speak louder than words.

- Understands that the cost of actually accounting for this kind of scenarios is incredibly high for the benefit in most cases

- Knows that genuinely 'critical' services (i.e. health) should be designed to account for this, and every other 'serious' issue such as 'I can't log in to Fortnite' just shows what the price and effort of actually making that work is versus how much it costs affected companies when it happens

- Knows how much time national newspapers spend actually talking about the importance of multi-region/multi-cloud redundancy, that is, it's zero until the one day where it happens and then it's old news

- Is just curious as to just what exactly happened from a technical perspective

This isn't to say that good blameless post-mortem shouldn't happen to figure out process and technical issues, but the armchair criticism with no actual followup? All noise, no signal.

BrenBarn

2 months ago

3 replies

I think all of that is mostly irrelevant. You don't need to pay a huge cost to avoid the small benefit, you don't need every service to be resilient to this, or any of that. You just need multiple different providers so that not everyone gets screwed at once.

bamboozled

2 months ago

2 replies

Yes but we live in a highly anti-competitive monopolized world now. With more to come under the new admin.

jdminhbg

2 months ago

2 replies

It’s hard to think of anything less monopolized than cloud hosting. There are hundreds of providers.

estimator7292

2 months ago

1 reply

For any business that matters, your choices are amazon, google, Microsoft, and that's about it.

I couldn't even name another provider except maybe Hetzner

bamboozled

2 months ago

The three you mentioned have over 60% market share which is why this article exists at all. Knowing what I know about cloud ifnra, anyone who is actually anyone is hosting on the big three. So it's not just a market share, it's market share + impact / importance.

You could also argue that YT is on GCP (to some level) and that would probably bump that number up much higher.

The vast majority of people hosting things on the internet are on these providers. But you get downvoted for pointing that out now.

bamboozled

2 months ago

Yeah right, and how many of them have any substantial customer base compared to AWS and Azure?

hdgvhicv

2 months ago

There’s two or three gartner approved ways of doing things for fortune 500 ctos, and f500 wannabes.

It’s not a monopoly but it’s close.

sunrunner

2 months ago

But that would require companies to actually spend time and money testing and working with either a cross-provider multi-master-type system (with all the associated consistency headaches) or regularly test a functioning disaster-recovery/fallback system.

The time spent on that (let alone cost, for companies with large amounts of data) far outweighs the cost when a single region has an issue of today's scope. And you said it yourself, it's a 'small benefit'. Small benefits sound like exactly the things not worth spending time or money on.

For as much as many companies have had issues today, the daily reality is that these same companies haven't been having issues all the rest of the time (or this wouldn't have felt so shocking) and are likely to be okay with an outage of this scope (plus, everyone's too busy making noise about the issues to be working normally).

morshu9001

2 months ago

There are multiple different providers, with nothing artificially limiting their use. Also idk what's so bad about Fortnite and Snapchat going down at once instead of it being staggered.

alecco

2 months ago

2 replies

> - Knows that running an operation at AWS scale is difficult and any armchair critism from 'experts' is exactly that. Actions speak louder than words.

NO. From their own reports, clearly AWS is too centralized and dependent on a specific region (us-east-1) and a specific service (DynamoDB). This has been observed for well over 10 years. Why do they stay in this centralized architecture? Cloud services need much higher standards than the average corporation. Just look how they took down 2000+ services for many hours.

[1] https://health.aws.amazon.com/health/status

inopinatus

2 months ago

Even wearing my ex-AWS hat and understanding to some degree the internal complexity of these services, I too am boggled that foundational stuff is still out of Virginia and not a separately operated global region for the subset of control-plane dependencies that can’t be refactored into tolerating eventual consistency (such as parts of IAM).

We always used to talk a lot about minimising blast radius and there’s been enough time, and enough scale, to fix it.

Nevertheless the Guardian’s choice to label self-promoting policy wonks as “experts” is a cringe-inducing reminder that journalists don’t know anything about anything.

sunrunner

2 months ago

I don't deny that an incident of this scope should prompt a serious technical and process review (and as you describe it, it sounds like this is long overdue), however how often does this kind of thing not affect 2000+ services? Companies should be tracking the time they don't have issues as much as the time they do in order to actually understand if they'd be better off elsewhere.

And to be clear, I'm not at all arguing for the monopolisation of cloud providers, only stating that it's easy to point from far away and say 'This is bad' while simultaneously not doing anything to understand the cost and make that change that you say is important, because it's actually costly (in many dimensions) to do.

imgabe

2 months ago

The "experts" in this case are

> Dr Corinne Cath-Speth, the head of digital at human rights organisation Article 19

Dr. Cath-Speth has a PhD in cultural anthropology

> Cori Crider, the executive director of the Future of Technology Institute

A lawyer

> Madeline Carr, professor of global politics and cybersecurity at University College London

A professor. Her bio doesn't say what her degree is in, but she mostly seems to publish in political science and international relations

So, not a single technical expert. Not anyone who has ever run a hosting service before or even worked for one. Just people who write papers and sit around waiting for journalists to call them for quotes.

gnerd00

2 months ago

maybe your VC overlords need a reality check?

free_bip

2 months ago

Because the experts have no say in policy. The only people who have a say are the people bribing (sorry I mean "lobbying") Congress. And even they have very little say because Congress is currently on a hot streak of doing absolutely nothing.

ChrisArchitect

2 months ago

More discussion: https://news.ycombinator.com/item?id=45640838

dijit

2 months ago

And we lean into it by saying "Well, if everyone else is down, I get a free pass".

(which, is not true in reality if you have ordinary customers).

cpncrunch

2 months ago

"The root cause is an underlying internal subsystem responsible for monitoring the health of our network load balancers."

https://health.aws.amazon.com/health/status?path=service-his...

boznz

2 months ago

I've really got to get me one of these 'expert' job gigs!

shadowgovt

2 months ago

Sure. Are the "experts" going to pony up the cash to build in redundancy, or change the market fundamentals that make it make more sense for a startup to rush to product on a shoestring and then keep adding features instead of building against not-yet-happened failure modes?

If not, I look forward to the next single-point-of-failure outage. And the next. And the next.

heavyset_go

2 months ago

It makes us vulnerable to a centrality attack either foreign or domestic. If someone wants to fuck society up, only a handful of data centers, routers, networking junctions, etc could do it.

physicsguy

2 months ago

We don’t use AWS at work but we still experienced disruption because lots of our customers do, and use it to transfer data to us. That means we then saw an uplift in data transfers as their systems came back online.

There is no panacea. The reason many people use these is because it’s easy and hard to find people that know other clouds and their quirks.

halis

2 months ago

Just need to retire the us-east-1 region, it's becoming a meme at this point.

55 more comments available on Hacker News

View full discussion on Hacker News

ID: 45646649Type: storyLast synced: 11/20/2025, 8:32:40 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN