AWS Outage Shows Internet Users 'at Mercy' of Too Few Providers, Experts Say
Posted2 months agoActive2 months ago
theguardian.comTechstoryHigh profile
heatednegative
Debate
80/100
Cloud ComputingAWS OutageVendor Lock-in
Key topics
Cloud Computing
AWS Outage
Vendor Lock-in
The recent AWS outage highlights the risks of relying on a few large cloud providers, sparking debate about vendor lock-in and the need for redundancy and diversification.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
22m
Peak period
140
0-12h
Avg / period
22.9
Comment distribution160 data points
Loading chart...
Based on 160 loaded comments
Key moments
- 01Story posted
Oct 20, 2025 at 1:32 PM EDT
2 months ago
Step 01 - 02First comment
Oct 20, 2025 at 1:54 PM EDT
22m after posting
Step 02 - 03Peak activity
140 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 27, 2025 at 4:06 AM EDT
2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45646649Type: storyLast synced: 11/20/2025, 8:32:40 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
"October 17, 2025, was my last day at Amazon Web Services... CloudFront is a CDN, a content delivery network, or, simply put, a large distributed cache for your cat photos. And a very successful one. Something like 30% of all internet traffic goes through CloudFront in one way or another. Pretty cool, huh? In practice, this means that with any change, you have a chance of crashing 30% of the internet."
No. No its not. But tech enthusiasts on HN and Reddit love it.
(Another 30% runs through cloudflare)
But now, people are quite happy to put their app in a Docker container and outsource all design and architecture decisions pertaining to data storage and performance.
And with that, the likes of ECS, Dynamo, RedShift, etc, are a somewhat reasonable answer to that. It's much easier to offer a distinct proposition around that state of affairs, than say a market that was solely based on EC2-esque VMs.
What I did not like, but absolutely expected, was this lurch towards near enough standardising one specific vendor's model. We're in quite a strange place atm, where AWS specific knowledge might actually have a slightly higher value than traditional DevOps skills for many organisations.
Felt like this all happened both at the speed of light, and in slow motion, at the same time.
Containers allow me to outsource host management. I gladly spend far less time troubleshooting cloud-init, SSH, process managers, and logging/metrics agents.
Exactly.
And sure, you can use S3/Dynamo/Aurora from an EC2 box, but what would be the point of that? Just get the app running in a container, and we can look into infrastructure later.
It's a very common refrain. That's why I believe Docker is strongly to linked the development of these proprietary, cloud based models of computing, that place containerisation at the heart of an ecosystem that bastardises the classic idea of a 'server'.
The existence of S3 is one good result of this. IAM, on the other hand, can die in dumpster fire. Though it won't...
An easy API? Easy replication / failover / backups? I would absolutely use S3 even with EC2.
> IAM, on the other hand, can die in dumpster fire.
I’m no great fan of AWS’s approach to IAM, but much of the pain is just the nature of fine-grained / least-privilege permissioning. On EC2 it’s more common to just grant broader permissions; IAM makes you think about least privilege, but you absolutely can grant admin for everything. And as far as a permissioning API goes, IAM is much cleaner/saner than Linux permissions.
Its also more resilient because I can trash a container and load up a new one with low overhead. I can't really do that with a full machine. It also gives some more security by sandboxing.
This does lead to laziness by programmers accelerated by myopic management. "It works" except when it doesn't. Easy to say you just need to restart the container then to figure out the actual issue.
But I'm not sure what that has to do with cloud. You'd do the same thing self hosting. Probably save money too. Though I'm frequently confused why people don't do both. Self host and host in the cloud. That's how you create resilience. Though you also need to fix problems rather than restart to be resilient too.
I feel like our industry wants to move fast but without direction. It's like we know velocity matters but since it's easier to read the speedometer we pretend they're the same thing. So fast and slow makes sense. Fast by magnitude of the vector. Slow if you're measuring how fast we make progress in the intended direction.
Before Docker you had things like Heroku and Amazon Elastic Beanstalk with a much greater degree of lock in than Docker.
ECS and its analogues on the other cloud providers have very little lock in. You should be able to deploy your container to any provider or your own VM. I don't see what Dynamo and data storage have to do with that. If we were all on EC2s with no other services you'd still have to figure out how to move your data somewhere else?
Like I truly don't understand your argument here.
Even if that is not a problem, you avoid having to install the kitchen sink on your host and make sure everything is configured properly. Just get it working on a container, build and image and spin it up when you need it. Leaves the host machine fairly clean.
You can run a bunch of services as containers within a single host. No cloud or k8s needed. docker-compose is sufficient for testing or smallish projects.
Also, there is a security benefit because if the container is compromised, problem is limited that container not the entire host.
I assume most organizations, both small and large, just host on whatever provider they know or that costs them the least. If you have budget maybe you deploy to multiple providers for redundancy? But that increases cost and complexity.
Who’s going to bother with colo given the cost / complexity? Who’s going to run a server from their office given ISP restrictions and downtime fears?
What is the realistic antidote here?
To fix it, test your failback procedures. For everything else, there's nothing to fix, it's working by design.
My CI was down for 2 hours this morning, despite not even being on AWS. We have a set of credentials on that host that we call assumeRole with and push to an S3 bucket, which has a lambda that duplicates to buckets in other regions. All our IAM calls were failing due to this outage, and we have 0 items deployed in us-east-1 (we're european)
One thing that AWS should do is provide an easier way to detect these hidden dependencies. You can do that with CloudTrail if you know how to do it (filter operations by region and check that none are in us-east-1), but a more explicit service would be nice.
The problem was we couldn’t log into cloud trail, or the console at all, to identify that, because IAM identity center is single region. This was a decision recommended by AWS, and blessed by our (army of) SRE teams.
The hindsight is 20/20, of course, it's a good practice to audit CloudTrail periodically for unexpected regional dependencies.
(1) offer void for services that run on AWS.
Either way, we would have only made it one step farther in our CI, as the next step is to build a conatiner with a base image from docker hub, and that was down too. The idea of running a multi region nexus repository to avoid Docker hub outages for my 14 person engineering team seems slightly overkill!
It's actually an interesting exercise to enumerate _all_ the external dependencies. But yeah, avoiding them all seems to be less than helpful for the vast majority of users.
For most of what I run these days, I'd rather just have someone else run and administer my database. Same with my load balancers. And my Kubernetes cluster. I don't really care if there is an outage every 2 years.
I don't think there is a "most" organizations. Either they're looking for big cloud or they're not, and least-cost is usually the last consideration when looking at any cloud, because you're trying to pay a premium to get particular advantages.
The realistic antidote? Move to a less-shitty region. Or architect your systems to be failure-resilient.
(Most people seem to think the entire region was offline? That's wrong. It was just particular services which wouldn't process control plane requests, and then a failure cascade caused more problems. But things that were already started running before, stayed running. A region is multiple datacenters. Even AZs are often multiple datacenters. It's virtually impossible for a whole region to stop working.)
And colo and datacenters aren't immune to going down
Did multi cloud redundancy end up being too expensive? Tech didn't line up enough? No good business case?
The elastic cloud story that never was? https://www.slideshare.net/slideshow/pets-vs-cattle-the-elas...
What happened?
If there's an issue with relying only on AWS it has not been expressed in this outage.
It is quite bizarre that such paltry amounts of data and problems with such tiny scale seem to pose challenging problems when done in the cloud.
Btw, I'd love to have a link to where I could buy an SD card the size of a pinky nail that holds terabytes of data.
If you got resiliency and uptime for a extra hundred dollars a year, that would be a no-brainer for any commercial operation. The byzantine kafkaesque horror of the cloud results in trivial problems and costs ballooning into nearly insurmountable and cost-ineffective obstacles.
These are not hard or costly problems or difficult scales. They have been made hard and costly and difficult.
I work with buckets where single files are >1 terabyte. There's more than one of these files, hence terabytes. I'm not going to do a human-readable summary listing of an entire bucket to get the full size. The point of the actual size is irrelevant. When people are spending 5-6 digits on cloud storage per month, they are not going to do it in multiple places. period. Maybe the new storage unit should just be monthly cloud spend, but then your pedantry will say nonsense like which cloud sever, which storage solution type, blah blah blah.
[1] https://shop.sandisk.com/products/memory-cards/microsd-cards...
Should we similarly cap say Front End frameworks on market penetration / growth? Is react too big to fail? Do we need to force some of it's users to use something else?
This is the situation for my company that started with the intent of being platform agnostic, but it quickly became much less complex as all of the potential client pool was using the same cloud. People with buckets with large amounts of data are not going to be able to convince the bean counters that it would be worth it to have that storage bill from multiple vendors.
Because it rarely is. Occasional downtime is just a cost of doing business. It is, or should be, rare enough that you just take it as it comes instead of trying to have a redundancy. We don't build tunnels everywhere as a backup for surface roads on snowy days. We just cancel school and work for the day and make up for it later. Do some important things get impacted? Sure, but most things are as mission critical as we make them out to be. The press coverage of an AWS outage makes it so easy to shrug it off and point fingers.
Simplicity is linked to uptime and having a single cloud solution is a simpeler solution.
For large companies, its mostly cost savings. Easier to negotiate a good discount at N million versus N/2 million.
Besides that no-one ever got fired for picking AWS ;)
You have to build it in. That takes time money and training. Do you do failovers? Do they work? What is your backup situation? What is your list of work items to do during the failover? How long does it take? Do you even HAVE a failover plan? Can your services handle being in 'split brain'? Do you have specialty services that can only run in one place?
The unfortunate reality is this planning happens many times too late.
Multi cloud redundancy is like Java being a solution to platform independency.
Additionally, moving your load to a different cloud can be challenging while one is down. It ends up being a lot of work that pays off for a few hours a year. For a lot of applications, it's better to just suffer the downtime and spend money on other things.
If you are just one company whose goal is to maximize uptime without bringing in the complexity of multi-cloud, relying on AWS is reasonable. You probably won't get better uptime using something else, you'll only be down at different times than most others, which in most cases is actually worse.
But on the other hand, maybe I hang around too many tech people to not empathically understand the other point of view.
Also there are some outages that affect real life like airlines, but tech news overstates some like Facebook. It turns out that FB and IG can be totally broken for a whole day, the world will keep spinning, and they won't even lose users.
If you removed AWS/GCP/Azure/etc and just had 100 small providers scattered all over, the result would be hundreds of outages throughout the year, as opposed to one big outage every other year [in one region]. AWS is already way more reliable than any other provider.
The real problem here is that companies that use AWS are morons who don't know how to architect/build infrastructure properly.
If it's important, it should be built right, regardless of who the provider is. A software building code would mandate how companies could use infrastructure (AWS or any provider) so that important services would not go down when one service or region goes down.
This is the basic concept behind things like the electrical code. It doesn't matter how great a public utility is; if your business is wired up so badly that a stiff breeze sets it on fire, just switching utilities isn't gonna help. And some utilities do occasionally have problems that persist down their lines to the customers, so customers need to set up equipment to protect against those failures. Whole-house surge protectors, lightning arresters, EMP shields, etc are necessary so that a rare event doesn't fry expensive customer equipment.
Example1: A company's infra goes down, but it doesn't come back up correctly. People run around trying to get it working again. It takes much longer than they hoped/expected, and they lose a lot more money than they expected. This is because they never really understood the risk they were exposed to. If they understood it, they would have done more ahead of time to mitigate that much risk.
(today's outage is this case. A lot of companies are going to lose money after today, because their customers are not happy with these "acceptable risks". Presumably, losing this much money due to one outage will not be an acceptable risk in hindsight. So the company either didn't understand its risk, or it did but was too stupid to prevent it)
Example 2: A company gets hacked, and its data is either exposed or wiped. This is a much worse result; they can lose tons of money, chase off customers, damage their brand, open them up to lawsuits and fines, even tank the whole company. It's clear that this risk is pretty unacceptable. But it keeps happening. And the reason usually isn't "some genius hacker"; it was a lack of understanding the risk of not investing in security.
(there's tons of examples of these in the news. presumably, not investing in security was not an acceptable risk in hindsight when it ended their business! almost always, the people involved in making these products don't know enough about security to understand the risks. but they also don't invest in security training, mandatory security controls, checklists, processes, quality gates, etc)
You don't need multi-region or multi-cloud to mitigate reliability risks. Just like you don't need to hire a big security team or invest tons of cash to mitigate security risks. You can use your existing infra and tools, and mitigate both issues. You just have to use them wisely. It takes some effort and time, but you do it once and it pays dividends indefinitely.
Building something without identifying its security/reliability risks, and then not calculating those risks' impact, is not acceptable risk; it's ignored risk. Is tanking your company and shedding customers an acceptable risk? Well, there's one way to find out.
The outage wasn't like "all our servers stopped running". It was dynamic, new, specific operations that failed. If you just had a Fargate container that was started a week ago, and you have no need to restart the container today, it just kept chugging along.
Our architecture is stuff that just keeps chugging along. Fargate, S3, RDS, CloudFront, CloudFlare, etc. From our perspective, there was no outage in us-east-1. Literally the only alert we got the entire time was "billing limit exceeded" - and that was a false alarm, because it was set to alarm if there is zero billing data.
(If most companies liked using cloud providers in parallel, they’d already be doing it today between AWS, Azure, and GCP.)
Companies are using higher-level "PaaS" suite of services from AWS such as DynamoDB, RedShift, etc and not just the lower-level "IaaS" such as basic EC2 instances or pure containers. Same "lock-in" situation with using the higher-level services from MS Azure and Google Cloud.
For those dependent on high-level services, migrating to a VPS like Hetzner or self-hosting is not possible unless they re-invent the AWS stack by installing/babysitting a bunch of open-source software. It's going to be a lot more involved than just installing a PostgreSQL db instance on a VPS.
Yes, and you can't escape that by outsourcing it. The complexity is still there, and it will still bite you when your outsourcer fails to manage it.
> We continue to investigate the root cause for the network connectivity issues that are impacting AWS services such as DynamoDB, SQS, and Amazon Connect in the US-EAST-1 Region. We have identified that the issue originated from within the EC2 internal network.
So, kinda? Some global services depend on us-east-1...
> Global services or features that rely on US-EAST-1 endpoints such as IAM updates and DynamoDB Global tables may also be experiencing issues.
Basically, you know it's going to be a bumpy day when us-east-1 has an issue because your ability to run across regions depends on what the issue is what the impact is.
Overall not that bad for us, but if you had more high-level service dependencies, there would have been impact.
Which was effectively the only region
Not that I disagree with you, but maybe not for the reasons you say (:
As someone who used to work on the inside, us-east-1 has the biggest pile of legacy workarounds for internal AWS issues, it has a variety of legacy API behaviours that don't exist in other regions, and because everyone picks it as the default, it has significantly more pressure on contested resources (i.e. things like spot instance pools).
Plus since it's the default in all the tooling, if you ever decide to go multi-region, you'll find tons of things break right away.
Or you might use BART to come to work and you got stuck: https://www.kqed.org/news/12060687/bart-resumes-service-but-...
wish I'd already had this link in my back pocket. our industry needs to take its job, as a whole, much more seriously.
Always worth taking sentences that use “the Cloud” or “the Internet” and try replacing those phrases with “A shed in Virginia” to see how they hold up. “Our service is fully based in a shed in Virginia”; “All my files are in a shed in Virginia”; “A shed in Virginia was designed to survive a nuclear war”, etc.
https://mastodon.social/@kjhealy/115407725852594322
You're not entirely wrong, but you're being hyperbolic too. I'm actually curious how old you are / how long you've worked in tech, because I started out pre-cloud and things weren't nearly as bad or as limited as you suggest.
First, on-prem servers are not the only alternative to "cloud." Many businesses, including the ones I worked for, did co-location. The companies owned their own bare metal servers, but would rent a rack in a data centre, and certain things - like the network admin - was entirely outsourced to the data centre / hosting company.
You could also rent managed bare metal servers (you still can). This means that you can pretty much outsource your entire IT department, but you're still not doing cloud services. Meaning you've got bare metal servers, someone you're paying at the hosting company is handling security updates and troubleshooting. You don't get things like auto-scaling or serverless or other cloud features, but you also don't have to worry about Jim tripping over the power cable either.
There's also still virtual servers. Which is basically a VM running on a server that hosts multiple clients.
All of this is to say that the alternative is not "cloud" or "box in a closet." The alternative is "cloud" and a ton of different server options: owned, rented, co-located, on-prem, dedicated, virtual, managed v un-managed (outsource IT vs admin your own) and the list goes on and on.
When that company went under during the dot-com crash, I started my first business shortly after. It was 2003, I was 21 years-old and this business allowed me to work from home and feed my family until I re-entered the job market in 2018. For 15 years, I was a one-person organization, and because my business operated "free" adult-entertainment websites, bandwidth was my most significant expense. For that reason, even when Cloud became a thing (which it wasn't in 2003), I never migrated because of the bandwidth costs alone. Cloudflare was a major game changer but even it didn't exist when I first started out. There were CDNs like Akamai but they were crazy expensive and out of my league. So at its peak, I had about 12 bare metal servers around the world (all rented from the same hosting company - original called Server Matrix it then became Softlayer and then was bought by IBM and went to shit and is now IBM Cloud). I admin'd those on top of writing and maintaining all of the code and running the business independently with occasional help from my wife.
I am obviously very technically competent. I'm a Principal Software Engineer today. But technically sophisticated? There wasn't much sophistication about it. I did bare metal servers because it was the only cost-effective way to run my business. It was attainable and it worked. And it worked in a way that Cloud couldn't when Cloud came on the scene - so I never went Cloud with that operation just due to cost alone.
Not saying as an industry we shouldn't diversify a little but it doesn't fundamentally change the relationship each company has to their hosting provider.
100% of our failure rates with this machine have been "carpet cleaners unplugged the machine" in 2 years. Last year we had nobody in the office (due to carpet cleaning). This year we sent someone in straight after the cleaning to fix it.
But apparently that’s too hard for the average.
In a surprise to literally no one that happening on the last friday before xmas break got my "We need to secure the main comms cabinet" (which had the backup server and main ingress for WAN and was in a separate building on other side of site) item that I'd been asking about for months to the top of the list.
Still one of my favourite "outages" because I got to my desk, turned PC on, no network, walked across the landing into the main office, opened comms cabinet, plugged it back in and was "resolved" before the MD got to my desk.
https://aws.amazon.com/blogs/aws/free-data-transfer-out-to-i...
Meanwhile, everyone that spends actual time in these areas:
- Knows that running an operation at AWS scale is difficult and any armchair critism from 'experts' is exactly that. Actions speak louder than words.
- Understands that the cost of actually accounting for this kind of scenarios is incredibly high for the benefit in most cases
- Knows that genuinely 'critical' services (i.e. health) should be designed to account for this, and every other 'serious' issue such as 'I can't log in to Fortnite' just shows what the price and effort of actually making that work is versus how much it costs affected companies when it happens
- Knows how much time national newspapers spend actually talking about the importance of multi-region/multi-cloud redundancy, that is, it's zero until the one day where it happens and then it's old news
- Is just curious as to just what exactly happened from a technical perspective
This isn't to say that good blameless post-mortem shouldn't happen to figure out process and technical issues, but the armchair criticism with no actual followup? All noise, no signal.
I couldn't even name another provider except maybe Hetzner
You could also argue that YT is on GCP (to some level) and that would probably bump that number up much higher.
The vast majority of people hosting things on the internet are on these providers. But you get downvoted for pointing that out now.
It’s not a monopoly but it’s close.
The time spent on that (let alone cost, for companies with large amounts of data) far outweighs the cost when a single region has an issue of today's scope. And you said it yourself, it's a 'small benefit'. Small benefits sound like exactly the things not worth spending time or money on.
For as much as many companies have had issues today, the daily reality is that these same companies haven't been having issues all the rest of the time (or this wouldn't have felt so shocking) and are likely to be okay with an outage of this scope (plus, everyone's too busy making noise about the issues to be working normally).
NO. From their own reports, clearly AWS is too centralized and dependent on a specific region (us-east-1) and a specific service (DynamoDB). This has been observed for well over 10 years. Why do they stay in this centralized architecture? Cloud services need much higher standards than the average corporation. Just look how they took down 2000+ services for many hours.
[1] https://health.aws.amazon.com/health/status
We always used to talk a lot about minimising blast radius and there’s been enough time, and enough scale, to fix it.
Nevertheless the Guardian’s choice to label self-promoting policy wonks as “experts” is a cringe-inducing reminder that journalists don’t know anything about anything.
And to be clear, I'm not at all arguing for the monopolisation of cloud providers, only stating that it's easy to point from far away and say 'This is bad' while simultaneously not doing anything to understand the cost and make that change that you say is important, because it's actually costly (in many dimensions) to do.
> Dr Corinne Cath-Speth, the head of digital at human rights organisation Article 19
Dr. Cath-Speth has a PhD in cultural anthropology
> Cori Crider, the executive director of the Future of Technology Institute
A lawyer
> Madeline Carr, professor of global politics and cybersecurity at University College London
A professor. Her bio doesn't say what her degree is in, but she mostly seems to publish in political science and international relations
So, not a single technical expert. Not anyone who has ever run a hosting service before or even worked for one. Just people who write papers and sit around waiting for journalists to call them for quotes.
(which, is not true in reality if you have ordinary customers).
https://health.aws.amazon.com/health/status?path=service-his...
If not, I look forward to the next single-point-of-failure outage. And the next. And the next.
There is no panacea. The reason many people use these is because it’s easy and hard to find people that know other clouds and their quirks.
55 more comments available on Hacker News