AWS Multiple Services Outage in Us-East-1
Key topics
AWS experienced a widespread outage in the us-east-1 region, affecting multiple services and causing issues for various companies and users, sparking discussions about cloud reliability and the risks of relying on a single region.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
-847s
Peak period
149
Day 1
Avg / period
40
Based on 160 loaded comments
Key moments
- 01Story posted
Oct 20, 2025 at 3:22 AM EDT
3 months ago
Step 01 - 02First comment
Oct 20, 2025 at 3:08 AM EDT
-847s after posting
Step 02 - 03Peak activity
149 comments in Day 1
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 29, 2025 at 3:23 PM EDT
2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Resolves to nothing.
Alternatively, perhaps their DNS service stopped responding to queries or even removed itself from BGP. It's possible for us mere mortals to tell which of these is the case.
Appears to have happened within the last 10-15 minutes.
We're having fun figuring out how to communicate amongst colleagues now! It's when it's gone when you realise your dependence
As a degraded-state fallback, email is what we're using now (we have our clients configured to encrypt with PGP by default, we use it for any internal email and also when the customer has PGP so everyone knows how to use that)
if you seriously have no external low dep fallback, please at least document this fact now for the Big Postmortem.
Including fabricating new RAM?
My experience with self hosting has been that, at least when you keep the services independent, downtime is not more common than in hosted environments, and you always know what's going on. Customising solutions, or workarounds in case of trouble, is a benefit you don't get when the service provider is significantly bigger than you are. It has pros and cons and also depends on the product (e.g. email delivery is harder than Mattermost message delivery, or if you need a certain service only once a year or so) but if you have the personell capacity and a continuous need, I find hosting things oneself to be the best solution in general
Unwarranted tip: next time, if you use macOS, just open the terminal and run `caffeinate -imdsu`.
I assume Linux/Windows have something similar built-in (and if not built-in, something that's easily available). For Windows, I know that PowerToys suite of nifty tools (officially provided by Microsoft) has Awake util, but that's just one of many similar options.
Except Google Spanner, I’m told, but AWS doesn’t have an answer for that yet AFAIK.
The cost of re-designing and re-implementing applications to synchronize data shipping to remote regions and only spinning up remote region resources as needed is even larger for these organizations.
And this is how we end up with these massive cloud footprints not much different than running fleets of VM’s. Just about the most expensive way to use the cloud hyperscalers.
Most non-tech industry organizations cannot face the brutal reality that properly, really leveraging hyperscalers involves a period of time often counted in decades for Fortune-scale footprints where they’re spending 3-5 times on selected areas more than peers doing those areas in the old ways to migrate to mostly spot instance-resident, scale-to-zero elastic, containerized services with excellent developer and operational troubleshooting ergonomics.
Admitting to that here?
In civilised jurisdictions that should be criminal.
Using cryptography to avoid accountability is wrong. Drug dealing and sex work, OK, but in other businesses? Sounds very crooked to me
When Slack was down we used... google... google mail? chat. When you go to gmail there is actually a chat app on the left.
I.e. some bottle-necks in new code appearing only _after_ you've deployed there, which is of course too late.
It didn't help that some services had their deploy trains (pipelines in amazon lingo) of ~3 weeks, with us-east-1 being the last one.
I bet the situation hasn't changed much since.
oof, so you're saying this outage could be cause by a change merged 3 weeks ago?
However, if you desperately need to access it you can force resolve it to 3.218.182.212. Seems to work for me. DNS through HN
curl -v --resolve "dynamodb.us-east-1.amazonaws.com:443:3.218.182.212" https://dynamodb.us-east-1.amazonaws.com/
> Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1.
Booting builder /usr/bin/docker buildx inspect --bootstrap --builder builder-1c223ad9-e21b-41c7-a28e-69eea59c8dac #1 [internal] booting buildkit #1 pulling image moby/buildkit:buildx-stable-1 #1 pulling image moby/buildkit:buildx-stable-1 9.6s done #1 ERROR: received unexpected HTTP status: 500 Internal Server Error ------ > [internal] booting buildkit: ------ ERROR: received unexpected HTTP status: 500 Internal Server Error
A `;)` is normally understoond to mean the author isn't entirely serious, and is making light of something or other.
Perhaps you American downvoters were on call and woke up with a fright, and perhaps too much time to browse Hacker News. ;)
The key is that you need to understand no provider will actually put their ass on the line and compensate you for anything beyond their own profit margin, and plan accordingly.
For most companies, doing nothing is absolutely fine, they just need to plan for and accept the occasional downtime. Every company CEO wants to feel like their thing is mission-critical but the truth is that despite everything being down the whole thing will be forgotten in a week.
For those that actually do need guaranteed uptime, they need to build it themselves using a mixture of providers and test it regularly. They should be responsible for it themselves, because the providers will not. The stuff that is actually mission-critical already does that, which is why it didn't go down.
It still baffles me how we ended up in this situation where you can almost hear peoples disapproval over the internet when you say AWS / Cloud isn't needed and you're throwing money away for no reason.
> Following my ill-defined "Tarsnap doesn't have an SLA but I'll give people credits for outages when it seems fair" policy, I credited everyone's Tarsnap accounts with 50% of a month's storage costs.
So in this case the downtime was roughly 26 hours, and the refund was for 50% of a month, so that's more than a 1-1 downtime refund.
Cloud providers just never adopted that and the "ha, sucks to be you" mentality they have became the norm.
IIRC it takes WAY too many managers to approve the dashboard being anything other than green.
It's not a reflection of reality nor is it automated.
Which is actually totally fine for the vast majority of things, otherwise there would be actual commercial pressures to make sure systems are resilient to such outages.
But there is only so much a cloud provider can guarantee within a region or whatever unit of isolation they offer.
you might be thinking of durability for s3 which is 11 nines, and i've never heard of anyone losing an object yet
Delivery of those nines is not a priority. Not for the cloud provider - because they can just lie their way out of it by not updating their status page - and even when they don't, they merely have to forego some of their insane profit margin for a couple hours in compensation. No provider will actually put their ass on the line and offer you anything beyond their own profit margin.
This is not an issue for most cloud clients either because they keep putting up with it (lying on the status page wouldn't be a thing if clients cared) - the unspoken truth is that nobody cares that your "growth & engagement" thing is down for an hour or so, so nobody makes anything more than a ceremonial stink about it (chances are, the thing goes down/misbehaves regularly anyway every time the new JS vibecoder or "AI employee" deploys something, regardless of cloud reliability).
Things where nines actually matter will generally invest in self-managed disaster recovery plans that are regularly tested. This also means it will generally be built differently and far away from your typical "cloud native" dumpster fire. Depending on how many nines you actually need (aka what's the cost of not meeting that target - which directly controls how much budget you have to ensure you always meet it), you might be building something closer to aircraft avionics with the same development practices, testing and rigor.
Because I don't see the business pressure to do? If problems happen they can 1) lie on the status page and hope nothing happens and 2) if they can't get away with lying, their downside is limited to a few hours of profit margin.
(which is not really a dig at AWS because no hosting provider will put their business on the line for you... it's more of a dig at people who claim AWS is some uptime unicorn while in reality they're nowhere near better than your usual hosting provider to justify their 1000x markup)
It's great if they're doing their best anyway, but I don't see it as anything more than "best effort", because nothing bad would happen even if they didn't do a good job at it.
2. Trusted brand
Turns out the default URL was hardcoded to use the us east interface and just by going to workspaces and then editing your URL to be the local region got everyone working again.
Unless you mean nothing is working for you at the moment.
* Give the computers a rest, they probably need it. Heck, maybe the Internet should just shut down in the evening so everyone can go to bed (ignoring those pesky timezone differences)
* Free chaos engineering at the cloud provider region scale, except you didn't opt in to this one and know about in advance, making it extra effective
* Quickly figure out a map which of the things you use have a dependency on a single AWS region without no capability to change or re-route
> https://en.wikipedia.org/wiki/Right_to_disconnect
If you're down for 5 minutes a year because one of your employees broke something, that's your fault, and the blame passes down through the CTO.
If you're down for 5 hours a year but this affected other companies too, it's not your fault
From AWS to Crowdstrike - system resilience and uptime isn't the goal. Risk mitigation isn't the goal. Affordability isn't the goal.
When the CEO's buddies all suffer at the same time as he does, it's just an "act of god" and nothing can be done, it's such a complex outcome that even the amazing boffins at aws/google/microsoft/cloudflare/etc can't cope.
If the CEO is down at a different time than the CEO's buddies then it's that Dave/Charlie/Bertie/Alice can't cope and it's the CTO's fault for not outsourcing it.
As someone who likes to see things working, it pisses me off no end, but it's the way of the world, and likely has been whenever the owner and CTO are separate.
Yes.
What is important is having a Contractual SLA that is defensible. Acts of God are defensible. And now major cloud infrastructure outtages are too.
After that process comes the BS and PR step, where reality is spun into a cotton candy that makes the leader look good no matter what.
AWS outages: almost never happens, you should have been more prepared for when it does
If you say it’s Microsoft then it’s just unavoidable.
Still, it would make a bit of sense if you can find a place in your code where crossing a region hurts less, to move some of your services to a different region.
While your business partners will understand that you’re down while they’re down, will your customers? You called yesterday to say their order was ready, and now they can’t pick it up?
But there are some people on Reddit who think we are all wrong but won't say anything more. So... whatever.
Nothing in the outage history really stands out as "this is the first time we tried this and oops" except for us-east-1.
It's always possible for things to succeed at a smaller scale and fail at full scale, but again none of them really stand out as that to me. Or at least, not any in the last ten years. I'm allowing that anything older than that is on the far side of substantial process changes and isn't representative anymore.
But I think what wasn't well considered was the async effect - If something is gone for 5 minutes, maybe it will be just fine, but when things are properly asynchronous, then the workflows that have piled up during that time becomes a problem in itself. Worst case, they turn into poison pills which then break the system again.
It is kinda cool that the worst aws outages are still within a single region and not global.
surely you mean:
> I find it interesting that AWS services appear to be so tightly integrated that when there's an issue THAT BECOMES VISIBLE TO ME in a region, it affects most or all services.
AWS has stuff failing alllllllll the time, it's not very surprising that many of the outages that become visible to you involve multi-system failures - lots of other ones don't become visible!
If you check out with a credit card, even if everything looked good then, the seller might not see the money for days or might never receive it at all.
Interesting point that banks actually tolerate a lot more eventual consistency than most software that just use a billing backend ever do.
Stuff like 503-ing a SaaS request because the billing system was down and you couldn’t check for limits, could absolutely be locally cached and eventual consistency would hurt very little. Unless your cost is quite high, I would much rather prefer to keep the API up and deal with the over-usage later.
Back before AWS provided transparency into AZ assignments, it was pretty common to use latency measurements to try and infer relative locality and mappings of AZs available to an account.
No landing page explaining services are down, just scary error pages. I thought account was compromised. Thanks HN for, as always, being the first to clarify what's happening.
Scary to see that in order to order from Amazon Germany, us-east1 must be up. Everything else works flawlessly but payments are a no go.
Btw, most parts of the amazon.de is working fine, but I can't load profiles, and can't login.
https://status.perplexity.ai
That means Cursor is down, can't login.
1898 more comments available on Hacker News