AWS Multiple Services Outage in Us-East-1

Posted3 months agoActive2 months ago

kondro

2,246 points

2,058 comments

health.aws.amazon.comTechstoryHigh profile

heatednegative

Debate

80/100

AWS OutageCloud ReliabilityUs-East-1

Key topics

AWS Outage

Cloud Reliability

Us-East-1

AWS experienced a widespread outage in the us-east-1 region, affecting multiple services and causing issues for various companies and users, sparking discussions about cloud reliability and the risks of relying on a single region.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

-847s

Peak period

149

Day 1

Avg / period

Comment distribution160 data points

Loading chart...

Based on 160 loaded comments

Key moments

01Story posted
Oct 20, 2025 at 3:22 AM EDT
3 months ago
Step 01
02First comment
Oct 20, 2025 at 3:08 AM EDT
-847s after posting
Step 02
03Peak activity
149 comments in Day 1
Hottest window of the conversation
Step 03
04Latest activity
Oct 29, 2025 at 3:23 PM EDT
2 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (2058 comments)

Showing 160 comments of 2058

atymic

3 months ago

3 replies

Looks like maybe a DNS issue? https://www.whatsmydns.net/#A/dynamodb.us-east-1.amazonaws.c...

Resolves to nothing.

immibis

3 months ago

1 reply

It's plausible that Amazon removes unhealthy servers from all round-robins including DNS. If all servers are unhealthy, no DNS.

Alternatively, perhaps their DNS service stopped responding to queries or even removed itself from BGP. It's possible for us mere mortals to tell which of these is the case.

Nextgrid

3 months ago

Chances are there's some cyclical dependencies. These can creep up unnoticed without regular testing, which is not really possible at AWS scale unless they want to have regular planned outages to guard against that.

theshrike79

3 months ago

It's always DNS.

Sparkyte

3 months ago

Maybe they forgot to pay the bills.

lexandstuff

3 months ago

1 reply

Yes, we're seeing issues with Dynamo, and potentially other AWS services.

Appears to have happened within the last 10-15 minutes.

atymic

3 months ago

Yep, first alert for us fired @ 2025-10-20T06:55:16Z

comp_throw7

3 months ago

2 replies

We're seeing issues with RDS proxy. Wouldn't be surprised if a DNS issue was the cause, but who knows, will wait for the postmortem.

richardwardza

3 months ago

We changed our db connection settings to go direct to the db and that's working. Try taking the proxy out the loop

romanhotsiy

3 months ago

We're also seeing issues with Lambda and RDS proxy endpoint.

mpcoder

3 months ago

2 replies

I can't even see my EKS clusters

imstil3earning

3 months ago

everything is gone :(

mopatches

3 months ago

Similar: our EC2s are gone.

Aachen

3 months ago

4 replies

Signal is down from several vantage points and accounts in Europe, I'd guess because of this dependence on Amazon overseas

We're having fun figuring out how to communicate amongst colleagues now! It's when it's gone when you realise your dependence

lexandstuff

3 months ago

3 replies

Thankfully Slack is still holding up.

LostMyLogin

3 months ago

2 replies

It’s acting up for me but wondering if it’s unrelated. Imagines failing to post and threads acting strange.

tomwojcik

3 months ago

Same. My Slack mobile app managed to sync the new messages, but it took it about 30 seconds, while usually it's sub 2 seconds.

rhdunn

3 months ago

Slack is having issues with huddles, canvas, and messaging per https://slack-status.com/. Earlier it was just huddles and canvas.

tubs

3 months ago

It’s super broken for me. Random threads no longer appear.

esskay

3 months ago

Looks like its gone down now

BoredPositron

3 months ago

2 replies

Our last resort fallbacks are channels on different IRC servers. They always hold.

Aachen

3 months ago

2 replies

Self hosting is golden. Sadly we already feel like we have too many services for our company's size, and the sensitivity of vulnerabilities in customer systems precludes unencrypted comms. IRC+TLS could be used but we also regularly send screenshots and such in self-destructing messages (not that an attacker couldn't disable that, but to avoid there being a giant archive when we do have some sort of compromise), so we'd rather fall back to something with a similar featureset

As a degraded-state fallback, email is what we're using now (we have our clients configured to encrypt with PGP by default, we use it for any internal email and also when the customer has PGP so everyone knows how to use that)

bananapub

3 months ago

1 reply

self-hosting isn't "golden", if you are serious about the reliability of complex systems, you can't afford to have your own outages impede your own engineers from fixing them.

if you seriously have no external low dep fallback, please at least document this fact now for the Big Postmortem.

Aachen

3 months ago

1 reply

The engineers can walk up to the system and do whatever they need to fix them. At least, that's how we self host in the office. If your organisation hosts it far away then yeah, it's not self hosted but remote hosted

lmm

3 months ago

3 replies

> The engineers can walk up to the system and do whatever they need to fix them.

Including fabricating new RAM?

Aachen

3 months ago

1 reply

Including falling back to third-party hosting when relevant. One doesn't exclude the other

My experience with self hosting has been that, at least when you keep the services independent, downtime is not more common than in hosted environments, and you always know what's going on. Customising solutions, or workarounds in case of trouble, is a benefit you don't get when the service provider is significantly bigger than you are. It has pros and cons and also depends on the product (e.g. email delivery is harder than Mattermost message delivery, or if you need a certain service only once a year or so) but if you have the personell capacity and a continuous need, I find hosting things oneself to be the best solution in general

Nextgrid

3 months ago

1 reply

Including fallback to your laptop if nothing else works. I saved a demo once by just running the whole thing from my computer when the Kubernetes guys couldn't figure out why the deployed version was 403'ing. Just had to poke the touchpad every so often so it didn't go to sleep.

filoleg

3 months ago

1 reply

> Just had to poke the touchpad every so often so it didn't go to sleep

Unwarranted tip: next time, if you use macOS, just open the terminal and run `caffeinate -imdsu`.

I assume Linux/Windows have something similar built-in (and if not built-in, something that's easily available). For Windows, I know that PowerToys suite of nifty tools (officially provided by Microsoft) has Awake util, but that's just one of many similar options.

BobaFloutist

3 months ago

You can just turn of automatic sleep/screen off in Windows native power settings.

belmer2

3 months ago

2 replies

The key thing that AWS provides is the capacity for infinite redundancy. Everyone that is down because us-east-1 is down didn't learn the lesson of redundancy.

sgarland

3 months ago

1 reply

Active-active RDBMS - which is really the only feasible way to do HA, unless you can tolerate losing consistency (or the latency hit of running a multi-region PC/EC system) - is significantly more difficult to reason about, and to manage.

Except Google Spanner, I’m told, but AWS doesn’t have an answer for that yet AFAIK.

senderista

3 months ago

They do now: https://aws.amazon.com/rds/aurora/dsql/

yourapostasy

3 months ago

Some organizations’ leadership takes one look at the cost of redundancy and backs away. Paying for redundant resources most organizations can stomach. The network traffic charges are what push many over the edge of “do not buy”.

The cost of re-designing and re-implementing applications to synchronize data shipping to remote regions and only spinning up remote region resources as needed is even larger for these organizations.

And this is how we end up with these massive cloud footprints not much different than running fleets of VM’s. Just about the most expensive way to use the cloud hyperscalers.

Most non-tech industry organizations cannot face the brutal reality that properly, really leveraging hyperscalers involves a period of time often counted in decades for Fortune-scale footprints where they’re spending 3-5 times on selected areas more than peers doing those areas in the old ways to migrate to mostly spot instance-resident, scale-to-zero elastic, containerized services with excellent developer and operational troubleshooting ergonomics.

dd_xplore

3 months ago

If you self host, you must keep the spares, atleast for an enterprise environment.

worik

3 months ago

1 reply

> in self-destructing messages (not that an attacker couldn't disable that, but to avoid there being a giant archive when we do have some sort of compromise

Admitting to that here?

In civilised jurisdictions that should be criminal.

Using cryptography to avoid accountability is wrong. Drug dealing and sex work, OK, but in other businesses? Sounds very crooked to me

jamwil

3 months ago

They're not sidestepping accountability, or at least we can't infer that from the information volunteered. They're talking about retention, and the less data you retain (subject to local laws), the less data can be leaked in a breach.

endre

3 months ago

same here, we never left IRC

tcumulus

3 months ago

Well, at least I now know that my Belgian university's Blackboard environment is running on AWS :)

Traubenfuchs

3 months ago

> We're having fun figuring out how to communicate amongst colleagues now!

When Slack was down we used... google... google mail? chat. When you go to gmail there is actually a chat app on the left.

nik736

3 months ago

1 reply

Twilio seems to be affected as well

richardwardza

3 months ago

Their entire status page is red!

rickette

3 months ago

4 replies

Couple of years ago us-east was considered the least stable region here on HN due to its age. Is that still a thing?

motiejus

3 months ago

1 reply

When I was there at aws (left about a decade ago), us-east-1 was considered least stable, because it was the biggest.

I.e. some bottle-necks in new code appearing only _after_ you've deployed there, which is of course too late.

It didn't help that some services had their deploy trains (pipelines in amazon lingo) of ~3 weeks, with us-east-1 being the last one.

I bet the situation hasn't changed much since.

shawabawa3

3 months ago

>It didn't help that some services had their deploy trains (pipelines in amazon lingo) of ~3 weeks, with us-east-1 being the last one.

oof, so you're saying this outage could be cause by a change merged 3 weeks ago?

immibis

3 months ago

Couple of weeks or months ago the front page was saying how us-east-1 instability was a thing of the past due to <whatever chang of architecture>.

sammy2255

3 months ago

Yes

esskay

3 months ago

Yup, never add anything new to us-east-1. There is never a good reason to willingly use that region.

ZeWaka

3 months ago

1 reply

Alexa devices are also down.

nvarsj

3 months ago

And ring! Don’t know why the chime needs an AWS connection. That was surprising.

starkindustries

3 months ago

1 reply

Zoom is unable to send screenshots.

starkindustries

3 months ago

zoom unable to send messages now as well.

sammy2255

3 months ago

4 replies

Can't resolve any records for dynamodb.us-east-1.amazonaws.com

However, if you desperately need to access it you can force resolve it to 3.218.182.212. Seems to work for me. DNS through HN

curl -v --resolve "dynamodb.us-east-1.amazonaws.com:443:3.218.182.212" https://dynamodb.us-east-1.amazonaws.com/

XCSme

3 months ago

1 reply

It's always DNS

alex_suzuki

3 months ago

Confirmed.

> Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1.

rstupek

3 months ago

thank you for that info!!!!!

planckscnst

3 months ago

There's also dynamodb-fips.us-east-1.amazonaws.com if the main endpoint is having trouble. I'm not sure if this record was affected the same way during this event.

sam1r

3 months ago

Dude!! Life saver.

renatovico

3 months ago

1 reply

docker hub or github cache internal maybe is affected:

Booting builder /usr/bin/docker buildx inspect --bootstrap --builder builder-1c223ad9-e21b-41c7-a28e-69eea59c8dac #1 [internal] booting buildkit #1 pulling image moby/buildkit:buildx-stable-1 #1 pulling image moby/buildkit:buildx-stable-1 9.6s done #1 ERROR: received unexpected HTTP status: 500 Internal Server Error ------ > [internal] booting buildkit: ------ ERROR: received unexpected HTTP status: 500 Internal Server Error

mopatches

3 months ago

DockerHub shows full outage: https://www.dockerstatus.com/

BartjeD

3 months ago

1 reply

So much for the peeps claiming amazing Cloud uptime ;)

JackSlateur

3 months ago

1 reply

Could you give us that uptime, in number ?

BartjeD

3 months ago

I'm afraid you missed the emoticon at the end of the sentence.

A `;)` is normally understoond to mean the author isn't entirely serious, and is making light of something or other.

Perhaps you American downvoters were on call and woke up with a fright, and perhaps too much time to browse Hacker News. ;)

Ekaros

3 months ago

9 replies

Wasn't the point why AWS is so much premium that you will always get at least 6 nines if not more in availability?

abujazar

3 months ago

2 replies

Last time I checked the standard SLA is actually 99 % and the only compensation you get for downtime is a refund. Which is why I don't use AWS for anything mission critical.

systemvoltage

3 months ago

3 replies

What do you do if not AWS?

abujazar

3 months ago

Been using AWS too, but for a critical service we mirrored across three Hetzner datacenters with master-master replication as well as two additional locations for cluster node voting.

Nextgrid

3 months ago

There's nothing particularly wrong with AWS, other than the pricing premium.

The key is that you need to understand no provider will actually put their ass on the line and compensate you for anything beyond their own profit margin, and plan accordingly.

For most companies, doing nothing is absolutely fine, they just need to plan for and accept the occasional downtime. Every company CEO wants to feel like their thing is mission-critical but the truth is that despite everything being down the whole thing will be forgotten in a week.

For those that actually do need guaranteed uptime, they need to build it themselves using a mixture of providers and test it regularly. They should be responsible for it themselves, because the providers will not. The stuff that is actually mission-critical already does that, which is why it didn't go down.

esskay

3 months ago

Theres literally thousands of options. 99% of people on AWS do not need to be on AWS. VPS servers or load balanced cloud instances from providers like Hetzner are more than enough for most people.

It still baffles me how we ended up in this situation where you can almost hear peoples disapproval over the internet when you say AWS / Cloud isn't needed and you're throwing money away for no reason.

rustc

3 months ago

3 replies

Does any host provide more compensation than refund for downtime?

TheDong

3 months ago

https://mail.tarsnap.com/tarsnap-announce/msg00050.html

> Following my ill-defined "Tarsnap doesn't have an SLA but I'll give people credits for outages when it seems fair" policy, I credited everyone's Tarsnap accounts with 50% of a month's storage costs.

So in this case the downtime was roughly 26 hours, and the refund was for 50% of a month, so that's more than a 1-1 downtime refund.

esskay

3 months ago

Most "legacy" hosts do yes. The norm used to be a percentage of your bill for every hour of downtime once uptime dropped below 99.9%. If the outage was big enough you'd get credit exceeding your bill, and many would allow credit withdrawal in those circumstances. There were still limits to protect the host but there was a much better SLA in place.

Cloud providers just never adopted that and the "ha, sucks to be you" mentality they have became the norm.

abujazar

3 months ago

Depends on which service you're paying for. For pure hosting the answer is no, which is why it rarely makes sense to go AWS for uptime and stability because when it goes down there's nothing you can do. As opposed to bare metal hosting with redundancy across data centers, which can even cost less than AWS for a lot of common workloads.

torginus

3 months ago

1 reply

They guarantee the dashboard will be green 99.999% of the time

Ekaros

3 months ago

1 reply

I take dashboard is not covered by SLA?

theshrike79

3 months ago

The dashboard is the SLA.

IIRC it takes WAY too many managers to approve the dashboard being anything other than green.

It's not a reflection of reality nor is it automated.

Unroasted6154

3 months ago

1 reply

You are supposed to build multi regional services if you need higher resilience.

Nextgrid

3 months ago

2 replies

Actual multi-region replication is hard and forces you to think about complicated things like the CAP theorem/etc. It's easier to pretend AWS magically solves that problem for you.

Which is actually totally fine for the vast majority of things, otherwise there would be actual commercial pressures to make sure systems are resilient to such outages.

hylaride

3 months ago

You could also achieve this in practice by just not using us-east-1, though at the very least you should have another region going for DR.

Unroasted6154

3 months ago

Never said it was easy or desirable for most companies.

But there is only so much a cloud provider can guarantee within a region or whatever unit of isolation they offer.

shawabawa3

3 months ago

1 reply

the highest availability service i think is S3 at 4 nines

you might be thinking of durability for s3 which is 11 nines, and i've never heard of anyone losing an object yet

alex_suzuki

3 months ago

1 reply

no, it's probably Route 53, touted as having "100% availability" (https://en.wikipedia.org/wiki/Amazon_Route_53)

shawabawa3

3 months ago

1 reply

hah, that's funny as the outage seems to be caused by DNS issues

kondroAuthor

3 months ago

Route53 was still resolving DNS entries just fine. But it looked like someone/something removed the entries for DynamoDB.

Nextgrid

3 months ago

1 reply

The point of AWS is to promise you the nines and make you feel good about it. Your typical "growth & engagement" startup CEO can feel good and make his own customers feel good about how his startup will survive a nuclear war.

Delivery of those nines is not a priority. Not for the cloud provider - because they can just lie their way out of it by not updating their status page - and even when they don't, they merely have to forego some of their insane profit margin for a couple hours in compensation. No provider will actually put their ass on the line and offer you anything beyond their own profit margin.

This is not an issue for most cloud clients either because they keep putting up with it (lying on the status page wouldn't be a thing if clients cared) - the unspoken truth is that nobody cares that your "growth & engagement" thing is down for an hour or so, so nobody makes anything more than a ceremonial stink about it (chances are, the thing goes down/misbehaves regularly anyway every time the new JS vibecoder or "AI employee" deploys something, regardless of cloud reliability).

Things where nines actually matter will generally invest in self-managed disaster recovery plans that are regularly tested. This also means it will generally be built differently and far away from your typical "cloud native" dumpster fire. Depending on how many nines you actually need (aka what's the cost of not meeting that target - which directly controls how much budget you have to ensure you always meet it), you might be building something closer to aircraft avionics with the same development practices, testing and rigor.

mightyham

3 months ago

1 reply

I can tell you from personal experience that improving/maintaining uptime (by doing root cause analysis, writing correction of error reports, going through application security reviews, writing/reviewing design docs for safely deploying changes, working on operational improvements to services) probably takes up a majority of most AWS engineers' time. I'm genuinely curious what you are basing the opinion "Delivery of those nines is not a priority" off of.

Nextgrid

3 months ago

> what you are basing the opinion "Delivery of those nines is not a priority" off of.

Because I don't see the business pressure to do? If problems happen they can 1) lie on the status page and hope nothing happens and 2) if they can't get away with lying, their downside is limited to a few hours of profit margin.

(which is not really a dig at AWS because no hosting provider will put their business on the line for you... it's more of a dig at people who claim AWS is some uptime unicorn while in reality they're nowhere near better than your usual hosting provider to justify their 1000x markup)

It's great if they're doing their best anyway, but I don't see it as anything more than "best effort", because nothing bad would happen even if they didn't do a good job at it.

esskay

3 months ago

It's usually true if you arent in US-East-1 which is widely known to be the least reliable location. Theres no reason anyone should be deploying anything new to it these days.

miohtama

3 months ago

1. Competitors are not any better, or worse

2. Trusted brand

crbaker

3 months ago

us-east-1 the worst region for availability

that_guy_iain

3 months ago

You can. You just need to do the work to make it work. That's the bit where everyone but Netflix and Amazon fail.

nikolay

3 months ago

8 replies

Choosing us-east-1 as your primary region is good, because when you're down, everybody's down, too. You don't get this luxury with other US regions!

tokioyoyo

3 months ago

2 replies

Doing pretty well up here in Tokyo region for now! Just can't log into console and some other stuff.

happymellon

3 months ago

Check the URL, we had an issue a couple of years ago with the Workspaces. US East was down but all of our stuff was in EU.

Turns out the default URL was hardcoded to use the us east interface and just by going to workspaces and then editing your URL to be the local region got everyone working again.

Unless you mean nothing is working for you at the moment.

thdhhghgbhy

3 months ago

Doesn't this mean you are not regionally isolated from us-east-1?

sam1r

3 months ago

2 replies

Sometimes we all need a tech shutdown.

sunrunner

3 months ago

1 reply

As they say, every cloud outage has a silver lining.

* Give the computers a rest, they probably need it. Heck, maybe the Internet should just shut down in the evening so everyone can go to bed (ignoring those pesky timezone differences)

* Free chaos engineering at the cloud provider region scale, except you didn't opt in to this one and know about in advance, making it extra effective

* Quickly figure out a map which of the things you use have a dependency on a single AWS region without no capability to change or re-route

rwky

3 months ago

1 reply

Back in the day people used to shut down mail servers at the weekend, maybe we should start doing that again.

giobox

3 months ago

1 reply

This still happens in some places. In various parts of Europe there are legal obligations not to email employees out of hours if it is avoidable. Volkswagen famously adopted a policy in Germany of only enabling receipt of new email messages for most of their employees 30 minutes before start of the working day, then disabling 30 minutes after the end, with weekends turned off also. You can leave work on Friday and know you won't be receiving further emails until Monday.

> https://en.wikipedia.org/wiki/Right_to_disconnect

schrectacular

3 months ago

B&H shuts down their site for the sabbath.

kshacker

3 months ago

Disconnect day

rdhatt

3 months ago

1 reply

One unexpected upside moving from a DC to AWS is when a region is down, customers are far more understanding. Instead of being upset, they often shrug it off since nothing else they needed/wanted was up either.

scandox

3 months ago

This is a remarkable and unfair truth. I have had this experience with Office365...when they're down a lot of customers don't care because all their customers are also down.

ta1243

3 months ago

3 replies

It took me so long to realise this is what's important in enterprise. Uptime isn't important, being able to blame someone else is what's important.

If you're down for 5 minutes a year because one of your employees broke something, that's your fault, and the blame passes down through the CTO.

If you're down for 5 hours a year but this affected other companies too, it's not your fault

From AWS to Crowdstrike - system resilience and uptime isn't the goal. Risk mitigation isn't the goal. Affordability isn't the goal.

When the CEO's buddies all suffer at the same time as he does, it's just an "act of god" and nothing can be done, it's such a complex outcome that even the amazing boffins at aws/google/microsoft/cloudflare/etc can't cope.

If the CEO is down at a different time than the CEO's buddies then it's that Dave/Charlie/Bertie/Alice can't cope and it's the CTO's fault for not outsourcing it.

As someone who likes to see things working, it pisses me off no end, but it's the way of the world, and likely has been whenever the owner and CTO are separate.

balls187

3 months ago

> It took me so long to realise this is what's important in enterprise. Uptime isn't important, being able to blame someone else is what's important.

Yes.

What is important is having a Contractual SLA that is defensible. Acts of God are defensible. And now major cloud infrastructure outtages are too.

shermantanktop

3 months ago

A slightly less cynical view: execs have a hard filter for “things I can do something about” and “things I can’t influence at all.” The bad ones are constantly pushing problems into the second bucket, but there are legitimately gray area cases. When an exec smells the possibility that their team could have somehow avoided a problem, that’s category 1 and the hammer comes down hard.

After that process comes the BS and PR step, where reality is spun into a cotton candy that makes the leader look good no matter what.

arvindh-manian

3 months ago

“No one ever got fired for hiring IBM”

Gigachad

3 months ago

1 reply

I was once told that our company went with Azure because when you tell the boomer client that our service is down because Microsoft had an outage, they go from being mad at you, to accepting that the outage was an act of god that couldn’t be avoided.

myhf

3 months ago

1 reply

Azure outages: happens all the time, understandable, no way to prevent this

AWS outages: almost never happens, you should have been more prepared for when it does

Gigachad

3 months ago

The 50 year old executive using the software doesn’t know what an AWS is and hardly knows what Amazon does outside of selling junk.

If you say it’s Microsoft then it’s just unavoidable.

kelseydh

3 months ago

2 replies

Is us-east-1 equally unstable to the other regions? My impression was that Amazon deployed changes to us-east-1 first so it's the most unstable region.

morshu9001

3 months ago

1 reply

Would think that Amazon safeguards their biggest region more, but no idea, I've never worked at AWS

hinkley

3 months ago

I do know from previous discussions that some companies are in us-east-1 because of business partnerships with other inhabitants and if one moves out the costs and latency goes up. So they are all stuck in this boat together.

Still, it would make a bit of sense if you can find a place in your code where crossing a region hurts less, to move some of your services to a different region.

While your business partners will understand that you’re down while they’re down, will your customers? You called yesterday to say their order was ready, and now they can’t pick it up?

hinkley

3 months ago

I've heard this so many times and not seen it contradicted so I started saying it myself. Even my last Ops team wanted to run some things in us-east-1 to get prior warning before they broke us-west-1.

But there are some people on Reddit who think we are all wrong but won't say anything more. So... whatever.

Nothing in the outage history really stands out as "this is the first time we tried this and oops" except for us-east-1.

It's always possible for things to succeed at a smaller scale and fail at full scale, but again none of them really stand out as that to me. Or at least, not any in the last ten years. I'm allowing that anything older than that is on the far side of substantial process changes and isn't representative anymore.

Sparkyte

3 months ago

I am down with that lets all build in US-East-1.

ej_campbell

3 months ago

And all your dependencies are co-located.

rdm_blackhole

3 months ago

1 reply

Vercel functions are down as well.

gianpaj

3 months ago

Yes https://www.vercel-status.com/

abujazar

3 months ago

4 replies

I find it interesting that AWS services appear to be so tightly integrated that when there's an issue in a region, it affects most or all services. Kind of defeats the purported resiliency of cloud services.

tokioyoyo

3 months ago

1 reply

You know how people say X startup is ChatGPT wrapper? A significant chunk of AWS services are wrappers of main services (DynamoDB, EC2, S3 and etc).

abujazar

3 months ago

2 replies

Yes, and that's exactly the problem. It's like choosing a microservice architecture for resiliency and building all the services on top of the same database or message queue without underlying redundancy.

UltraSane

3 months ago

1 reply

There IS a huge amount of redundancy built into the core services but nothing is perfect.

Aperocky

3 months ago

DNS is always the single point of failure.

But I think what wasn't well considered was the async effect - If something is gone for 5 minutes, maybe it will be just fine, but when things are properly asynchronous, then the workflows that have piled up during that time becomes a problem in itself. Worst case, they turn into poison pills which then break the system again.

pm90

3 months ago

afaik they have a tiered service architecture, where tier 1 services are allowed to rely on tier 0 services but not vice-versa, and have a bunch of reliability guarantees on tier 0 services that are higher than tier 1.

It is kinda cool that the worst aws outages are still within a single region and not global.

bananapub

3 months ago

1 reply

you can't possibly know that?

surely you mean:

> I find it interesting that AWS services appear to be so tightly integrated that when there's an issue THAT BECOMES VISIBLE TO ME in a region, it affects most or all services.

AWS has stuff failing alllllllll the time, it's not very surprising that many of the outages that become visible to you involve multi-system failures - lots of other ones don't become visible!

abujazar

3 months ago

Sure, but none of those other issues are ever documented by AWS as their status page is usually just one big lie.

bradhe

3 months ago

1 reply

Yeah I think there are a number of "hidden" dependencies on different regions, especially us-east-1. It's an artifact of it being AWS' largest region, etc.

hshdhdhehd

3 months ago

2 replies

why dont they have us east 2, 3, 4 etc. Actually have different cities.

bradhe

3 months ago

us-east-1 is actually dozens of physical buildings distributed over a massive area. It's not like a single data center somewhere...

sgarland

3 months ago

us-east-2 does exist; it’s in Ohio. One major issue is a number of services have (had? Not sure if it’s still this way) a control plane in us-east-1, so if it goes down, so does a number of other services, regardless of their location.

esskay

3 months ago

2 replies

I think a lot of its probably technical debt. So much internally still relies on legacy systems in US-East-1, and every time this happens I'm sure theres a discussion internally about decoupling that reliance which then turns into a massive diagram that looks like a family tree dating back a thousand years of all the things that need to be changed to stop it happening.

abujazar

3 months ago

1 reply

Sounds plausible. It's also a "fat and happy" symptom not to be able to fix deep underlying issues despite an ever growing pile of cash in the company.

dv_dt

3 months ago

Fixing deep underlying issues tends to fare poorly on performance reviews because success is not an easily traceable victory event. It is the prolonged absence of events like this, and it's hard to prove a negative.

Nextgrid

3 months ago

4 replies

There's also the issue of sometimes needing actual strong consistency. Things like auth or billing for example where you absolutely can't tolerate eventual consistency or split-brain situations, in which case you need one region to serve as the ultimate source of truth.

chmod775

3 months ago

Banking/transactions is full of split-brains where everyone involved prays for eventual consistency.

If you check out with a credit card, even if everything looked good then, the seller might not see the money for days or might never receive it at all.

mdavidn

3 months ago

Banking is full of examples of eventually consistent systems. ACH, credit card transactions, blockchain...

smj-edison

3 months ago

Interestingly, TigerBeetle manages to have distributed strict consistency over 6 machines.

kassner

3 months ago

> billing […] can't tolerate eventual consistency

Interesting point that banks actually tolerate a lot more eventual consistency than most software that just use a billing backend ever do.

Stuff like 503-ing a SaaS request because the billing system was down and you couldn’t check for limits, could absolutely be locally cached and eventual consistency would hurt very little. Unless your cost is quite high, I would much rather prefer to keep the API up and deal with the over-usage later.

yuvadam

3 months ago

1 reply

During the last us-east-1 apocalypse 14 years ago, I started awsdowntime.com - don't make me regsiter it again and revive the page.

Sparkyte

3 months ago

1 reply

I remember the one where some contractor accidentally cut the trunk between AZes.

oasisbob

3 months ago

1 reply

In us-east-1? That doesn't sound that impactful, have always heard that us-east-1's network is a ring.

Back before AWS provided transparency into AZ assignments, it was pretty common to use latency measurements to try and infer relative locality and mappings of AZs available to an account.

Sparkyte

2 months ago

It was a long time ago. It was very impactful at the time. You could still reach the other AZes but that one. I think this was US-East1c.

seviu

3 months ago

2 replies

I can't log in to my AWS account, in Germany, on top of that it is not possible to order anything or change payment options from amazon.de.

No landing page explaining services are down, just scary error pages. I thought account was compromised. Thanks HN for, as always, being the first to clarify what's happening.

Scary to see that in order to order from Amazon Germany, us-east1 must be up. Everything else works flawlessly but payments are a no go.

that_guy_iain

3 months ago

1 reply

I just ordered stuff from Amazon.de. And I highly any Amazon site can go down because of one region. Just like Netflix are rarely affected.

seviu

3 months ago

1 reply

I can’t even login, I get the internal error treatment. This is on Amazon.de

that_guy_iain

3 months ago

I'm on Amazon.de and I literally ordered stuff seconds before posting the comment. They took the money and everything. The order is in my order history list.

serial_dev

3 months ago

I wanted to log into my Audible account after a long time on my phone, I couldn't, started getting annoyed, maybe my password is not saved correctly, maybe my account was banned, ... Then checking desktop, still errors, checking my Amazon.de, no profile info... That's when I started suspecting that it's not me, it's you, Amazon! Anyway, I guess, I'll listen to my book in a couple of hours, hopefully.

Btw, most parts of the amazon.de is working fine, but I can't load profiles, and can't login.

padjo

3 months ago

Friends don’t let friends use us-east-1

montek01singh

3 months ago

I cannot create a support ticket with AWS as well.

al_james

3 months ago

Cant even login via the AWS access portal.

bpye

3 months ago

Amazon.ca is degraded, some product pages load but can't see prices. Amusing.

roosgit

3 months ago

Can confirm. I was trying to send the newsletter (with SES) and it didn't work. I was thinking my local boto3 was old, but I figured I should check HN just in case.

hexbin010

3 months ago

Why after all these years is us-east-1 such a SPOF?

__coder__

3 months ago

Perplexity also have outage.

https://status.perplexity.ai

qianli_cs

3 months ago

We're seeing issues with multiple AWS services https://health.aws.amazon.com/health/status

thundergolfer

3 months ago