Today Is When the Amazon Brain Drain Sent AWS Down the Spout

Posted2 months agoActive2 months ago

raw_anon_1111

1,018 points

654 comments

theregister.comTechstoryHigh profile

heatednegative

Debate

85/100

AWS OutageAmazon Brain DrainTech Industry Layoffs

Key topics

AWS Outage

Amazon Brain Drain

Tech Industry Layoffs

The article discusses a recent AWS outage and attributes it to Amazon's brain drain due to layoffs and poor management practices, sparking a heated debate among commenters about the causes and implications of the outage.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

21m

Peak period

127

0-12h

Avg / period

22.9

Comment distribution160 data points

Loading chart...

Based on 160 loaded comments

Key moments

01Story posted
Oct 20, 2025 at 4:50 PM EDT
2 months ago
Step 01
02First comment
Oct 20, 2025 at 5:11 PM EDT
21m after posting
Step 02
03Peak activity
127 comments in 0-12h
Hottest window of the conversation
Step 03
04Latest activity
Oct 26, 2025 at 8:08 AM EDT
2 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (654 comments)

Showing 160 comments of 654

jqpabc123

2 months ago

2 replies

Nothing gets sold or fixed without people who know how it's built.

bwfan123

2 months ago

1 reply

cue in: programming as theory building [1] or building systems as theory building, ie, Mental causal models of how and why things work the way they do. Mental models live in people's heads and walk out of the door when they do. Management learns this the hard way [2].

[1] https://pages.cs.wisc.edu/~remzi/Naur.pdf

[2] https://x.com/elonmusk/status/1980221072512635117

intermerda

2 months ago

1 reply

> Management learns this the hard way [2]

I don't get this. Why is Musk tweeting a fake quote and why are you posting it? What does it signify?

yawz

2 months ago

I assume it was not labeled as fake when the OP sent it.

nine_zeros

2 months ago

But it is more important to keep the PIP, stack-ranking, ladder-climbing game running than keeping the people /s

ortusdux

2 months ago

4 replies

"If you were a ‘product person’ at IBM or Xerox: so you make a better copier or better computer. So what? When you have a monopoly market-share, the company’s not any more successful. So the people who make the company more successful are the sales and marketing people, and they end up running the companies. And the ‘product people’ get run out of the decision-making forums.

The companies forget how to make great products. The product sensibility and product genius that brought them to this monopolistic position gets rotted out by people running these companies who have no conception of a good product vs. a bad product. They have no conception of the craftsmanship that’s required to take a good idea and turn it into a good product. And they really have no feeling in their hearts about wanting to help the costumers.”

- Steve Jobs - https://en.wikipedia.org/wiki/Steve_Jobs:_The_Lost_Interview

ASalazarMX

2 months ago

1 reply

Great words, but he lost any right to them when he made famous the "You're holding it wrong" workaround. IMO that was the defining moment when Apple started its decline on product innovation.

JustExAWS

2 months ago

2 replies

It was a nothingburger. Apple sold the same GSM iPhone 4 for three years with no design changes and nothing else was said about it three months later.

DecentShoes

2 months ago

1 reply

It was not nothing. The phone stopped working if the user held it naturally in their hand. I had one. Reception completely cut out.

JustExAWS

2 months ago

1 reply

If it were that big of deal, don’t you think Apple would have been forced to recall it and definitely couldn’t keep selling it for 3 years. True they did redo the antenna for the Verizon CDMA iPhone 4. But they never bothered to back port the changes to the GSM one.

I also had a GSM iPhone 4.

Compare that to how quickly they ran away from the shitty Intel modems when they were selling some made by Intel and some made by Samsung (?)

hluska

2 months ago

Apple offered a free case to make the problem go away. My iPhone just had trouble typing Apple and free in the same sentence.

bobbiechen

2 months ago

Well, there was a software change to smooth out how the bars would display.. https://9to5mac.com/2025/10/08/a-15-year-mystery-solved-the-...

Mistletoe

2 months ago

2 replies

I’d add a corollary to Steve’s often quoted idea, that became even more relevant after Covid. Everyone competent that makes tons of money retired early. We are left with the dregs at all these companies, the newbies and those that didn’t budget well and plan for early retirement.

Any interaction you have with a company post-Covid you can feel it. Nothing works anymore and you can’t even tell anyone about it or why.

nebula8804

2 months ago

1 reply

It was going to happen Covid or not. People get old...what are you going to do? Is it really that bad if a bunch decide, "you know what? I'll take the financial hit and retire a few years early because I have been reminded of the fragility of life".

Mistletoe

2 months ago

1 reply

Oh I don’t blame them, I was a scientist and retired early myself. So has my brother that was a doctor. But what world is left behind when all the competent and passionate people exit the working world early? We see that world every day on the news now and it isn’t pretty.

nebula8804

2 months ago

>But what world is left behind when all the competent and passionate people exit the working world early?

Again, it was going to happen eventually. Boomers should have done better to mentor the youth and pass the torch. Some did, most didn't. That time has now passed. Millenials and Gen-Z will have to pick up the pieces after things fall apart and it will be painful...many things will just not be the same. Things we took for granted may well just disappear given a smaller population size. But at least they can finally define how they want the world to work. Its been too long. Millenials are in their 40s now. They should have taken the reins years ago if the prior generation actually cared to pass the torch.

One of the things that I recall doing on Election Day was cursing out Trump when he was re-elected.

This shit stain is a representation of the Boomer's last act: To write the final chapters of the Millenials' adult years.

The mess that Trump is making with the tariffs and destruction of institutional knowledge will eventually get fixed but it will take 15-20+ years. Essentially the remaining years of Millennials working lives. At this point I am transitioning to acceptance. Acceptance that so many processes will have to be rebuilt by Millenials and Gen-Z. At least there will be an opportunity to reinvent old ways of thinking despite all the pain.

rkomorn

2 months ago

1 reply

Maybe, in retrospect, all those brilliant folks who made all that money would've deserved it more if they'd built cultures that could survive 5 years.

ModernMech

2 months ago

At best, a company's culture is just a branding and marketing strategy to attract employees and to appear cool. They are not real cultures -- they are constructed, forced and dictated by management, and oftentimes cult-like in their construction.

Company cultures are not built to last, they are designed to generate profit. The culture is incidental, it will be whatever is most profitable at any given moment. At best, a company's culture is just a branding and marketing strategy to attract employees and to appear cool. Therefore they are fickle and prone to complete collapse when just a few people are replaced.

Gormo

2 months ago

Is this actually a verbatim quote? The typo at the end makes that seem unlikely.

deepsquirrelnet

2 months ago

FWIW this has happens in consulting too, not just product companies. Just swap “product” for “delivery”.

behnamoh

2 months ago

3 replies

> It is a fact that there have been 27,000+ Amazonians impacted by layoffs between 2022 and 2024, continuing into 2025. It's hard to know how many of these were AWS versus other parts of its Amazon parent, because the company is notoriously tight-lipped about staffing issues. Internal documents reportedly say that Amazon suffers from 69 percent to 81 percent regretted attrition across all employment levels. In other words, "people quitting who we wish didn't." The internet is full of anecdata of senior Amazonians lamenting the hamfisted approach of their Return to Office initiative; experts have weighed in citing similar concerns.

So the title is all speculation. The author put 2 and 2 together and concluded that 10 is greater than 9.

Worthless article.

samrus

2 months ago

1 reply

Just because its speculation doesnt mean its worthless. But yeah it should be taken as speculation rather than a validated and tested hypithesis

behnamoh

2 months ago

3 replies

Anything that wastes my time and only reveals its half-assed reasoning half way through the article is indeed worthless.

bryanlarsen

2 months ago

First time reading El Reg?

a0123

2 months ago

You could make smart inferences based on past and very frequent occurrences.

Or you could just say "there is no way the thing that constantly happens over and over again has happened once again, just no way".

Staff cuts constantly happen in the name of maximising profits. They always yield poor results for a company's performance. Every time. Especially for the consumer's side of it (not the company's finances of course).

Every time.

But maybe this time it's different. That one time.

toofy

2 months ago

i mean, you can assume if its on theregister its not going to have some kind of academic rigor or whatever it might be you're looking for. its the register, same basic rigor quality as the ny post.

that said, my suspicion is they're likely on to something here regarding layoffs and quality degradation.

JustExAWS

2 months ago

1 reply

I worked at AWS and still have friends who work there. I don’t know any L5s who wouldn’t jump at a chance to leave if they even got a slightly worse offer than what they are making now. I know a few L6s and L7s that would stick around out of momentum.

But I know very few people in the industry who know about Amazon’s reputation that have a life long dream of working there given a choice.

I was 46 when I was hired there for a “permanently remote [sic] field by design role” in ProServe and it was my 8th job out of college. I went in with my eyes wide open. I had a plan, stay for four years, sell my RSUs as soon as they vested, pay off debt, save some money, put it on my resume to open doors and make connections and leave.

I was never expecting to make more when I left. I used the time to downsize and reduce my expenses - including moving to state tax free Florida.

When I saw the writing on the wall, I played the game while I was on focus to get my next vest and wait for the “get 40k+ severance and leave immediately or try to work through the PIP”.

I took the latter and had three offers within 3 weeks. This was late 2023.

azemetre

2 months ago

1 reply

How long did you stick it out? Were you close toward completing your plan?

JustExAWS

2 months ago

1 reply

Close enough. I missed 2 vesting periods. But the severance and rapidly having a job made up for one and I got refreshers my third year that I hadn’t counted on.

I left debt free, sold my old home for exactly twice what I had built for 8 years earlier, downsized to a condo half the price I sold it for (and 1/3 the size) and I was debt free with savings.

I’m now a staff consultant working full time at a 3rd party AWS consulting firm with a lot less stress and still remote. They were the last to fall. But AWS made their ProServe department return to office at the beginning of this year.

ferguess_k

2 months ago

Ah that's really nice for you. Congratulations for gritting the teeth and making through all those years.

dijit

2 months ago

1 reply

Maybe its speculation, but I mean drawing conclusions becomes easy when 40% of devops staff being laid off by AWS was in the news three days ago.

https://blog.stackademic.com/aws-just-fired-40-of-its-devops...

placardloop

2 months ago

1 reply

AWS doesn’t even have a “devops team” nor even any devops job roles. AWS also does not use Terraform (which is what the article says everyone was replaced with) at any significant scale, so this article is similar junk.

dijit

2 months ago

4 replies

Uh, they still have roles open for DevOps:

https://amazon.jobs/en/jobs/3080348/devops-engineer-linux-re...

https://amazon.jobs/en/jobs/3082914/devops-systems-engineer-...

This one mentions terraform by name (though that doesn't necessarily imply its in use, though having worked in large companies I would argue that sweeping statements about a popular technology not being used is likely to be wrong)

https://amazon.jobs/en/jobs/3042892/delivery-consultant-devo...

placardloop

2 months ago

1 reply

AWS does not have dedicated devops roles. All AWS SWEs are expected to take oncall shifts and respond to incidents, manage build pipelines, etc rather than having specific devops people to do it for them. The article you linked claiming 40% of them were fired is total junk. You can believe that or not, I don’t care.

The last one is a ProServe role, which is a consulting role that spends their time working in customer environments, which is where they may encounter terraform. It does not mean anything about internal use of terraform.

dijit

2 months ago

1 reply

Again, I’d be wary making sweeping generalisations like that.

I already showed you that AWS has (or hires) DevOps people with publicly available information, maybe the article is incorrect but you’re clearly not better informed, so maybe cut it with the rude commentary.

placardloop

2 months ago

1 reply

Misunderstanding the things you are linking does not mean you proved anyone wrong.

dijit

2 months ago

1 reply

I’m going to link this again, what role is this please:

https://amazon.jobs/en/jobs/3080348/devops-engineer-linux-re...

placardloop

2 months ago

1 reply

Within AWS this role falls under the Systems Engineer job family. It is not a devops role, and its involvement in events like today would be the same involvement as every other SWE at Amazon.

Just do a quick google search for that “40% of devops laid off” and you’ll see that it’s actually an old article from months ago that multiple people, including AWS employees, are saying is bullshit and unsourced.

edit: found another source that says this 40% number came from an AWS consultant that worked with customers to help them be better at DevOps, and it was 40% of their specific team that was laid off. Even if it were true, it has nothing to do with the internal operations of AWS services. This is why it’s important to understand the information you’re sharing before making judgements off of it.

dijit

2 months ago

All I am finding are corroborating articles, maybe you can help me here.

https://www.theregister.com/2025/07/18/aws_sheds_jobs/

Seems wild that you would promote job titles you don’t hire for, makes me think that it’s reasonable for news outlets to refer to those roles in the same way honestly.

JustExAWS

2 months ago

Source: Former AWS Professional Services employee.

Notice the job description:

As part of the AWS Managed Operations team, you will play a pivotal role in building and leading operations and development teams dedicated to delivering high-availability AWS services, including EC2, S3, Dynamo, Lambda, and Bedrock, exclusively for EU customers.

They aren’t looking for DevOpe engineers to work alongside the “service teams” - the teams that build and support internal AWS services. They are working with AWS customers who may already be using Terraform. AWS has a large internal consulting division staffed with full time employees. When they work with customers they will use Terraform if needed.

spaceprison

2 months ago

I work for Amazon (AWS for 4 years then “the website” side of the house for the last 3)

The previous commenter is correct, there is no NOC or devops team and I’ve not encountered a Devops job family and I’ve never seen terraform internally. Within AWS, the service teams that work these outages are the same ones that design the service, fix bugs, deploy the pipelines, be oncall, etc. the roles that fill these teams are pretty much one of three types: nde, sde, sysde. They typically use cdk if they’re doing AWS things, else they’ll use internal tooling.

The job you posted is a customer facing consultant like role - customers use terraform so having a customer facing consultant type that knows how customer-y things work is a good decision.

hinkley

2 months ago

You could both be right if they are trying to expand terraform use from a beachhead to the entire company. You need to hire people with prior experience for such things.

jongjong

2 months ago

6 replies

Speaking of DNS, I still cannot comprehend why we still rely on the current complex, aging, centralized, rent-seeking DNS.

It's one one of the few parts of the internet which could potentially be replaced over time with very little disruption.

The hierarchy of resolvers could be replaced with a far simpler flat hierarchy Blockchain where people could buy and permanently own their domains directly on-chain... No recurring fees. People could host websites on the Blockchain from beyond the grave... This is kind of a dream of mine. Not possible to achieve in our current system.

madaxe_again

2 months ago

2 replies

To bake an apple pie from scratch, you must first overthrow capitalism.

jongjong

2 months ago

I'm not really interested in communism but I will support it if it lets me have a Blockchain-based DNS system.

All the arguments I'm hearing against a Blockchain DNS system are rooted in petty crony-capitalist thinking.

This kind of thinking seems to permeate most other parts of society... It's gotta stop.

"Oh but what if someone steals it"

This ain't gonna be much of a problem in a functioning society where the top 20 domain names doesn't hoard like 95% of the traffic.

"Oh but we don't want people to own domains permanently or else they will take all the good domains"

Um hello?? Have to checked this thing called reality? It's already the case! So happy billionaires have to pay their $20 per months to maintain their market monopolies.

I actually don't mind other people having more stuff than me but tired of petty people ruining good ideas and stalling progress to make a few bucks.

chicagobuss

2 months ago

perfection

tpmoney

2 months ago

1 reply

> People could host websites on the Blockchain from beyond the grave...

This is precisely why something like this isn't a popular solution lots of people are working towards. Domains broadly speaking aren't a finite resource, but usable domains using common words definitely are. As time marches on human readable/typeable "permanent identifiers" are going to have to go away. Email address, usernames and the like are all going to get recycled, just like phone numbers are. Domains are currently recycled and most people probably think that's a good thing (assuming they think about it at all)

yreg

2 months ago

Also the potential for a domain to get irreversibly stolen is not a good feature for the security of the users.

tombert

2 months ago

1 reply

I think this would make the squatting problem that we already have way worse. There would be bots buying every single remotely usable domain, and there would be no incentive for them to sell it unless they get an absurdly large offer.

I bought tombert.com in 2014 and forgot to renew it in 2015, and it was auctioned off by GoDaddy. For like six years, it was owned by squatters, and they wanted thousands of dollars for the domain [1]. I called offering the $100 for it, and they claimed that they can't go below $1400 because this domain is in "extremely high demand". I finally was able to buy it back in 2021, presumably because the squatter purged out domains that hadn't been purchased for N years and they wanted to save money.

Now, you could argue "see! You wouldn't have had to worry about it expiring if it were permanent on the blockchain", and that's true, but if someone else had gotten to that domain first, then I would also never get it. I think the only thing that keeps the internet even remotely fair in this regard is that domain names cost some amount of money to keep.

[1] https://web.archive.org/web/20160219161720/http://www.hugedo...

grugagag

2 months ago

1 reply

Yes, having some cost disincentivies some abuses that completely free attracts. If email cost a few cents to send there would probably be a lot less spam around

tombert

2 months ago

I'm pretty sure that is why Something Awful was successful. Since an account cost $10, and any abuse could lead to a ban, you very quickly filter out spam and people who are solely there to shitpost.

tpmoney

2 months ago

> People could host websites on the Blockchain from beyond the grave...

pluto_modadic

2 months ago

DNS is:

- simple

- battle hardened

- distributed

- affordable

blockchains are:

- essoteric, backwards, and not easily implemented

- new and unproven, frequently hacked

- effectively a ploy to centralize / redo Web 1.0 but owned by one blockchain

- ...waaaaaaay more about money and "owning something" than DNS is.

bombcar

2 months ago

Let’s solve rent-seeking by building a new system that’s even more rent-seeking!

At least currently death dissolves bonds.

chicagobuss

2 months ago

2 replies

internal reports from current AWS engineers seem to be confirming all of the speculation in this article. Shit's rotten from the inside out and you can pretty evenly blame AI, brain drain, and good old fashioned "big company politics"

https://forums.theregister.com/forum/all/2025/10/20/aws_outa...

mschuster91

2 months ago

1 reply

> So they can't suggest a fix even if they know 100% what it will be. Thats exactly what happened this time. EIGHT different staff members pointed to the underlying cause and were told (some literally) to "shut the f*ck up and get back to your job"

Jesus, if even an ounce of that is true... Yes, everyone on the internet is a cat clawing on a keyboard... but if a ton of people legitimately confirmed to be ex-AWS point to similar culture issues... probably it's AWS that's rotting.

foobiekr

2 months ago

The problem is that if it was 8 in 10 suggestions or even 8 in 20, then yes, terrible. On the other hand, if it was 8 needles in a haystack of garbage, at some point you do, in fact, need to tell people to STFU so you can work.

int3trap

2 months ago

There's been a massive talent exodus, especially among the principal and senior principal engineering roles, across all Amazon orgs since the RTO policies have been enforced. Its demoralizing to lose key engineers that you look up to and want to continue to learn from all because a few people far removed from the day to day make a bad call.

RTO in combination with Amazon being last place in AI innovation have led to departures of anyone that can leave, leaving.

pdonis

2 months ago

3 replies

One thing I love about El Reg is that they never shirk from calling a spade a spade.

freedomben

2 months ago

1 reply

I also love the humor and personality that authors are able to inject into their pieces

mrec

2 months ago

I miss Lewis Page. It's a rare day when I'm so downhearted that the memory of https://www.theregister.com/2008/12/30/german_beaver/ can't raise at least a wan smile.

yen223

2 months ago

I nearly missed that this article was written by Corey Quinn, guy who writes a lot about AWS

furyofantares

2 months ago

They never shirk from calling anything a spade.

add-sub-mul-div

2 months ago

1 reply

Amazon has reportedly been a shitty place to work forever, so using issues that happen to be popular today to explain turnover is disingenuous.

sc68cal

2 months ago

1 reply

From the article.

> At the end of 2023, Justin Garrison left AWS and roasted them on his way out the door. He stated that AWS had seen an increase in Large Scale Events (or LSEs), and predicted significant outages in 2024. It would seem that he discounted the power of inertia

Your comment is relying on that referenced inertia. Things will continue to function for a period of time, but there exists an inflection point at which they no longer function as previously.

bombcar

2 months ago

1 reply

Arguably we had a perfect example with Twitter.

aerostable_slug

2 months ago

1 reply

In what way? It seems reliable enough. I admit I don't check it constantly, but I've not noticed outages impacting my perhaps middling usage.

mschuster91

2 months ago

1 reply

The year or so after Musk took over was brutal. The influx of far-right pests, troll farms and porn bots was one thing, but the reliability went down the drain.

bombcar

2 months ago

2 replies

Exactly / but it did not DIE as many were predicting (even on HN many were claiming it would go down immediately).

Inertia is a hell of a force.

mschuster91

2 months ago

1 reply

Musk effectively left his newest plaything a few months after the takeover and some events like him going in in a datacenter and disconnecting servers, that was when Twitter (I'll call it X when he acknowledges his daughter) started to stabilize again.

All that he seems to be doing these days at Twitter is messing around with the recommendation algorithm, override the decisions of what's left of moderation for his far-right friends and that's it. Oh and of course Grok/xAI or however it's called these days, but IIRC that's a separate corporate entity that just got shoehorned onto Twitter.

sc68cal

2 months ago

xAI took on all the debt that was used as leverage by Musk in order to take Twitter private.

grugagag

2 months ago

It kind of died, twitter as it was even though X is still running and somewhat recovered in terms of infra.

mlhpdx

2 months ago

4 replies

This is the time to accept that the path forward is keeping people and giving them the best tools you possibly can to do their work. That is, the same as has been true for decades remains so.

Yes, development tools are better every day. Yes, you can downsize. No it won’t be felt immediately. Yes, it mortgages the future and at a painfully high interest rate.

Suspending disbelief won’t make downsizing work better.

add-sub-mul-div

2 months ago

1 reply

I don't think there's ignorance of the fact that turnover is bad, I think the field is being designed to homogenize staff and favor uniform mediocrity so that employees truly do become interchangeable. We're so close to just plain talent being likened to cowboyism.

RobRivera

2 months ago

I observed a similar thing in the post Goldwater active duty officer corps reformations.

bcrosby95

2 months ago

1 reply

Nah, the decades old crop of "new" big tech companies are just entering their IBM phase.

goldchainposse

2 months ago

2 replies

Where are the young companies trying to replace them? There are all the AI companies, but Google and Meta both have competitive chatbots, and OpenAI is signing weird deals that don't make it look like a long-term player.

ModernMech

2 months ago

1 reply

They all get bought out by Amazon, Google, Meta et al. The cash just tastes too good when stacked up against the prospect of grinding for 15 years and probably nothing coming of it.

mcny

2 months ago

Google tried to sell itself to Yahoo! for a million dollars.

pyrale

2 months ago

Remember these antitrust laws the US decided not to enforce? Turns out they are useful, after all.

CobrastanJorji

2 months ago

4 replies

Seems like it worked fine. They laid off a quarter of their junior principal engineers, the stock went up. They had a massive outage a few months later, the stock went up again. Everything's working out fine for their strategy so far.

few

2 months ago

1 reply

I remember comments saying the stock went up because the average joe didn't realize how much of the internet was powered by AWS until all their day to day apps started failing. To most people Amazon is an online shopping site.

selimthegrim

2 months ago

The memes going around when the National Inquirer tried to blackmail Bezos and people were showing what their site was running on were pretty classic

array_key_first

2 months ago

2 replies

It continues to work until eventually the debt is so high the company implodes.

See: general electric, RCA, Xerox, GM

rob74

2 months ago

Yes, until the company has had all shareholder value sucked out of it and its hollowed-out shell finally implodes.

But Bezos will still have his billions.

nikodunk

2 months ago

Boeing?

foobarian

2 months ago

1 reply

You would think this would eventually show up on the balance sheets, right? Presumably a lot of their big customers have SLAs with money penalties, so maybe next quarter earnings? Or quarter after that?

dmoy

2 months ago

1 reply

SLA monetary penalties won't make the difference there. Enough giant customers moving substantial workload off of AWS (either to another cloud, or otherwise) would, but the timeline for that is years, not next quarter.

mcny

2 months ago

More likely outcome is the companies will spend even more on AWS, deploying to multiple AWS regions...

Just a guess but I think this bubble will stretch a bit more before it pops.

b00ty4breakfast

2 months ago

The entire thing reminds me of Wile E Coyote suspended in mid-air before he looks down and plummets to the bottom of a ravine

croes

2 months ago

> Yes, development tools are better every day

Are they?

nijave

2 months ago

2 replies

It was certainly suspicious that actual progress on the outage seemed to start right around U.S. west coast start of day. Updates before that were largely generic "we're monitoring and mitigating" with nothing of substance.

edoceo

2 months ago

2 replies

I thought the recovery was early AM Seattle time (like 4am). Where I think start-of-day is like 9am. Maybe recovery started early (6am) New York time?

nijave

2 months ago

2 replies

[09:13 AM PDT] We have taken additional mitigation steps to aid the recovery of the underlying internal subsystem responsible for monitoring the health of our network load balancers and are now seeing connectivity and API recovery for AWS services. We have also identified and are applying next steps to mitigate throttling of new EC2 instance launches. We will provide an update by 10:00 AM PDT.

[08:43 AM PDT] We have narrowed down the source of the network connectivity issues that impacted AWS Services...

[08:04 AM PDT] We continue to investigate the root cause for the network connectivity issues...

[12:11 AM PDT] <declared outage>

They claim not to have known the root cause for ~8hr

anon7000

2 months ago

1 reply

I don’t think that’s true, there was an initial Dynamo outage that was resolved in the wee hours that ultimately cascaded into the ec2 problem that lasted most of the day

nijave

2 months ago

1 reply

Was the Dynamo outage separate? My take was the NLB issue was the root cause and Dynamo was a symptom which they flipped some internal switches to mitigate the impact to that dependency

easton

2 months ago

1 reply

If their internal NLB monitoring can delete the A record for dynamodb that seems like a weird dependency (like, i can imagine the nlb going missing entirely can cause it to clean up via some weird orchestration, but this didn't sound like that).

nijave

2 months ago

I was thinking more along the lines of the NLB being in front of DNS servers and dropping resolvers

Or an NLB could also be load balancing by managing DNS records--it's not really clear what a NLB means in this context

Or there was an overload condition because of the NLB malfunctioning that caused UDP traffic to get dropped

Obviously a lot of reading between the lines is required without a detailed RCA--hopefully they release more info

CoopaTroopa

2 months ago

1 reply

Sure, that timeline looks bad when you leave out the 14 updates between 12:11am PDT and 8:04am PDT.

The initial cause appears to be a a bad DNS entry that they rolled back at 2:22am PDT. They started seeing recovery with services but as reports of EC2 failures kept rolling in they found a network issue with a load balancer that was causing the issue at 8:43am.

znpy

2 months ago

1 reply

> Sure, that timeline looks bad when you leave out the 14 updates between 12:11am PDT and 8:04am PDT.

Their 14 updates did not bring my stuff back up.

My nines are not their nines. https://rachelbythebay.com/w/2019/07/15/giant/

CoopaTroopa

2 months ago

I didn't say they fixed everything within those 14 updates. I'm pointing out it's disingenuous to say they didn't start working on the issue until start of business when there are 14 updates of what they have found and done during that time.

technofiend

2 months ago

huh.. maybe publicly communicated recovery was then. I was seeing knock-on effects hours later and didn't see full recovery until late afternoon EST.

rr808

2 months ago

I noticed that too. I think tech culture has to change a bit. Silicon Valley is a great location if you're making hardware or prepackaged software. If you have to support a real economy that is mostly on the East Coast you need a presence there.

NKosmatos

2 months ago

2 replies

This is how articles should be written, this is why I’m reading El Reg (a.k.a. The Register) all these decades, this is what happens when high management cares only about profits and when real engineers don’t eat the RTO bullshit. Bravo for putting this online.

P.S. I’m not an Amazon hater, replace the company name with any other big one of your choice and the article will have the same meaning ;-)

add-sub-mul-div

2 months ago

1 reply

It's not your feeling about Amazon that would cast doubt on your take, it's that you've reduced it to one pet cause and decided a source is well written because it appeals to your dislike of RTO. Nowhere has any evidence of the relevance of that to this been presented, Amazon has had outages since before WFH even began, they've all always had their occasional outages and bad days.

christhecaribou

2 months ago

There’s clear data that shows an increase in the frequency and severity of LSEs post RTO3 in 2023, and it looks like RTO5 has accelerated that trend.

orbital-decay

2 months ago

1 reply

The Register is an opinionated tech tabloid filled with outrage bait. This article is not an exception, drawing far reaching conclusions from little evidence.

christhecaribou

2 months ago

“Little evidence?” If the “aws” partition doesn’t actually exist when IAD breaks, Amazon hasn’t even discovered how to make multi-region cloud infrastructure. That’s a big deal.

Dig1t

2 months ago

3 replies

>Amazon remained the single largest H-1B sponsor, increasing approvals from 9,257 in 2024 to 10,044 in 2025, an addition of 787 visas.

https://www.reddit.com/r/SeattleWA/comments/1ncm25p/amazon_m...

I’m confused how they can have such a failure, they are employing the best and brightest top tier talent from India.

Hopefully they can increase their H1B allotment even more next year to help prevent these types of failures.

dafelst

2 months ago

1 reply

Given today is Diwali, perhaps the reason everything went down is because the best and brightest from India were all on vacation and weren't there to babysit/roll back the deployment that broke everything?

grugagag

2 months ago

ouch, that’s what a Diwali outage looks like..

hluska

2 months ago

We’ve reached a point where I can no longer distinguish between people without experience and people repeating the talking points they’re told to repeat. That’s a major loss.

However, talent is a very small part of shipping a project. How that talent is resourced is far more important.

browningstreet

2 months ago

It depends on how the resources are assigned, what projects they’re asked to focus on, and which strategic tech debt initiatives aren’t approved.

citizenpaul

2 months ago

2 replies

Its almost like institutional knowledge is a real thing that you cannot put on some BIG BRAIN MBA spreadsheet.

Groxx

2 months ago

how many lines of code is it though? or how many tokens? we need something we can measure if we're gonna fire people!

immibis

2 months ago

Esophagus4

2 months ago

1 reply

I’ve seen this happen with startups as well -

They’ll get acquired and top people leave as their stock vests or get pushed out because the megacorp wants someone different in the seat.

The people who knew the tech are gone and you’re left with an unmaintainable mess that becomes unreliable and no one knows how to fix it.

le-mark

2 months ago

I once saw a layoff that was followed by a week long outage because no one remaining knew how to deploy to prod, and no one knew how to recover after the failed deployment. I felt bad for the people remaining who had to go through that,but it was hilarious.

pinkmuffinere

2 months ago

11 replies

> one really gets the sense that it took them 75 minutes to go from "things are breaking" to "we've narrowed it down to a single service endpoint, but are still researching," which is something of a bitter pill to swallow

Is 75 minutes really considered that long of a time? I don't do my day-job in webdev, so maybe I'm just naive. But being able to diagnose the single service endpoint in 75 minutes seems pretty good to me. When I worked on firmware we frequently spent _weeks_ trying to diagnose what part of the firmware was broken.

estimator7292

2 months ago

1 reply

75 minutes is damn good turnaround for any major problem, IMO

bberrry

2 months ago

75 minutes to diagnose what's failing is not.

__loam

2 months ago

3 replies

Amazon is supposed to have the best infrastructure in the business because everyone else runs on it. They should have access to the sre talent that can quickly mitigate this kind of issue

Eridrus

2 months ago

1 reply

I dunno man, what part of the AWS experience leaves you thinking the software is amazing?

It's good enough, but there's no real evidence it's the best, simply the largest.

__loam

2 months ago

There is a distinction between usability and reliability lol. If AWS reliability trends down then it's an industry wide problem.

raverbashing

2 months ago

The sre talent moved elsewhere once the rto bs started

Freedom2

2 months ago

What if the SRE talent has a lot of real-life experience but can't pass LeetCode puzzles that have nothing to do with the job?

jmaestrooper

2 months ago

3 replies

>Is 75 minutes really considered that long of a time?

From my experience in setting up and running support services, not really. It's actually pretty darn quick.

First, the issue is reported to level 1 support, which is bunch of juniors/drones on call, often offshore (depending on time of the day) who'll run through their scripts and having determined that it's not in there, escalate to level 2.

Level 2 would be more experienced developer/support tech, who's seen a thing or two and dealt with serious issues. It will take time to get them online as they're on call but not online at 3am EST, as they have to get their cup of joe, turn on the laptop etc. Would take them a bit to realize that the fecal matter made contact with the rotating blades and escalate to level 3.

Which involves setting up the bridge, waking up the decisions makers (in my case it was director and VP level), and finally waking up the guy who either a) wrote all this or b) is one of 5 or 6 people on the planet capable of understanding and troubleshooting the tangled mess.

I do realize that AWS support might be structured quite a bit differently, but still... 75 minutes is pretty good.

Edit: That is not to say that AWS doesn't have a problem with turnover. I'm well aware of their policies and tendency to get rid of people in 2/3 years, partially due to compensation structures where there's a significant bump in compensation - and vesting - once you reach that timeframe.

But in this particular case I don't think support should take much of a blame. The overall architecture on the other hand...

snowwrestler

2 months ago

2 replies

Sorry, are you saying you worked at Amazon and this is how they handle major outages? Just snooze and wait for a ticket to make its way up from end user support? No monitoring? No global time zone coverage?

Because if so, this seems like about the most damning thing I could learn from this incident.

pavel_pt

2 months ago

1 reply

I worked at AWS (EC2 specifically), and the comment is accurate.

Engineers own their alarms, which they set up themselves during working hours. An engineer on call carries a "pager" for a given system they own as part of a small team. If your own alert rules get tripped, you will be automatically paged regardless of time of day. There are a variety of mechanisms to prioritize and delay issues until business hours, and suppress alarms based on various conditions - e.g. the health of your own dependencies.

End user tickets can not page engineers but fellow internal teams can. Generally escalation and paging additional help in the event that one can not handle the situation is encouraged and many tenured/senior engineers are very keen to help, even at weird hours.

cudgy

2 months ago

2 replies

“There are a variety of mechanisms to prioritize and delay issues until business hours”

What are business hours for a global provider of critical tech services?

lljk_kennedy

2 months ago

"This is important enough for someone to work on as soon as their shift starts, but not important enough to page someone out of bed for."

pavel_pt

2 months ago

Business hours for the team receiving the alarm; many issues can wait to be resolved during your own waking hours if they are not impacting customers.

Anon1096

2 months ago

No, it's just mindless speculation from someone who clearly hasn't worked a critical service's on call rotation before. Not at all what it's actually like, all these services have automatic alarms that will start blaring and firing pagers, and once scope of impact is determined to be large escalations start happening extremely quickly paging anyone even possibly able to diagnose the issue. There's also crisis rotations staffed with high level ICs and incident managers who will join ASAP and start directing the situation, you don't need to wait for some director or VP.

lljk_kennedy

2 months ago

Wholly inaccurate. AWS Systems Engineers would have been paged by automated monitoring systems once alert thresholds were breached. No escalation through Support needed.

shepherdjerred

2 months ago

AWS operates completely than what you're describing.

Alerts and monitoring will results in automatic pages to engineers. There is no human support before it gets escalated.

If an engineer hasn't taken a look within a few minutes, it escalates to their manager, and so on.

Cyberdogs7

2 months ago

1 reply

For a service like AWS, 75 mins is going to result in a LOT of COE's for people on way it wasn't mitigated quicker. A Sev 1 like this has an SLA of 20 mins to mitigate impact. Writing about these failures will consume a dozen peoples time for the next 6 weeks.

I have 10 years of experience at Amazon as an L6/L7 SDM, across 4 teams (Games, logistics, Alexa, Prime video). I have also been on a team that caused a sev 1 in the past.

rags2riches

2 months ago

3 replies

> LOT

Just capitalised for emphasis, right?

> COE

Center of Excellence? Council of Europe? Still wondering even after Googling.

> SLA

Service Level Agreement. This I knew beforehand.

> SDM

Service Delivery Manager?

collinmanderson

2 months ago

1 reply

> COE

I guessed this was an internal Amazon thing so I searched “Amazon COE”

Correction of Error

https://wa.aws.amazon.com/wellarchitected/2020-07-02T19-33-2...

> SDM

Software Developer Manager (from searching Amazon SDM)

https://amazon.jobs/content/en/how-we-hire/sdm-interview-pre...

Cyberdogs7

2 months ago

Thank you for providing clarity where I did not.

Cyberdogs7

2 months ago

Sorry, I sometimes forget the vernacular is not universal. The other sibling comment did provide the correct definitions.

libria

2 months ago

> COE

I tell the juniors it stands for Correction of Employment. Keeps them on their toes.

lljk_kennedy

2 months ago

1 reply

It's 75 minutes to _communicate_ the message to customers. Definitely internal teams were ahead of this before it was posted to the AWS Health Dashboard. Status Page posts are lagging indicators of incident progress.

jeffrallen

2 months ago

I work in an incident management team where the turnaround from "we've decided to take x action, to y metric shows it is working, to z is posted on the status page" can be 1-2 minutes.

It is possible with professionals, institutional knowledge, drills, and good tools.

Magmalgebra

2 months ago

Depends on what you're measuring.

Quite a few of AWS's more mature customers (including my company) were aware within 15 minutes of the incident that Dynamo was failing and hypothesized that it'd taken other services. Hopefully AWS engineers were at least fast.

75 minutes to make a decision about how to message that outage is not particularly slow though, and my guess is that this is where most of the latency actually came from.

Merad

2 months ago

The web operates in a very different world if you've invested in good tooling. I used to be lead on a modestly sized payment processing back end to the tune of about 100 transactions/second (we were essentially Stripe for the client facing apps at the company). In many cases our monitoring and telemetry let us identify root cause in a matter of minutes. Not saying that is or should be the norm for all web apps, but what we had was not too far off from a read-only debugger view of the back end app's state throughout the request and it was very powerful. Of course for us more often than not the root cause was "the bank we depend on is having a problem" so our knowledge couldn't do much other than help the company shape customer communications about the incident.

jofzar

2 months ago

If you a regular company, no. If you are the biggest provider in the world and your change is breaking large parts of the Internet, yes.

rkozik1989

2 months ago

A few years back I was working at a software company that provided on-site sensor sensor networks to hospitals, pharmacies, etc. Our product required them to physically install a server on-site, but we were starting to get disrupted by cloud-based solutions. Essentially what we did was alert medical staff when blood, organs, etc. refrigeration temperatures went out of range. If the right people involved did not get notifications on time for these issues people will die. Its not hyperbole, you have to wait years for liver transplant. Their aren't just new livers available for everyone if a handful of them spoil.

With that being said, the problem here isn't that it took 75 minutes to find the root cause, but rather that the fix took hours to propagate through the us-east-1 data center network. Which is completely unacceptable for industries like healthcare where even small disruptions are a matter of life and death.

michaelt

2 months ago

> Is 75 minutes really considered that long of a time? [...] When I worked on firmware we frequently spent _weeks_ trying to diagnose what part of the firmware was broken.

One might spend weeks diagnosing a problem if the problem only happens 0.01% of the time, correlated with nothing, goes away when retried, and nobody can reproduce it in a test environment.

But 0.01%-and-it-goes-away-when-retried does not make a high priority incident. High priority incidents tend to be repeatable problems that weren't there an hour ago.

Generally a well designed, properly resourced business critical system will be simple enough and well enough monitored that problems can be diagnosed in a good deal less than 75 minutes - even if rolling out a full fix takes longer.

Of course, I don't know how common well designed, properly resourced business critical systems are.

ecnahc515

2 months ago

Also it's pretty likely it took less time than that to get an idea, but generally for public updates you want to be very reserved, otherwise users get the wrong impressions.

president_zippy

2 months ago

1 reply

Between the engineering staff and the warehouse workers, I wonder how long it will be until they have already fired everyone who ever would have been willing to work there.

Even with candidate pools of hundreds of thousands of H1-B engineers and tens of millions of illegal immigrant warehouse workers, there still comes a point where such a big company firing so many people so quickly exhausts all their options.

It reminds me of the Robot Chicken Sketch where Imperial Officers aboard the Death Star all pretend to be force choked to death by Darth Vader so they can avoid getting killed by lightsaber, then come back in under different names in different jobs. It's worse though for Amazon: nobody wants to come back.

https://www.youtube.com/watch?v=fFihTRIxCkg

steve-atx-7600

2 months ago

1 reply

Seriously. I don’t know any half way decent engineer that would ever work there twice.

president_zippy

2 months ago

1 reply

I like to think I'm halfway decent at my job, and I wouldn't work there once. During undergrad, my landlord working for AMZN on the opposite end of the country offered me an interview, but it was during final exam week.

I asked if I could schedule the interview after my final exams, and his arrogance really showed when not only did he refuse, but then insisted my exams are not don't even register on the same scale of importance as the opportunity to work for Amazon.

Somewhat related: a recruiter at Google cold-called me a couple months into my first job out of undergrad back in 2016 and was similarly condescending about "the chance" to work for Google compared to everything else. I already had a low opinion of them when they gave my then-girlfriend an introductory O'Reilly book on Java after she failed their interview.

I regret being born too late to work somewhere like Bell Labs, SGI, or Sun. I had a ton of graybeard wizard coworkers from these places, and they were all a pleasure to learn from and even better friends. For the first 2 years of my first job, every day of work was like walking into the Shire and talking magic spells with 20 Gandalfs.

That job was great until I got put on a team with a guy who was a former middle manager at some IBM-like company and went from being surrounded by people lightyears ahead of me to being surrounded by Dilbert characters. The messed-up part was that it wasn't even punishment. I was rewarded after completing a project with my choice of which team I joined next, and I joined the wrong one. I assumed that joining a new team to utilize this newfangled "cloud computing" thing would be trailblazing, and I didn't do any diligence on who I would work with.

To this day, I still regret not rejoining the first team I worked for, otherwise I would still be at that company and happy about it. Then again, the boredom and discontent while being on that sucky team is the reason I started investing, and now I can buy a house in cash and fund myself to do whatever I want for at least a decade. Hard to complain about the way things turned out.

le-mark

2 months ago

1 reply

> to being surrounded by Dilbert characters.

As a real life Wally I appreciate this comment.

president_zippy

2 months ago

1 reply

Wally is the one Dilbert character I can tolerate in the workplace. He's honest about who he is and what he does. When you know you're in a bloated company run by buffoons, all you can do for your sanity is work to rule and not upset the apple cart.

I was Wally for the last 2 1/2 years of that previous job until I started to realize I'm becoming more and more like a Dilbert character myself. Something in my brain just told me it wasn't sustainable, call it fear of God or paranoia, but letting my skills atrophy in a place like that for 20 years didn't seem like it would end well for me.

The only problem was that I stayed so long, and it made me hate software engineering so much that I didn't even want to be a software engineer anymore.

I put up with it just long enough so I could avoid selling stock and drawing cash out of my portfolio, and now I'm back at square one as a post-bacc student getting my applications in order for MD and PhD programs where I'll most certainly wind up drawing hundreds of thousands out of my portfolio to pay rent and eat dinner for about a decade.

It's sad, I really enjoyed systems programming, but it seems like finding interesting systems programming and distributed computing projects that have significant economic value is like squeezing blood out of a stone. Maybe LLMs or future progress in bioinformatics will change that, now that finding ways to shovel a lot of data into and out of GPUs is valuable, but I'm so far into physiology, genetics/proteomics, and cell biology that I'm not sure I would even want to go back.

anal_reactor

2 months ago

I'm currently in a place that pays me €100k just to sit on my ass, and I can do that remotely. I've tried actually doing some work, but that backfired. Not sure what to do, because on one hand my skills are evaporating, but on the other if I wanted a job that pays more I'd have to learn a lot and then work substantially more. I'm wondering if maybe sitting here until retirement is a viable option.

causal

2 months ago

Yeah. They will identify the cause, but not the cause behind the cause.

ChrisArchitect

2 months ago

More discussion: https://news.ycombinator.com/item?id=45640838

charcircuit

2 months ago

This fails to recognize that the people who designed everything to rely on us east 1 did so a long time ago. "Brain drain" could just mean that they've had their fun and now want other people to deal with their mess.

>I've seen zero signs that this stems from a lack of transparency, and every indication that they legitimately did not know what was breaking for a patently absurd length of time.

That information is under NDA, so it's only natural you aren't privy to it.

bithead

2 months ago

What - their AI couldn't find it sooner? Better get those RAGs in order.

g-b-r

2 months ago

For those who worked there recently, how much does the comment at [1] reflect the current state of things?

[1] https://forums.theregister.com/forum/all/2025/10/20/aws_outa...

494 more comments available on Hacker News

View full discussion on Hacker News

ID: 45649178Type: storyLast synced: 11/26/2025, 1:00:33 PM

Want the full context?