Today Is When the Amazon Brain Drain Sent AWS Down the Spout
Posted2 months agoActive2 months ago
theregister.comTechstoryHigh profile
heatednegative
Debate
85/100
AWS OutageAmazon Brain DrainTech Industry Layoffs
Key topics
AWS Outage
Amazon Brain Drain
Tech Industry Layoffs
The article discusses a recent AWS outage and attributes it to Amazon's brain drain due to layoffs and poor management practices, sparking a heated debate among commenters about the causes and implications of the outage.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
21m
Peak period
127
0-12h
Avg / period
22.9
Comment distribution160 data points
Loading chart...
Based on 160 loaded comments
Key moments
- 01Story posted
Oct 20, 2025 at 4:50 PM EDT
2 months ago
Step 01 - 02First comment
Oct 20, 2025 at 5:11 PM EDT
21m after posting
Step 02 - 03Peak activity
127 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 26, 2025 at 8:08 AM EDT
2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45649178Type: storyLast synced: 11/26/2025, 1:00:33 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
[1] https://pages.cs.wisc.edu/~remzi/Naur.pdf
[2] https://x.com/elonmusk/status/1980221072512635117
I don't get this. Why is Musk tweeting a fake quote and why are you posting it? What does it signify?
The companies forget how to make great products. The product sensibility and product genius that brought them to this monopolistic position gets rotted out by people running these companies who have no conception of a good product vs. a bad product. They have no conception of the craftsmanship that’s required to take a good idea and turn it into a good product. And they really have no feeling in their hearts about wanting to help the costumers.”
- Steve Jobs - https://en.wikipedia.org/wiki/Steve_Jobs:_The_Lost_Interview
I also had a GSM iPhone 4.
Compare that to how quickly they ran away from the shitty Intel modems when they were selling some made by Intel and some made by Samsung (?)
Any interaction you have with a company post-Covid you can feel it. Nothing works anymore and you can’t even tell anyone about it or why.
Again, it was going to happen eventually. Boomers should have done better to mentor the youth and pass the torch. Some did, most didn't. That time has now passed. Millenials and Gen-Z will have to pick up the pieces after things fall apart and it will be painful...many things will just not be the same. Things we took for granted may well just disappear given a smaller population size. But at least they can finally define how they want the world to work. Its been too long. Millenials are in their 40s now. They should have taken the reins years ago if the prior generation actually cared to pass the torch.
One of the things that I recall doing on Election Day was cursing out Trump when he was re-elected.
This shit stain is a representation of the Boomer's last act: To write the final chapters of the Millenials' adult years.
The mess that Trump is making with the tariffs and destruction of institutional knowledge will eventually get fixed but it will take 15-20+ years. Essentially the remaining years of Millennials working lives. At this point I am transitioning to acceptance. Acceptance that so many processes will have to be rebuilt by Millenials and Gen-Z. At least there will be an opportunity to reinvent old ways of thinking despite all the pain.
Company cultures are not built to last, they are designed to generate profit. The culture is incidental, it will be whatever is most profitable at any given moment. At best, a company's culture is just a branding and marketing strategy to attract employees and to appear cool. Therefore they are fickle and prone to complete collapse when just a few people are replaced.
So the title is all speculation. The author put 2 and 2 together and concluded that 10 is greater than 9.
Worthless article.
Or you could just say "there is no way the thing that constantly happens over and over again has happened once again, just no way".
Staff cuts constantly happen in the name of maximising profits. They always yield poor results for a company's performance. Every time. Especially for the consumer's side of it (not the company's finances of course).
Every time.
But maybe this time it's different. That one time.
that said, my suspicion is they're likely on to something here regarding layoffs and quality degradation.
But I know very few people in the industry who know about Amazon’s reputation that have a life long dream of working there given a choice.
I was 46 when I was hired there for a “permanently remote [sic] field by design role” in ProServe and it was my 8th job out of college. I went in with my eyes wide open. I had a plan, stay for four years, sell my RSUs as soon as they vested, pay off debt, save some money, put it on my resume to open doors and make connections and leave.
I was never expecting to make more when I left. I used the time to downsize and reduce my expenses - including moving to state tax free Florida.
When I saw the writing on the wall, I played the game while I was on focus to get my next vest and wait for the “get 40k+ severance and leave immediately or try to work through the PIP”.
I took the latter and had three offers within 3 weeks. This was late 2023.
I left debt free, sold my old home for exactly twice what I had built for 8 years earlier, downsized to a condo half the price I sold it for (and 1/3 the size) and I was debt free with savings.
I’m now a staff consultant working full time at a 3rd party AWS consulting firm with a lot less stress and still remote. They were the last to fall. But AWS made their ProServe department return to office at the beginning of this year.
https://blog.stackademic.com/aws-just-fired-40-of-its-devops...
https://amazon.jobs/en/jobs/3080348/devops-engineer-linux-re...
https://amazon.jobs/en/jobs/3082914/devops-systems-engineer-...
This one mentions terraform by name (though that doesn't necessarily imply its in use, though having worked in large companies I would argue that sweeping statements about a popular technology not being used is likely to be wrong)
https://amazon.jobs/en/jobs/3042892/delivery-consultant-devo...
The last one is a ProServe role, which is a consulting role that spends their time working in customer environments, which is where they may encounter terraform. It does not mean anything about internal use of terraform.
I already showed you that AWS has (or hires) DevOps people with publicly available information, maybe the article is incorrect but you’re clearly not better informed, so maybe cut it with the rude commentary.
https://amazon.jobs/en/jobs/3080348/devops-engineer-linux-re...
Just do a quick google search for that “40% of devops laid off” and you’ll see that it’s actually an old article from months ago that multiple people, including AWS employees, are saying is bullshit and unsourced.
edit: found another source that says this 40% number came from an AWS consultant that worked with customers to help them be better at DevOps, and it was 40% of their specific team that was laid off. Even if it were true, it has nothing to do with the internal operations of AWS services. This is why it’s important to understand the information you’re sharing before making judgements off of it.
https://www.theregister.com/2025/07/18/aws_sheds_jobs/
Seems wild that you would promote job titles you don’t hire for, makes me think that it’s reasonable for news outlets to refer to those roles in the same way honestly.
Notice the job description:
As part of the AWS Managed Operations team, you will play a pivotal role in building and leading operations and development teams dedicated to delivering high-availability AWS services, including EC2, S3, Dynamo, Lambda, and Bedrock, exclusively for EU customers.
They aren’t looking for DevOpe engineers to work alongside the “service teams” - the teams that build and support internal AWS services. They are working with AWS customers who may already be using Terraform. AWS has a large internal consulting division staffed with full time employees. When they work with customers they will use Terraform if needed.
The previous commenter is correct, there is no NOC or devops team and I’ve not encountered a Devops job family and I’ve never seen terraform internally. Within AWS, the service teams that work these outages are the same ones that design the service, fix bugs, deploy the pipelines, be oncall, etc. the roles that fill these teams are pretty much one of three types: nde, sde, sysde. They typically use cdk if they’re doing AWS things, else they’ll use internal tooling.
The job you posted is a customer facing consultant like role - customers use terraform so having a customer facing consultant type that knows how customer-y things work is a good decision.
It's one one of the few parts of the internet which could potentially be replaced over time with very little disruption.
The hierarchy of resolvers could be replaced with a far simpler flat hierarchy Blockchain where people could buy and permanently own their domains directly on-chain... No recurring fees. People could host websites on the Blockchain from beyond the grave... This is kind of a dream of mine. Not possible to achieve in our current system.
All the arguments I'm hearing against a Blockchain DNS system are rooted in petty crony-capitalist thinking.
This kind of thinking seems to permeate most other parts of society... It's gotta stop.
"Oh but what if someone steals it"
This ain't gonna be much of a problem in a functioning society where the top 20 domain names doesn't hoard like 95% of the traffic.
"Oh but we don't want people to own domains permanently or else they will take all the good domains"
Um hello?? Have to checked this thing called reality? It's already the case! So happy billionaires have to pay their $20 per months to maintain their market monopolies.
I actually don't mind other people having more stuff than me but tired of petty people ruining good ideas and stalling progress to make a few bucks.
This is precisely why something like this isn't a popular solution lots of people are working towards. Domains broadly speaking aren't a finite resource, but usable domains using common words definitely are. As time marches on human readable/typeable "permanent identifiers" are going to have to go away. Email address, usernames and the like are all going to get recycled, just like phone numbers are. Domains are currently recycled and most people probably think that's a good thing (assuming they think about it at all)
I bought tombert.com in 2014 and forgot to renew it in 2015, and it was auctioned off by GoDaddy. For like six years, it was owned by squatters, and they wanted thousands of dollars for the domain [1]. I called offering the $100 for it, and they claimed that they can't go below $1400 because this domain is in "extremely high demand". I finally was able to buy it back in 2021, presumably because the squatter purged out domains that hadn't been purchased for N years and they wanted to save money.
Now, you could argue "see! You wouldn't have had to worry about it expiring if it were permanent on the blockchain", and that's true, but if someone else had gotten to that domain first, then I would also never get it. I think the only thing that keeps the internet even remotely fair in this regard is that domain names cost some amount of money to keep.
[1] https://web.archive.org/web/20160219161720/http://www.hugedo...
This is precisely why something like this isn't a popular solution lots of people are working towards. Domains broadly speaking aren't a finite resource, but usable domains using common words definitely are. As time marches on human readable/typeable "permanent identifiers" are going to have to go away. Email address, usernames and the like are all going to get recycled, just like phone numbers are. Domains are currently recycled and most people probably think that's a good thing.
- simple
- battle hardened
- distributed
- affordable
blockchains are:
- essoteric, backwards, and not easily implemented
- new and unproven, frequently hacked
- effectively a ploy to centralize / redo Web 1.0 but owned by one blockchain
- ...waaaaaaay more about money and "owning something" than DNS is.
At least currently death dissolves bonds.
https://forums.theregister.com/forum/all/2025/10/20/aws_outa...
Jesus, if even an ounce of that is true... Yes, everyone on the internet is a cat clawing on a keyboard... but if a ton of people legitimately confirmed to be ex-AWS point to similar culture issues... probably it's AWS that's rotting.
RTO in combination with Amazon being last place in AI innovation have led to departures of anyone that can leave, leaving.
> At the end of 2023, Justin Garrison left AWS and roasted them on his way out the door. He stated that AWS had seen an increase in Large Scale Events (or LSEs), and predicted significant outages in 2024. It would seem that he discounted the power of inertia
Your comment is relying on that referenced inertia. Things will continue to function for a period of time, but there exists an inflection point at which they no longer function as previously.
Inertia is a hell of a force.
All that he seems to be doing these days at Twitter is messing around with the recommendation algorithm, override the decisions of what's left of moderation for his far-right friends and that's it. Oh and of course Grok/xAI or however it's called these days, but IIRC that's a separate corporate entity that just got shoehorned onto Twitter.
Yes, development tools are better every day. Yes, you can downsize. No it won’t be felt immediately. Yes, it mortgages the future and at a painfully high interest rate.
Suspending disbelief won’t make downsizing work better.
See: general electric, RCA, Xerox, GM
But Bezos will still have his billions.
Just a guess but I think this bubble will stretch a bit more before it pops.
Are they?
[08:43 AM PDT] We have narrowed down the source of the network connectivity issues that impacted AWS Services...
[08:04 AM PDT] We continue to investigate the root cause for the network connectivity issues...
[12:11 AM PDT] <declared outage>
They claim not to have known the root cause for ~8hr
Or an NLB could also be load balancing by managing DNS records--it's not really clear what a NLB means in this context
Or there was an overload condition because of the NLB malfunctioning that caused UDP traffic to get dropped
Obviously a lot of reading between the lines is required without a detailed RCA--hopefully they release more info
The initial cause appears to be a a bad DNS entry that they rolled back at 2:22am PDT. They started seeing recovery with services but as reports of EC2 failures kept rolling in they found a network issue with a load balancer that was causing the issue at 8:43am.
Their 14 updates did not bring my stuff back up.
My nines are not their nines. https://rachelbythebay.com/w/2019/07/15/giant/
P.S. I’m not an Amazon hater, replace the company name with any other big one of your choice and the article will have the same meaning ;-)
https://www.reddit.com/r/SeattleWA/comments/1ncm25p/amazon_m...
I’m confused how they can have such a failure, they are employing the best and brightest top tier talent from India.
Hopefully they can increase their H1B allotment even more next year to help prevent these types of failures.
However, talent is a very small part of shipping a project. How that talent is resourced is far more important.
They’ll get acquired and top people leave as their stock vests or get pushed out because the megacorp wants someone different in the seat.
The people who knew the tech are gone and you’re left with an unmaintainable mess that becomes unreliable and no one knows how to fix it.
Is 75 minutes really considered that long of a time? I don't do my day-job in webdev, so maybe I'm just naive. But being able to diagnose the single service endpoint in 75 minutes seems pretty good to me. When I worked on firmware we frequently spent _weeks_ trying to diagnose what part of the firmware was broken.
It's good enough, but there's no real evidence it's the best, simply the largest.
From my experience in setting up and running support services, not really. It's actually pretty darn quick.
First, the issue is reported to level 1 support, which is bunch of juniors/drones on call, often offshore (depending on time of the day) who'll run through their scripts and having determined that it's not in there, escalate to level 2.
Level 2 would be more experienced developer/support tech, who's seen a thing or two and dealt with serious issues. It will take time to get them online as they're on call but not online at 3am EST, as they have to get their cup of joe, turn on the laptop etc. Would take them a bit to realize that the fecal matter made contact with the rotating blades and escalate to level 3.
Which involves setting up the bridge, waking up the decisions makers (in my case it was director and VP level), and finally waking up the guy who either a) wrote all this or b) is one of 5 or 6 people on the planet capable of understanding and troubleshooting the tangled mess.
I do realize that AWS support might be structured quite a bit differently, but still... 75 minutes is pretty good.
Edit: That is not to say that AWS doesn't have a problem with turnover. I'm well aware of their policies and tendency to get rid of people in 2/3 years, partially due to compensation structures where there's a significant bump in compensation - and vesting - once you reach that timeframe.
But in this particular case I don't think support should take much of a blame. The overall architecture on the other hand...
Because if so, this seems like about the most damning thing I could learn from this incident.
Engineers own their alarms, which they set up themselves during working hours. An engineer on call carries a "pager" for a given system they own as part of a small team. If your own alert rules get tripped, you will be automatically paged regardless of time of day. There are a variety of mechanisms to prioritize and delay issues until business hours, and suppress alarms based on various conditions - e.g. the health of your own dependencies.
End user tickets can not page engineers but fellow internal teams can. Generally escalation and paging additional help in the event that one can not handle the situation is encouraged and many tenured/senior engineers are very keen to help, even at weird hours.
What are business hours for a global provider of critical tech services?
Alerts and monitoring will results in automatic pages to engineers. There is no human support before it gets escalated.
If an engineer hasn't taken a look within a few minutes, it escalates to their manager, and so on.
I have 10 years of experience at Amazon as an L6/L7 SDM, across 4 teams (Games, logistics, Alexa, Prime video). I have also been on a team that caused a sev 1 in the past.
Just capitalised for emphasis, right?
> COE
Center of Excellence? Council of Europe? Still wondering even after Googling.
> SLA
Service Level Agreement. This I knew beforehand.
> SDM
Service Delivery Manager?
I guessed this was an internal Amazon thing so I searched “Amazon COE”
Correction of Error
https://wa.aws.amazon.com/wellarchitected/2020-07-02T19-33-2...
> SDM
Software Developer Manager (from searching Amazon SDM)
https://amazon.jobs/content/en/how-we-hire/sdm-interview-pre...
I tell the juniors it stands for Correction of Employment. Keeps them on their toes.
It is possible with professionals, institutional knowledge, drills, and good tools.
Quite a few of AWS's more mature customers (including my company) were aware within 15 minutes of the incident that Dynamo was failing and hypothesized that it'd taken other services. Hopefully AWS engineers were at least fast.
75 minutes to make a decision about how to message that outage is not particularly slow though, and my guess is that this is where most of the latency actually came from.
With that being said, the problem here isn't that it took 75 minutes to find the root cause, but rather that the fix took hours to propagate through the us-east-1 data center network. Which is completely unacceptable for industries like healthcare where even small disruptions are a matter of life and death.
One might spend weeks diagnosing a problem if the problem only happens 0.01% of the time, correlated with nothing, goes away when retried, and nobody can reproduce it in a test environment.
But 0.01%-and-it-goes-away-when-retried does not make a high priority incident. High priority incidents tend to be repeatable problems that weren't there an hour ago.
Generally a well designed, properly resourced business critical system will be simple enough and well enough monitored that problems can be diagnosed in a good deal less than 75 minutes - even if rolling out a full fix takes longer.
Of course, I don't know how common well designed, properly resourced business critical systems are.
Even with candidate pools of hundreds of thousands of H1-B engineers and tens of millions of illegal immigrant warehouse workers, there still comes a point where such a big company firing so many people so quickly exhausts all their options.
It reminds me of the Robot Chicken Sketch where Imperial Officers aboard the Death Star all pretend to be force choked to death by Darth Vader so they can avoid getting killed by lightsaber, then come back in under different names in different jobs. It's worse though for Amazon: nobody wants to come back.
https://www.youtube.com/watch?v=fFihTRIxCkg
I asked if I could schedule the interview after my final exams, and his arrogance really showed when not only did he refuse, but then insisted my exams are not don't even register on the same scale of importance as the opportunity to work for Amazon.
Somewhat related: a recruiter at Google cold-called me a couple months into my first job out of undergrad back in 2016 and was similarly condescending about "the chance" to work for Google compared to everything else. I already had a low opinion of them when they gave my then-girlfriend an introductory O'Reilly book on Java after she failed their interview.
I regret being born too late to work somewhere like Bell Labs, SGI, or Sun. I had a ton of graybeard wizard coworkers from these places, and they were all a pleasure to learn from and even better friends. For the first 2 years of my first job, every day of work was like walking into the Shire and talking magic spells with 20 Gandalfs.
That job was great until I got put on a team with a guy who was a former middle manager at some IBM-like company and went from being surrounded by people lightyears ahead of me to being surrounded by Dilbert characters. The messed-up part was that it wasn't even punishment. I was rewarded after completing a project with my choice of which team I joined next, and I joined the wrong one. I assumed that joining a new team to utilize this newfangled "cloud computing" thing would be trailblazing, and I didn't do any diligence on who I would work with.
To this day, I still regret not rejoining the first team I worked for, otherwise I would still be at that company and happy about it. Then again, the boredom and discontent while being on that sucky team is the reason I started investing, and now I can buy a house in cash and fund myself to do whatever I want for at least a decade. Hard to complain about the way things turned out.
As a real life Wally I appreciate this comment.
I was Wally for the last 2 1/2 years of that previous job until I started to realize I'm becoming more and more like a Dilbert character myself. Something in my brain just told me it wasn't sustainable, call it fear of God or paranoia, but letting my skills atrophy in a place like that for 20 years didn't seem like it would end well for me.
The only problem was that I stayed so long, and it made me hate software engineering so much that I didn't even want to be a software engineer anymore.
I put up with it just long enough so I could avoid selling stock and drawing cash out of my portfolio, and now I'm back at square one as a post-bacc student getting my applications in order for MD and PhD programs where I'll most certainly wind up drawing hundreds of thousands out of my portfolio to pay rent and eat dinner for about a decade.
It's sad, I really enjoyed systems programming, but it seems like finding interesting systems programming and distributed computing projects that have significant economic value is like squeezing blood out of a stone. Maybe LLMs or future progress in bioinformatics will change that, now that finding ways to shovel a lot of data into and out of GPUs is valuable, but I'm so far into physiology, genetics/proteomics, and cell biology that I'm not sure I would even want to go back.
>I've seen zero signs that this stems from a lack of transparency, and every indication that they legitimately did not know what was breaking for a patently absurd length of time.
That information is under NDA, so it's only natural you aren't privy to it.
[1] https://forums.theregister.com/forum/all/2025/10/20/aws_outa...
494 more comments available on Hacker News