More Than Dns: Learnings From the 14 Hour AWS Outage
Key topics
The post analyzes the 14-hour AWS outage on Oct 20, highlighting the complexities of the issue and the challenges of troubleshooting, while the discussion revolves around the implications of the outage, AWS's reliability, and the trade-offs between speed and reliability in software development.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
9h
Peak period
45
Day 3
Avg / period
10
Based on 60 loaded comments
Key moments
- 01Story posted
Oct 27, 2025 at 11:56 AM EDT
2 months ago
Step 01 - 02First comment
Oct 27, 2025 at 9:04 PM EDT
9h after posting
Step 02 - 03Peak activity
45 comments in Day 3
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 10, 2025 at 1:54 PM EST
about 2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Not dogfooding is the software and hardware equivalent of the electricity network "black start" -you never want to be there, but somewhere in the system you need a honda petrol generator, which is enough to excite the electromagnets on a bigger generator, which you spin up to start the turbine spinning until the steam takes up load and the real generator is able to get volts onto the wire.
Pumped Hydro is inside the machine. It's often held out as the black-start mechanism because it's gravity, there's less to go wrong, but if we are in 'line up the holes in the cheese grater' space, you can always have 'want of a nail' issues with any mechanism. The honda generator can have a hole in the petrol tank, the turbine at the pumped hydro can be out for maintenance.
Having such a gobstoppingly massive singular region seems to be working against AWS
The services that fail on AWS’ side are usually multi zone failures anyway, so maybe it wouldn’t help.
I’ve often wondered why they don’t make a secret region just for themselves to isolate the control plane but maybe they think dogfooding us-east-1 is important (it probably wouldn’t help much anyway, a meteor could just as easily strike us-secret-1).
To be clear, the above list is a control plane depency on us-east-1. During the incident the service itself may have been fine but could not be (re)configured.
The big one is really Route 53 though. DNS having issues caused a lot of downstream effects since it's used by ~everything to talk to everything else.
Other services include Slack. No, it's not an AWS service, but for companies reliant on it or something like it, it doesn't matter how much you're not in EC2 if an outside service that you rely on went down. Full list: https://www.reddit.com/r/DigitalMarketing/comments/1oc2jtd/a...
It's a bit of a circular problem since it's usually useful to colocate in the same region for latency or security (privatelink)
Are you saying it's different on land-based steam power plants? Why?
If I had to guess, I'd assume that the generators on your destroyer were gas turbine generators that use some kind of motor as part of their startup process to get the turbine spinning. It's entirely possible that there was an electric motor in there to facilitate the generator's "black-start," but it may have been powered by a battery rather than a smaller generator.
As for how the generators' fields were started, now that you mention it I'm not sure. We did have emergency diesel generators (and of course shore power when we were pier-side), so maybe those supplied electricity to jump-start the generators. But they were 750 kw generators (upgraded in 1974 from 500 kw generators), so I don't imagine batteries would have sufficed.
But what would make sense is that upstream services return their load (queue depths or things like PSI from newer kernels) to downstream ones as part of the API responses, so if shit's going on the downstream ones will become more patient and slow down. (But if things are getting cleaned up then the downstream services can speed things up.)
I haven’t seen anyone really solve this problem at scale with maybe the exception of YouTube’s mechanism (https://research.google/pubs/load-is-not-what-you-should-bal...), but that’s specific to them and it isn’t universally applicable to arbitrary workloads.
my point is there's no need to try (and fail) to define some universal backpressure semantics between coupling points, after all this can be done locally, and even after the fact (every time there's an outage, or better yet every time there's a "near miss") the signal to listen to will show up.
and if not, then not, which means (as you said) that link likely doesn't have this kind of simple semantics. maybe because the nature of the integration is not request-response or not otherwise structured to provide this apparent legibility, even if it's causally important for downstream.
simply thinking about this during post-mortems, having metrics available (which is anyway a given in these complex high-availability systems), having the option in the SDK, seems like the way forward
(yes, I know this is basically the circuit breaker and other Netflix-evangelized ideas with extra steps :))
Also, circuit breakers have issues of their own:
“Even with a single layer of retries, traffic still significantly increases when errors start. Circuit breakers, where calls to a downstream service are stopped entirely when an error threshold is exceeded, are widely promoted to solve this problem. Unfortunately, circuit breakers introduce modal behavior into systems that can be difficult to test, and can introduce significant addition time to recovery. We have found that we can mitigate this risk by limiting retries locally using a token bucket. This allows all calls to retry as long as there are tokens, and then retry at a fixed rate when the tokens are exhausted.” https://aws.amazon.com/builders-library/timeouts-retries-and...
Consider a situation in which all the clients have circuit breakers. All of them enter the open state once the trigger condition is met, which drops request load on the service to zero. Your autoscaler reduces capacity to the minimum level in response. Then, all the circuit breakers are reset to the closed state. Your service then experiences a sudden rush of normal- or above-normal traffic, causing it to immediately exhaust availabile capacity. It’s a special case of bimodal behavior, which we try to avoid as a matter of sound operational practice.
Tangent, but players of Satisfactory might recognize this condition. If your main power infrastructure somehow goes down, you might not have enough power to start the pumps/downstream machines to power up your main power generators. Thus it's common to have some Tier 0 generators stashed away somewhere to kick start the rest of the system (at least before giant building-sized batteries were introduced a few updates ago).
The article bring up a fairly important point in impact reductions from bugs. Critical systems need to have sanity checks for states and values that never should occur during normal operation, with some corresponding action in case they happen. End-points could have had sanity checks of invalid DNS, such as zero ip-addresses or broken DNS, and either reverted back to an valid state or a predefined emergency system. Either would have reduced the impact.
The fact that there was an outage is not unexpected… it happens… but all the stumbling and length to get things under control was concerning.
The horrible us-east-1 S3 outage of 2017[1] was around 5 hours.
1. https://aws.amazon.com/message/41926/
The more you have, the faster the backlog grows in case of an outage, so you need longer to process it all once the system comes back online.
Tech departments running around with their hair on fire / always looking busy isn't one that always builds trust.
Some good information in the comments as well.
That's assuming these are actual coworkers, not say a dozen escalating levels of micromanagers, which I agree would be hell. The where isn't really that important in my experience, as long as you've got reliable Internet access.
A faster, better-tested "restart all the droplet managers from a known reasonable state" process is probably more important than finding all the Dynamo race conditions.
I think when there is an extended outage it exposes the shortcuts. If you have 100 systems, and one or two can't start fast from zero, and they're required to get back to running smoothly, well you're going to have a longer outage. How would you deal with that, you'd uniformly across your teams subject them to start from zero testing. I suspect though that many teams are staring down a scaling bottleneck, or at least were for much of Amazon's life and so scaling issues (how do we handle 10x usage growth in the next year and half, which are the soft spots that will break) trump cold start testing. Then you get a cold start event with that last one being 5 years ago and 1 or 2 out of your 100 teams falls over and it takes multiple hours all hands on deck to get it to start.
You'd be unpleasantly surprised, on that point the COE points to the public write-up for details.
us-east-1 feels like a single point of failure for half the internet. People really need to look into multi-region, maybe even use AI like Zast.ai for intelligent failover so we're not all dead in the water when N. Virginia sneezes.
The cost of a work days worth of downtime is rarely enough to justify the expense of trying to deploy across multiple regions or clouds.
Esp if you are public facing and not internal. You just go 'well everyone else was down to because of aws' and your customers just go 'ah okay fair enough'
The "cold start" analogy resonates. As an indie builder, I'm constantly making decisions that optimize for "ship fast" vs "what happens when this breaks." The reality is: you can't architect for every failure mode when you're a team of 1-2 people.
But here's what's fascinating about this analysis: the recommendation for control theory and circuit breakers isn't just for AWS-scale systems. Even small products benefit from graceful degradation patterns. The difference is—at AWS scale, a metastable failure can cascade for 14 hours. At indie scale, you just... lose customers and never know why.
The talent exodus point is also concerning. If even AWS—with their resources and institutional knowledge—can struggle when senior folks leave, what chance do startups have? This is why I'm increasingly convinced that boring, well-documented tech stacks matter more than cutting-edge ones. When you're solo, you need to be able to debug at 2am without digging through undocumented microservice chains.
Question: For those running prod systems, how do you balance "dogfooding" (running on your own infra) vs "use boring, reliable stuff" (like AWS for your control plane)? Is there a middle ground?
My personal crusade is trying to convince anyone who will listen to me that the value of a senior employee in any role isn't in their years of experience. It's in their years of experience with your system. More companies should do whatever it takes to retain employees long term. No matter how good your documentation procedures are, no matter how thorough your knowledge base, no matter how many knowledge transfer sessions you have, your long tenured employees have forgotten they know more knowledge than they will ever be able document.
I have never seen a team lose a (productive) long standing member that still wasn't suffering from that loss years later. We like to pretend that software and especially reusable libraries, components and frameworks make developers interchangeable widgets. Substitutable cogs that can simply be swapped out by having enough documentation. But writing software is converting a mental state into a physical one. And reading code is trying to read someone else's mind (incidentally this is part of why two excellent developers can find themselves unable to work together, they just don't think enough alike). When you lose those senior people, you lose the person who was around for that one outage 8 years ago, and who if they were around for the current outage would have a memory float up from the forgotten depths that cause them to check on the thing no one (yet) suspects.
This isn't to say you shouldn't document things, and that you shouldn't work to spread knowledge around. The "bus factor" is real. People will retire. People will leave for reasons you have no control over. So document everything you can, build good knowledge transfer systems. But always remember that in the end, they will only cover a portion of the knowledge your employees have. And the longer your employees are around, the more knowledge they will have. And unless you are willing to pay your people to document everything they know full time (and I assure you, neither you nor your customers are willing to do that), they will always know more than is documented.
Nor despite my analogies does this apply only to software developers. One team I worked on lost a long term "technical writer" (that wasn't their full time work but it was a role they filled for that team). This person was the best person I've ever known at documenting things. And still the team reeled at losing them. Even with all the documentation they left behind, knowing what was documented, where and how recently and all sorts of metadata that wasn't captured in the documentation itself went with them when they left. Years later their documentation was still useful, but no one was anywhere near their level of encyclopedic knowledge of it all.
To use an analogy, the New York Yankees perform the best when they focus on building up their roster from their farm system. Long tenured players mentor the new ones, and a cohesive team forms that operates as more than the sum of its parts. They perform the worst when success and money take control and they try to hire "all-stars" away from other teams. The intangibles just aren't there, or aren't as strong. A good company is the same. A long term team built from the ground up is worth far more than a similarly "experienced" team made up of outside hires.
Talent exodus is concerning, so don't give your people a reason to leave. You can't stop it all, but you can do a lot to avoid self inflicted wounds.
Anyone using AWS isn’t concerned with the physical :)
At the end of the day, AWS is still physically made up of physical computers that still have to obey the laws of physics. If you max out the bandwidth of the connection going in or out of, say, an RDS system (which is available via NDA if you are spending enough money with them), suddenly the physical is very important.
The only exception is if you are doing something where your tech stack is integral to your product (e.g. you need crazily high performance or scale or something from the get go).
Split out the front ends into separate services but leave the back end as a monolith, just try not to get logically separate parts too entangled so you have a decent chance of separating them later if and when "later" arrives
Yeah, right. I'm surprised how anyone involved with software engineering can be surprised by this. I would argue that there many, if not infinitely many, similar bugs out there. It's just that the right conditions for them to show up haven't been met yet.
Root cause analysis is just like any other tool. Failure to precisely define the nature of the problem is what usually turns RCA into a wild goose chase. Consider the following problem statements:
"The system feels yucky to use. I don't like it >:("
"POST /reports?option=A is slow around 3pm on Tuesdays"
One of these is more likely to provide a useful RCA that proceeds and halts in a reasonable way.
"AWS went down"
Is not a good starting point for a useful RCA session. "AWS" and "down" being the most volatile terms in this context. What parts of AWS? To what extent were they down? Is the outage intrinsic to each service or due to external factors like DNS?
"EC2 instances in region X became inaccessible to public internet users between Y & Z"
This is the kind of grain I would be doing my PPTX along if I was working at AWS. You can determine that there was a common thread after the fact. Put it in your conclusion / next-steps slide. Starting hyper-specific means that you are less likely to get distracted and can arrive at a good answer much faster. Aggregating the conclusions of many reports, you could then prioritize the strategy for preventing this in the future.
It sounds like fixing broken leases takes a lot more work than renewing functional leases? Funnily enough, there is already an AWS blog post about specifically preventing these types of issues: https://aws.amazon.com/builders-library/reliability-and-cons...