More Than Dns: Learnings From the 14 Hour AWS Outage

Posted2 months agoActiveabout 2 months ago

birdculture

176 points

60 comments

thundergolfer.comTechstoryHigh profile

calmmixed

Debate

70/100

AWS OutageCloud InfrastructureReliability Engineering

Key topics

AWS Outage

Cloud Infrastructure

Reliability Engineering

The post analyzes the 14-hour AWS outage on Oct 20, highlighting the complexities of the issue and the challenges of troubleshooting, while the discussion revolves around the implications of the outage, AWS's reliability, and the trade-offs between speed and reliability in software development.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

Peak period

Day 3

Avg / period

Comment distribution60 data points

Loading chart...

Based on 60 loaded comments

Key moments

01Story posted
Oct 27, 2025 at 11:56 AM EDT
2 months ago
Step 01
02First comment
Oct 27, 2025 at 9:04 PM EDT
9h after posting
Step 02
03Peak activity
45 comments in Day 3
Hottest window of the conversation
Step 03
04Latest activity
Nov 10, 2025 at 1:54 PM EST
about 2 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (60 comments)

Showing 60 comments

ggm

2 months ago

6 replies

Aside from not dogfooding, what would have reduced the impact? Because "don't have a bug" is .. well it's the difference between desire and reality.

Not dogfooding is the software and hardware equivalent of the electricity network "black start" -you never want to be there, but somewhere in the system you need a honda petrol generator, which is enough to excite the electromagnets on a bigger generator, which you spin up to start the turbine spinning until the steam takes up load and the real generator is able to get volts onto the wire.

Pumped Hydro is inside the machine. It's often held out as the black-start mechanism because it's gravity, there's less to go wrong, but if we are in 'line up the holes in the cheese grater' space, you can always have 'want of a nail' issues with any mechanism. The honda generator can have a hole in the petrol tank, the turbine at the pumped hydro can be out for maintenance.

nijave

2 months ago

3 replies

A cap on region size could have helped. Region isolation didn't fail here so splitting us-east-1 into 3,4,5 would have been a smaller impact

Having such a gobstoppingly massive singular region seems to be working against AWS

pas

2 months ago

1 reply

there's already some virtualization going on. (I heard that what people see as us-east-1a might be us-east-1c for others to spread the load. though obviously it's still too big.)

easton

2 months ago

1 reply

They used to do that (so that everyone picking 1a wouldn’t actually route traffic to the same az), but at some point they made it so all new accounts had the same AZ labels to stop confusing people.

The services that fail on AWS’ side are usually multi zone failures anyway, so maybe it wouldn’t help.

I’ve often wondered why they don’t make a secret region just for themselves to isolate the control plane but maybe they think dogfooding us-east-1 is important (it probably wouldn’t help much anyway, a meteor could just as easily strike us-secret-1).

nevon

2 months ago

1 reply

This is not exactly true. The az names are indeed randomized per account, and this is the identifier that you see everywhere in the APIs. The difference now is that they also expose a mapping from AZ name (randomized) to AZ id (not randomized), so that you can know that AZ A in one account is actually in the same datacenter as AZ B in a different account. This becomes quite relevant when you have systems spread across accounts but want the communication to stay zonal.

waiwai933

2 months ago

1 reply

You're both partially right. Some regions have random mapping for AZs; all regions since 2012 have static AZ mapping. https://docs.aws.amazon.com/global-infrastructure/latest/reg...

nevon

2 months ago

Oh wow. Thanks for telling me this. I didn't know that this was different for different regions. I just checked some of my accounts, and indeed the mapping is stable between accounts for for example Frankfurt, but not Sydney.

otterley

2 months ago

2 replies

us-east-2 already exists and wasn’t impacted. And the choice of where to deploy is yours!

fragmede

2 months ago

1 reply

Which is great, except for global services that you don't have control on where to deploy to, that ended up being in us-east-1, resulting in issues, no matter where your EC2 instances happened to be.

otterley

2 months ago

1 reply

Like what?

fragmede

2 months ago

1 reply

AWS IAM, AWS Organizations, Amazon Route 53 (DNS); AWS services that rely on Route53 include ELB, API Gateway; Amazon S3 bucket creation, some other calls; sts.amazonaws.com is still us-east-1 for many cases; Amazon CloudFront, AWS WAF (for CloudFront), AWS Shield Advanced all have us-east-1 as their control plane.

To be clear, the above list is a control plane depency on us-east-1. During the incident the service itself may have been fine but could not be (re)configured.

The big one is really Route 53 though. DNS having issues caused a lot of downstream effects since it's used by ~everything to talk to everything else.

Other services include Slack. No, it's not an AWS service, but for companies reliant on it or something like it, it doesn't matter how much you're not in EC2 if an outside service that you rely on went down. Full list: https://www.reddit.com/r/DigitalMarketing/comments/1oc2jtd/a...

nijave

about 2 months ago

Which parts of STS? I was thinking that had regional endpoints but had some caveats around what existing clients default to

nijave

about 2 months ago

1 reply

Our bigger issue is SaaS like CircleCI and Quay also being in us-east-1. We have a working DR strategy (albeit about a 4-6 hour exercise) to switch regions but a lot of 3rd party dependencies are tied to us-east-1

It's a bit of a circular problem since it's usually useful to colocate in the same region for latency or security (privatelink)

otterley

about 2 months ago

Insist that your SaaS providers have implemented and tested their regional failover capability!

elchananHaas

2 months ago

DynamoDB is working on going cellular which should help. Some parts are already cellular, and others like DNS are in progress. https://docs.aws.amazon.com/wellarchitected/latest/reducing-...

mcswell

2 months ago

2 replies

(Almost) irrelevant question. You wrote "...a bigger generator, which you spin up to start the turbine spinning until the steam takes up load..." I once served on a steam powered US Navy guided missile destroyer. In addition to the main engines, we had four steam turbine electrical generators. There was no need--indeed, no mechanism--for spinning any of these turbines electrically, they all started up simply by sending them steam. (To be sure, you'd better ensure that the lube oil was flowing.)

Are you saying it's different on land-based steam power plants? Why?

brendoelfrendo

2 months ago

1 reply

Most (maybe all?) large grid-scale generators use electromagnets to generate the electromagnetic field they use to generate electricity. These magnets require electricity to generate that field, so you need a small generator to kickstart your big generator's magnets in order to start producing power. There are other concerns, too; depending on the nature of the plant, there may be some other machinery that requires electricity before the plant can operate. It doesn't take much startup energy to open the gate on a hydroelectric dam, but I don't think anyone is shoveling enough coal to cold-start a coal plant without a conveyer belt.

If I had to guess, I'd assume that the generators on your destroyer were gas turbine generators that use some kind of motor as part of their startup process to get the turbine spinning. It's entirely possible that there was an electric motor in there to facilitate the generator's "black-start," but it may have been powered by a battery rather than a smaller generator.

mcswell

2 months ago

Steam turbine generators (this was back in the 1970s, the ship was decommissioned in 2003). No motor to start turning the turbines, just apply steam. The main engines were also steam turbines, and likewise started just by opening the throttle valves--one for forward, another for backing. We did have a jacking gear powered by an electric motor, but it was only used to prevent warping as the engine cooled down when we went cold iron. And of course the boilers ran on fuel oil, not coal--coal went out on naval ships in the early 20th century.

As for how the generators' fields were started, now that you mention it I'm not sure. We did have emergency diesel generators (and of course shore power when we were pier-side), so maybe those supplied electricity to jump-start the generators. But they were 750 kw generators (upgraded in 1974 from 500 kw generators), so I don't imagine batteries would have sufficed.

fragmede

2 months ago

1 reply

The "why" would involve different design considerations that a warship that may get shot at with enemy missiles, and the cost incentives for building one, and also because naval destroyers typically aren't connected to the grid, so black starts are much more of a possibility and need to recover from them is often under duress, vs a land-based power station simply isn't going to have those same issues.

mcswell

2 months ago

Yeap, and get shot at we did--but by artillery, not missiles. The North Vietnamese were rumored to have Russian Styx missile boats, but if they did they never left port.

pas

2 months ago

1 reply

The post argues for "control theory" and slowing down changes. (Which ... will sure, maybe, but it will slow down convergence, or it complicates things if some class of actions will be faster than others.)

But what would make sense is that upstream services return their load (queue depths or things like PSI from newer kernels) to downstream ones as part of the API responses, so if shit's going on the downstream ones will become more patient and slow down. (But if things are getting cleaned up then the downstream services can speed things up.)

otterley

2 months ago

1 reply

I’ve been thinking about this problem for decades. Load feedback is a wonderful idea, but extremely difficult to put into practice. Every service has a different and unique architecture; and even within a single service, different requests can consume significantly different resources. This makes it difficult, if not impossible, to provide a single quantitative number in response to the question “what request rate can I send you? It also requires tight coupling between the load balancer and the backends, which has problems of its own.

I haven’t seen anyone really solve this problem at scale with maybe the exception of YouTube’s mechanism (https://research.google/pubs/load-is-not-what-you-should-bal...), but that’s specific to them and it isn’t universally applicable to arbitrary workloads.

pas

2 months ago

1 reply

thanks for the YT paper!

my point is there's no need to try (and fail) to define some universal backpressure semantics between coupling points, after all this can be done locally, and even after the fact (every time there's an outage, or better yet every time there's a "near miss") the signal to listen to will show up.

and if not, then not, which means (as you said) that link likely doesn't have this kind of simple semantics. maybe because the nature of the integration is not request-response or not otherwise structured to provide this apparent legibility, even if it's causally important for downstream.

simply thinking about this during post-mortems, having metrics available (which is anyway a given in these complex high-availability systems), having the option in the SDK, seems like the way forward

(yes, I know this is basically the circuit breaker and other Netflix-evangelized ideas with extra steps :))

otterley

2 months ago

1 reply

The simplest and most effective strategy we know today to automatically recover that gives the impacted service the ability to avoid entering a metastable state is for clients to implement retries with exponential backoff. No circuit breaker-type functionality is required. Unfortunately it requires that clients be well behaved.

Also, circuit breakers have issues of their own:

“Even with a single layer of retries, traffic still significantly increases when errors start. Circuit breakers, where calls to a downstream service are stopped entirely when an error threshold is exceeded, are widely promoted to solve this problem. Unfortunately, circuit breakers introduce modal behavior into systems that can be difficult to test, and can introduce significant addition time to recovery. We have found that we can mitigate this risk by limiting retries locally using a token bucket. This allows all calls to retry as long as there are tokens, and then retry at a fixed rate when the tokens are exhausted.” https://aws.amazon.com/builders-library/timeouts-retries-and...

Consider a situation in which all the clients have circuit breakers. All of them enter the open state once the trigger condition is met, which drops request load on the service to zero. Your autoscaler reduces capacity to the minimum level in response. Then, all the circuit breakers are reset to the closed state. Your service then experiences a sudden rush of normal- or above-normal traffic, causing it to immediately exhaust availabile capacity. It’s a special case of bimodal behavior, which we try to avoid as a matter of sound operational practice.

fragmede

2 months ago

Thundering herd is a known post-outage, service restore failure mode. You never let your load balancer and the API boxes behind it dip below some statically defined low water mark; the waste of money is better than going down when the herd shows up. When it does show up, as noted, token bucket rate limiter running 429s while the herd slowly gets let back onto the system. Even if the app server can take it, it's not a given that the eg queue or database systems can absorb the herd as well (especially if that's what went down).

accrual

2 months ago

> but somewhere in the system you need a honda petrol generator, which is enough to excite the electromagnets on a bigger generator ...

Tangent, but players of Satisfactory might recognize this condition. If your main power infrastructure somehow goes down, you might not have enough power to start the pumps/downstream machines to power up your main power generators. Thus it's common to have some Tier 0 generators stashed away somewhere to kick start the rest of the system (at least before giant building-sized batteries were introduced a few updates ago).

belorn

2 months ago

We can draw inspiration from older dns infrastructure like the root servers. They use a list of names rather than a single name. We can imagine if the root (".") was a single nameserver that was distributed with anycast, and how a single misconfiguration would bring down the whole internet. Instead we have a list of name servers, operated by different entities, and the only thing that should happen if one goes down is that the next one get used after a timeout.

The article bring up a fairly important point in impact reductions from bugs. Critical systems need to have sanity checks for states and values that never should occur during normal operation, with some corresponding action in case they happen. End-points could have had sanity checks of invalid DNS, such as zero ip-addresses or broken DNS, and either reverted back to an valid state or a predefined emergency system. Either would have reduced the impact.

7952

2 months ago

I think a lot problems across different systems have a similar issue. You have a system that needs to have some autonomy (like a flying aeroplane). It has a sources of authority (say a sensor, ATC) but that sometimes is unavailable, delayed, gives wrong data. When that happens we are unwilling to fall back on more autonomy and automation. But there is limited scope for human intervention due to the scale of the problem or just technical difficulty. We reach an inflection point where the only direction left is to give up some element of human control. Accept that systems will sometimes receive bad data and need some autonomy to ignore it when it is contraindicated. And that higher level control is just another source of possible false data.

JCM9

2 months ago

3 replies

Legitimate question on if the talent exodus from AWS is starting to take its toll. I’m talking about all the senior long-turned folks jumping ship for greener pastures, not the layoffs this week which mostly didn’t touch AWS (folks saying that will happen in future rounds).

The fact that there was an outage is not unexpected… it happens… but all the stumbling and length to get things under control was concerning.

Nextgrid

2 months ago

1 reply

If you average it out over the last decade do we really have more outages now than before? Any complex system with lots of moving parts is bound to fail every so often.

thundergolfer

2 months ago

1 reply

It's the length of the outage that's striking. AWS us-east-1 has had a few serious outages in the last ~decade, but IIRC none took near 14 hours to resolve.

The horrible us-east-1 S3 outage of 2017[1] was around 5 hours.

1. https://aws.amazon.com/message/41926/

Nextgrid

2 months ago

1 reply

Couldn’t this be explained by natural growth of the amount of cloud resources/data under management?

The more you have, the faster the backlog grows in case of an outage, so you need longer to process it all once the system comes back online.

JCM9

2 months ago

Not really. The issue was the time it took to correctly diagnose the issue and then the cascading failures that resulted triggering more lengthy troubleshooting. Rightly or wrongly it plays into the “the folks that knew best how all this works have left the building” vibes. Folks inside AWS say that’s not entirely inaccurate.

j45

2 months ago

Hope not.. Smooth tech that runs is like the Maytag man.

Tech departments running around with their hair on fire / always looking busy isn't one that always builds trust.

zorpner

2 months ago

Corey Quinn wrote an interesting article addressing that question: https://www.theregister.com/2025/10/20/aws_outage_amazon_bra...

Some good information in the comments as well.

SketchySeaBeast

2 months ago

2 replies

I can't imagine a more uncomfortable place to try and troubleshoot all this than in a hotel lobby surrounded by a dozen coworkers.

scottlamb

2 months ago

Easy: alone, struggling to contact coworkers (while mostly trying to diagnose the problem). I've done both (the alone state didn't last for hours because we did have emergency communication channels, and the hotel was a ski lodge in my case). The surrounded by coworkers is much better.

That's assuming these are actual coworkers, not say a dozen escalating levels of micromanagers, which I agree would be hell. The where isn't really that important in my experience, as long as you've got reliable Internet access.

thundergolfer

2 months ago

It wasn't too bad! The annoying bit was that the offsite schedule was delayed for hours for the other ~40 people not working on the issue.

tptacek

2 months ago

1 reply

Good to see an analysis emphasizing the metastable failure mode in EC2, rather than getting bogged down by the DNS/Dynamo issue. The Dynamo issue, from their timeline, looks like it got fixed relatively quickly, unlike EC2, which needed a fairly elaborate SCRAM and recovery process that took many hours to execute.

A faster, better-tested "restart all the droplet managers from a known reasonable state" process is probably more important than finding all the Dynamo race conditions.

thundergolfer

2 months ago

1 reply

I was motivated by your back-and-forth in the original AWS summary to go and write this post :)

tptacek

2 months ago

It's good, and I love that you brought the Google SRE stuff into it.

fizlebit

2 months ago

1 reply

It looks from the public writeup that the thing programming the DNS servers didn't acquire a lease on the server to prevent concurrent access to the same record set. I'd love to see the internal details on that COE.

I think when there is an extended outage it exposes the shortcuts. If you have 100 systems, and one or two can't start fast from zero, and they're required to get back to running smoothly, well you're going to have a longer outage. How would you deal with that, you'd uniformly across your teams subject them to start from zero testing. I suspect though that many teams are staring down a scaling bottleneck, or at least were for much of Amazon's life and so scaling issues (how do we handle 10x usage growth in the next year and half, which are the soft spots that will break) trump cold start testing. Then you get a cold start event with that last one being 5 years ago and 1 or 2 out of your 100 teams falls over and it takes multiple hours all hands on deck to get it to start.

fer

2 months ago

> I'd love to see the internal details on that COE.

You'd be unpleasantly surprised, on that point the COE points to the public write-up for details.

zastai0day

2 months ago

2 replies

Man, AWS is at it again. That big outage on Oct 20 was rough, and now (Oct 29) it looks like things are shaky again. SMH.

us-east-1 feels like a single point of failure for half the internet. People really need to look into multi-region, maybe even use AI like Zast.ai for intelligent failover so we're not all dead in the water when N. Virginia sneezes.

OliverGuy

2 months ago

Tbh, for most companies/orgs the cost/complexity of multi region just isn't worth it.

The cost of a work days worth of downtime is rarely enough to justify the expense of trying to deploy across multiple regions or clouds.

Esp if you are public facing and not internal. You just go 'well everyone else was down to because of aws' and your customers just go 'ah okay fair enough'

binary132

2 months ago

That sounds like engineering work and expense without a dollar sign attached to it, so maybe it’ll happen after all the product work (i.e. never.)

sanskarix

2 months ago

3 replies

As someone building a SaaS product launching soon, outages like this are incredibly instructive—though also terrifying.

The "cold start" analogy resonates. As an indie builder, I'm constantly making decisions that optimize for "ship fast" vs "what happens when this breaks." The reality is: you can't architect for every failure mode when you're a team of 1-2 people.

But here's what's fascinating about this analysis: the recommendation for control theory and circuit breakers isn't just for AWS-scale systems. Even small products benefit from graceful degradation patterns. The difference is—at AWS scale, a metastable failure can cascade for 14 hours. At indie scale, you just... lose customers and never know why.

The talent exodus point is also concerning. If even AWS—with their resources and institutional knowledge—can struggle when senior folks leave, what chance do startups have? This is why I'm increasingly convinced that boring, well-documented tech stacks matter more than cutting-edge ones. When you're solo, you need to be able to debug at 2am without digging through undocumented microservice chains.

Question: For those running prod systems, how do you balance "dogfooding" (running on your own infra) vs "use boring, reliable stuff" (like AWS for your control plane)? Is there a middle ground?

tpmoney

2 months ago

2 replies

> If even AWS—with their resources and institutional knowledge—can struggle when senior folks leave, what chance do startups have?

My personal crusade is trying to convince anyone who will listen to me that the value of a senior employee in any role isn't in their years of experience. It's in their years of experience with your system. More companies should do whatever it takes to retain employees long term. No matter how good your documentation procedures are, no matter how thorough your knowledge base, no matter how many knowledge transfer sessions you have, your long tenured employees have forgotten they know more knowledge than they will ever be able document.

I have never seen a team lose a (productive) long standing member that still wasn't suffering from that loss years later. We like to pretend that software and especially reusable libraries, components and frameworks make developers interchangeable widgets. Substitutable cogs that can simply be swapped out by having enough documentation. But writing software is converting a mental state into a physical one. And reading code is trying to read someone else's mind (incidentally this is part of why two excellent developers can find themselves unable to work together, they just don't think enough alike). When you lose those senior people, you lose the person who was around for that one outage 8 years ago, and who if they were around for the current outage would have a memory float up from the forgotten depths that cause them to check on the thing no one (yet) suspects.

This isn't to say you shouldn't document things, and that you shouldn't work to spread knowledge around. The "bus factor" is real. People will retire. People will leave for reasons you have no control over. So document everything you can, build good knowledge transfer systems. But always remember that in the end, they will only cover a portion of the knowledge your employees have. And the longer your employees are around, the more knowledge they will have. And unless you are willing to pay your people to document everything they know full time (and I assure you, neither you nor your customers are willing to do that), they will always know more than is documented.

Nor despite my analogies does this apply only to software developers. One team I worked on lost a long term "technical writer" (that wasn't their full time work but it was a role they filled for that team). This person was the best person I've ever known at documenting things. And still the team reeled at losing them. Even with all the documentation they left behind, knowing what was documented, where and how recently and all sorts of metadata that wasn't captured in the documentation itself went with them when they left. Years later their documentation was still useful, but no one was anywhere near their level of encyclopedic knowledge of it all.

To use an analogy, the New York Yankees perform the best when they focus on building up their roster from their farm system. Long tenured players mentor the new ones, and a cohesive team forms that operates as more than the sum of its parts. They perform the worst when success and money take control and they try to hire "all-stars" away from other teams. The intangibles just aren't there, or aren't as strong. A good company is the same. A long term team built from the ground up is worth far more than a similarly "experienced" team made up of outside hires.

Talent exodus is concerning, so don't give your people a reason to leave. You can't stop it all, but you can do a lot to avoid self inflicted wounds.

grebc

2 months ago

I think a lot of developers may not appreciate the fleshy human parts of organisations, given most of our efforts to date as an industry are typically to reduce headcount, but staff really are key to any successful organisation.

pjmlp

2 months ago

I get to live this routinely in consulting projects, where management thinks teams are like playing footbal, replacing players mid game and keep going, naturally it isn't without all sorts of hickups as you well point out.

grebc

2 months ago

1 reply

I have never heard of a control plane before this AWS outage, sounds instructive without needing an explanation but for 1-2 person what does a control plane provide?

cmckn

2 months ago

1 reply

“Control plane” and “data plane” are very common terms. It’s just a way to think about responsibilities within a system, not necessarily its physical architecture.

grebc

2 months ago

1 reply

I guess I’m asking the functionality in particular for small teams.

Anyone using AWS isn’t concerned with the physical :)

fragmede

2 months ago

The size of the team is irrelevant. It's like asking if a load balancer is useful for a two person team. What's the system the two person team is in charge of, and for how many users? The two people might be in charge of the backend for a sprawling complex system with millions of users, or they might also be the only users of some small local physical bulletin board made out of cork.

At the end of the day, AWS is still physically made up of physical computers that still have to obey the laws of physics. If you max out the bandwidth of the connection going in or out of, say, an RDS system (which is available via NDA if you are spending enough money with them), suddenly the physical is very important.

ifwinterco

2 months ago

For a startup, boring tech stacks are absolutely the correct choice. Using an opinionated framework like Django can be a very good idea for the same reason - it provides a structure which you just have to follow without too many decisions to make, and any new hires with experience in the framework can hit the ground running to a larger extent.

The only exception is if you are doing something where your tech stack is integral to your product (e.g. you need crazily high performance or scale or something from the get go).

Split out the front ends into separate services but leave the back end as a monolith, just try not to get logically separate parts too entangled so you have a decent chance of separating them later if and when "later" arrives

bboozzoo

2 months ago

1 reply

> What is surprising is that a classic Time-of-check-time-of-use (TOCTOU) bug was latent in their system until Monday. Call me naive, but I thought AWS, an organization running a service that averages 100 million RPS, would have flushed out TOCTOU bugs from their critical services.

Yeah, right. I'm surprised how anyone involved with software engineering can be surprised by this. I would argue that there many, if not infinitely many, similar bugs out there. It's just that the right conditions for them to show up haven't been met yet.

accrual

2 months ago

I had a similar thought. TOCTOU bugs could be anywhere and only take a few lines of code to create the conditions for them, but have no immediate warning they exist unless you're looking for them.

bob1029

2 months ago

> Most obviously, RCA has an infinite regress problem

Root cause analysis is just like any other tool. Failure to precisely define the nature of the problem is what usually turns RCA into a wild goose chase. Consider the following problem statements:

"The system feels yucky to use. I don't like it >:("

"POST /reports?option=A is slow around 3pm on Tuesdays"

One of these is more likely to provide a useful RCA that proceeds and halts in a reasonable way.

"AWS went down"

Is not a good starting point for a useful RCA session. "AWS" and "down" being the most volatile terms in this context. What parts of AWS? To what extent were they down? Is the outage intrinsic to each service or due to external factors like DNS?

"EC2 instances in region X became inaccessible to public internet users between Y & Z"

This is the kind of grain I would be doing my PPTX along if I was working at AWS. You can determine that there was a common thread after the fact. Put it in your conclusion / next-steps slide. Starting hyper-specific means that you are less likely to get distracted and can arrive at a good answer much faster. Aggregating the conclusions of many reports, you could then prioritize the strategy for preventing this in the future.

__turbobrew__

2 months ago

> In normal operation of EC2, the DWFM maintains a large number (~10^6) of active leases against physical servers and a very small number (~10^2) of broken leases, the latter of which the manager is actively attempting to reestablish.

It sounds like fixing broken leases takes a lot more work than renewing functional leases? Funnily enough, there is already an AWS blog post about specifically preventing these types of issues: https://aws.amazon.com/builders-library/reliability-and-cons...

vrighter

2 months ago

14 hours drops them to at least 99.8% uptime, over a whole year. Where will that be advertised?

View full discussion on Hacker News

ID: 45722471Type: storyLast synced: 11/20/2025, 4:26:23 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN