Back to Home11/17/2025, 5:07:17 PM

How when AWS was down, we were not

195 points
74 comments

Mood

thoughtful

Sentiment

positive

Category

tech

Key topics

high availability

cloud infrastructure

AWS

SRE

Debate intensity60/100

The article discusses how Authress achieved high availability during an AWS outage, sparking a thoughtful discussion on the strategies and trade-offs involved in building resilient systems.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

1h

Peak period

70

Day 1

Avg / period

37

Comment distribution74 data points

Based on 74 loaded comments

Key moments

  1. 01Story posted

    11/17/2025, 5:07:17 PM

    2d ago

    Step 01
  2. 02First comment

    11/17/2025, 6:37:09 PM

    1h after posting

    Step 02
  3. 03Peak activity

    70 comments in Day 1

    Hottest window of the conversation

    Step 03
  4. 04Latest activity

    11/19/2025, 9:01:11 AM

    10h ago

    Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (74 comments)
Showing 74 comments
tptacek
2d ago
1 reply
This is a rare case where the original bait-y title is probably better than the de-bait-ified title, because the actual article is much less of a brag and much more of an actual case study.
dang
2d ago
2 replies
Re-how'd, plus I've resisted the temptation to insert a comma that feels missing to me.
wparad
1d ago
1 reply
I spent a long time, trying to figure out, what the title of the article, should be. I'm terrible at SEO and generating click-bait titles, it is unfortunately, what, it, is.
dang
1d ago
You did fine! Your title is clear and I was just being playful.
tptacek
2d ago
"How?! When AWS was down: we were not!"
pinkmuffinere
2d ago
3 replies
> During this time, us-east-1 was offline, and while we only run a limited amount of infrastructure in the region, we have to run it there because we have customers who want it there

> [Our service can only go down] five minutes and 15 seconds per year.

I don't have much experience in this area, so please correct me if I'm mistaken:

Don't these two quotes together imply that they have failed to deliver on their SLA for the subset of their customers that want their service in us-east-1? I understand the customers won't be mad at them in this case, since us-east-1 itself is down, but I feel like their title is incorrect. Some subset of their service is running on top of AWS. When AWS goes down, that subset of their service is down. When AWS was down, it seems like they were also down for some customers.

loloquwowndueo
2d ago
1 reply
Depends on what the SLA phrasing is - us-east-1 affinity is a requirement put forth by some customers so I would totally expect the SLA to specifically state it’s subject to us-east-1 availability. Essentially these customers are opting out of Authress’s fault-tolerant infrastructure and the SLA should be clear about that.
dylan604
2d ago
As TFA states, we have to offer services in that region because that's where some users are as well. However, the core of services are not in that region. I have also suggested when the time comes for offering SLAs, that there is explicit wording exempting us-east-1.
wparad
1d ago
1 reply
It's a good point.

We don't actually commit to running infrastructure in one specific AWS region. Customers can't request that the infra runs exactly in us-east-1, but they can request that it runs in "Eastern United States". The problem is that with scenarios that might require VPC peering or low latency connections, we can't just run the infrastructure in us-east-2 and commit to never having a problem. For the same reason, what happens if us-east-2 were to have an incident.

We have to assume that our customers need it in a relatively close region, and that at the same time need to plan for the contingency that region can be down.

Then there are the customer's users to think of as well. In some cases, those users might be globally dispersed, even if the customer infrastructure is only one major location. So while it would be nice to claim "well you were also down at that moment", in practices customer's users will notice, and realistically, we want to make sure we aren't impeding remediation on their side.

That is, even if a customer says "use us-east-1", and then us-east-1 is down, it can't look that way to the customer. This gets a lot more complicated, when the services that we are providing may be impacted differently. Consider us-east-1 dynamoDB down, but everything else was still working. Partial failure modes are much harder to deal with.

macintux
1d ago
> Partial failure modes are much harder to deal with.

Truer words were never spoken.

PaulRobinson
2d ago
The bulk of the article discusses their failover strategy, where they detect failures in a region and how they route requests to a backup region, and how to deal with data consistency and cost issues arising from that.
sharklasers123
2d ago
3 replies
Is there not an inherent risk using an AWS service (Route 53) to do the health check? Wouldn’t it make more sense to use a different cloud provider for redundancy?
indigodaddy
1d ago
1 reply
Had the same thought, eg if things are really down can it even do the check etc
hnuser123456
1d ago
Ask some friends and family if you can install an RPi on their home network that monitors your service.
wparad
1d ago
1 reply
If the check can't be done, then everything stays stable, so I'm guessing the question is, "What happens if Route 53 does the check and incorrectly reports the result?"

In that case, no matter what we are using there is going to be a critical issue. I think the best I could suggest at that point would be to have records in your zone that round robin different cloud providers, but that comes with its own challenges.

I believe there are some articles sitting around regarding how AWS plans for failure and the fallback mechanism actually reduces load on the system rather than makes it worse. I think it would require in-depth investigation on the expected failover mode to have a good answer there.

For instance, just to make it more concrete, what sort of failure mode are you expecting to happen with the Route 53 health check? Depending on that there could be different recommendations.

indigodaddy
1d ago
1 reply
Have you considered the scenario of "everything is so dead in aws", that the check doesn't happen, plus the backends are dead too (this is assuming the backend services live in aws as well) ? But I'd guess in that case you'd know quickly enough from supplementary alerting (you guys don't seem the type to not have some sort of awesome monitoring in place) and you have a different/worse DR problem on your hands.

As far as the OP's point though, I'm going to probably assume that the health checks need to stay within/from AWS because 3rd party health checks could taint/dilute the point of the in-house AWS HC service to begin with.

wparad
1d ago
2 replies
I think there are two worlds of thought to the "AWS is totally dead everywhere". And that's: * It is never going to happen due to the way AWS is designed (or at least told to us, which explains why it is so hard to execute actions across regions.) * It will happen but then everything else is going to be dead, so what's the point?

One problem we've run into, which is the "DNS is single point of failure" is that there isn't a clear best strategy to deal with "failover to a different cloud at the DNS routing level."

I'm not the foremost expert when it comes to ASNs and BGPs, but from my understanding that would require some multi-cloud collaboration to get multiple CDNs to still resolve, something that feels like it would require both multiple levels of physical infrastructure as well as significant cost to actually implement correctly compared to the ROI for our customers.

There's a corollary here for me, which is, still as simple as possible to achieve the result. Maybe there is a multi-cloud strategy, but the strategies I've seen still rely on having the DNS zone in one provider that fail-overs or round-robins specific infra in specific locations.

Third party health checks have less of a problem of "tainting" and more just cause further complications, as you add in complexity to resolving your real state, the harder it is to get it right.

For instance, one thing we keep going back and forth on is "After the incident is over, is there a way for us to stay failed-over and not automatically fail back".

And the answer for us so far is "not really". There are a lot of bad options, which all could have catastrophic impacts if we don't get it exactly correct, and haven't come with significant benefits, yet. But I like to think I have an open mind here.

parliament32
1d ago
There is good options if you're willing to pay for them, but they have nothing to do with DNS. You will never get DNS TTLs low enough (and respected) to prevent a multi-minute service interruption in cases like these.

Proper HA is owning your own IP space and anycast advertising it from multiple IXes/colos/clouds to multiple upstreams / backbone networks. BGP hold times are like a dead-mans-switch and will ensure traffic stops being routed in that direction within a few seconds in case of a total outage, plus your own health-automation should disable those advertisements when certain things happen. Of course, you need to deal with the engineering complexity of your traffic coming in to multiple POPs at once, and it won't be cheap at all (to start, you're looking at ~10kUSD capex for a /24 of IP space, plus whatever the upstreams charge you monthly), but it will be very resilient to pretty much any single point of failure, including AWS disappearing entirely.

toast0
1d ago
It's painful, but you can split your DNS across multiple providers. It's not usually done other than during migrations, but if you put two NS names from providerA and two from providerB, you'll get a mix of resolution. If either provider fails and stops responding, most resolvers will use the other provider. If one provider fails and returns bad data (including errors), the redundancy doesn't really help --- you probably went from a full outage that's easy to diagnose to a partial outage that's much harder to diagnose; and if both providers are equally reliable, you increased your chances of having an outage.
kondro
1d ago
While there appears to be some us-east-1 SPoF for Route 53 updates (as shown recently), the actual health checks themselves occur in up to 8 different regions [1] with an 18%[2] agreement of failure required to initiate a failover.

AWS has very good isolation between regions and, while it relies on us-east-1 for control plane updates to Route 53, health checks and failovers are data plane operations[3] and aren't affected by a us-east-1 outage.

Relying on a single provider always seems like a risk, but the increased complexity of designing systems for multi-cloud will usually result in an increased risk of failure, not a decrease.

1. us-east-1, us-west-1, us-west-2, eu-west-1, ap-southeast-1, ap-southeast-2, ap-northeast-1 and sa-east-1 which defaults to all of them.

2. https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/dn...

3. https://aws.amazon.com/blogs/networking-and-content-delivery...

indigodaddy
2d ago
2 replies
Back in the day (10-12 years ago) at a telecom/cable we accomplished this with F5 Big IP GSLB DNS (and later migrated to A10's GSLB equivalent devices) as the auth DNS server for services/zones that required or were suitable for HA. (I can't totally remember but I'm guessing we must have had a pretty low TTL for this).

Had no idea that Route 53 had this sort of functionality

indigodaddy
1d ago
1 reply
Speaking of F5 Big IP DNS devices, does anyone know of any auth DNS software solution for GSLB/health checking for DNS (I guess excluding Route 53 or other cloud/SaaS). Last I looked all I could find was the polaris-gslb addon for PowerDNS, but the GitHub for that has no activity in 8 years.
loadbalancer
1d ago
We've been using Polaris for the Loadbalancer.org GSLB for a few years now, and we've found it fast and stable. Its so simple its never needed any updates:

https://www.loadbalancer.org/blog/gslb-why-global-server-loa...

Although we did patch a dynamic health check a while back, which will be open source of course. But I'll get someone to check if we actually gave it back to the community or not...

wparad
1d ago
Maybe I should have titled the article "AWS Route53 HealthChecks are amazing" :)
iso1631
1d ago
2 replies
I'm interested in how they measure that downtime. If you're down for 200 milliseconds, does that accumulate. How do you even measure that you're down for 200ms.

(For what it's worth, for some of my services, 200ms is certainly an impact, not as bad as 2 seconds out outage but still noticable and reportable)

smadge
1d ago
1 reply
I think a lot of web services talk about reliability in terms of uptime (e.g. down for less than 5 minutes a year) but in reality operate on failure ratios (less than 0.001% of request to our service fail).
wparad
1d ago
100%
wparad
1d ago
Good catch. The truth is, while we track downtime for incident reporting, it's much more correct to actually be tracking the number of requests that result in a failure. Our SLAs are based on request volume, and not specifically time. Most customers don't have perfect sustained usage. Being down when they aren't running is irrelevant to everyone.

This is where the grey failures can come into play. It's really hard to tell, often impossible to know what the impact of an incident is to a customer, even if you know you are having an incident, without them telling you.

In order to know that you are "down", our edge of the HTTP request would need to be able to track requests. For us that is CloudFront, but if there is an issue before that, at DNS, at network level, etc... we just can't know what the actual impact is.

As far as measuring how you are down. We can pretty accurately know the list of failures that are happening, (when we can know), and what the results are.

That's because most components are behind cloudfront in any case. And if cloudfront isn't having a problem, we'll have telemetry that tells us what the HTTP request/response status codes and connection completions look like. Then it's a matter of measuring from our first detection to the actual remediation being deployed (assuming there is one).

Another thing that helps here is that we have multiple other products that also use Authress, and we can run technology in other regions that can report this information, for those accounts (obviously can't be for all customers), which can help us identify with additional accuracy, but is often unnecessary.

wparad
1d ago
1 reply
Hey, I wrote that article!

I'll try to add comments and answer questions where I can.

- Warren

ckozlowski
1d ago
1 reply
Hi Warren! I'm Chris, and I'm with AWS, where among other things, I work on the Well-Architected Framework. Would you be willing to talk with us? You can reach me at kozlowck@amazon.com. Thanks!

Edit: This is a fantastic write-up by the way!

wparad
1d ago
Thank you!
rdoherty
1d ago
2 replies
This is probably one of the best summarizations of the past 10 years of my career in SRE. Once your systems get complex enough, something is always broken and you have to prepare for that. Detection & response become just as critical as pre-deploy testing.

I do worry about all the automation being another failure point, along with the IaC stuff. That is all software too! How do you update that safely? It's turtles all the way down!

evanmoran
1d ago
3 replies
Iac is definitely a failure point, but the manual alternative is much worse! I’ve had a lot of benefit from using pulumi, simply because the code can be more compact than the terraform hcl was.

For example, for the fall over regions (from the article) you could make a pulumi function that parameterizes only the n things that are different per fall over env and guarantee / verify the scripts are nearly identical. Of course, many people use modules / terragrunt for similar reasons, but it ends up being quite powerful.

xyzzy123
1d ago
2 replies
I actually like terraform for its LACK of power (tho yeah these days when I have a choice I use a lot of small states and orchestrate with tg).

Pulumi or CDK are for sure more powerful (and great tools) but when I need to reach for them I also worry that the infra might be getting too complex.

yearolinuxdsktp
1d ago
3 replies
IMO Pulumi and CDK are an opportunity to simplify your infra by capturing what you’re working with using higher-level abstractions and by allowing you to refactor and extract reusable pieces at any level. You can drive infra definitions easily from typed data structures, you can add conditionals using natural language syntax, and stop trying to program in a configuration language.

You still end up having IaaC.

darkwater
1d ago
> and stop trying to program in a configuration language

Many people don't program with a configuration language like HCL. We use it as what it is - a DSL - that covers its main use case in an elegant manner. Maybe I never touched complex enough infra that twists a DSL into a general-use language, but in my experience there are simply no real benefits when using something like CDK (I never tried Pulumi to be fair).

andrewaylett
1d ago
That's how we use CDK. Our CDK (in general) creates CloudFormation which we then deploy. As far as the tooling which we have for IaC is concerned, it's indistinguishable from hand-written CloudFormation — but we're able to declare our intent at a higher level of abstraction.
xyzzy123
1d ago
Absolutely, the best case is it's much better, safer, readable etc. However, the worst case is also worse, and the things I end up having to provide support for tend to cluster on one side of that scale :/
wparad
1d ago
Agreed, it is much too easy to fall into bad habits. The whole goal of OpenTofu is declarative infrastructure. With CDK and pulumi, it's very easy to end up in a place where you lose that.

But if you need to do something in a particular way, the tools should never be an obstacle.

spyspy
1d ago
If you do use terraform, for the love of god do NOT use Terraform Cloud. Up there with Github in the list of least reliable cloud vendors. I always have a "break glass" method of deploying from my work machine for that very reason.
wparad
1d ago
I think some people are going to scream when I say this, but we're using mostly CloudFormation templates.

We don't use the CDK because it introduces complexity into the system.

However to make CloudFormation usable, it is written in typescript, and generates the templates on the fly. I know that sounds like the CDK, but given the size of our stacks, adding an additional technology in, doesn't make things simpler, and there is a lot of waste that can be removed, by using a software language rather than using json/yaml.

There are cases we have some OpenTofu, but for infrastructure resources that customer specific, we have deployments that are run in typescript using the AWS SDK for javascript.

It would be nice if we could make a single change and have it roll-out everywhere. But the reality is that there are many more states in play then what is represented by a single state file. Especially when it comes to interactions between—our infra, our customer's configuration, and the history of requests to change the configuration, as well as resources with mutable states.

One example of that is AWS certificates. They expire. We need them expiring. But expiring certs don't magically update state files or stacks. It's really bad to make assumptions about a customer's environment based on what we thought we knew the last time a change was rolled out.

wparad
1d ago
Thank you!

One of the question I frequently get is "do you automatically rollback". And I have hide in the corner and say "not really". Often, if you knew a rollback would work, you probably could also have known to not roll out in the first place. I've seen a lot of failures that only got worse when automation attempted to turn the thing on and off again.

Luckily from an automation roll-out standpoint, it's not that much harder to test in isolation. The harder parts to validate are things like "Does a Route 53 Failover Record really work in practice at the moment we actually need it to work?"

Usually the answer is yes, but then there's always the "but it too could be broken", and as you said, it's turtles all the way down.

The nice part is realistically, the automation for dealing with rollout and IaC is small and simple. We've split up our infrastructure to go with individual services, so each piece of infra is also straight forward.

In practice, our infra is less DRY and more repeated, which has the benefit of avoiding complexity that often comes from attempting to reduce code duplication. The ancillary benefit is that, simple stuff changes less frequently. Less frequent changes because less opportunity for issues.

Not-surprisingly, most incidents comes from changes humans make. Where the second most amount of incidents come from assumptions humans make about how a system operates in edge conditions. If you know these two things to be 100% true, you spend more time designing simple systems and attempting to avoid making changes as much as possible, unless it is absolutely required.

hartator
1d ago
1 reply
Interesting how engineers like to nerd out about SLAs, but never claim or issue credits when something does occur.
wparad
1d ago
In the last decade, there has been at least one time where we did issue credits to our customers when there was a problem. Issues credits back to our customers is a small compensation for any issue we're responsible for, and doing so is part of our Terms of Service.
scottlamb
1d ago
2 replies
I'm surprised the section about retries doesn't mention correlations. They say:

> P_{total}(Success) = 1 - P_{3rdParty}(Failure)^{RetryCount}

By treating P_{3rdParty}(Failure) as fixed, they're assuming a model in which each each try is completely independent: all the failures are due to background noise. But that's totally wrong, as shown by the existence of big outages like the one they're describing, and not consistent with the way they describe outages in terms of time they are down (rather than purely fraction of requests).

In reality, additional retries don't improve reliability as much as that formula says. Given that request 1 failed, request 2 (sent immediately afterward with the same body) probably will too. And there's another important effect: overload. During a major outage, retries often decrease reliability in aggregate—maybe retrying one request makes it more likely to go through, but retrying all the requests causes significant overload, often decreasing the total number of successes.

I think this correlation is a much bigger factor than "the reliability of that retry handler" that they go into instead. Not sure what they mean there anyway—if the retry handler is just a loop within the calling code, calling out its reliability separately from the rest of the calling code seems strange to me. Maybe they're talking about an external queue (SQS and the like) for deferred retries, but that brings in a whole different assumption that they're talking about something that can be processed asynchronously. I don't see that mentioned, and it seems inconsistent with the description of these requests as on the critical path for their customers. Or maybe they're talking about hitting a "circuit breaker" that prevents excessive retries—which is a good practice due to the correlation I mentioned above, but if so it seems strange to describe it so obliquely, and again strange to describe its reliability as an inherent/independent thing, rather than a property of the service being called.

Additionally, a big pet peeve of mine is talking about reliability without involving latency. In practice, there's only so long your client is willing to wait for the request to succeed. If say that's 1 second, and you're waiting 500 ms for an outbound request before timing out and retrying, you can't even quite make it to 2 full (sequential) tries. You can hedge (wait a bit then send a second request in parallel) for many types of requests, but that also worsens the math on overload and correlated failures.

The rest of the article might be much clearer, but I have a fever and didn't make it through.

lorrin
1d ago
1 reply
Agreed, I think the introduction is wrong and detracts from the rest of the article.
wparad
1d ago
Hmmm, which part of the intro did you find an issue with? I want to see if I can fix it.
shoo
1d ago
1 reply
> the section about retries doesn't mention correlations. [...] By treating P_{3rdParty}(Failure) as fixed, they're assuming a model in which each each try is completely independent: all the failures are due to background noise. But that's totally wrong, as shown by the existence of big outages like the one they're describing

Yes, that jumped out at me as well. A slightly more sophisticated model could be to assume there are two possible causes of a failed 3rd party call: (a) a transient issue - failure can be masked by retrying, and (b) a serious outage - where retrying is likely to find that the 3rd party dependency is still unavailable.

Our probabilistic model of this 3rd party dependency could then look something like

  P(first call failure) = 0.10
  P(transient issue | first call failure) = 0.90
  P(serious outage | first call failure) = 0.10
  P(call failure | transient issue, prior call failure) = 0.10
  P(call failure | serious outage, prior call failure) = 0.95
I.e. a failed call is 9x more likely to be caused by a transient issue than a serious outage. If the cause was a transient issue we assume independence between sequential attempts like in the article, but if the failure was caused by a serious outage there's only a 5% chance that each sequential retry attempt will succeed.

In contrast with the math sketched in the article, where retrying a 3rd party call with a 10% failure rate 5 times could suffice for a 99.999% success rate, with the above model of failure modes including a serious outage failure mode producing a string of failures, we'd need to retry 135 times after a first failed call to achieve the same 99.999% success rate.

Your points about overall latency client is willing to wait & retries causing additional load are good, in many systems "135 retry attempts" is impractical and would mean "our overall system has failed and is unavailable".

Anyhow, it's still an interesting article. The meat of the argument and logic about 3rd party deps needing to meet some minimum bar of availability to be included still makes sense, but if our failure model considers failure modes like lengthy outages that can cause correlated failure patterns, that raises the bar for how reliable any given 3rd party dep needs to be even further.

wparad
1d ago
1 reply
This is absolutely true, but the end result is the same. The assumption is "We can fix a third party component behaving temporarily incorrectly, and therefore we can do something about it". If the third party component never behaves correctly, then nothing we can do to fix it.

Correlations don't have to be talked about, because they don't increase the likelihood for success, but rather the likihood of failure, meaning that we would need orders of magnitude more reliable technology to solve that problem.

In reality, those sorts of failures aren't usually temporary, but rather systemic, such as "we've made an incorrect assumption about how that technology works" - feature not a bug.

In that case, it doesn't really fit into this model. There are certainly things that would better indicate to us that we could use or are not allowed to use a component, but for the sake of the article, I think that was probably going much to far.

TL;DR Yes for sure, individual attempts are correlated, but in most cases, it doesn't make sense to track that because those situations end up in other buckets of "always down = unreliable" or "actually up - more complex story which may not need to be modelled".

scottlamb
1d ago
1 reply
I think the work matters as much as the result, and you had to make at least a couple strange turns to get the "right answer" that retries don't solve the problem:

* the 3rd-party component offering only 90% success—I've never actually seen a system that bad. 99.9% success SLA is kind of the minimum, and in practice any system that has acceptable mean latency for a critical auth path is likely also gonna have >=99.99% in practice (even if they don't promise refunds based on that). * the whole "really reliable retry handler" thing—as mentioned in my first comment, I don't understand what you were getting at here.

I would go a whole other way with this section—more realistic, much shorter. Let's say you want to offer 99.999% success within 1 second, and the third-party component offers 99.9% success per try. Then two tries gives you 99.9999% success if the failures are all uncorrelated but does not help at all when the third-party system is down for minutes or hours at a time. Thus, you need to involve an alternative that is believed to be independent of the faulty system—and the primary tool AWS gives you for that is regional independence. This sets up the talk about regional failover much more quickly and with less head-scratching. I probably would have made it through the whole article yesterday even in my feverish state.

wparad
20h ago
Hmm, I never considered potentially using an SLA on latency as a potential way to justify the argument. If I pull this content into a future article or talk, I will definitely consider reframing it for easier understanding.
markdown
1d ago
1 reply
BTW clicking on your website logo takes one to https://authress.io/knowledge-base/ instead of https://authress.io
wparad
1d ago
Ha, thanks, we'll fix that.
thisnullptr
1d ago
5 replies
It’s fascinating to me people think their services are so important they can’t survive any downtime. Can we all admit that, while annoying, nothing really bad happened even when us-east-1 was down for almost half a working day?
shoo
1d ago
1 reply
In many contexts you are correct & further, as someone in that earlier thread about the AWS us-east-1 outage mentioned, customers can be more forgiving of outages if you as the vendor can point to a widespread AWS us-east-1 outage and note that us-east-1 is down for everyone.

But, as JSR_FDED's sibling comment notes & as is spelled out in the article, authress' business model offering an auth service means that their outage may entirely brick their clients customer facing auth / machine to machine auth.

I've worked in megacorp environments where an outage of certain internal services responsible for auth or issuing JWTs would break tens or hundreds of internal services and break various customer-facing flows. In many business contexts a big messy customer facing outage for a day or so doesn't actually matter but in some contexts it really can. In terms of blast radius, unavailability of a key auth service depended on by hundreds of things is up there with, i dunno, breaking the network.

wparad
1d ago
Absolutely, part of the problem is that a whole region being down is often less of a problem, then just one critical service. And as you point out the blast radius of a critical dependency is huge.
bostik
1d ago
1 reply
As other posters have commented, an external auth service is a very special thing indeed. In modern and/or zero-trust systems if auth doesn't work, then effectively nothing works.

My rule of thumb from the past experiences is that if you demand a 99.9% uptime for your own systems and you have an in-house auth, then that auth system must have 99.99% reliability. If you are serving auth for OTHERS, then you have a system that can absolutely never be down, and at that point five nines becomes a baseline requirement.

Auth is a critical path component. If your service is in the critical path in both reliability and latency[ß] for third parties, then every one of your failures is magnified by the number of customers getting hit by it.

ß: The current top-voted comment thread includes a mention that latency and response time should also be part of an SLA concern. I agree. For any hot-path system you must be always tracking the latency distribution, both from the service's own viewpoint AND from the point of view of the outside world. The typically useful metrics for that are p95, p99, p999 and max. Yes, max is essential to include: you want to always know what was the worst experience someone/something had during any given time window.

wparad
1d ago
2 replies
The sad truth of the world is that in many cases latency isn't the most critical aspect for tracking. We absolutely do track it because we have the expectation that authentication requests complete. But there are many moving parts to this that make reliable tracking not entirely feasible: * end location of user * end location of customer service * third party login components (login with google, et al) * corporate identity providers * webauthn * customer specific login mechanism workflows * custom integrations for those login mechanisms * user's user agent * internet connectivity

All of those significantly influence the response capability in a way which makes tracking latency next to useless. Maybe there is something we can be doing though. In more than a couple scenarios we do have tracking in place, metrics, and alerting, it just doesn't end up in our SLA.

scottlamb
1d ago
I imagine you exclude failures of customer-chosen systems from your reliability measurements—for example, if you send a backend request or redirect the user's browser to the customer's corporate identity provider and that persistently fails, you don't call it your own outage.

The same can apply to latency. What is the latency of requests to your system—including dependencies you choose, likely excluding dependencies the customer chooses. The network leg between the customer or user and your system is a bit of a gray area. The simplest thing to do is measure each request's latency from the point of view of your backend rather than the initiator. This is probably good enough, although in theory it lets you off the hook a bit too easily—to some extent you can choose whether you run near the customer or not and how many round trips are required. But it's not fair to fail your SLA because of end-user bufferbloat or whatever.

bostik
1d ago
While I agree with parts of the above, there are bits that I disagree with. It's true that you cannot control the network conditions for third parties, and therefore can never be in a position where you would guarantee an SLA for round-trip experience. But I object the notion that tracking end-to-end latency is useless. After all, the three Nielsen usability thresholds are all about latency(!)

Funnily enough, looking through your itemisation I spot two groups that would each benefit from their own kinds of latency monitoring. End location and internet connectivity of the client go into the first. Third-party providers go into the second.

For the first, you'd need to have your own probes reporting from the most actively used networks and locations around the world - that would give you a view into the round-trip latency per major network path. For the second, you'd want to track the time spent between the steps that you control - which in turn would give you a good view into the latency-inducing behaviour of the different third-party providers. Neither are SLA material but they certainly would be useful during enterprise contract negotiations. (Shooting impossible demands down by showing hard data tends to fend off even the most obstinate objections.)

User-agent and bespoke integrations/workflows are entirely out of your hands, and I agree it's useless to try to measure latency for them specifically.

Disclaimer: I have worked with systems where the internal authX roundtrip has to complete within 1ms, and the corresponding client-facing side has to complete its response within 3ms.

JSR_FDED
1d ago
If you’re providing auth services to many companies then a failure will increase the likelihood of something bad to an unacceptable degree.
catlifeonmars
1d ago
It’s not that things can’t survive downtime technically, it’s that in _many_ cases (although as you rightly point out, not _most_) downtime is costly to businesses.

I agree that the set of business critical functions in most shops is going to be vastly overestimated by engineers in the ground.

filearts
1d ago
That's a bit of a naive perspective. There are plenty of situations and industries where access being down has an impact far beyond inconvenience. For example, access to medical files for treatment, allergies and surgery. Or access to financial services.
JSR_FDED
1d ago
2 replies
> We test before deployment. There is no better time to test.

Love the deadpan delivery.

kailden
1d ago
1 reply
On the other hand, saying “Untested code is never released” is a pretty bold statement, even if I understand the good intent
wparad
10h ago
The point here is that, it's important to clarify what you mean by "untested code". Some companies release untested code all the time to production by hide their usage behind feature flags. They admit that the code has not been tested.

Code that we release behind feature flags has been tested, the only reason something goes out on a flag is when we don't want to release it to everyone yet, for product reasons, not technical ones.

Again going back to "untested", when code gets merged, no one ever says "well taht was untested". But what does tested mean? Of course it means that everyone who is accountable for the code believes it was tested. It's subjective, so there can be no other answer.

wparad
1d ago
:)
0xbadcafebee
1d ago
1 reply
It's a very rare day that a professional explanation of real operations best practices lands on HN. Good job, Authress!
wparad
1d ago
Thank you!
DeathArrow
1d ago
1 reply
TLDR: they use dynamic DNS routing and have fail over regions.
netdevphoenix
1d ago
1 reply
Can't believe they didn't just mention that in the beginning. There was a lot of show off imo. None of those would have save them on 20th oct if they didn't have the dynamic dns routing
wparad
1d ago
Actually this isn't the only thing that exists. As I pointed out that only exists for resources that are duplicated between regions. There's also the critical fallbacks that exist at the service level to decide which resources to consume. Our usage there is possible both with CloudFront Origin Groups as well as replicated data in our database to multiple regions.
sam-cop-vimes
1d ago
1 reply
What a well written article! Nothing complex is built overnight, so it is interesting to see how their defenses have evolved to their current state. Requires an engineering team which actually cares about all this and consistency of approach across what seems like 6 years? Impressive.
wparad
1d ago
Thank you.
ID: 45955565Type: storyLast synced: 11/19/2025, 7:11:53 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.