Keep Your Applications Running While AWS Is Down

Posted2 months ago

stsffap

3 points

1 comments

restate.devTechstory

calmpositive

Debate

0/100

Cloud ComputingHigh AvailabilityAWS

Key topics

Cloud Computing

High Availability

AWS

The article discusses how to keep applications running during AWS outages using geo-replicated architectures, with the community showing interest in the topic.

Snapshot generated from the HN discussion

Discussion Activity

Light discussion

First comment

N/A

Peak period

Start

Avg / period

Key moments

01Story posted
Oct 22, 2025 at 3:35 AM EDT
2 months ago
Step 01
02First comment
Oct 22, 2025 at 3:35 AM EDT
0s after posting
Step 02
03Peak activity
1 comments in Start
Hottest window of the conversation
Step 03
04Latest activity
Oct 22, 2025 at 3:35 AM EDT
2 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (1 comments)

Showing 1 comments

stsffapAuthor

2 months ago

Author here. AWS’s recent us-east-1 outage inspired us to share our approach to geo-replication with Restate.

The core idea: geo-replication should be a deployment concern, not something you architect into every line of application code. You write normal business logic, then configure replication policies at deployment time and let Restate handle the rest.

The configuration is straightforward: `default-replication = "{region: 2, node: 3}"` ensures data is replicated to at least 2 regions and 3 nodes. This ensures that your apps can tolerate a region outage or losing two arbitrary nodes while staying fully available. Behind the scenes, Restate handles leader election, log replication, and state synchronization. We use S3 cross-region replication for snapshots with delayed log trimming to ensure consistency.

We tested this with a 6-node cluster across 3 AWS regions under 400 req/s load. Killing an entire region resulted in sub-60-second automatic failover with zero downtime and no data loss. Only 1% of requests saw latency spikes during the failover window. Once nodes in us-east-1 were no longer running, P50 latency increased when replication shifted from nearby us-east-1/us-east-2 to distant us-east-2/us-west-1.

Happy to answer technical questions or discuss tradeoffs!

View full discussion on Hacker News

ID: 45665917Type: storyLast synced: 11/17/2025, 9:10:41 AM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN