Matrix.org – Database Incident
Posted4 months agoActive4 months ago
status.matrix.orgTechstory
supportivemixed
Debate
20/100
Matrix.orgDatabase IncidentRaid FailureDisaster Recovery
Key topics
Matrix.org
Database Incident
Raid Failure
Disaster Recovery
Matrix.org experienced a database incident due to a RAID failure, resulting in a 23-hour downtime, with the community offering support and discussing the importance of disaster recovery and hosting mission-critical data.
Snapshot generated from the HN discussion
Discussion Activity
Light discussionFirst comment
N/A
Peak period
5
0-2h
Avg / period
2.3
Comment distribution25 data points
Loading chart...
Based on 25 loaded comments
Key moments
- 01Story posted
Sep 2, 2025 at 3:13 PM EDT
4 months ago
Step 01 - 02First comment
Sep 2, 2025 at 3:13 PM EDT
0s after posting
Step 02 - 03Peak activity
5 comments in 0-2h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 4, 2025 at 1:12 AM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45107696Type: storyLast synced: 11/20/2025, 4:38:28 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Now you're getting it
The stuff of absolute nightmares...
https://mastodon.matrix.org/@matrix/115136245785561439
[Edit] From another comment, 55TB?!? Holy wat-man...
- Probably thousands of large chatrooms, and hundreds of millions of small chatrooms
- Probably hundreds of millions of messages that include a media upload like an image or video, including countless re-posts of random memes
- Overhead from ratchet algorithm cryptography, as well as additional message metadata that is likely in JSON format
- Huge excesses of messages from bridge bots, spam bots, and malfunctioning utility bots. To give a sense of scale... the entirety of Libera.chat (formerly Freenode IRC) used to be bridged to matrix.org, meaning almost every single message from Libera would be copied to matrix.org automatically.
- Everything from other homeservers that federate with matrix.org and have been joined by at least one matrix.org user, including homeservers that no longer exist
However, much of the room is taken up by the Synapse DB schema being particularly denormalised (prioritising performance over disk footprint) - especially caching snapshots of historical key/value state for rooms, which currently takes up ~65x more space than then actual underlying dataset. Ironically, we're looking into that currently, but not fast enough to speed up this DB rebuild.
Good luck on getting the schema overhead out of the way. I'm sure nowadays you are probably also using faster underlying storage SSD's behind the raid controllers. Dell/HP keep them overpriced of course but I found them to be very much worth it for databases as did the DBA's.
I hope your on-call teams get to take a week off after that incident.
I don't immediately see an official doc on this; is it right under my nose?
Is this doc good? https://www.redpill-linpro.com/techblog/2025/04/08/matrix-ba...
If you're happy using kubernetes, https://element.io/server-suite/community should be a good bet (or https://element.io/server-suite/pro if you are actually doing mission-critical stuff and want a version professionally supported by Element)
If you're happy using docker-compose, then https://github.com/element-hq/element-docker-demo is a very simple template for getting going.
Alternatively, https://github.com/spantaleev/matrix-docker-ansible-deploy is quite popular as a 3rd-party distro using ansible-managed docker containers.
Sorry all for the downtime on matrix.org - we're having to do a full 55TB db restore from backup which will take ~17 hours to run. :|
But it is hard to trust a random server, if all you know is the name and mean uptime. Mastodon shows the community posts and an introduction by the local admin, before you make an account. Matrix should do the same.
whoops
> Sorry, but it's bad news: we haven't been able to restore the DB primary filesystem to a state we're confident in running as a primary (especially given our experiences with slow-burning postgres db corruption). So we're having to do a full 55TB DB snapshot restore from last night, which will take >10h to recover the data, and then >4h to actually restore, and then >3h to catch up on missing traffic.
https://mastodon.matrix.org/@matrix/115136866878237078
See also
https://www.theregister.com/2025/09/03/matrixorg_raid_failur... https://www.heise.de/en/news/Matrix-main-server-down-million...