Replacing Ebs and Rethinking Postgres Storage From First Principles
Posted2 months agoActiveabout 2 months ago
tigerdata.comTechstoryHigh profile
calmmixed
Debate
60/100
Cloud StoragePostgresDatabase Infrastructure
Key topics
Cloud Storage
Postgres
Database Infrastructure
Tiger Data discusses replacing EBS with a custom storage solution for Postgres, sparking a discussion on the trade-offs and challenges of building a cloud storage alternative.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
6h
Peak period
33
24-36h
Avg / period
9.8
Comment distribution59 data points
Loading chart...
Based on 59 loaded comments
Key moments
- 01Story posted
Oct 29, 2025 at 11:49 AM EDT
2 months ago
Step 01 - 02First comment
Oct 29, 2025 at 5:46 PM EDT
6h after posting
Step 02 - 03Peak activity
33 comments in 24-36h
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 6, 2025 at 12:14 PM EST
about 2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45748484Type: storyLast synced: 11/20/2025, 1:51:04 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
We just launched a bunch around “Postgres for Agents” [0]:
forkable databases, an MCP server for Postgres (with semantic + full-text search over the PG docs), a new BM25 text search extension (pg_textsearch), pgvectorscale updates, and a free tier.
[0] https://www.tigerdata.com/blog/postgres-for-agents
To my eye, seeing "Agentic Postgres" at the top of the page, in yellow, is not persuasive; it comes across as bandwagony. (About me: I try to be open but critical about new tech developments; I try out various agentic tooling often.).
But I'm not dismissing the product. I'm just saying this part is what I found persuasive:
> Agents spin up environments, test code, and evolve systems continuously. They need storage that can do the same: forking, scaling, and provisioning instantly, without manual work or waste.
That explains it clearly in my opinion.
* Seems to me, there are taglines that only work after someone in "on-board". I think "Agentic Postgres" is that kind of tagline. I don't have a better suggestion in mind at the moment, though, sorry.
E.g. Micron 7450 PRO 3.84 TB - IOPS 4K 735k lesend, 160k schreibend
[0] https://assets.micron.com/adobe/assets/urn:aaid:aem:d133a40b... [1] https://www.micron.com/products/storage/ssd/data-center-ssd/...
Why EBS didn't work:
Why all this matters: What didn't work: Their solution: Performance stats (single volume):Note that those numbers are terrible vs. a physical disk, especially latency, which should be < 1ms read, << 1ms write.
(That assumes async replication of the write ahead log to a secondary. Otherwise, write latency should be ~ 1 rtt, which is still << 5ms.)
Stacking storage like this isn’t great, but PG wasn’t really designed for performance or HA. (I don’t have a better concrete solution for ansi SQL that works today.)
The raw numbers are one thing, but the overall performance of pg is another. If you check out https://planetscale.com/blog/benchmarking-postgres-17-vs-18 for example, in the average QPS chart, you can see that there isn't a very large difference in QPS between GP3 at 10k iops and NVMe at 300k iops.
So currently I wouldn't recommend this new storage for the highest end workloads, but it's also a beta project that's still got a lot of room for growth! I'm very enthusiastic about how far we can take this!
- EBS typically operates in the millisecond range. AWS' own documentation suggests "several milliseconds"; our own experience with EBS is 1-2 ms. Reads/writes to local disk alone are certainly faster, but it's more meaningful to compare this against other forms of network-attached storage.
- If durability matters, async replication isn't really the right baseline for local disk setups. Most production deployments of Postgres/databases rely on synchronous replication -- or "semi-sync," which still waits for at least one or a subset of acknowledgments before committing -- which in the cloud lands you in the single-digit millisecond range for writes again.
Is that even true? I've resized an EBS instance a few minutes after another resize before.
It is used in first line of the text but no explanation was given.
The 5ms write latency is because the backend distributed block storage layer is doing synchronous replication to multiple servers for high availability and durability before ack'ing a write. (And this path has not yet been super-performance-optimized for latency, to be honest.)
Is there a source for the 20m time limit for failed EBS volumes? I experienced this at work for the first time recently but couldn't find anything documenting the 20m SLA (and it did take just about 20 full minutes).
The docs do say, however, "If the volume has been impaired for more than 20 minutes, you can contact the AWS Support Center." [0] which suggests its some expected cleanup/remount interval.
That is, it is something that we regularly encounter when EC2 instances fail, so we were sharing from personal experience.
[0] https://docs.aws.amazon.com/ebs/latest/userguide/work_volume...
EBS volume attachment is typically ~11s for GP2/GP3 and ~20-25s for other types.
1ms read / 5ms write latencies seem high for 4k blocks. IO1/IO2 is typically ~0.5ms RW, and GP2/GP3 ~0.6ms read and ~0.94ms write.
References: https://cloudlooking.glass/matrix/#aws.ebs.us-east-1--cp--at... https://cloudlooking.glass/matrix/#aws.ebs.*--dp--rand-*&aws...
https://cloudlooking.glass/dashboard/#aws.ebs.us-east-1--cp-...
Another interesting bit, is last March AWS changed something in the control plane which both triggered a multi-day LSE, and ultimately increased attachment times from 2-3s to 10-20s (also visible in the graphs).
I did not set that up myself, but the colleague that worked on that told me that enabling tcp multipath for iscsi yielded significant performance gains.
Also, were existing network or distributed file systems not suitable? This use case sounds like Ceph might fit, for example.
Entirely programmable storage so far has allowed us to try a few different things to try and make things efficient and give us the features we want. We've been able to try different dedup methods, copy-on-write styles, different compression methods and types, different sharding strategies... All just as a start. We can easily and quickly create a new experimental storage backends and see exactly how pg performs with it side-by-side with other backends.
We're a kubernetes shop, and we have our own CSI plugin, so we can also transparently run a pg HA pair with one pg server using EBS and the other running in our new storage layer, and easily bounce between storage types with nothing but a switchover event.
You can probably hire people to maintain it.
Was it the ramp-up cost or expertise?
But you think you have resources to maintain a distributed strongly-consistent replicating block store?
The edge cases in RDB are literally why Ceph takes expertise to manage! Things like failure while recovering from failure while trying to maintain performance are inherently tricky.
DAOS seemed promising a couple of years ago. But in terms of popularity it seems to be stuck. No Ubuntu packages, no wide spread deployment, Optane got killed.
Yet the NVMe + metadata approach seemed promising.
Would love to see more databases fork it to do what you need from it.
Or if folks have looked at it and decided not to do it, an analysis of why would be super interesting.
I'm really sad to see them waste the opportunity and instead build an nth managed cloud on top of AWS, chasing buzzword after buzzword.
Had they made deals with cloud providers to offer managed TimescaleDB so they can focus on their core value proposition they could have won the timeseries business, but ClickHouse made them irrelevant and Neon already has won the "Postgres for agents" business thanks to a better architecture than this.
We think we're still building great things, and our customers seem to agree.
Usage is at an all-time high, revenue is at an all-time high, and we’re having more fun than ever.
Hopefully we’ll win you back soon.
We're continuing to evaluate demand for multi-region clusters, love to hear from you.
Our existing Postgres fleet, which uses EBS for storage, still serves thousands of customers today; nothing has changed there.
What’s new is Fluid Storage, our disaggregated storage layer that currently powers the new free tier (while in beta). In this architecture, the compute nodes running Postgres still access block storage over the network. But instead of that being AWS EBS, it’s our own distributed storage system.
From a hardware standpoint, the servers that make up the Fluid Storage layer are standard EC2 instances with fast local disks.
[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-inst...
“The storage device driver exposes Fluid Storage volumes as standard Linux block devices mountable with filesystems such as ext4 or xfs. It...allows volumes to be resized dynamically while online.”
Yet an xfs file system cannot be shrunk at all, and an ext4 filesystem cannot be shrunk without first unmounting it.
Are you simply doing thin provisioning of these volumes, so they appear to be massive but aren’t really? I see later that you say you account for storage based on actual consumption.
They can be used with, for example, the listed file systems.
No one claimed the listed file systems would (usefully) cooperate with (all aspects of) the block device's resizing.
Put differently, there is no point in being able to shrink a volume if you can’t safely shrink the filesystem that uses it.
The usual solution to this problem is thin provisioning, where you put a translation layer between the blocks the filesystem thinks it’s using and the actual underlying blocks. With thin provisioning you can allocate only, say, 1GB to the physical storage, but the block device presents itself as much larger than that, so you can pretend to create a 1PB filesystem on top of it.
The advantage is that it’s allocating pages on demand from an elastic pool of storage so it appears as an infinite block device. Another advantage is cheap COW clones.
The downside is (probably) specialized tuning for Postgres access patterns. I shudder to think what went into page metadata management. Perhaps it’s similar to e.g. SQL Server buffer pool manager).
It’s not clear to me why it’s better than Aurora design - on the surface page servers are higher level concepts and should allow more holistic optimizations (and less page write traffic due to shipping log in lieu of whole pages). Is also not clear what stopped Amazon from doing the same (perhaps EBS serving more diverse access patterns?).
Very cool!
https://tanelpoder.com/posts/testing-the-silk-platform-in-20...
It's a great way to mix copy on write and effectively logical splitting of physical nodes. It's something I've wanted to build at a previous role.
I'm curious whether you evaluated solutions like zfs/Gluster? Also curious whether you looked at Oracle Cloud given their faster block storage?