Replacing Ebs and Rethinking Postgres Storage From First Principles

Posted2 months agoActiveabout 2 months ago

mfreed

105 points

59 comments

tigerdata.comTechstoryHigh profile

calmmixed

Debate

60/100

Cloud StoragePostgresDatabase Infrastructure

Key topics

Cloud Storage

Postgres

Database Infrastructure

Tiger Data discusses replacing EBS with a custom storage solution for Postgres, sparking a discussion on the trade-offs and challenges of building a cloud storage alternative.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

Peak period

24-36h

Avg / period

9.8

Comment distribution59 data points

Loading chart...

Based on 59 loaded comments

Key moments

01Story posted
Oct 29, 2025 at 11:49 AM EDT
2 months ago
Step 01
02First comment
Oct 29, 2025 at 5:46 PM EDT
6h after posting
Step 02
03Peak activity
33 comments in 24-36h
Hottest window of the conversation
Step 03
04Latest activity
Nov 6, 2025 at 12:14 PM EST
about 2 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (59 comments)

Showing 59 comments

thr0w

2 months ago

3 replies

Postgres for agents, of course! It makes too much sense.

jacobsenscott

2 months ago

1 reply

The agent stuff is BS for the pointy hairs. This seems to address real problems I've had with PG though.

akulkarni

2 months ago

Yeah, I know what you mean. I used to roll my eyes every time someone said “agentic,” too. But after using Claude Code myself, and seeing how our best engineers build with it, I changed my mind. Agents aren’t hype, they’re genuinely useful, make us more productive, and honestly, fun to work with. I’ve learned to approach this with curiosity rather than skepticism.

akulkarni

2 months ago

Thanks! We agree :-)

We just launched a bunch around “Postgres for Agents” [0]:

forkable databases, an MCP server for Postgres (with semantic + full-text search over the PG docs), a new BM25 text search extension (pg_textsearch), pgvectorscale updates, and a free tier.

[0] https://www.tigerdata.com/blog/postgres-for-agents

xpe

2 months ago

Hard to say if the above comment is serious or sarcastic.

To my eye, seeing "Agentic Postgres" at the top of the page, in yellow, is not persuasive; it comes across as bandwagony. (About me: I try to be open but critical about new tech developments; I try out various agentic tooling often.).

But I'm not dismissing the product. I'm just saying this part is what I found persuasive:

> Agents spin up environments, test code, and evolve systems continuously. They need storage that can do the same: forking, scaling, and provisioning instantly, without manual work or waste.

That explains it clearly in my opinion.

* Seems to me, there are taglines that only work after someone in "on-board". I think "Agentic Postgres" is that kind of tagline. I don't have a better suggestion in mind at the moment, though, sorry.

the8472

2 months ago

2 replies

Though AWS instance-attached NVMe(oF?) still has less IOPS per TB than bare metal NVMe does.

    E.g. i8g.2xlarge, 1875 GB, 300k IOPS read
    vs. WD_BLACK SN8100, 2TB, 2300k IOPS read

antonkochubey

2 months ago

1 reply

Don't compare customer SSDs, which quote burst IOPS, to data-center SSDs which quote sustained IOPS in their spec sheets.

E.g. Micron 7450 PRO 3.84 TB - IOPS 4K 735k lesend, 160k schreibend

the8472

2 months ago

1 reply

The 7450 spec sheet[0] lists 1M iops read for the U.2 connector, 735k is for the M.2. And that's a PCIe Gen4 drive. If you look at the Micron 9550 Pro[1] (Gen5) the specsheet lists 3M IOPS for the 3.2TB model.

[0] https://assets.micron.com/adobe/assets/urn:aaid:aem:d133a40b... [1] https://www.micron.com/products/storage/ssd/data-center-ssd/...

antonkochubey

2 months ago

I have a strong suspicion that difference is due to heat dissipation of different form-factors, not due to connector. But yeah, it looks like whatever AWS is using is not on par with modern drives indeed, a factor of 10 to the drive you linked is a huge difference.

everfrustrated

2 months ago

You can't do those rates 24x7 on a WD_BLACK tho.

0xbadcafebee

2 months ago

8 replies

There's a ton of jargon here. Summarized...

Why EBS didn't work:

  - EBS costs for allocation
  - EBS is slow at restores from snapshot (faster to spin up a database from a Postgres backup stored in S3 than from an EBS snapshot in S3)
  - EBS only lets you attach 24 volumes per instance
  - EBS only lets you resize once every 6–24 hours, you can't shrink or adjust continuously
  - Detaching and reattaching EBS volumes can take 10s for healthy volumes to 20m for failed ones, so failover takes longer

Why all this matters:

  - their AI agents are all ephemeral snapshots; they constantly destroy and rebuild EBS volumes

What didn't work:

  - local NVMe/bare metal: need 2-3x nodes for durability, too expensive; snapshot restores are too slow
  - custom page-server psql storage architecture: too complex/expensive to maintain

Their solution:

  - block COWs
  - volume changes (new/snapshot/delete) are a metadata change
  - storage space is logical (effectively infinite) not bound to disk primitives
  - multi-tenant by default
  - versioned, replicated k/v transactions, horizontally scalable
  - independent service layer abstracts blocks into volumes, is the security/tenant boundary, enforces limits
  - user-space block device, pins i/o queues to cpus, supports zero-copy, resizing; depends on Linux primitives for performance limits

Performance stats (single volume):

  - (latency/IOPS benchmarks: 4 KB blocks; throughput benchmarks: 512 KB blocks)
  - read: 110,000 IOPS and 1.375 GB/s (bottlenecked by network bandwidth
  - write: 40,000–67,000 IOPS and 500–700 MB/s, synchronousy replicated
  - single-block read latency ~1 ms, write latency ~5 ms

hedora

2 months ago

2 replies

Thanks for the summary.

Note that those numbers are terrible vs. a physical disk, especially latency, which should be < 1ms read, << 1ms write.

(That assumes async replication of the write ahead log to a secondary. Otherwise, write latency should be ~ 1 rtt, which is still << 5ms.)

Stacking storage like this isn’t great, but PG wasn’t really designed for performance or HA. (I don’t have a better concrete solution for ansi SQL that works today.)

graveland

2 months ago

1 reply

(I'm on the team that made this)

The raw numbers are one thing, but the overall performance of pg is another. If you check out https://planetscale.com/blog/benchmarking-postgres-17-vs-18 for example, in the average QPS chart, you can see that there isn't a very large difference in QPS between GP3 at 10k iops and NVMe at 300k iops.

So currently I wouldn't recommend this new storage for the highest end workloads, but it's also a beta project that's still got a lot of room for growth! I'm very enthusiastic about how far we can take this!

samlambert

2 months ago

it's a 70% difference at lower cost. i know math is hard but c'mon try and be serious.

mfreedAuthor

2 months ago

A few datapoints that might help frame this:

- EBS typically operates in the millisecond range. AWS' own documentation suggests "several milliseconds"; our own experience with EBS is 1-2 ms. Reads/writes to local disk alone are certainly faster, but it's more meaningful to compare this against other forms of network-attached storage.

- If durability matters, async replication isn't really the right baseline for local disk setups. Most production deployments of Postgres/databases rely on synchronous replication -- or "semi-sync," which still waits for at least one or a subset of acknowledgments before committing -- which in the cloud lands you in the single-digit millisecond range for writes again.

bradyd

2 months ago

2 replies

> EBS only lets you resize once every 6–24 hours

Is that even true? I've resized an EBS instance a few minutes after another resize before.

electroly

2 months ago

AWS documents it as "After modifying a volume, you must wait at least six hours and ensure that the volume is in the in-use or available state before you can modify the same volume" but community posts suggest you can get up to 8 resizes in the six hour window.

jasonthorsness

2 months ago

The 6-hour counter is most certainly, painfully true. If you work with an AWS rep please complain about this in every session; maybe if we all do they will reduce the counter :P.

thesz

2 months ago

1 reply

What does EBS mean?

It is used in first line of the text but no explanation was given.

karanbhangui

2 months ago

https://aws.amazon.com/ebs/

lisperforlife

2 months ago

1 reply

The 5ms write latency and 1ms write latency sounds like they are using S3 to store and retrieve data with some local cache. My guess is a S3 based block storage exposed as a network block device. S3 supports compare-and-swap operations (Put-If-Match), so you can do a copy-on-write scenario quite easily. May be somebody from TigerData can give a little bit more insight into this. I know slatedb supports S3 as a backend for their key-value store. We can build a block device abstraction using that.

mfreedAuthor

2 months ago

None of this. It's in the blog post in a lot of detail =)

The 5ms write latency is because the backend distributed block storage layer is doing synchronous replication to multiple servers for high availability and durability before ack'ing a write. (And this path has not yet been super-performance-optimized for latency, to be honest.)

_rs

2 months ago

1 reply

> Detaching and reattaching EBS volumes can take 10s for healthy volumes to 20m for failed ones

Is there a source for the 20m time limit for failed EBS volumes? I experienced this at work for the first time recently but couldn't find anything documenting the 20m SLA (and it did take just about 20 full minutes).

mfreedAuthor

2 months ago

I'm not aware of any published source for this time limit, nor ways to reduce it.

The docs do say, however, "If the volume has been impaired for more than 20 minutes, you can contact the AWS Support Center." [0] which suggests its some expected cleanup/remount interval.

That is, it is something that we regularly encounter when EC2 instances fail, so we were sharing from personal experience.

[0] https://docs.aws.amazon.com/ebs/latest/userguide/work_volume...

jread

2 months ago

1 reply

I'm working on graduate research evaluating AWS control and data plane performance.

EBS volume attachment is typically ~11s for GP2/GP3 and ~20-25s for other types.

1ms read / 5ms write latencies seem high for 4k blocks. IO1/IO2 is typically ~0.5ms RW, and GP2/GP3 ~0.6ms read and ~0.94ms write.

References: https://cloudlooking.glass/matrix/#aws.ebs.us-east-1--cp--at... https://cloudlooking.glass/matrix/#aws.ebs.*--dp--rand-*&aws...

0xbadcafebee

2 months ago

1 reply

You might want to add the bit from the blog about worst-case attach times to your research. From my own experience (though it was years ago), sometimes an EBS volume would fail and simply never return. Definitely won't be acceptable for some use cases

jread

2 months ago

Yes, we've been testing volume attachments every 5m since start of the year, and have experienced 100-150 attachment failures per volume type in that time frame during multiple events (most recently last week):

https://cloudlooking.glass/dashboard/#aws.ebs.us-east-1--cp-...

Another interesting bit, is last March AWS changed something in the control plane which both triggered a multi-day LSE, and ultimately increased attachment times from 2-3s to 10-20s (also visible in the graphs).

samat

2 months ago

1 reply

Excellent tl;dr! Would pay to get them for every worthwhile tech article.

akulkarni

2 months ago

znpy

2 months ago

Reminds me of about ten years ago when a large media customer was running NetApp on cloud to get most of what you just wrote on AWS (because EBS features sucked/sucks very bad and are also crazy expensive).

I did not set that up myself, but the colleague that worked on that told me that enabling tcp multipath for iscsi yielded significant performance gains.

stefanha

2 months ago

3 replies

@graveland Which Linux interface was used for the userspace block driver (ublk, nbd, tcmu-runner, NVMe-over-TCP, etc)? Why did you choose it?

Also, were existing network or distributed file systems not suitable? This use case sounds like Ceph might fit, for example.

graveland

2 months ago

2 replies

There's some secret sauce there I don't know if I'm allowed to talk about yet, so I'll just address the existing tech that we didn't use: most things either didn't have a good enough license, cost too much, would take a TON of ramp-up and expertise we don't currently have to manage and maintain, but generally speaking, our stuff allows us to fully control it.

Entirely programmable storage so far has allowed us to try a few different things to try and make things efficient and give us the features we want. We've been able to try different dedup methods, copy-on-write styles, different compression methods and types, different sharding strategies... All just as a start. We can easily and quickly create a new experimental storage backends and see exactly how pg performs with it side-by-side with other backends.

We're a kubernetes shop, and we have our own CSI plugin, so we can also transparently run a pg HA pair with one pg server using EBS and the other running in our new storage layer, and easily bounce between storage types with nothing but a switchover event.

adsharma

2 months ago

Ceph is under LGPL.Cost doesn't seem to be a barrier. Supports k8s through CSI and has observability and documentation.

You can probably hire people to maintain it.

Was it the ramp-up cost or expertise?

yencabulator

about 2 months ago

> would take a TON of ramp-up and expertise we don't currently have to manage and maintain

But you think you have resources to maintain a distributed strongly-consistent replicating block store?

The edge cases in RDB are literally why Ceph takes expertise to manage! Things like failure while recovering from failure while trying to maintain performance are inherently tricky.

adsharma

2 months ago

1 reply

One of the problems with Ceph is that it doesn't operate at the highest possible throughput or the lowest possible latency point.

DAOS seemed promising a couple of years ago. But in terms of popularity it seems to be stuck. No Ubuntu packages, no wide spread deployment, Optane got killed.

Yet the NVMe + metadata approach seemed promising.

Would love to see more databases fork it to do what you need from it.

Or if folks have looked at it and decided not to do it, an analysis of why would be super interesting.

adsharma

2 months ago

https://www.epcc.ed.ac.uk/whats-happening/articles/exploring...

kjetijor

2 months ago

I was struck by how similar this seems to Ceph/RADOS/RBD. I.e. how they implemented snapshotted block storage on top, sounds more or less exactly the same as how RBD is implemented on top of RADOS in ceph.

unsolved73

2 months ago

1 reply

TimescaleDB was such a great project!

I'm really sad to see them waste the opportunity and instead build an nth managed cloud on top of AWS, chasing buzzword after buzzword.

Had they made deals with cloud providers to offer managed TimescaleDB so they can focus on their core value proposition they could have won the timeseries business, but ClickHouse made them irrelevant and Neon already has won the "Postgres for agents" business thanks to a better architecture than this.

akulkarni

2 months ago

1 reply

Thanks for the kind words about TimescaleDB :-)

We think we're still building great things, and our customers seem to agree.

Usage is at an all-time high, revenue is at an all-time high, and we’re having more fun than ever.

Hopefully we’ll win you back soon.

NewJazz

2 months ago

1 reply

Does Tiger Cloud support multi-region clusters? We are using aurora postgresql currently but it is straining (our budget and itself).

mfreedAuthor

2 months ago

Currently support multi-AZ clusters and multi-region disaster recovery (continuous PITR between regions).

We're continuing to evaluate demand for multi-region clusters, love to hear from you.

tayo42

2 months ago

2 replies

Are they not using aws anymore? I found that confusing. It says they're not using ebs, not using attached nvme, but I didn't think there were other options in aws?

mfreedAuthor

2 months ago

Tiger Cloud certainly continues to run on AWS. We have built it to rely on fairly low-level AWS primitives like EC2, EBS, and S3 (as opposed to some of the higher-level service offerings).

Our existing Postgres fleet, which uses EBS for storage, still serves thousands of customers today; nothing has changed there.

What’s new is Fluid Storage, our disaggregated storage layer that currently powers the new free tier (while in beta). In this architecture, the compute nodes running Postgres still access block storage over the network. But instead of that being AWS EBS, it’s our own distributed storage system.

From a hardware standpoint, the servers that make up the Fluid Storage layer are standard EC2 instances with fast local disks.

wrs

2 months ago

There weren’t, so they built one. (It is NVMe at the bottom, though.)

electroly

2 months ago

2 replies

EC2 instances have dedicated throughput to EBS via Nitro that you lose out on when you run your own EBS equivalent over the regular network. You only get 5Gbps maximum between two EC2 instances in the same AZ that aren't in the same placement group[1], and you're limited by the instance type's general networking throughput. Dedicated throughput to EBS from a typical EC2 instance is multiple times this figure. It's an interesting tradeoff--I assume they must be IOPS-heavy and the throughput is not a concern.

[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-inst...

otterley

2 months ago

1 reply

That 5Gbps limit is per flow (e.g. TCP connection), not per instance pair. With enough concurrent flows, you can saturate the interface bandwidth between peers, even if it’s 200Gbps or more.

electroly

2 months ago

Ah, of course! Thank you for this correction.

huntaub

2 months ago

I believe this is also changing with instances that now allow you to adjust the ratio of throughput on the NIC that's dedicated to EBS vs. general network traffic (with the intention, I'm sure, that people would want more EBS throughput than the default).

otterley

2 months ago

1 reply

I’m a bit confused about this:

“The storage device driver exposes Fluid Storage volumes as standard Linux block devices mountable with filesystems such as ext4 or xfs. It...allows volumes to be resized dynamically while online.”

Yet an xfs file system cannot be shrunk at all, and an ext4 filesystem cannot be shrunk without first unmounting it.

Are you simply doing thin provisioning of these volumes, so they appear to be massive but aren’t really? I see later that you say you account for storage based on actual consumption.

namibj

2 months ago

1 reply

..... They are block devices. Those allow resizing.

They can be used with, for example, the listed file systems.

No one claimed the listed file systems would (usefully) cooperate with (all aspects of) the block device's resizing.

otterley

2 months ago

You can’t just shrink a block device any way you want. It has to be done in concert with the filesystem on top of it to ensure any allocated blocks in the space to be deallocated are relocated to a safe area in the volume. Otherwise, data could be lost and filesystem corruption could occur.

Put differently, there is no point in being able to shrink a volume if you can’t safely shrink the filesystem that uses it.

The usual solution to this problem is thin provisioning, where you put a translation layer between the blocks the filesystem thinks it’s using and the actual underlying blocks. With thin provisioning you can allocate only, say, 1GB to the physical storage, but the block device presents itself as much larger than that, so you can pretend to create a 1PB filesystem on top of it.

cpt100

2 months ago

pretty cool

up2isomorphism

2 months ago

If you are targeting customers on AWS, don’t challenge EBS, because it is a losing game to begin with. There are 100 ways for AWS to optimize but none of them are available to you.

DenisM

2 months ago

IUUC they built a EBS replacement on top of NVME attached to a dynamically sized fleet of EC2 instances.

The advantage is that it’s allocating pages on demand from an elastic pool of storage so it appears as an infinite block device. Another advantage is cheap COW clones.

The downside is (probably) specialized tuning for Postgres access patterns. I shudder to think what went into page metadata management. Perhaps it’s similar to e.g. SQL Server buffer pool manager).

It’s not clear to me why it’s better than Aurora design - on the surface page servers are higher level concepts and should allow more holistic optimizations (and less page write traffic due to shipping log in lieu of whole pages). Is also not clear what stopped Amazon from doing the same (perhaps EBS serving more diverse access patterns?).

Very cool!

kristianp

2 months ago

So they've built a competitor to EBS that runs on EC2 and nvme. Seems like their prices will need to be much higher than those of AWS to get decent profit margins. I really hate being in the high-cost ecosystem of the large cloud providers, so I wouldn't make use of this.

tanelpoder

2 months ago

If anyone is interested in reading about a similar ”local-NVMe made redundant & shared over network as block devices” engine, last year I did some testing of Silk’s cloud block storage solution (1.3M x 8kB IOPS and 20 GiB/s throughput when reading the block store from a single GCP VM). They’re using iSCSI with multipathing on the client side instead of a userspace driver:

https://tanelpoder.com/posts/testing-the-silk-platform-in-20...

maherbeg

2 months ago

This has a similar flavor to xata.io's SimplyBlock based storage system * https://xata.io/blog/xata-postgres-with-data-branching-and-p... * https://www.simplyblock.io/

It's a great way to mix copy on write and effectively logical splitting of physical nodes. It's something I've wanted to build at a previous role.

2 months ago

Yes, EBS sucks, but plenty of cloud providers already implemented the same thing Tiger Data has a decade ago. Like Google.

runako

2 months ago

Thanks for the writeup.

I'm curious whether you evaluated solutions like zfs/Gluster? Also curious whether you looked at Oracle Cloud given their faster block storage?

View full discussion on Hacker News

ID: 45748484Type: storyLast synced: 11/20/2025, 1:51:04 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN