Tuning Async Io in Postgresql 18
Posted3 months agoActive3 months ago
vondra.meTechstory
calmpositive
Debate
40/100
PostgresqlAsync IoDatabase Performance
Key topics
Postgresql
Async Io
Database Performance
The article discusses tuning async IO in PostgreSQL 18, and the discussion revolves around the performance implications of different settings and potential use cases.
Snapshot generated from the HN discussion
Discussion Activity
Moderate engagementFirst comment
1h
Peak period
7
2-4h
Avg / period
2.9
Comment distribution23 data points
Loading chart...
Based on 23 loaded comments
Key moments
- 01Story posted
Sep 29, 2025 at 7:42 AM EDT
3 months ago
Step 01 - 02First comment
Sep 29, 2025 at 8:44 AM EDT
1h after posting
Step 02 - 03Peak activity
7 comments in 2-4h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 30, 2025 at 1:57 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45412494Type: storyLast synced: 11/20/2025, 12:29:33 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
At least it is only two settings to look at for tuning, though it does seem this is the kind of setting you have to try out with your own DB and workload.
Right now async IO is used for sequential scans and bitmap scans, not for index scans. My initial guess would be that it mostly helps for complex queries (that use multiple indexes, so bitmap scans) and unoptimized queries (sequential scans), not so much for straightforward and/or optimized queries that use a single index. But this is really just a guess, I'm way out of my depth at this point. I'm curious how much it'll help once it is implemented for index scans as well.
> The default is io_method = worker. We did consider defaulting both to sync or io_uring, but I think worker is the right choice. It’s actually “asynchronous”, and it’s available everywhere (because it’s our implementation).
I think a pool of workers is a sensible default. Many environments disable io_uring for security reasons anyway.
At least that's how I see it right now, we'll see how that works on a much wider range of hardware and systems. The github repo linked from the pgsql-hackers post has a lot more results, some charts include results for the index prefetching patch - and there it makes more difference in some cases. But the patch is still fairly rough, it could be a bug in it too, and it changed a lot since August.
Sorry, should have mentioned that in the blog post.
The first is somewhat obvious, with how new io_uring is and how it is specifically a linux thing. I don't think that is necessarily bad. My guess is you can get really good performance with relatively little code compared to other options.
The second, though, is a bit tougher to consider. For one, it will be more code. Managing workers is managing code that you probably don't have on your radar. That said, you have full control over the worker so that you can make different priority work queues in ways that I don't see how you could do with io_uring.
Those are good guesses, IMHO.
For sequential scans, some of the "async" work could be done by kernel read-ahead, but AIO makes it explicit and loads the data into shared buffers, not just page cache. For bitmap scans we already had prefetching by fadvise, which is somewhat similar to read-ahead (also into page cache), and there were some bugs that made it ineffective in some cases, and AIO fixes that.
For index scans the difference can be an order of magnitude (say, 5-10x). Doing random I/O block by block is simply awful, prefetching data is important. I was just doing some testing on TPC-H, and on scale 50 I see Q8 going from 130s to 20s, and Q19 from 50s to 8s. And smaller improvements for a couple more queries. Of course, it depends on what else the query is doing - if it's spending 1% on the index, you won't notice a difference.
According to a quick google search (to refresh my memory), PGSQL compression (eg: TOAST) targets specific large data values within tables, while ZFS compresses all data written to the ZFS pool.
Depending on your database you can reduce the size by 50% sometimes with high zstd zfs configurations.
No lunch is free though. Aside from the obvious cpu cycles spent compressing configuring zfs / postgres is a pain in the ass and really depends on the trade offs and use cases.
That said, ZFS presents some challenges for a few reasons:
- As you probably already know, PGSQL relies heavily on system RAM for caching (effective_cache_size). That said, ZFS and OS cache are NOT the same thing, thus you need to take this into consideration when configuring PGSQL. We normally set PGSQL effective_cache_size=512MB and use `zfs_arc_min` and `zfs_arc_max` options to adjust ZFS ARC cache size. We typically get a +95% hit rate on ZFS (ARC) caching.
- ZFS is definitely slower than XFS or EXT4 and it took a while to understand how options like `zfs_compressed_arc_enabled`, `zfs_abd_scatter_enabled`, and `zfs_prefetch_disable` affect performance. In particular, the `zfs_compressed_arc_enabled` option determines if the ZFS cache data is compressed in RAM as well on disk. When enabled, this option can seriously affect latency since the data has to be uncompressed each time it is read/written. That said, a very nice side affect of `zfs_compressed_arc_enabled=on` is the amount of data in the cache. From my understanding, if you get 5:1 data compression on disk, you get the same for ARC cache. Thus, if you give ZFS 12GB of cache, you get about 60GB of data in ZFS memory cache.
- Getting ZFS installed requires additional kernel modules + the kernel header files, and these files have to match the version of ZFS you want to run. This is especially important if you update your kernel very often (thus requiring new ZFS modules to be built and installed).
Lots of blog posts are on the 'net describing some of these challenges. It's worth checking them out...
What's unstated is: why does this happen? Is this a performance bug that can be fixed in later release, or an inherent io_uring limitation?
You can compare worker_pool/io_uring/aio etc. using `fio` to see how different io approaches perform on the same hardware.
With io_uring everything happens in the backend process, and so consumes some of the CPU time that might otherwise be spent executing the query. All the checksum verification, memcpy into shared buffers, etc. happen in the backend. And those things can be quite expensive. With worker this happens in the other processes, spreading the overhead.
Of course, on truly I/O-bound workload (actually waiting on the I/O), this may not be a huge difference. For warmed-up cases it may be more significant.
But this is more difficult than spawning thousands of os threads and running sync workers on them
Some of these limitations are mostly due to Postgres design, no doubt about that.
If there are no threads then this would be, you can have a process per cpu core