High-Performance Dbmss with Io_uring: When and How to Use It
Key topics
A recent research paper sheds light on the benefits and best practices of using io_uring in high-performance database management systems, sparking a lively discussion among commenters. The authors, including melhindi, engaged with the community, clarifying key points and updating their summary to address concerns around Docker and sandboxing limitations. Notably, topspin highlighted the omission of Docker's default io_uring blocking due to security concerns, prompting the authors to revise their high-level summary. As the conversation unfolds, the community is left pondering whether this limitation might one day be lifted, as hayd wondered.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
50m
Peak period
13
12-15h
Avg / period
3.3
Based on 46 loaded comments
Key moments
- 01Story posted
Jan 6, 2026 at 2:29 PM EST
2d ago
Step 01 - 02First comment
Jan 6, 2026 at 3:19 PM EST
50m after posting
Step 02 - 03Peak activity
13 comments in 12-15h
Hottest window of the conversation
Step 03 - 04Latest activity
Jan 8, 2026 at 4:33 PM EST
20h ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
The "surface area" argument against io_uring can apply to literally any innovation. Over on LWN, there's an article on path traversal difficulties that mentions people how, because openat2(2) is often banned as inconvenient to whitelist using seccomp, eople have to work around path traversal bugs using fiddly, manual, and slow element-by-element path traversal in user space.
Ridiculous security theater. A new system call had a vulnerability in 2010 and so we're never able to take practical advantage of new kernel features ever?
(It doesn't help that gvisor refuses to acknowledge the modern world.)
Great example of descending into a shitty equilibrium because the great costs of a bad policy are diffuse but the slight benefits are concentrated.
The only effective lever is commercial pressure. All the formal methods in the world won't help when the incentive structure reinforces technical obstinacy.
There is now a memory buffer that the user space and the kernel is reading, and with that buffer you can _always_ do any syscall that io_uring supports. And things like strace, eBPF, and seccomp cannot see the actual syscalls that are being called in that memory buffer.
And, having something like seccomp or eBPF inspect the stream might slow it down enough to eat the performance gain.
Think of io_uring as a pair of unidirectional pipes. You shove syscalls and (pointers to) data into one pipe and the results (asynchronously) gush out of the other pipe, errors and all. Each pipe is actually a separate block of memory shared between your process and the kernel: you scribble in one and read from the other, and the kernel does the opposite.
https://github.com/axboe/liburing/discussions/1047
Here is a fun one from September (CVE-2025-39816): "io_uring/kbuf: always use READ_ONCE() to read ring provided buffer lengths."
That is an attackers wet dream right there: bump the length and exfiltrate sensitive data. And it wasn't just some short lived "Linus's branch" work no one actually ran: it existed for a time in, for example, Ubuntu 24.04 LTS (circa 2024 release date.) I just cherry picked that one from among many.
> With IOPOLL, completion events are polled directly from the NVMe device queue, either by the application or by the kernel SQPOLL thread (cf. Section 2), replacing interrupt-based signaling. This removes interrupt setup and handling overhead but disables non-polled I/O, such as sockets, within the same ring.
> Treating io_uring as a drop-in replacement in a traditional I/O-worker design is inadequate. Instead, io_uring requires a ring-per-thread design that overlaps computation and I/O within the same thread.
1) So does this mean that if you want to take advantage of IOPOLL, you should use two rings per thread: one for network and one for storage?
2) SQPoll is shown in the graph as outperforming IOPoll. I assume both polling options are mutually exclusive?
3) I'd be interested in what the considerations are (if any) for using IOPoll over SQPoll.
4) Additional question: I assume for a modern DBMS you would want to run this as thread-per core?
Regarding your questions:
1) Yes. If you want to take advantage of IOPOLL while still handling network I/O, you typically need two rings per thread: an IOPOLL-enabled ring for storage and a regular ring for sockets and other non-polled I/O.
2) They are not mutually exclusive. SQPOLL was enabled in addition to IOPOLL in the experiments (+SQPoll). SQPOLL affects submission, while IOPOLL changes how completions are retrieved.
3) The main trade-off is CPU usage vs. latency. SQPOLL spawns an additional kernel thread that busy spins to issue I/O requests from the ring. With IOPOLL interrupts are not used and instead the device queues are polled (this does not necessarily result in 100% CPU usage on the core).
4) Yes. For a modern DBMS, a thread-per-core model is the natural fit. Rings should not be shared between threads; each thread should have its own io_uring instance(s) to avoid synchronization and for locality.
The practical guidelines are useful. Basically “first prove I/O is actually your bottleneck, then change the architecture to use async/batching, and only then reach for features like fixed buffers / zero-copy / passthrough / polling.”
I'm curious, how sensitive are the results to kernel version & deployment environments? Some folks run LTS kernels and/or containers where io_uring may be restricted by default.
This sounds strange to me, of not requiring fsync. I may be wrong, but if it was meant that Enterprise SSDs have buffers and power-failure safety modes which works fine without explicit fsync, I think it's too optimistic view here.
However, in our experiments (including Figure 9), we bypass the page cache and issue writes using O_DIRECT to the block device. In this configuration, write completion reflects device-level persistence. For consumer SSDs without PLP, completions do not imply durability and a flush is still required.
> "When an SSD has Power-Loss Protection (PLP) -- for example, a supercapacitor that backs the write-back cache -- then the device's internal write-back cache contents are guaranteed durable even if power is lost. Because of this guarantee, the storage controller does not need to flush the cache to media in a strict ordering or slow way just to make data persistent." (Won et al., FAST 2018) https://www.usenix.org/system/files/conference/fast18/fast18...
We will make this more explicit in the next revision. Thanks.
I guess you are thinking of iouring being disabled in dockeresque rumtimes. This restriction does not apply to a VM, specially if you can run your own kernel (e.g. EC2).
Experiment away!
Which userspace libraries support this?. Liburing does, but Zig's Standard Library (relevant because a tigerbeetler wrote the article) does not, just silently gives out corrupt values from the completion queue.
On the rust side, rustix_uring does not support this, but widely doesn't let you set kernel flags for something it doesn't support. tokio-rs/io-uring looks like it might from the docs, but I can't figure out how (if anyone uses it there, let me know).
I am foolishly trying to make pure rust bindings myself [2].
[1] https://docs.rs/axboe-liburing/
[2] https://docs.rs/hringas
But yeah the liburing test suite is massive, way more than anyone else has.