Md Raid or Drbd Can Be Broken From Userspace When Using O_direct
Posted3 months agoActive3 months ago
bugzilla.kernel.orgTechstory
heatednegative
Debate
80/100
LinuxRaidFilesystem Corruption
Key topics
Linux
Raid
Filesystem Corruption
A long-standing Linux kernel bug allows userspace to corrupt MD RAID or DRBD mirrored filesystems using O_DIRECT, sparking concerns about data integrity and Linux's reliability for mission-critical data.
Snapshot generated from the HN discussion
Discussion Activity
Moderate engagementFirst comment
2h
Peak period
8
2-4h
Avg / period
3
Comment distribution30 data points
Loading chart...
Based on 30 loaded comments
Key moments
- 01Story posted
Oct 18, 2025 at 7:39 AM EDT
3 months ago
Step 01 - 02First comment
Oct 18, 2025 at 9:24 AM EDT
2h after posting
Step 02 - 03Peak activity
8 comments in 2-4h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 19, 2025 at 4:55 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45626549Type: storyLast synced: 11/20/2025, 6:56:52 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
(FWIW, I appreciate the performance impact of a full fix here might be brutal, but the suggestion of requiring boot-args opt-in for O_DIRECT in these cases should not have been ignored, as there are a ton of people who might not actively need or even be using O_DIRECT, and the people who do should be required to know what they are getting into.)
(Oh, unless you are maybe talking about something orthogonal to the fixes mentioned in the discussion thread, such as some property of the extra checksumming done by these filesystems? And so, even if the disks de-synchronize, maybe zfs will detect an error if it reads "the wrong one" off of the underlying MD RAID, rather than ending up with the other content?)
I run btrfs on top of mdraid in RAID6 so I can incrementally grow it while still having copy-on-write, checksums, snapshots, etc.
I hope that one day btrfs fixes its parity raid or bcachefs will become stable enough to fully replace mdraid. In the meantime I'll continue using mdraid with a copy-on-write filesystem on top.
indeed out of date - that was merged a long time ago and shipped in a stable version earlier this year.
When the actual checksum of what was read from storage doesn't match the expected value, it will try reading alternate locations (if there are any), and it will write back the corrected block if it succeeds in reconstructing a block with the expected checksum.
This has only recently been fixed by disabling O_DIRECT for files with checksums (so the default): https://lore.kernel.org/linux-btrfs/54c7002136a047b7140c3647...
ZFS has O_DIRECT do nothing as well, as far as I know.
No wonder O_DIRECT never saw much love.
"I hope some day we can just rip the damn disaster out."
-- Linus Torvalds, 2007
https://lkml.org/lkml/2007/1/10/235
Something like O_DIRECT is critical for high-performance storage in software for well-understood reasons. It enables entire categories of optimization by breaking a kernel abstraction that is intrinsically unfit for purpose; there is no way to fix it in the kernel, the existence of the abstraction is the problem as a matter of theory.
As a database performance enjoyer, I've been using O_DIRECT for 15+ years. Something like it will always exist because removing it would make some high-performance, high-scale software strictly worse.
"So is the original requirement for O_DIRECT addressed completely by O_SYNC and O_DSYNC"
I'm guessing you'd say "no"
The practical purpose of O_DIRECT is to have precise visibility and control over what is in memory, what is on disk, and any inflight I/O operations. This opens up an entire category of workload-aware execution scheduling optimizations that become crucial for performance as storage sizes increase.
But that method doesn't necessarily have to be "something like O_DIRECT", which turns into a composition/complexity nightmare all for the sake of preserving the traditional open()/write()/close() interface. If you're really that concerned about performance, it's probably better to use an API that reflects the OS-level view of your data, as Linus pointed out in this ancient (2002!) thread:
https://yarchive.net/comp/linux/o_direct.html
Or, as noted in the 2007 thread that someone else linked above, at least posix_fadvise() lets the user specify a definite extent for the uncached region, which is invaluable information for the block and FS layers but not usually communicated at the time of open().
I think it's quite reasonable to consider the real problem to be the user code that after 20 years hasn't managed to migrate to something more sophisticated than open(O_DIRECT), rather than Linux's ability to handle every single cache invalidation corner case in every possible composition of block device wrappers. It really is a poorly-thought-out API from the OS implementor's perspective, even if at first seemingly simple and welcoming to an unsophisticated user.
O_DIRECT is used to disable cache replacement algorithms entirely in contexts where their NP-hardness becomes unavoidably pathological. You can't fix "fundamentally broken algorithm" with more knobs.
The canonical solution for workloads that break cache replacement is to dynamically rewrite the workload execution schedule in realtime at a very granular level. A prerequisite for this when storage is involved is to have perfect visibility and control over what is in memory, what is on disk, and any inflight I/O operations. The execution sequencing and I/O schedule are intertwined to the point of being essentially the same bit of code. For things like database systems this provides qualitative integer factor throughput improvements for many workloads, so very much worth the effort.
Without O_DIRECT, Linux will demonstrably destroy the performance of the carefully orchestrated schedule by obliviously running it through cache replacement algorithms in an attempt to be helpful. More practically, O_DIRECT also gives you fast, efficient visibility over the state of all storage the process is working with, which you need regardless.
Even if Linux handed strict explicit control of the page cache to the database process it doesn't solve the problem. Rewriting the execution schedule requires running algorithms across the internal page cache metadata. In modern systems this may be done 100 million times per second in userspace. You aren't gatekeeping analysis of that metadata with a syscall. The way Linux organizes and manages this metadata couldn't support that operation rate regardless.
Linux still needs to work well for processes that are well-served by normal cache replacement algorithms. O_DIRECT is perfectly adequate for disabling cache replacement algorithms in contexts where no one should be using cache replacement algorithms.
The way I was told it, if the database engine implements it's own cache (like InnoDB and presumably Oracle), are just "doubling up" if you also use the OS cache?. Perhaps Oracle is happy with its own internal caching (for reads).
I've seen a DB guy insist on O_DIRECT without implementing array controller cache battery alerting, or checking if drives themselves had caches disabled. Nope "O_DIRECT fixes everything!" ... although these days enterprise class SSDs have little batteries and capacitors to handle power loss so in the right circumstances that's kinda resolved too, but like the array controller cache batteries, this is one more thing you have to monitor if you're running your own hardware
Overall, there's grossly insufficient comprehensive testing tools, techniques, and culture in FOSS (FreeBSD, Linux, and most projects) rely upon informal/under-documented, ad-hoc, meat-based scream testing rather than proper, formal verification of correctness. Although no one ever said high-confidence software engineering was easy, it's essential to avoid entire classes of CVEs and unexpected operation bugs.
0: https://www.freebsd.org/releases/13.0R/relnotes/
1: https://lists.freebsd.org/pipermail/freebsd-fs/2018-December...