Disks Lie: Building a Wal That Actually Survives
Key topics
The debate around building a reliable Write-Ahead Log (WAL) that can withstand disk failures has sparked a lively discussion, with commenters weighing in on the nuances of ensuring data durability. While some, like compressedgas, pointed out the importance of fsyncing directory entries, others, like jmpman, shared harrowing tales of disk failures, including off-track writes and scrubbed platters, highlighting the need for robust data integrity measures. The conversation took a meta turn when commenters began questioning the author's writing style, with some, like breakingcups and nmilo, suspecting AI-generated content due to phrases like "it's a contract." Amidst the discussion, jmpman's suggestions for paranoid data validation, including LBA-seeded CRC and SCSI FUA bit checks, underscored the complexity of crafting a production-grade WAL.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
6h
Peak period
37
66-72h
Avg / period
7.4
Based on 52 loaded comments
Key moments
- 01Story posted
Dec 11, 2025 at 8:16 PM EST
22 days ago
Step 01 - 02First comment
Dec 12, 2025 at 2:22 AM EST
6h after posting
Step 02 - 03Peak activity
37 comments in 66-72h
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 15, 2025 at 2:35 PM EST
18 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
> A production-grade WAL isn't just code, it's a contract.
I hate that I'm now suspicious of this formulation.
the language technique of negative parallel construction is a classic signal for AI writing
This, along with RAID-1, is probably sufficient to catch the majority of errors. But realize that these are just probabilities - if the failure can happen on the first drive, it can also happen on the second. A merkle tree is commonly used to also protect against these scenarios.
Notice that using something like RAID-5 can result in data corruption migrating throughout the stripe when using certain write algorithms
https://www.usenix.org/legacy/event/fast08/tech/full_papers/...
The most vexing storage failure is phantom writes. A disk read returns a "valid" page, just not the last written/fsync-ed version of that page. Reliably detecting this case is very expensive, particularly on large storage volumes, so it is rarely done for storage where performance is paramount.
It would suddenly become blank. You have an OS and some data today, and tomorrow you wake up and everything claims it is empty. It would still work, though. You could still install a new OS and keep going, and it would work until next time.
What a friendly surprise on exam week.
Sold it to a friend for really cheap with a warning about what had been happening.
I have been passing my anxieties about hardrives to junior engineers for a decade now.
I don't know. Now async I/O is all the rage and that is the same idea.
> Link fsync to that write (IOSQE_IO_LINK)
> The fsync's completion queue entry only arrives after the write completes
> Repeat for secondary file
Wait, so the OS can re-order the `fsync` to happen before the write request it is syncing? Is there a citation or link to some code for that? It seems too ridiculous to be real.
> O_DSYNC: Synchronous writes. Don't return from write() until the data is actually stable on the disk.
If you call `fsync` this isn't needed correct? And if you use this, then `fsync` isn't needed right?
This is an io_uring-specific thing. It doesn't guarantee any ordering between operations submitted at the same time, unless you explicitly ask it to with the `IOSQE_IO_LINK` they mentioned.
Otherwise it's as if you called write() from one thread and fsync() from another, before waiting for the write() call to return. That obviously defeats the point of using fsync() so you wouldn't do that.
> If you call fsync(), [O_DSYNC] isn't needed correct? And if you use [O_DSYNC], then fsync() isn't needed right?
I believe you're right.
Related: I would think that grouping your writes and then fsyncing rather than fsyncing every time would be more efficient but it looks like a previous commenter did some testing and that isn't always the case https://news.ycombinator.com/item?id=15535814
Even if you were doing primary and secondary log file writes, they are to different files so it doesn't matter if they race.
I think there are a lot of reasons to use this flag besides a write()+f(data)sync() sequence:
* If you're putting something in a write-ahead log then applying it to the primary storage, you want it to be fully committed to the write-ahead log before you start changing the primary storage, so if there's a crash halfway through the primary storage change you can use the log to get to a consistent state (via undo or redo).
* If you're trying to atomically replace a file via the rename-a-temporary-file-into-place trick, you can submit the whole operation to the ring at once, but you'd want to use `IOSQE_IO_LINK` to ensure the temporary file is fully written/synced before the rename happens.
btw, a clarification about my earlier comment: `O_SYNC` (no `D`) should be equivalent to calling `fsync` after every write. `O_DSYNC` should be equivalent to calling the weaker `fdatasync` after every write. The difference is the metadata stored in the inode.
> * If you're putting something in a write-ahead log then applying it to the primary storage, you want it to be fully committed to the write-ahead log before you start changing the primary storage, so if there's a crash halfway through the primary storage change you can use the log to get to a consistent state (via undo or redo).
I guess I meant exclusively in terms of writing to the WAL. As I understand most DBMSes synchronously write the log entries for a transaction and asynchronously write the data pages to disk via a separate API or just mark the pages as dirty and let the buffer pool manager flush them to disk at its discretion.
> * If you're trying to atomically replace a file via the rename-a-temporary-file-into-place trick, you can submit the whole operation to the ring at once, but you'd want to use `IOSQE_IO_LINK` to ensure the temporary file is fully written/synced before the rename happens.
Makes sense
I think they do need to ensure that page doesn't get flushed before the log entry in some manner. This might happen naturally if they're doing something in single-threaded code without io_uring (or any other form of async IO). With io_uring, it could be a matter of waiting for completion entry for the log write before submitting the page write, but it could be the link instead.
Yes I agree. I meant like they synchronously write the log entries, then return success to the caller, and then deal with dirty data pages.
If you can't be bothered to write it, I can't be bothered to read it.
These do not make the writing better! They obscure whatever the insight is behind LinkedIn-engagement tricks and turns of phrase that obfuscate rather than clarify.
I’ll keep flagging and see if the community ends up agreeing with me, but this is making more and more of my hn experience disappointing instead of delightful.
https://www.usenix.org/legacy/event/fast08/tech/full_papers/...
https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&d...
The OS page cache is not a "problem"; it's a basic feature with well-documented properties that you need to learn if you want to persist data. The writing style seems off in general (e.g. "you're lying to yourself").
AFAIK fsync is the best practice not O_DIRECT + O_DSYNC. The article mentions O_DSYNC in some places and fsync in others which is confusing. You don't need both.
Personally I would prefer to use the filesystem (RAID or ditto) to handle latent sector errors (LSEs) rather than duplicating files at the app level. A case could be made for dual WALs if you don't know or control what filesystem will be used.
Due to the page cache, attempting to verify writes by reading the data back won't verify anything. Maaaybe this will work when using O_DIRECT.