Disks Lie: Building a Wal That Actually Survives

Posted22 days agoActive18 days ago

jtregunna

42 points

39 comments

blog.canoozie.netTech Discussionstory

informativeneutral

Debate

40/100

Data_storageData Reliability

Key topics

Data_storage

Data Reliability

The debate around building a reliable Write-Ahead Log (WAL) that can withstand disk failures has sparked a lively discussion, with commenters weighing in on the nuances of ensuring data durability. While some, like compressedgas, pointed out the importance of fsyncing directory entries, others, like jmpman, shared harrowing tales of disk failures, including off-track writes and scrubbed platters, highlighting the need for robust data integrity measures. The conversation took a meta turn when commenters began questioning the author's writing style, with some, like breakingcups and nmilo, suspecting AI-generated content due to phrases like "it's a contract." Amidst the discussion, jmpman's suggestions for paranoid data validation, including LBA-seeded CRC and SCSI FUA bit checks, underscored the complexity of crafting a production-grade WAL.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

Peak period

66-72h

Avg / period

7.4

Comment distribution52 data points

Loading chart...

Based on 52 loaded comments

Key moments

01Story posted
Dec 11, 2025 at 8:16 PM EST
22 days ago
Step 01
02First comment
Dec 12, 2025 at 2:22 AM EST
6h after posting
Step 02
03Peak activity
37 comments in 66-72h
Hottest window of the conversation
Step 03
04Latest activity
Dec 15, 2025 at 2:35 PM EST
18 days ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (39 comments)

Showing 52 comments

compressedgas

22 days ago

1 reply

I thought an fsync on the containing directories of each of the logs was needed to ensure the that newly created files were durably present in the directories.

jtregunnaAuthor

21 days ago

Right, you do need to fsync when creating new files to ensure the directory entry is durable. However, WAL files are typically created once and then appended to for their lifetime, so the directory fsync is only needed at file creation time, not during normal operations.

breakingcups

21 days ago

2 replies

> Conclusion

> A production-grade WAL isn't just code, it's a contract.

I hate that I'm now suspicious of this formulation.

jtregunnaAuthor

21 days ago

2 replies

In what sense? The phrasing is just a generalization, production-grade anything needs consideration of the needs and goals of the project.

rogerrogerr

21 days ago

2 replies

“<x> isn’t <y>, it’s <z>” is an AI smell.

devman0

19 days ago

1 reply

Wouldn't that just be because the construction is common in the training materials, which means it's a common construction in human writing?

1718627440

19 days ago

It must be, but any given article is likely to not be the average of the training material, and thus has a different expectedness of such a construction.

dspillett

19 days ago

It is, but partly because it is a common form in the training data. LLM output seems to use the form more than people, presumably either due to some bias in the training data (or the way it is tokenised) or due to other common token sequences leading into it (remember: it isn't an official acronym but Glorified Predictive Text is an accurate description). While it is a smell, it certainly isn't a reliable marker, there needs to be more evidence than that.

Aeglaecia

19 days ago

https://en.wikipedia.org/wiki/Wikipedia:Signs_of_AI_writing

the language technique of negative parallel construction is a classic signal for AI writing

nmilo

19 days ago

You’re not insane. This is definitely AI.

jmpman

20 days ago

2 replies

https://en.wikipedia.org/wiki/Data_Integrity_Field

This, along with RAID-1, is probably sufficient to catch the majority of errors. But realize that these are just probabilities - if the failure can happen on the first drive, it can also happen on the second. A merkle tree is commonly used to also protect against these scenarios.

Notice that using something like RAID-5 can result in data corruption migrating throughout the stripe when using certain write algorithms

wmf

19 days ago

Note that 99% of drives don't implement DIF.

jmpman

20 days ago

The paranoid would also follow the write with a read command, setting the SCSI FUA (forced unit access) bit, requiring the disk to read from the physical media, and confirming the data is really written to that rotating rust. Trying to do similar in SATA or with NVMe drives might be more complicated, or maybe impossible. That’s the method to ensure your data is actually written to viable media and can be subsequently read.

jeffbee

19 days ago

1 reply

This FAST '08 paper "Parity Lost and Parity Regained" is still the one I pull out and show people if they seem to be under-imagining all the crimes an HDD can do.

https://www.usenix.org/legacy/event/fast08/tech/full_papers/...

eatonphil

19 days ago

Seconded.

jandrewrogers

19 days ago

1 reply

People consistently underestimate the many ways in which storage can and will fail in the wild.

The most vexing storage failure is phantom writes. A disk read returns a "valid" page, just not the last written/fsync-ed version of that page. Reliably detecting this case is very expensive, particularly on large storage volumes, so it is rarely done for storage where performance is paramount.

formerly_proven

19 days ago

1 reply

Not that uncommon failure mode for some SSDs, unclean shutdown is like a dice roll for some of them: maybe you get what you wrote five seconds ago, maybe you get a snapshot of a couple hours ago.

jandrewrogers

19 days ago

1 reply

Early SSDs were particularly prone to phantom writes due to firmware bugs. Still have scars from the many creative ways in which early SSDs would routinely fail.

doubled112

19 days ago

In college I had a 90GB OCZ Vertex, or maybe it was a Vertex 2.

It would suddenly become blank. You have an OS and some data today, and tomorrow you wake up and everything claims it is empty. It would still work, though. You could still install a new OS and keep going, and it would work until next time.

What a friendly surprise on exam week.

Sold it to a friend for really cheap with a warning about what had been happening.

kami23

19 days ago

4 replies

I worked with a greybeard that instilled in me that when we were about to do some RAID maintenance that we would always run sync twice. The second to make sure it immediately returns. And I added a third for my own anxiety.

zabzonk

19 days ago

1 reply

it's not just a good idea for raid

kami23

19 days ago

Oh definitely not, I do it on every system that I've needed it to be synced before I did something. We were just working at a place that had 2k+ physical servers with 88 drives each in RAID6, so that was our main concern back then.

I have been passing my anxieties about hardrives to junior engineers for a decade now.

wmf

19 days ago

2 replies

You need to sync twice because Unix is dumb: "According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but may return before the actual writing is done." https://man7.org/linux/man-pages/man2/sync.2.html

1718627440

19 days ago

2 replies

> Unix is dumb

I don't know. Now async I/O is all the rage and that is the same idea.

marcosdumay

19 days ago

1 reply

The syscall is literally called "sync", though.

1718627440

19 days ago

I think it is a way to shoehorn async into a synchronously written application.

wmf

19 days ago

1 reply

If they had a sync() system call and a wait_for_sync_to_finish() system call then you'd be right. But they didn't have those.

1718627440

19 days ago

https://news.ycombinator.com/item?id=46268127 (To quote yourself :-) )

amelius

19 days ago

1 reply

Then how do you know the writes are done after the second sync?

wmf

19 days ago

1 reply

AFAIK multiple syncs can't happen at the same time so the second sync implicitly waits for the first one to complete.

amelius

19 days ago

2 replies

If it was that simple, then why doesn't sync just do 2x sync internally?

wmf

19 days ago

Why is creat() missing the e? Why does an FTP server connect back to the client?

wakawaka28

19 days ago

If I had to guess, it is just extra work to do it twice, and you may not need to wait for it for some use cases. The better way would be to add a flag or alternative function to make the sync a blocking operation in the first place.

lowbloodsugar

19 days ago

  sync; sync; halt

theideaofcoffee

19 days ago

I do the same, sometimes my shell history has a desperate sync; sync; sync

n_u

19 days ago

1 reply

> Submit the write to the primary file

> Link fsync to that write (IOSQE_IO_LINK)

> The fsync's completion queue entry only arrives after the write completes

> Repeat for secondary file

Wait, so the OS can re-order the `fsync` to happen before the write request it is syncing? Is there a citation or link to some code for that? It seems too ridiculous to be real.

> O_DSYNC: Synchronous writes. Don't return from write() until the data is actually stable on the disk.

If you call `fsync` this isn't needed correct? And if you use this, then `fsync` isn't needed right?

scottlamb

19 days ago

2 replies

> Wait, so the OS can re-order the fsync() to happen before the write request it is supposed to be syncing? Is there a citation or link to some code for that? It seems too ridiculous to be real.

This is an io_uring-specific thing. It doesn't guarantee any ordering between operations submitted at the same time, unless you explicitly ask it to with the `IOSQE_IO_LINK` they mentioned.

Otherwise it's as if you called write() from one thread and fsync() from another, before waiting for the write() call to return. That obviously defeats the point of using fsync() so you wouldn't do that.

> If you call fsync(), [O_DSYNC] isn't needed correct? And if you use [O_DSYNC], then fsync() isn't needed right?

I believe you're right.

n_u

19 days ago

1 reply

I guess I'm a bit confused why the author recommends using this flag and fsync.

Related: I would think that grouping your writes and then fsyncing rather than fsyncing every time would be more efficient but it looks like a previous commenter did some testing and that isn't always the case https://news.ycombinator.com/item?id=15535814

scottlamb

19 days ago

1 reply

I'm not sure there's any good reason. Other commenters mentioned AI tells. I wouldn't consider this article a trustworthy or primary source.

n_u

19 days ago

1 reply

If you were using io_uring and used O_DSYNC you wouldn't need to use IOSQE_IO_LINK right?

Even if you were doing primary and secondary log file writes, they are to different files so it doesn't matter if they race.

scottlamb

18 days ago

1 reply

> It also seems if you were using io_uring and used O_DSYNC you wouldn't need to use IOSQE_IO_LINK right? Even if you were doing primary and secondary log file writes, they are to different files so it doesn't matter if they race.

I think there are a lot of reasons to use this flag besides a write()+f(data)sync() sequence:

* If you're putting something in a write-ahead log then applying it to the primary storage, you want it to be fully committed to the write-ahead log before you start changing the primary storage, so if there's a crash halfway through the primary storage change you can use the log to get to a consistent state (via undo or redo).

* If you're trying to atomically replace a file via the rename-a-temporary-file-into-place trick, you can submit the whole operation to the ring at once, but you'd want to use `IOSQE_IO_LINK` to ensure the temporary file is fully written/synced before the rename happens.

btw, a clarification about my earlier comment: `O_SYNC` (no `D`) should be equivalent to calling `fsync` after every write. `O_DSYNC` should be equivalent to calling the weaker `fdatasync` after every write. The difference is the metadata stored in the inode.

n_u

18 days ago

1 reply

> I think there are a lot of reasons to use this flag besides a write()+f(data)sync() sequence:

> * If you're putting something in a write-ahead log then applying it to the primary storage, you want it to be fully committed to the write-ahead log before you start changing the primary storage, so if there's a crash halfway through the primary storage change you can use the log to get to a consistent state (via undo or redo).

I guess I meant exclusively in terms of writing to the WAL. As I understand most DBMSes synchronously write the log entries for a transaction and asynchronously write the data pages to disk via a separate API or just mark the pages as dirty and let the buffer pool manager flush them to disk at its discretion.

> * If you're trying to atomically replace a file via the rename-a-temporary-file-into-place trick, you can submit the whole operation to the ring at once, but you'd want to use `IOSQE_IO_LINK` to ensure the temporary file is fully written/synced before the rename happens.

Makes sense

scottlamb

18 days ago

1 reply

> As I understand most DBMSes synchronously write the log entries for a transaction and asynchronously write the data pages to disk via a separate API or just mark the pages as dirty and let the buffer pool manager flush them to disk at its discretion.

I think they do need to ensure that page doesn't get flushed before the log entry in some manner. This might happen naturally if they're doing something in single-threaded code without io_uring (or any other form of async IO). With io_uring, it could be a matter of waiting for completion entry for the log write before submitting the page write, but it could be the link instead.

n_u

18 days ago

> I think they do need to ensure that page doesn't get flushed before the log entry in some manner.

Yes I agree. I meant like they synchronously write the log entries, then return success to the caller, and then deal with dirty data pages.

scottlamb

18 days ago

Actually, one clarification: `O_SYNC` (no `D`) should be equivalent to calling `fsync` after every write. `O_DSYNC` should be equivalent to calling the weaker `fdatasync` after every write. The difference is the metadata stored in the inode.

henning

19 days ago

1 reply

This looks AI-generated, including the linked code. That explains why the .zig-cache directory and the binary is checked into Git, why there's redundant commenting, and why the README has that bold, bullet point and headers style that is typical of AI.

If you can't be bothered to write it, I can't be bothered to read it.

twoodfin

19 days ago

The front page this weekend has been full of this stuff. If there’s a hint of clickbait about the title, it’s almost a forgone conclusion you’ll see all the other LLM tics, too.

These do not make the writing better! They obscure whatever the insight is behind LinkedIn-engagement tricks and turns of phrase that obfuscate rather than clarify.

I’ll keep flagging and see if the community ends up agreeing with me, but this is making more and more of my hn experience disappointing instead of delightful.

joecool1029

19 days ago

Flashback to this old thread about SSD vendors lying about FLUSH'd writes: https://news.ycombinator.com/item?id=38371307 (I have a SKHynix drive with this issue)

eatonphil

19 days ago

Check out Parity Lost and Parity Regained and Characteristics, Impact, and Tolerance of Partial Disk Failures (which this blog indirectly cites) if you'd like authoritative reading on the topic.

https://www.usenix.org/legacy/event/fast08/tech/full_papers/...

https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&d...

jmpman

21 days ago

I’ve seen disks do off track writes, dropped writes due to write channel failures, and dropped writes due to the media having been literally scrubbed off the platter previously. You need LBA seeded CRC to catch these failures along with a number of other checks. I get excited when people write about this in the industry. They’re extremely interesting failure modes that I’ve been lucky enough to have been exposed to, at volume, for a large fraction of my career.

wmf

19 days ago

This article is pretty low quality. It's an important and interesting topic and the article is mostly right but it's not clear enough to rely on.

The OS page cache is not a "problem"; it's a basic feature with well-documented properties that you need to learn if you want to persist data. The writing style seems off in general (e.g. "you're lying to yourself").

AFAIK fsync is the best practice not O_DIRECT + O_DSYNC. The article mentions O_DSYNC in some places and fsync in others which is confusing. You don't need both.

Personally I would prefer to use the filesystem (RAID or ditto) to handle latent sector errors (LSEs) rather than duplicating files at the app level. A case could be made for dual WALs if you don't know or control what filesystem will be used.

Due to the page cache, attempting to verify writes by reading the data back won't verify anything. Maaaybe this will work when using O_DIRECT.

amelius

19 days ago

Deleting data from the disk is actually even harder.

View full discussion on Hacker News

ID: 46239726Type: storyLast synced: 12/14/2025, 11:40:35 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN