Put a Ring on It: a Lock-Free Mpmc Ring Buffer
Key topics
The quest for a lock-free MPMC (multi-producer, multi-consumer) ring buffer has sparked a lively discussion, with the original author revealing their 5-year-old work and commenters chiming in with related research and implementations. Some pointed out similar existing work, such as the FASTER key-value store and the LMAX Disruptor, while others shared alternative implementations, like the Nim-loony library and looqueue-rs. The conversation highlights the complexity of designing efficient concurrent data structures, with the author clarifying that their work was developed independently, and others noting that the ideas aren't entirely trivial but follow naturally from the problem space. As the discussion unfolds, it becomes clear that the pursuit of lock-free data structures remains an active area of research and innovation.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
14m
Peak period
16
0-2h
Avg / period
4.1
Based on 45 loaded comments
Key moments
- 01Story posted
Dec 16, 2025 at 8:32 AM EST
17 days ago
Step 01 - 02First comment
Dec 16, 2025 at 8:46 AM EST
14m after posting
Step 02 - 03Peak activity
16 comments in 0-2h
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 17, 2025 at 4:34 PM EST
16 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
However, this is well written and very easy to read.
However, I definitely did not see the paper you've sited, but just spent a few minutes with the paper you cited. Section 5.2 seems to cover their version. It's far from clear what their algorithm is, but they are talking about page eviction; it doesn't seem like they're using even a single fixed array for the ring, but I'm not 100% sure because it's pretty light on any detail. a
There are definitely designs that can deal with non-POD data of variable size, although that does imply heterogeneous types, which will need some sort of type erasure to destroy safely.
Which is based on: https://ieeexplore.ieee.org/document/9490347
Virtually zero? I have to go out of my way to remind HN it exists while everyone is 300+ comments deep into reinventing the wheel on a ~quarterly basis.
> have to go out of my way
Yeah, that's exactly the annoying part. Can't mention ring buffer ever without someone bring up LMAX. "But do you know, that some Java developers somewhere once wrote something that was not completely slow?!"
[0]: https://h4x0r.org/futex/ discussion: https://news.ycombinator.com/item?id=44951563
maybe that is what you want.
In JS ecosystem, buffers that allow data loss is more common (aka ring buffers), but ringbuf.js [1] is the only complete implementation to my knowledge. In my use case on I/O between WASM modules where data must be transferred as-is, the process must block on buffer overrun/underrun in a synchronous manner. Therefore a circular buffer is required. I could not find such a niche library written in JS, so I decided to bite the bullet and reinvent the wheel [2].
[1]: https://github.com/padenot/ringbuf.js
[2]: https://github.com/andy0130tw/spsc
Btw, kaze-core uses a `used` atomic variable, to avoid reading both readPos/writePos in routine - they are not atomic at all.
In JavaScript, atomic operations are relatively lightweight, so their overhead is likely acceptable. Given that, I am open to adjusting my code to your suggested approach and seeing how it performs in practice.
<3
https://en.wikipedia.org/wiki/False_sharing
Rather than incrementing each counter by one, dither the counters to reduce cache conflicts? So what if the dequeue becomes a bit fuzzy. Make the queue a bit longer, so everyone survives at least as long as they would have survived before.
Some NICs in the past have simply detected this scenario rather than handled it properly, and poisoned the packet as it leaves the card
It looks like both enqueue and dequeue are RMWing the "epochs" variable? This can pretty easily lead to something like O(N^2) runtime if N processors are banging this cache line at the same time.
The epoch isn’t CAS’d; it is FAA. The epoch is then used to determine if there is contention due to the tail meeting the head, or due to a wrap-around due to slow writes.
There’s also a back-off scheme to ease contention for a full queue.
Though, I did originally have a variant that adds a fairly straightforward ‘help’ mechanism that makes the algorithm wait-free and reduces the computational complexity.
However, the extra overhead didn’t seem worth it, so I took it out pretty quickly. Iirc, the only place where the ring in the queue wouldn’t out-perform it are on tiny queues with a huge write imbalance.
If you go run the tests in the repo associated w the article, you probably will see that a ring with only 16 entries will tend to start being non-performant at about a 4:1 writer to reader ratio. But iirc that effect goes away before 128 slots in the ring.
There, the ring still fits in a page, and even with a big imbalance I can’t remember seeing less than 1m ops per second on my Mac laptop.
Real world observational data beats worst case analysis, and I’ve never seen an issue for scenarios I consider reasonable.
But, if my unrealistic is realistic to you, let me know and I can break out the wait free version for you.
fetch_add requires taking a cache line exclusive. If producers and consumers are both doing this on the same cache line, it is very expensive and does not scale.
SPSC ring buffers in particular are simple but not particularly efficient in the case where the consumer is keeping up with the producer (the ideal case) due to all the (atomic) synchronization and cachelines ping pong
A CAS implemented with LL/SC (ARM, POWER) is weak as LL/SC an spuriously fail. So it always needs to be retried in a loop. Such a weak CAS might only be lock-free, not wait free as it might not provide global progress guarantees ; in practice some platforms give stronger progress guarantees as they might convert an LL/SC to a strong CAS via idiom recognition.
A strong CAS (x86, SPARC I thnk) is implemented directly in the architecture and it is typically strong. It also usually gives strong fairness guarantees.
If your algorithm needs to CAS in a loop might as well use a weak CAS to avoid a loop-of-loops. Otherwise a strong CAS might generate better code on some architectures.
Without immersion in the "Why" of each technological niche it can be hard to judge whether you're reading advice that really hasn't been relevant in decades ("The ASCII character set is not supported on all computers") or that's still important to your work today ("The file naming conventions may vary from one system to another")
Good question actually! ARM64 has a proper CAS it seems, but I think smaller ARMs still have only LL/SC and that's still relevant for embedded. I can't find a definitive answer for POWER, it is possible that is till only has LL/SC (I leave it to you whether POWER is still relevant, but certainly IBM cares that the standard support it). Can't find a definite answer for RISC-V: it seems that originally it only had LL/SC, but there are AMO extensions that add RISC-V.
I think most GPUs have native CAS instructions.
There are probably other embedded processors that are still relevant for C++, who knows what they support.