Put a Ring on It: a Lock-Free Mpmc Ring Buffer

Posted17 days agoActive16 days ago

signa11

69 points

30 comments

h4x0r.orgstory

informativeneutral

ConcurrencyData StructuresAI Performance Analysis

Key topics

Concurrency

Data Structures

AI Performance Analysis

The quest for a lock-free MPMC (multi-producer, multi-consumer) ring buffer has sparked a lively discussion, with the original author revealing their 5-year-old work and commenters chiming in with related research and implementations. Some pointed out similar existing work, such as the FASTER key-value store and the LMAX Disruptor, while others shared alternative implementations, like the Nim-loony library and looqueue-rs. The conversation highlights the complexity of designing efficient concurrent data structures, with the author clarifying that their work was developed independently, and others noting that the ideas aren't entirely trivial but follow naturally from the problem space. As the discussion unfolds, it becomes clear that the pursuit of lock-free data structures remains an active area of research and innovation.

Snapshot generated from the HN discussion

Discussion Activity

Active discussion

First comment

14m

Peak period

0-2h

Avg / period

4.1

Comment distribution45 data points

Loading chart...

Based on 45 loaded comments

Key moments

01Story posted
Dec 16, 2025 at 8:32 AM EST
17 days ago
Step 01
02First comment
Dec 16, 2025 at 8:46 AM EST
14m after posting
Step 02
03Peak activity
16 comments in 0-2h
Hottest window of the conversation
Step 03
04Latest activity
Dec 17, 2025 at 4:34 PM EST
16 days ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (30 comments)

Showing 45 comments

withinboredom

17 days ago

2 replies

This isn't that new. (see: FASTER: A Concurrent Key-Value Store with In-Place Updates. 2018 ACM SIGMOD International Conference on Management of Data)

However, this is well written and very easy to read.

viega

17 days ago

2 replies

Well, when I was doing the original work on it (about 5 years ago now), I spent a lot of time trying to find something else in the literature. I couldn't find anything that wasn't SPMC or MPSC, unless it had severe limitations, like not actually having a drop policy when full.

However, I definitely did not see the paper you've sited, but just spent a few minutes with the paper you cited. Section 5.2 seems to cover their version. It's far from clear what their algorithm is, but they are talking about page eviction; it doesn't seem like they're using even a single fixed array for the ring, but I'm not 100% sure because it's pretty light on any detail. a

binarycoffee

17 days ago

1 reply

I hope you won't mind my picking your brain: which MPSC ring buffer implementations did you find that does drop old items when full? I could only found implementations that are basically re-purposed MPMC, or that cannot deal with non-POD data (seqlock-based).

nly

17 days ago

Generally I find it's best to implement push operations as try_push operations and let the caller decide if they want to drop, or spin.

There are definitely designs that can deal with non-POD data of variable size, although that does imply heterogeneous types, which will need some sort of type erasure to destroy safely.

jcalvinowens

16 days ago

This classic paper from 1996 describes a simple unbounded mpmc queue: https://www.cs.rochester.edu/~scott/papers/1996_PODC_queues....

auxym

17 days ago

2 replies

Also this Nim library: https://github.com/nim-works/loony

Which is based on: https://ieeexplore.ieee.org/document/9490347

JonChesterfield

17 days ago

1 reply

The paper is annoyingly difficult to locate but the author's implementation is at https://github.com/oliver-giersch/looqueue-rs

elcritch

17 days ago

1 reply

Checkout the Nim-loony repo in the paper folder for the pdf.

JonChesterfield

17 days ago

Ah right, in the nim repo, not the authors one. Contains https://github.com/nim-works/loony/blob/main/papers/GierschE... indeed, thank you

viega

17 days ago

So that work came after mine, and seems to be a FIFO not a ring buffer. The library I built at the time also had FIFOs and LIFOs that were tweaks on previous algorithms, but nothing earth shaking. I'll check this one out when I can though.

WeaselNo7

17 days ago

3 replies

Strange to see a lock-free ring buffer without seeing mention of LMAX/Martin Thompson's Java Disruptor (https://github.com/LMAX-Exchange/disruptor)

atq2119

17 days ago

1 reply

Eh, I get it. I also independently came up with these MPMC approaches before hearing about LMAX. They're not entirely trivial but they do follow fairly naturally from the problem space when you really think about it. It's a good piece of engineering, but the one thing that's really unique about LMAX is the amount of advertising it gets.

bob1029

17 days ago

1 reply

> the one thing that's really unique about LMAX is the amount of advertising it gets.

Virtually zero? I have to go out of my way to remind HN it exists while everyone is 300+ comments deep into reinventing the wheel on a ~quarterly basis.

dpc_01234

17 days ago

2 replies

Right buffer is a relative obvious and simple idea. For some reason the Java-OOP crowd keeps thinking that LMAX deserves a nobel price for being neither first nor last to use it.

> have to go out of my way

Yeah, that's exactly the annoying part. Can't mention ring buffer ever without someone bring up LMAX. "But do you know, that some Java developers somewhere once wrote something that was not completely slow?!"

immibis

16 days ago

IMO the take-away from LMAX is not ring buffers - it's the knowledge of how much useful work a single CPU core can do. It's a story of playing to hardware's strengths instead of wrapping yourself up in bullshit excuses. They realized their problem was fundamentally not parallelizable, so they wrote it to run serially as fast as possible instead of wrapping themselves up in bullshit excuses, and the resulting performance was much faster than anyone would have ever guessed if they hadn't done it.

bob1029

17 days ago

I am not sure why this is annoying. Some problems can be solved comprehensively. This is a pretty good example of one. It might be better for us to focus our attention on other aspects of the problem space (the customer/business).

viega

17 days ago

I don't think I even saw this until I published the article. I don't think it was academically published, or googled well back in 2000. Nor did it match my needs for a ring buffer at the time, which was to drop stale data (I think in that algorithm, write operations fail when the buffer is full), so if I did see it, I wouldn't have payed enough attention to notice if it even had users. It's good work for sure, but that's why it didn't get mentioned.

CyberDildonics

17 days ago

There are a lot of algorithms that get passed around in java because they can deal only with pointers and pretend memory allocation and deallocation doesn't exist and doesn't lock. I'm not convinced it actually makes sense to use java as a reference language for lock free algorithms because it expose all the details.

qbane

17 days ago

2 replies

I wrote my first SPSC circular buffer implementation in TS upon reading the previous post on futex [0] from the same author. It was more intricate than it had seemed, until I finally sat down and wrote my own code.

[0]: https://h4x0r.org/futex/ discussion: https://news.ycombinator.com/item?id=44951563

xavierxwang

17 days ago

1 reply

FYI: I have made a SPSC circular buffer for swap data in a pair of process: https://github.com/starwing/kaze-core

maybe that is what you want.

qbane

17 days ago

1 reply

By a quick glance, yes, this is what I want: a channel to communicate between processes via a piece of shared memory, protected by a pair of futexes.

In JS ecosystem, buffers that allow data loss is more common (aka ring buffers), but ringbuf.js [1] is the only complete implementation to my knowledge. In my use case on I/O between WASM modules where data must be transferred as-is, the process must block on buffer overrun/underrun in a synchronous manner. Therefore a circular buffer is required. I could not find such a niche library written in JS, so I decided to bite the bullet and reinvent the wheel [2].

[1]: https://github.com/padenot/ringbuf.js

[2]: https://github.com/andy0130tw/spsc

xavierxwang

17 days ago

1 reply

After a quick glance, it seems that you don’t maintain the reading/writing status in the shared memory. That means you have to make a syacall in every read/write call. You could look into the kaze-core for an alternative implementation, which doesn’t require any syscall if possible.

Btw, kaze-core uses a `used` atomic variable, to avoid reading both readPos/writePos in routine - they are not atomic at all.

qbane

16 days ago

That is a fair assessment. Maintaining read/write pos and peek them at every operation is a big performance hit. The impact is amplified if each invocation needs a syscall. That is exactly what futexes address: Allowing spin locks to remain in user space and avoid entering the kernel as long as contention is low.

In JavaScript, atomic operations are relatively lightweight, so their overhead is likely acceptable. Given that, I am open to adjusting my code to your suggested approach and seeing how it performs in practice.

bitexploder

17 days ago

I recently started playing with these in game design as well to coordinate networking and game threads in a lock free manner. If it fits your use case it is truly a free lunch! They definitely have a lot of edge cases and are not easy to implement, but they aren't too crazy either.

atq2119

17 days ago

2 replies

[delayed]

viega

17 days ago

2 replies

Yes, I meant to clarify the memory model discussion; I had tried to simplify and did a poor job; I got similar feedback after it was published, and never remembered to get to it. Will try to do it soon, though it's about the worst time for this to have hit, not sure when I'll be able to sit down for it, but will try to get it done in the next day. Hopefully it doesn't wait until next time it gets some views.

herodoturtle

17 days ago

> Co-founder of Crash Override.

withinboredom

17 days ago

I also note that you don't mention the cache-line contention issue when accessing atomics in a multi-threaded context. That's a huge performance issue with lock-free constructs.

Syzygies

17 days ago

1 reply

Yes. I was surprised there was no mention of "false sharing".

https://en.wikipedia.org/wiki/False_sharing

Rather than incrementing each counter by one, dither the counters to reduce cache conflicts? So what if the dequeue becomes a bit fuzzy. Make the queue a bit longer, so everyone survives at least as long as they would have survived before.

gpderetta

17 days ago

My understanding is that there is no global ordering anyway: allocation of a node has a total order, but actually writing data to a node can be arbitrarily delayed. At this point might as well use separate queues for each writer.

xavierxwang

17 days ago

1 reply

Is there any chance to modify Vyukov's MPMC queue implement (https://www.1024cores.net/home/lock-free-algorithms/queues/b...) to support drop handler? That work doesn't need 128 bit CAS.

viega

17 days ago

1 reply

You can use a 64-bit CAS if you want to use a 32-bit epoch and a pointer compression scheme of any kind, or just a 32-bit index into regions that are thread specific. I think I did the later when I did the original work, using the core primitive to build ring buffers that have arbitrary sized slots instead of 64-bit slots (which requires a bit of additional gymnastics, but the basic trick is to have the ring index into a bigger ring that you can FAA into, where the bigger ring has more slots by at least the max number of threads (I use this primitive heavily still for in-memory debug logging). Maybe at some point I'll do an article on that too.

viega

17 days ago

BTW, should be noted that the need to issue a cache line lock on x86 does seem to slow down 128-bit CAS quite a bit on x86-64 platforms. On arm64, there's no reason to skimp with a 64-bit CAS.

drob518

17 days ago

1 reply

Seems hung up on defining a ring buffer as something non-blocking (dropping). Having used ring buffers in both software and hardware systems for 40 years, we always called them ring buffers without this distinction. We would have called them all ring buffers. One nice thing about them is that it’s very easy to pass buffers back and forth between a single producer and single consumer without any locks or fancy memory atomics. All you need is a guarantee that memory writes occur in order. The AMD LANCE Ethernet controller (and many later derivatives) used this scheme to allow smooth overlapping of software frame processing with hardware transmission and reception way back in the 1980s.

nly

17 days ago

Some higher performance NICs these days now use write combining memory to write packet chunks to the NIC and have to buffer an resequence chunks in to packets on the NIC itself because the order out of the CPU isn't guaranteed, even on x86.

Some NICs in the past have simply detected this scenario rather than handled it properly, and poisoned the packet as it leaves the card

loeg

17 days ago

1 reply

Why would you want an MPMC queue primitive, instead of just using MPSC per consumer? I don't really see a discussion of the cache contention issues. (There is some mention of contention, but it seems the author is using the word in a different way.)

It looks like both enqueue and dequeue are RMWing the "epochs" variable? This can pretty easily lead to something like O(N^2) runtime if N processors are banging this cache line at the same time.

viega

17 days ago

1 reply

For me, I’ve got use cases where it’s valuable to keep event data interleaved because it will also get used in flight. It works well enough I also use it for things where it’s not necessary like in memory debug rings (which requires a bit of additional work).

The epoch isn’t CAS’d; it is FAA. The epoch is then used to determine if there is contention due to the tail meeting the head, or due to a wrap-around due to slow writes.

There’s also a back-off scheme to ease contention for a full queue.

Though, I did originally have a variant that adds a fairly straightforward ‘help’ mechanism that makes the algorithm wait-free and reduces the computational complexity.

However, the extra overhead didn’t seem worth it, so I took it out pretty quickly. Iirc, the only place where the ring in the queue wouldn’t out-perform it are on tiny queues with a huge write imbalance.

If you go run the tests in the repo associated w the article, you probably will see that a ring with only 16 entries will tend to start being non-performant at about a 4:1 writer to reader ratio. But iirc that effect goes away before 128 slots in the ring.

There, the ring still fits in a page, and even with a big imbalance I can’t remember seeing less than 1m ops per second on my Mac laptop.

Real world observational data beats worst case analysis, and I’ve never seen an issue for scenarios I consider reasonable.

But, if my unrealistic is realistic to you, let me know and I can break out the wait free version for you.

loeg

17 days ago

> The epoch isn’t CAS’d; it is FAA.

fetch_add requires taking a cache line exclusive. If producers and consumers are both doing this on the same cache line, it is very expensive and does not scale.

nly

17 days ago

1 reply

A good write up, but a MPMC queue can be built without using CAS and without fixed size slots.

SPSC ring buffers in particular are simple but not particularly efficient in the case where the consumer is keeping up with the producer (the ideal case) due to all the (atomic) synchronization and cachelines ping pong

travisa

17 days ago

1 reply

For those of us not up to date with the state of the art, can you provide references? Very interested!

nly

16 days ago

Martin Thompson, of LMAX Disruptor fame, has I believe a novel solution in Aeron called the logbuffer, which fully formed 10+ years ago in that codebase.

gpderetta

17 days ago

1 reply

> [re weak vs strong cas] But, while that’s the guidance you’ll find all over the internet, I don’t actually know which CPUs this would affect. Maybe it’s old news, I dunno. But it does still seem to make a shade of difference in real-world tests

A CAS implemented with LL/SC (ARM, POWER) is weak as LL/SC an spuriously fail. So it always needs to be retried in a loop. Such a weak CAS might only be lock-free, not wait free as it might not provide global progress guarantees ; in practice some platforms give stronger progress guarantees as they might convert an LL/SC to a strong CAS via idiom recognition.

A strong CAS (x86, SPARC I thnk) is implemented directly in the architecture and it is typically strong. It also usually gives strong fairness guarantees.

If your algorithm needs to CAS in a loop might as well use a weak CAS to avoid a loop-of-loops. Otherwise a strong CAS might generate better code on some architectures.

tialaramex

17 days ago

1 reply

From my reading of the article I think they understood why we'd want these two primitives for CAS, but they weren't clear (whereas your answer is better here) on whether that's a thing we care about today in 2025. ARM vs x86-64 matters for many people today whereas if we only wanted the other primitive for the M68k well, sorry Amiga fans but who cares.

Without immersion in the "Why" of each technological niche it can be hard to judge whether you're reading advice that really hasn't been relevant in decades ("The ASCII character set is not supported on all computers") or that's still important to your work today ("The file naming conventions may vary from one system to another")

gpderetta

16 days ago

> whether that's a thing we care about today in 2025

Good question actually! ARM64 has a proper CAS it seems, but I think smaller ARMs still have only LL/SC and that's still relevant for embedded. I can't find a definitive answer for POWER, it is possible that is till only has LL/SC (I leave it to you whether POWER is still relevant, but certainly IBM cares that the standard support it). Can't find a definite answer for RISC-V: it seems that originally it only had LL/SC, but there are AMO extensions that add RISC-V.

I think most GPUs have native CAS instructions.

There are probably other embedded processors that are still relevant for C++, who knows what they support.

j_seigh

17 days ago

I did a lock-free MPMC ring buffer with 1 128 bit CAS and 1 64 bit CAS for enqueue and 1 64 bit CAS for dequeue. The payload is an unrestricted uintptr_t (64 bit) value so no way to avoid the 128 bit CAS in the enqueue.

View full discussion on Hacker News

ID: 46288286Type: storyLast synced: 12/16/2025, 7:55:42 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN