I've Been Writing Ring Buffers Wrong All These Years (2016)

Posted17 days agoActive10 days ago

flaghacker

144 points

62 comments

snellman.netTech DiscussionstoryHigh profile

informativeneutral

Debate

20/100

Data StructuresProgramming

Key topics

Data Structures

Programming

Diving into the nuances of ring buffer implementations, commenters debate the merits of different approaches, with some defending the "waste an element" method as lock-free and efficient, particularly in microcontrollers. Others argue over the definition of "lock-free," with some pointing out that atomic compare-and-swap instructions can still be considered lock-free despite acquiring a cache line lock. The discussion highlights the trade-offs between different implementations, with some favoring simplicity and others prioritizing efficiency. A 2016 blog post sparked this lively discussion, which remains relevant today due to its exploration of fundamental data structure design.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

Peak period

48-60h

Avg / period

9.6

Comment distribution67 data points

Loading chart...

Based on 67 loaded comments

Key moments

01Story posted
Dec 16, 2025 at 2:11 PM EST
17 days ago
Step 01
02First comment
Dec 16, 2025 at 3:44 PM EST
2h after posting
Step 02
03Peak activity
36 comments in 48-60h
Hottest window of the conversation
Step 03
04Latest activity
Dec 23, 2025 at 6:02 PM EST
10 days ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (62 comments)

Showing 67 comments

codeworse

17 days ago

2 replies

As far as I know, the last approach is the only way to implement efficient lock-free ring-buffer

mrcode007

15 days ago

3 replies

There is one more way that is truly lock free. Most lock free implementations relying on atomic compare and swap instructions are not lock free afaik; they have a lock on the cache line in the CPU (in a way you go away from global lock to many distributed locks).

There is one more mechanism that allows implementing ring buffers without having to compare head and tail buffers at all (and doesn’t rely on counters or empty/full flags etc) that piggybacks on the cache consistency protocol

spockz

15 days ago

1 reply

Interesting! Do you know of an example implementation of this?

mrcode007

14 days ago

Yes. [1] has background, [2] has the implementation (fig 2. pseudocode). Since you understood my comment I trust you can figure out the rest :) it’s a very neat trick.

[1]https://www.microsoft.com/en-us/research/publication/concurr... [2]https://arxiv.org/pdf/1012.1824

dooglius

15 days ago

1 reply

That's not how "lock free" is defined/used. If you are considering the MESI M state to be a "lock" then you also have to grant that any write instruction is a "lock".

mrcode007

14 days ago

In fact this a crux of the problem in low latency code and there are ways to combat this.

I know there is an academic wait-free and lock-free definition but folks use those often incorrectly as a slogan that something is magically better because it’s „lockfree”.

Imagine how _you_ would implement a read-modify-write atomic in the CPU and why E stands for exclusive (sort of like exclusive in a mutex)

wat10000

15 days ago

Those hardware-level locks are typically not considered because they work quite differently. A standard software mutex can cause other threads to block indefinitely if, for example, the thread holding the mutex gets preempted for a long time. "Lock free" isn't really about the locks, it's about a guarantee that the system makes progress.

In this sense, the hardware locks used for atomic instructions don't really count, because they're implemented such that they can only be held for a brief, well defined time. There's no equivalent to suspending a thread while it holds a lock, causing all other threads to wait for an arbitrary amount of time.

zephen

15 days ago

1 reply

The middle approach is the only one that is not lock-free.

The first approach is lock-free, but as the author says, it wastes an element.

But here's the thing. If your element is a character, and your buffer size is, say, 256 bytes, and you are using 8-bit unsigned characters for indices, the one wasted byte is less than one percent of your buffer space, and also is compensated for by the simplicity and reduced code size.

fullstop

15 days ago

1 reply

I've used the "Waste an element" one for ages on microcontrollers where I don't want to deal with the overhead in an ISR.

zephen

15 days ago

Agreed.

The article author claims that the "don't waste an element" code is also more efficient, but that claim seems to be based on a hard-on about the post-increment operator, rather than any kind of dive into the cyclometric complexity, or even, y'know, just looking at the assembler output from the compiler.

Someone

15 days ago

3 replies

[delayed]

andrepd

15 days ago

1 reply

> C++ has std::bitset and std::vector

Notably, this is not the case. C++ std::vector is specialised for bools to pack bits into words, causing an untold array (heh) of headaches.

And "wasteful" is doing a lot of lifting here. In terms of memory usage? Yes. In terms of CPU? The other way around.

mbel

15 days ago

> In terms of CPU? The other way around.

That depends on your architecture and access pattern. In case of sequential access, packed bools may perform better due to arithmetic being usually way cheaper than memory operations.

cpgxiii

15 days ago

> C++ has std::bitset and std::vector and Java similarly has BitSet and Array because using the generic code for arrays of bits is too wasteful.

Rather infamously, C++ tried to be clever here and std::vector<bool> is not just a vector-of-bools but instead a totally different vector-ish type that lacks many of the important properties of every other instantiation of std::vector. Yes, a lot of the time you want the space efficiency of a dynamic bitset, rather than wasting an extra 7 bits per element. But also quite often you do want the behavior of a "real" std::vector for true/false values, and then you have to work around it manually (usually via std::vector<uint8_t> or similar) to get the expected behavior.

jsnell

15 days ago

It was for a dynamically growing ring buffer that also did short-object optimization. The natural implementation was to have the capacity and the offsets stored in fixed locations and with a fixed width, and have the variable part be a union of pointer or inline byte buffer.

Depending on the element width, you'd have space for different amounts of data in the inline buffer. Sometimes 1, sometimes a lot more. Specializing for a one-element inline buffer would be quite complex with limited gains.

In retrospect trying to use that as a running gag for the blog post did not work well without actually giving the full context, but the full context would have been a distraction.

RossBencina

15 days ago

7 replies

[delayed]

azemetre

15 days ago

1 reply

Your link has an invalid cert FYI, but do appreciate the knowledge drop. Rung buffers are some of the cooler data structures out there.

RossBencina

14 days ago

Unfortunately the original source is now behind a sign-in-wall.

hinkley

15 days ago

2 replies

I think unfortunately we sometimes ascribe to powers of two supernatural powers that are really about caches being built in powers of two.

Intel is still 64 byte cache lines as they have been for quite a long time but they also do some shenanigans on the bus where they try to fetch two lines when you ask for one. So there’s ostensibly some benefit of aligning data particularly on linear scans to 128 byte alignment for cold cache access.

rcoveson

15 days ago

3 replies

But there's a reason that caches are always sized in powers of two as well, and that same reason is applicable to high-performance ring buffers: Division by powers of two is easy and easy is fast. It's reliably a single cycle, compared to division by arbitrary 32bit integers which can be 8-30 cycles depending on CPU.

Also, there's another benefit downstream of that one: Powers of two work as a schelling point for allocations. Picking powers of two for resizable vectors maximizes "good luck" when you malloc/realloc in most allocators, in part because e.g. a buddy allocator is probably also implemented using power-of-two allocations for the above reason, but also for the plain reason that other users of the same allocator are more likely to have requested power of two allocations. Spontaneous coordination is a benefit all its own. Almost supernatural! :)

loeg

15 days ago

Fwiw in this application you would never need to divide by an arbitrary integer each time; you'd pick it once and then plumb it into libdivide and get something significantly cheaper than 8-30 cycles.

kevin_thibedeau

15 days ago

powers-of-two are problematic with growable arrays on small heaps. You risk ending up with fragmented space you can't allocate unless you keep growth less than 1.61x, which would necessitate data structures that can deal with arbitrary sizes.

hinkley

14 days ago

CPU Caches are powers of two because retrieval involves a logarithmic number of gates have to fire in a clock cycle. There is a saddle point where more cache starts to make the instructions per second start to go back down again, and that number will be a power of two.

That has next to nothing to do with how much of your 128 GB of RAM should be dedicated to any one data structure, because working memory for a task is the sum of a bunch of different data structures that have to fit into both the caches and main memory, which used to be powers of two but now main memory is often 2^n x 3.

And as someone else pointed out, the optimal growth factor for resizable data structures is not 2, but the golden ratio, 1.61. But most implementations use 1.5 aka 3/2.

KeplerBoy

14 days ago

It's not just Intel is it, AMD is also using 64 byte cache lines afaik.

tom_

15 days ago

2 replies

A couple of the comments to the article suggest using 64-bit numbers, which is exactly the right solution. 2^64 nanoseconds=584.55 years - overflow is implausible for any realistic use case. Even pathological cases will struggle to induce wraparound at a human timescale.

(People will probably moan at the idea of restarting the process periodically rather than fixing the issue properly, but when the period would be something like 50 years I don't think it's actually a problem.)

thaumasiotes

15 days ago

2 replies

> but when the period would be something like 50 years I don't think it's actually a problem

I think you have that backwards. If something needs to be done every week, it will get done every week. That's not a problem.

If something needs to be done every fifty years, you'll be lucky if it happens once.

tom_

15 days ago

1 reply

My parting shot was slightly tongue in cheek, apologies. Fifty years is a long time. The process, whatever it is, will have been replaced or otherwise become irrelevant long before the period is up. 64 bits will be sufficient.

gpderetta

14 days ago

At some point a random bitflip becomes more likely than the counter overflowing.

reincarnate0x14

14 days ago

I agree with that sentiment in general but even though I've seen systems in continuous operation for 15 years, I've never seen anything make it to 20. I wouldn't write something with the external expectation it never made it that far, but in practical terms, that's probably about as safe as it gets. Even like embedded medical devices expect to get restarted every now and again.

Just as an example the Voyager computers have been restarted and that's been almost 60 years.

ale42

14 days ago

> using 64-bit numbers, which is exactly the right solution

On a 64-bit platform, sure. When you're working on ring buffers with an 8-bit microcontroller, using 64-bit numbers would be such an overhead that nobody would even think of it.

ErroneousBosh

14 days ago

1 reply

> The author says that non-power-of-two is not possible, but I'm pretty sure it is if you use a conditional instead of integer modulus.

I don't see why it wouldn't be, it's just computationally expensive to take the modulo value of the pointer rather than just masking off the appropriate number of bits.

myrmidon

14 days ago

2 replies

Replacing just the mask operation is not enough.

The problem is incrementing past the index integer type limit.

Consider a simple example with ring buffer size 9, and 16bit indices:

When you increment the write index from 0xffff to 0, your "masked index" jumps from 6 (0xffff % 9) to 0 (instead of 7).

There is no elegant fix that I'm aware of (using a very wide index type, like possibly a uint64, is extremely non-elegant).

ErroneousBosh

14 days ago

1 reply

Yes, that's what I'm saying. You can't just use a quick and easy mask, you have to use a modulo operator which is computationally expensive enough that it's probably killing the time savings you made elsewhere.

There's probably no good reason to make your buffer sizes NOT a power of two, though. If memory's that tight, maybe look elsewhere first.

myrmidon

14 days ago

1 reply

What I mean is: This ringbuffer implementation (and its simplicity) relies on the index range being a multiple of the buffer size (which is only true for powers of two, when the index is e.g. a 32bit unsigned integer).

If you swap bitmasking for modulo operations then that does work at first glance, but breaks down when the index wraps around. This forces you to abandon the simple "increment" operation for something more complex, too.

The requirement for a power-of-two size is more intrinsic to the approach than just the bitmasking operation itself.

ErroneousBosh

13 days ago

Yes, I get you now. If you let it roll over and you apply a modulo operation, now you have two modulo operations :-)

aidenn0

10 days ago

Everything still works if you unconditionally modulo by any multiple of 9 between 18 and 0xffff, but that's very expensive.

aidenn0

15 days ago

Non-power-of-two is only really feasible of the total number of inserts will fit in your post/ack counters. Otherwise you have to implement overflow manually which may or may not be possible to do with the available atomic primitives on your architecture.

I first encountered this structure at a summer internship at a company making data switches.

waffletower

14 days ago

Regardless of correctness, as a DSP dork I really identified with the question: "What kind of a monster would make a non-power of two ring anyway?" I remember thinking similarly when requesting a power of two buffer from a 3rd party audio hardware device and having it correct to a nearby non-power of two. Latency adding ringbuffer to the rescue.

zephen

15 days ago

> It is not just a way of writing ring buffers. It's a way of implementing concurrent non-blocking single-reader single-writer atomic ring buffers with only atomic load and store (and memory barriers).

That may or may not be part of the actual definition of a ring buffer, but every ring buffer I have written had those goals in mind.

And the first method mentioned in the article fully satisfies this, except for the one missing element mentioned by the author. Which in practice, often is not only not a problem, but simplifies the logic so much that you make up for it in code space.

Or, for example, say you have a 256 character buffer. You really, really want to make sure you don't waste that one character. So you increase the size of your indices. Now they are 16 bits each instead of 8 bits, so you've gained the ability to store 256 bytes by having 260 bytes of data, rather than 255 bytes by having 258 bytes of data.

Obviously, if you have a 64 byte buffer, there is no such tradeoff, and the third example wins (but, whether your are doing the first or third example, you still have to mask the index data off at some point, whether it's on an increment or a read).

> The author says that non-power-of-two is not possible, but I'm pretty sure it is if you use a conditional instead of integer modulus.

There's "not possible" and then "not practical."

Sure, you could have a 50 byte buffer, but now, if your indices are ever >= 50, you're subtracting 50 before accessing the array, so this will increase the code space (and execution time).

> The [index size > array size] technique is also widely known in FPGA/hardware circles

Right, but in those hardware circles, power-of-two _definitely_ matters. You allocate exactly one extra bit for your pointers, and you never bother manually masking them or taking a modulo or anything like that -- they simply roll over.

If you really, really need to construct something like a 6 entry FIFO in hardware, then you have techniques available to you that mere mortal programmers could not use efficiently at all. For example, you could construct a drop-through FIFO, where every element traverses every storage slot (with a concomitant increase in minimum latency to 6 clock cycles), or you could construct 4 bit indices that counted 0-1-2-3-4-5-8-9-10-11-12-13-0-1-2 etc.

Most ring buffers, hardware or software, are constructed as powers of two, and most ring buffers either (a) have so much storage that one more element wouldn't make any difference, or (b) have the ability to apply back pressure, so one more element wouldn't make any difference.

ekropotin

15 days ago

3 replies

I’m jealous of people, who have to write ring buffers for work.

It feels like 90% swe jobs these days are about writing CRUD wrappers.

avadodin

15 days ago

3 replies

Sorry.

Mostly Type 1 and overflow is a diagnostic log at most. Losing all stale unprocessed data and leaving a ready empty buffer behind is often the desired outcome.

Type 3 is probably banned on most codebases because of the integer overflow.

Krssst

15 days ago

Unsigned integer arithmetic operations don't overflow but are done modulo 2^n (https://en.cppreference.com/w/c/language/operator_arithmetic...). The author does use unsigned integers so I don't think there is a problem there.

Signed integer overflow is definitely a problem however. Something as simple as incrementing a user-provided int can lead to UB (if the user provides INT_MAX).

RealityVoid

15 days ago

Banned is a bit strong, maybe discouraged. MISRA might yell but it's valid technique, IMO, unsigned integer overflow will be fine.

zephen

15 days ago

> Type 3 is probably banned on most codebases because of the integer overflow.

Not only this, but the purported code reduction benefits associated with type 3 are only superficial, and won't actually appear in any assembly listing.

Neywiny

15 days ago

2 replies

And yet here I sit, writing ring buffers, and never thinking about this idea. Probably because of the power of two issue. Which isn't actually a problem because as he points out, who would do that? But it makes me think that it's a restriction that it just isn't.

But in all honesty, look for more embedded jobs, then. We can certainly use the help.

ekropotin

15 days ago

2 replies

For some unexplainable reason, CRUD job’s pay is better than embedded, on average.

Neywiny

15 days ago

I mean idk, I'm living comfortably and as the adage says, not working a day in my life. But if you're at a spot where you need the pay more than you want to write ring buffers, I understand.

IsTom

14 days ago

Probably there are less people excited to make another CRUD app than to write embedded code.

nathan_douglas

15 days ago

1 reply

What do you work with (if you don't mind answering)? I'm looking for a change and like low-level stuff about as much as I like any other level. I've done some cycle-accurate NES emulation and VM implementation stuff - I'm not much of a DSA guy but performance and efficiency appeal to me.

Neywiny

14 days ago

I work with pretty much everything (except GPUs I guess). Embedded is extremely relative. To some, embedded means a rack mount server that's idk embedded in a vehicle instead of a datacenter. That's not me. To others, embedded means a 4-bit low power, mask-rom fed micro inside a sensor IC. That's also not me.

So I work with microcontrollers of various vendors, I do FPGA with hard and soft processors, recently did just past the smoke test through embedded Linux on a SoC, and I've done plenty of desktop code on Linux and Windows for interfacing. I get to work with a wide range of devices and a wide range of tasks for them. Might not pay as much but my goodness is it fun

RealityVoid

15 days ago

Jokes on me, when I need them, I don't feel like writing them so I just pick up an old one and tweak it. Or just tell Claude to build me one and it one shots it.

Mikhail_Edoshin

15 days ago

1 reply

Technically each side needs an index plus a single bit. The bit is a counter, you toggle it on every wrap. It overflows, but this is correct, we only need the last bit. Initially it is 0. By comparing the indexes and the bit you tell apart all cases and do not lose an entry.

(I think this was published in one of Llang's papers but in a rather obscure language.)

grumbelbart2

15 days ago

1 reply

One of the comments in the article proposes that: Just wrap both counters at 2capacity (instead of capacity or UINT_MAX).

Mikhail_Edoshin

15 days ago

Which does that and that's what Llang suggests as well, I remember 2 in some power in his formula. I myself find the separate one-bit counter easier to understand: each side counts pages, but they don't need the full number, only the difference between their counters and the difference can be at most one, so one bit is sufficient. If the counters are same, the actors are on the same page, if they are different, then the writer is one page ahead.

z3t4

14 days ago

2 replies

[delayed]

danhau

14 days ago

They are efficient FIFOs (queues). You‘ll find them in many places. I know them from multimedia / audio, where you often have unsynchronized readers and writers.

If you‘ve ever gamed on a PC, you might have heard one. When a game freezes, sometimes you hear a short loop of audio playing until the game unfreezes. That‘s a ringbuffer whose writer has stopped, but the async reader is still reading.

aldonius

14 days ago

It's a somewhat different kind of ring buffer, because there's just one index, but I used it in my signal processing class for a finite-impulse-response filter.

Choose N to be a power of two >= the length of your filter.

Increment index i mod N, write the sample at buffer position x[i], output sum of x[i+k mod N] * a[i+k mod N] where a[k] are your filter coefficients, repeat with next sample at next time step.

nly

14 days ago

1 reply

Most people implement them now in my field using mmap tricks so the CPU can do the wraparound for you in virtual memory.

Makes the code trivial

thrtythreeforty

14 days ago

Not only that, but you can also always form a normal (ptr, size) slice reference to any piece of the ring buffer, even when it wraps. This is really helpful for Eigen arrays that you need to rotate.

atq2119

14 days ago

1 reply

It's a silly offhand remark at the end of the article, but anybody who is genuinely interested in whether they've been tying their shoes wrong will enjoy Ian's shoelace site: https://www.fieggen.com/shoelace/

msm_

14 days ago

I thought you're joking, but then I opened https://www.fieggen.com/shoelace/grannyknot.htm

cuno

14 days ago

This stuff dates way, way before 2004.

For non-power of two, just checked our own very old circular byte buffer library code and using the notation from this article, it is:

  entriesAllocated() { return ((wrPtr-rdPtr+2*bufSize) % (2*bufSize)); }
  remainingSpace() { return bufSize - ((wrPtr-rdPtr+2*bufSize) % (2*bufSize)); }
  isEmpty() { return entriesAllocated()==0);
  isFull() { return remainingSpace()==0);
  incWr(int n) { wrPtr = (wrPtr+n) % (2*bufSize); }
  incRd(int n) { rdPtr = (rdPtr+n) % (2*bufSize); }

The 2*bufSize gives you an extra bit (beyond representing bufSize) that lets you disambiguate empty vs full. And if it is a constant power of two (e.g. via C++ template), then you can see how this just compiles into a bitmask instead, like the author's version.

zephen

13 days ago

The article author claims that his new version is simpler than the old version.

But the new version is really only simpler TEXTUALLY, because of the post-increment operators:

push(val) { assert(!full()); array[mask(write++)] = val; }

shift() { assert(!empty()); return array[mask(read++)]; }

(If you look at assembly output, it's probably the same or more code.)

But, at least in some languages, those increments might happen before the array access, which could mean that using them causes a race condition.

In fact, in C or C++, those increments are GUARANTEED to happen before the access to array, because they are guaranteed to happen before the calls to mask.

tl;dr -- dude claims to insure he can utilize one more character of his buffer, while writing code that ensures that if he is truly operating at the margins, he will be doing things in the wrong order.

dang

15 days ago

Related. Others?

I've been writing ring buffers wrong all these years - https://news.ycombinator.com/item?id=13175832 - Dec 2016 (167 comments)

gpderetta

14 days ago

That's how TCP sequence numbers work as well.

kybernetikos

15 days ago

Every implementation of "the lmax disrupter" I've come across uses this trick.

spacechild1

14 days ago

> All of those seem like non-issues. What kind of a monster would make a non-power of two ring anyway?

Huh? Anytime you want to restrict the buffer to a specific size, you will have to support non-power-of-two capacities. There are cases where the capacity of the ring buffer determines the latency of the system, e.g. an audio ringbuffer or a network jitter buffer.

View full discussion on Hacker News

ID: 46292937Type: storyLast synced: 12/19/2025, 7:05:34 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN