Last activity about 2 months agoPosted Sep 28, 2025 at 12:40 PM EDT

The Weird Concept of Branchless Programming

judicious

170 points

90 comments

Mood

calm

Sentiment

mixed

Discussion Activity

Very active discussion

First comment

Peak period

Day 1

Avg / period

22.5

Comment distribution90 data points

Loading chart...

Based on 90 loaded comments

Key moments

01Story posted
Sep 28, 2025 at 12:40 PM EDT
about 2 months ago
Step 01
02First comment
Sep 28, 2025 at 1:54 PM EDT
1h after posting
Step 02
03Peak activity
83 comments in Day 1
Hottest window of the conversation
Step 03
04Latest activity
Oct 3, 2025 at 10:01 AM EDT
about 2 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (90 comments)

Showing 90 comments

sfilmeyer

about 2 months ago

2 replies

I enjoyed reading the article, but I'm pretty thrown by the benchmarks and conclusion. All of the times are reported to a single digit of precision, but then the summary is claiming that one function shows an improvement while the other two are described as negligible. When all the numbers presented are "~5ms" or "~6ms", it doesn't leave me confident that small changes to the benchmarking might have substantially changed that conclusion.

Joel_Mckay

about 2 months ago

1 reply

In general, modern compilers will often unroll or inline functions without people even noticing. This often helps with cache level state localization and parallelism.

Most code should focus on readability, then profile for busy areas under use, and finally refactor the busy areas though hand optimization or register hints as required.

If one creates something that looks suspect (inline Assembly macro), a peer or llvm build will come along and ruin it later for sure. Have a great day =3

hinkley

about 2 months ago

1 reply

Doesn’t it also help with branch prediction since the unrolled loop can use different statistics with each copy?

Joel_Mckay

about 2 months ago

Non-overlapping sub-problems may be safely parallelized, and executed out-of-order.

In some architectures, both of the branch code motions are executed in parallel, and one is simply tossed after dependent operations finish. We can't be sure exactly how branch predictors and pre-fetch is implemented as it falls under manufacturer NDA. =3

gizmo686

about 2 months ago

2 replies

Yeah. When your timing results are a single digit multiple of your timing precision, that is a good indication you either need a longer test, or a more precise clock.

At a 5ms baseline with millisecond precision, the smallest improvement you can measure is 20%. And you cannot distinguish a 20% speedup with a 20% slowdown that happened to get luck with clock ticks.

For what it is worth, I ran the provided test code on my machine with a 100x increase in iterations and got the following:

  == Benchmarking ABS ==
  ABS (branch):     0.260 sec
  ABS (branchless): 0.264 sec

  == Benchmarking CLAMP ==
  CLAMP (branch):     0.332 sec 
  CLAMP (branchless): 0.538 sec

  == Benchmarking PARTITION ==
  PARTITION (branch):     0.043 sec
  PARTITION (branchless): 0.091 sec

Which is not exactly encouraging (gcc 13.3.0, -ffast-math -march=native. I did not use the -fomit-this-entire-function flag, which my compiler does not understand).

I had to drop down to O0 to see branchless be faster in any case:

  == Benchmarking ABS ==
  ABS (branch):     0.743 sec
  ABS (branchless): 0.948 sec

  == Benchmarking CLAMP ==
  CLAMP (branch):     4.275 sec
  CLAMP (branchless): 1.429 sec

  == Benchmarking PARTITION ==
  PARTITION (branch):     0.156 sec
  PARTITION (branchless): 0.164 sec

Roxxik

about 2 months ago

I also tried myself, on different array sizes, with more iterations. The branchy version is not strictly worse.

https://gist.github.com/Stefan-JLU/3925c6a73836ce841860b55c8...

Someone

about 2 months ago

> I had to drop down to O0 to see branchless be faster in any case

Did you check whether your branchy code actually still was branchy after the compiler processed it at higher optimization levels?

MontyCarloHall

about 2 months ago

2 replies

This great video [0] demonstrates how CPU performance has only increased 1.5-2x over the past 15(!) years when executing extremely branchy code. Really shows you just how deep modern CPU pipelines have become.

The video also showcases a much more impressive branchless speedup: computing CRC checksums. Doing this naïvely with an if-statement for each bit is ~10x slower than doing it branchless with a single bitwise operation for each bit. The author of the article should consider showcasing this too, since it's a lot more impressive than the measly 1.2x speedup highlighted in the article. I assume the minimal/nonexistent speedups in the article are due to modern CPU branch prediction being quite good. But branch predictors inherently fail miserably at CRC because the conditional is on whether the input bit is 1 or 0, which is essentially random.

[0] https://www.youtube.com/watch?v=m7PVZixO35c

SynasterBeiter

about 2 months ago

1 reply

The linked video doesn't take into account power consumption of these CPUs. He seems to be comparing laptop CPUs, NUC CPUs and desktop CPUs. If you compare a 100W CPU and a 30W CPU that's a couple of years newer, you shouldn't be surprised there isn't much of a difference in performance.

MontyCarloHall

about 2 months ago

Even if you exclude the three mobile CPUs in the charts (the 2012 i5, the 2015 i7, and the 2023 i9 NUC), the results still hold.

>If you compare a 100W CPU and a 30W CPU that's a couple of years newer, you shouldn't be surprised there isn't much of a difference in performance

Sure, but this is over a lot more than a couple years. I'd expect a 2023 mobile i9 to be considerably more than twice as fast as a 2007 desktop Core 2 Duo.

hinkley

about 2 months ago

Branchless is also useful for cryptographic transforms as it frustrates timing attacks. And that’s a situation where it only needs to be relatively fast compared to the branching alternative because we are trying to improve security while limiting the overhead of doing so.

rmnclmnt

about 2 months ago

1 reply

Great article, triggers some memories.

When you get to think about branchless programming, especially for SIMD optimizations in the real world, you always learn a lot and it’s as if you get a +1 level on your algorithmic skills. The hardest part then is make sure the tricks are clearly laidout so that someone else can take it from here next time

kccqzy

about 2 months ago

1 reply

It also triggered some memories for me too. A college professor wanted to teach all the bit manipulating stuff and gave an assignment where students had to transform branchy code into branchless code using shifts and bit operators. Had a lot of fun doing that.

rmnclmnt

about 2 months ago

Yeah this is the kind of thinking opening up new heights when you get it!

These type of exercises should be mandatory for all compute intensive related jobs, especially for all data science people (i know some of them know this stuff but that does not seem to be the majority)

Legend2440

about 2 months ago

2 replies

Article doesn’t mention this, but I’d consider neural networks a form of branchless programming. It’s all a bunch of multiply and threshold operations.

abyesilyurt

about 2 months ago

3 replies

Thresholding requires branching no?

bqmjjx0kac

about 2 months ago

As demonstrated in the article, you can compute clamp(x, min, max) with straight-line code.

Legend2440

about 2 months ago

No, it’s just a math operation and can be applied to any amount of data in parallel.

WJW

about 2 months ago

No, you can do that without branching. See https://en.algorithmica.org/hpc/pipelining/branchless/#predi... for example.

david-gpu

about 2 months ago

> It’s all a bunch of multiply and threshold operations.

Real-world high-performance matrix multiplication functions do contain branches internally, even on GPUs. If you are ever curious about what that looks like, NVidia maintains an open-source library called CUTLASS.

rao-v

about 2 months ago

1 reply

I’m amused to see a for loop in a function that is purportedly branchless

(not a critique)

bluGill

about 2 months ago

Loops are very predictable in general so the branch isn't a problem. Of course this assumes your branch codition is standard - you can do weird things in the end condition that would food the cpu.

adrianmonk

about 2 months ago

8 replies

I've always wondered if any CPUs have tried to reduce the branch penalty by speculatively executing both ways at once in parallel. You'd have two of everything (two pipelines, two ALUs, two sets of registers, etc.) and when you hit a conditional branch, instead of guessing which way to go, you'd essentially fork.

Obviously that requires a lot of extra transistors and you are doing computation that will be thrown away, so it's not free in terms of space or power/heat/energy. But perhaps it could handle cases that other approaches can't.

Even more of a wild idea is to pair up two cores and have them work together this way. When you have a core that would have been idle anyway, it can shadow an active core and be its doppelganger that takes the other branch. You'd need to have very fast communication between cores so the shadow core can spring into action instantly when you hit a branch.

My gut instinct is it's not worth it overall, but I'm curious whether it's been tried in the real world.

terryf

about 2 months ago

2 replies

Yes, this has been done for a while now, speculative execution + register renaming is how this happens. https://en.wikipedia.org/wiki/Register_renaming

o11c

about 2 months ago

Doesn't that only work on one side of the branch though?

umanwizard

about 2 months ago

No, what’s been done for a while is speculatively executing one predicted path, not both paths in parallel.

thegreatwhale8

about 2 months ago

1 reply

It's what happens and it gave us a really big issue a few years ago https://en.wikipedia.org/wiki/Spectre_(security_vulnerabilit...

umanwizard

about 2 months ago

No, that is because of speculatively executing one path, not both paths in parallel.

jasonwatkinspdx

about 2 months ago

1 reply

Yes, it's been looked at. If you wanna skim the research use "Eager Execution" and "Disjoint Eager Execution" as jumping off points.

It doesn't require duplicating everything. You just need to add some additional bookkeeping of dependencies and what to retire vs kill at the end of the pipeline.

In practice branch predictors are so good that speculating off the "spine" of most likely path just isn't worth it.

In fact there were a lot of interesting microarchitectural ideas from the late 90s to early 00s that just ended up being moot because the combination of OoO speculation, branch predictors, and big caches proved so effective.

adastra22

about 2 months ago

1 reply

I think you’re missing the context: that good branch prediction is what causes these security holes. “Wasteful” multi path execution is a security feature.

jasonwatkinspdx

about 2 months ago

No, security vulnerabilities are orthogonal. There's nothing about branch prediction that necessitates leaking information, as demonstrated by the fixes shipped in current processors.

anileated

about 2 months ago

1 reply

> I've always wondered if any CPUs have tried to reduce the branch penalty by speculatively executing both ways at once in parallel

They already do it (edit: they don’t). It is difficult to get security right, however (see https://en.wikipedia.org/wiki/Spectre_(security_vulnerabilit...).

umanwizard

about 2 months ago

1 reply

That is not true, and several people have already make the same mistake in this thread. What is done now is speculatively executing one path, not two or more paths in parallel.

anileated

about 2 months ago

True, it was incorrect for me to say they already do parallel execution. However, when parallel execution is a special case of speculative execution, the security concern I meant to highlight still applies, doesn’t it?

mshockwave

about 2 months ago

yes, it has been done for at least a decade if not more

> Even more of a wild idea is to pair up two cores and have them work together this way

I don't think that'll be profitable, because...

> When you have a core that would have been idle anyway

...you'll just schedule in another process. Modern OS rarely runs short on available tasks to run

hawk_

about 2 months ago

What has worked out very well in practice is hyper-threading. So you take instructions from two threads and if one of them is waiting on a branch the units of the CPU don't go unused.

Someone

about 2 months ago

> You'd have two of everything (two pipelines, two ALUs, two sets of registers, etc.)

As others said: yes, it has been tried and it works, but it costs a lot in hardware and power usage. A problem is that lots of code has a branch every 10 or so instructions. Fast high-end CPUs (the only realistic target for this feature) can dispatch multiple instructions per cycle. Combined that means you will hit a branch every two or three cycles. Because of that, you do not end up with two of everything but with way more.

So, you’re throwing away not 50% of your work but easily 80%.

Some code has fewer branches, but that often can easily be parallelized or vectorized.

recursivecaveat

about 2 months ago

They do this on FPGA a lot. Since you know statically the content of the branches, and you need to have resources there to run either of them, it is pretty low overhead to set them up to run in parallel and select the appropriate result afterwards.

FooBarBizBazz

about 2 months ago

2 replies

I used to do stuff like this (ok, not half as smart), but stopped around 2013 or so, as the distinction between "implementation defined" behavior (ok) and "undefined" behavior (not ok) started to matter and bite.

After thinking through this carefully, though, I do not see UB (except for signed overflow in a corner case):

Step 1, bit shift.

I understand that, until C++20, left shift of a signed int was UB. But this right shift appears to be implementation defined, which is ok if documented in the code.

Step 2: Add.

Then, (x + mask) is defined behavior (a) for positive x, since then mask=0, and (b) for most negative x, since then mask=-1. However, if x is numeric_limits::lowest, then you get signed integer overflow, which is UB.

Step 3, XOR.

Then the final XOR doesn't have UB AFAICT. It wouldn't be UB as of C++20, when signed integers became two's complement officially. It might have been implementation defined before then, which would be almost as good for something that ubiquitous, but I'm not sure.

Ok, so I think this does not involve UB except for an input of numeric_limits_lowest.

Sound about right?

To fix this, perhaps you would need to make that + an unsigned one?

It bothers me how hard you need to think to do this language lawyering. C++ is a system of rules. Computers interpret rules. The language lawyer should be a piece of software. I get it, not everything can be done statically, so, fine, do it dynamically? UBSan comes closest in practice, but doesn't detect everything. I understand formally modeled versions of C and C++ have been developed commercially, but these are not open source so they effectively do not exist. It's a strange situation.

Just the other day I considered writing something branchless and said, "nah", because of uncertainty around UB. How much performance is being left on the table around the world because of similar thinking, driven by the existence of UB?

Maybe I was supposed to write OCaml or Pascal or Rust.

teo_zero

about 2 months ago

Well, I would say that implementation defined is ok only if you have full control on the full compilation process. If your code aims at universality you should find better tricks.

The UB on the add happens in cases where all incarnations of abs() would fail as well, because there simply isn't a correct return value.

wucke13

about 2 months ago

Rust in particular with miri is quite impressive at catching them. You just run your testcases via

    cargo miri run

And if your code actually touches UB, mirei will most likely point out exactly where and why.

vdupras

about 2 months ago

2 replies

In the part about "abs", there's an assembly breakdown:

mov eax, edi

sar eax, 31

mov ecx, eax

add edi, ecx

xor eax, edi

Has this been generated by a C compiler? If yes, it's a bit puzzling, because can't you remove "mov ecx, eax", replace "add edi, ecx" by "add edi, eax" and have the exact same result?

userbinator

about 2 months ago

1 reply

If you look at compiler output, you will always see plenty of small stupidities.

vdupras

about 2 months ago

1 reply

So why does conventional wisdom say that compilers will, in the vast majority of the time, outperform programmers doing assembly by hand? It seems contradictory to me.

toast0

about 2 months ago

1 reply

Because the cost of that unnecessary mov is very small, so the win from human assembly is very small.

But rules of thumb are like this. If you know enough to question the rule of thumb, go ahead. Hand assembly in hot code can be worth the cost.

It's also possible the value in ecx is used again outside the snippet?

vdupras

about 2 months ago

1 reply

In that context, it's not very small, it's 20% (all instructions are register-to-register instructions, so they all have the same weight). It's huge.

Yes, there's the possibility that ecx is used elsewhere, and in that case, my second comment is irrelevant, because I was answering to the possibility that such big wart is to be expected from compilers because they crop up regularly.

But then again, it's unlikely that it's used elsewhere, because eax has the return value of the C snippet, there's nothing else to do, the function can return. So the original question remains: did this come from a C compiler? If yes, it's crappy code.

toast0

about 2 months ago

1 reply

> In that context, it's not very small, it's 20% (all instructions are register-to-register instructions, so they all have the same weight). It's huge.

Huge in space sure. Not in execution time.

vdupras

about 2 months ago

1 reply

It's 20% of the execution time. All these instructions use the same number of cycles.

toast0

about 2 months ago

1 reply

Do they? I put together two quick and dirty nonsense test programs this is option2:

   int main (void) {
       for (int i = 0; i < 1000000000; ++i) {
            asm volatile (
                ".intel_syntax\n"
                "mov eax, edi\n"
                "sar eax, 31\n"
                "add edi, eax\n"
                "xor eax, edi\n"
            :::);
       }
       return 0;
   }

option1 has the extraneous mov ecx, eax, and then add with ecx.

I confirmed with objdump -d that the assembly hadn't been touched and that the loops were the same. On my otherwise mostly idle dual L5640 system and pinned to a single cpu (just in case), option1 consistently runs in 3.14 seconds and option2 consistently runs in 3.15 seconds.

Adding an extra zero, both option1 and option2 runs in 30.94-30.95 user seconds. The extraneous move doesn't seem to cost any actual time.

userbinator

about 2 months ago

1 reply

Microbenchmarks don't usually tell the whole story. Once the bloat adds up the cache misses and macro-scale benchmarks will show a difference.

toast0

about 2 months ago

I'm sure the size penalty adds up in some cases.

But if you look at your program that must go faster, and you see unnecessary moves in the hot section(s), go ahead and remove them, but don't be surprised if it doesn't change much.

If you went and did your whole program by hand, the debloating might also not change much. That's why there's a rule of thumb.

If you have the skill to make a change to the compiler so it can output a better sequence of instructions, I suspect thsat's pretty difficult, but it may make enough of a difference over a large number of programs to be worthwhile.

senfiaj

about 2 months ago

This what was generated with clang / gcc x64 with O2/O3 flag on godbolt.org

  abs_branchless(int):
        mov     eax, edi
        neg     eax
        cmovs   eax, edi
        ret

  abs_branch(int):
        mov     eax, edi
        neg     eax
        cmovs   eax, edi
        ret

zackmorris

about 2 months ago

6 replies

I need something like this for a switch() command (technically a list of arbitrary functions). Sort of like up to N branches in one step.

The idea is to take some number of inputs A, B, C, ... and conceptually perform all of the possible functions simultaneously, then keep the one that's desired and throw the rest away. For any arbitrary logic. Ideally using fewer operations than all of that, but that's optional. Driven by one variable, it would look like:

  // branch version
  switch(var1)
  {
    case 1:
      var4 = var2 + var3;
      break;
    case 2:
      var4 = var2 - var3;
      break;
    case 3:
      var4 = var2 * var3;
      break;
    case 4:
      var4 = var2 / var3;
      break;
    // ...
    default:
      var4 = 0; // optional and arbitrary
      break;
  }
  
  // branchless version
  var4 = magic(var1, var2, var3);

I don't know how to do this outside of programmable hardware like an FPGA. The problem is that it's extremely challenging to write/solve the formulas that map ordinary arithmetic and bitwise functions to the larger set of functions.

So for now, I may just use a switch() command in a shader and let it figure out any reusable logic internally. I don't know if shaders have come far enough to allow that performantly, or if they end up just calculating everything and throwing away the paths not taken. Which would suggest that the max speed would be O(1/N) for the number of cases.

Does anyone know? Maybe truth tables? Or a SAT solver? I'm also not sure how this would work for floating point, but that's optional too.

Edit: I updated the cases to show how var1 controls the math performed on var2, var3 and var4.

AnotherGoodName

about 2 months ago

Think of a way you can make the number of each case 0 or 1 then you have a single sum of all possibilities where that 0 or 1 multiplies the portion of the switch statement it represents (the 0 will mean not active the 1 includes it).

Eg. If we had just 1 or 2 as options for the case statements note that test1 = xor(var1 & 0x1, var1>>1 & 0x1) & var1 will, with integer rounding, equal ‘1’ if var1 is one or ‘0’ if var1 is 2,3 or 4. (If it goes to 5 you need another xor at the next highest bit).

Likewise test2 = xor(var1 & 0x1, var1>>1 & 0x1) & var1<<1 equals ‘0’ if var1 is 1, 3 or 4 but it equals ‘1’ if it’s two.

So (test1 * (value on var1 being 1) + (test2 * (value on var1 being 2)) will get you a branchless version for the simple either 1 or 2 case. This will pretty much always be much slower than the branching version but some types of systems out there don’t really do branching and are linear. This type of zeroing trick is really common on matrix multipliers.

Essentially your creating binary logic gates that give 0 or 1 for the exact value you want and blindly multiplying that by the value it would have produced if 1.

fwip

about 2 months ago

Maybe I'm missing something obvious, but how about:

    vars[1] = var2 + var3;
    vars[2] = var2 - var3;
    ...
    vars[0] = 0;
    var4 = vars[var1];

secondcoming

about 2 months ago

Compilers would turn that into a jump table:

https://godbolt.org/z/14h5djYe8

Although this looks branchless if you don't care about divide-by-zero. The ALU might not be too happy:

https://godbolt.org/z/9nqG3xMPY

judiciousAuthor

about 2 months ago

I’m by no means an expert on this topic, but what about using predefined hash map of value to function call/closure unless that also yields branched programming underneath? Additionally, maybe you want to use macros of some kind to generate the “magic” function you’re looking for, where it constructs a single branchless statement up to N terms, whether it’s procedural macros in Rust or C macros.

zackmorris

about 2 months ago

Thank you everyone for your answers, they got me thinking in new ways.

In case anyone ever finds this, I realized that a brute force solution would be a lookup table of all combinations. For three 8 bit variables, that's 2^24=16,777,216 or 16 MB of EPROM.

Then we could apply truth table simplification techniques like Karnaugh maps, sum of products (SOP), products of sums (POS), etc:

https://workforce.libretexts.org/Bookshelves/Electronics_Tec...

https://tma.main.jp/logic/index_en.html

https://karnaughmapsolver.com/

I asked AI, and if the bits approximate a pseudorandom pattern, then we would need approximately log2(2^24)=24 levels of 2-input boolean logic gates. For NAND or NOR gates, apparently that's something like 3x2^27=400 million gates. If we could manage 24-input gates somehow, that might be reduced to ceil(log2(24))=5 levels, but internally that would occupy about the same amount of die area. Perhaps using a hybrid of gates, multiplexers and EPROMS could result in a much smaller circuit at the cost of higher propagation delay (latency), which is the goal of FPGAs.

We could then play with the ordering of the control variable var1 to explore all 256 possibilities and see if any result in a reduced number of logic levels (by maximizing the size of Karnaugh map loops). I suspect that might only save a few levels at most though, but even that could cut the number of gates by 50-90%.

Thankfully the control variable var1 might only have 16-64 possibilities, which lowers it to 4-6 bits or just 2^20=1,048,576 to 2^22=4,194,304 or 1-4 MB of EPROM, but still about 3*2^25=100 million gates worst case.

For 16, 32 or 64 bit integers or floats, it might be better to just calculate all cases with dedicated hardware and select the right one.

yccs27

about 2 months ago

The most general approach is to calculate all the different cases, and then do a branchless selection:

    var4 = (var1==1)*(var2+var3) | (var1==2)*(var2-var3) | ...

Of course this is basically the slowest possible option, but it might work as a starting point for a general-purpose optimizer to find a faster solution. If it doesn't, this will likely compile to conditional move instructions.

It might help if you replace the comparison and multiplication with an equivalent expression made from bitwise operations, but I believe most compilers already know how to do this transformation.

Joker_vD

about 2 months ago

3 replies

Just so you know, the "return x < 0 ? -x : x;" compiles into

    abs_branch:
        mov     eax, edi
        neg     eax
        cmovs   eax, edi
        ret

on x64, and into

    abs_branch:
        srai    a5,a0,31
        xor     a0,a5,a0
        sub     a0,a0,a5
        ret

on RISC-V if you use a C compiler with a half-decent codegen. And "branchy" clamp() translates into

    clamp:
        cmp     edi, edx
        mov     eax, esi
        cmovle  edx, edi
        cmp     edi, esi
        cmovge  eax, edx
        ret

Seriously, the automatic transformation between ?: and if-then-else (in both directions) is quite well studied by now. And if you try to benchmark difference between branching and branchless implementations, please make sure that the branches you expect are actually there in the compiler's output.

zeckalpha

about 2 months ago

1 reply

The inverse problem is here too:

> int partition_branchless(int* arr, int low, int high) {... for (int j = low; j < high; j++) { ... }... }

That for loop is just a sugared while loop which is just a sugared cmp and jmp

xigoi

about 2 months ago

The articleesays that it optimizes just the inside of the loop (the loop jump can be optimized via unrolling, which the compiler may do automatically).

TuxSH

about 2 months ago

In fact, I have seen gcc optimize clever hacks that tried to use multiplication, into conditional moves (on aarch32).

There is this common misconception that conditional moves == branching. On actually relevant software architectures, they very much are not. Replacing a p=0.5 branch into a conditional move is in itself a significant optimization.

rcxdude

about 2 months ago

Yeah, if you care about branches for any reason then you need to at least verify the compiler's output, if not drop down to assembly completely.

Animats

about 2 months ago

1 reply

This is just cutesy on CPUs, but is a big part of GPU programming.

itemize123

about 2 months ago

1 reply

off topic, what are good resources to dive into gpu programming (for someone mostly in the cpu world)

Animats

about 2 months ago

Here's a short CUDA demo from NVidia, of adding two arrays of a million numbers each, elementwise. The line that actually does the add is

     add<<<1, 1>>>(N, x, y);

All N adds are conceptually done in parallel, with no side effects. In practice, hundreds or thousands of adds are done simultaneously, depending on the available hardware.

This is true branchless programming.

[1] https://developer.nvidia.com/blog/even-easier-introduction-c...

mwkaufma

about 2 months ago

3 replies

Is cmov branchless, or just branching by another name?

senfiaj

about 2 months ago

1 reply

From my understanding branches are about conditional jump instructions. Here are some of them:

  JZ, JE - Jump if Zero, Jump if Equal
  JNZ, JNE - Jump if Not Zero, Jump if Not Equal
  JC - Jump if Carry
  JNC - Jump if No Carry
  JO - Jump if Overflow
  JNO - Jump if No Overflow
  JS - Jump if Signed (Negative)
  JNS - Jump if Not Signed (Positive or Zero)
  JP, JPE - Jump if Parity, Jump if Parity is Even
  JNP, JPO - Jump if Not Parity, Jump if Parity is Odd

mwkaufma

about 2 months ago

The article's claim is that branchless code avoids branch-prediction, and therefore optimizes better on speculative CPUs for certain workloads. Jump instructions obviously incur branch-prediction; my question is is CMOV incurs branch prediction, also? Certainly the internal micro-ops behind the instructions will still need to make a prediction or stall?

stevefan1999

about 2 months ago

1 reply

That depends on how you define branch.

Say in Rust:

let foo = if bar { 1 } else { 2 };

And

let mut foo; if bar { foo = 1; } else { foo = 2; }

Despite they looked the same, functions the same and effectively the same, but the first one is the conditional move, and the second one would be a jump initially (until further compiler optimization kick in)

You will notice that for conditional move, you "get" a predictable expression for the result, but with branched jump, it's like you "get" a bunch of arbitrary statements, that writes to the expression. It may end up folding so both will essentially be compiled to cmov, but the way to representation of the assignment is different. You can be certain with conditional instructions, but you can't be certain with branched jump, otherwise we don't need branch prediction.

In fact, the way conditional instructions work is due to Church encoding, that you created a lambda function that calls the left or right function depending on the input evaluation, which can be seen as implicitly embedding the branch.

mwkaufma

about 2 months ago

1 reply

Lotsa detail, but not to the point: does cmov depend on branch prediction or not?

stevefan1999

about 2 months ago

No, predicated instructions like CMOV do not depend on branch prediction.

teo_zero

about 2 months ago

CMOV is branchless. During all steps of its execution, the CPU doesn't have to invalidate any stage of its pipeline (well, excluding errors like accessing forbidden memory addresses, etc.).

senfiaj

about 2 months ago

1 reply

Interesting article. I remember have solved a leetcode problem called "Trapping Rain Water" https://leetcode.com/problems/trapping-rain-water/descriptio...

On leetcode the "elegant" and fastest solution was something like this:

  class Solution {
  public:
    int trap(vector<int>& height) {
        int i = 1;
        int n = height.size();
        int j = n - 1;
        int leftmax = height[0];
        int rightmax = height[n - 1];
        int tw = 0;
        while (i <= j) {
            if (leftmax <= rightmax) {
                if (leftmax > height[i]) {
                    tw += leftmax - height[i];
                } else {
                    leftmax = height[i];
                }
                i++;
            } else {
                if (rightmax > height[j]) {
                    tw += rightmax - height[j];
                } else {
                    rightmax = height[j];
                }
                j--;
            }
        }
        return tw;
    }
  };

My solution was this:

  class Solution {
  public:
    int trap(const vector<int>& height) {
        int mx = 0;
        long long sum = 0;

        for (const auto h : height) {
            mx = max(mx, h);
            sum += mx - h;
        }

        int rMx = 0, h, i;

        for (i = height.size() - 1; (h = height[i]) != mx; --i) {
            rMx = max(h, rMx);
            sum += rMx;
        }

        return sum - (height.size() - 1 - i) * mx;
    }
  };

However, I compiled and run some benchmarks on my local machine, and it turned out while on average my solution was slower, but in the worst case it was beating that solultion. My solution was less sensitive to the random input and the degradation was far less dramatic, around 2x, while that solution degraded much much more. I also thought that it was probably caused by the fact that my solution seem to be more predictable for the CPU branch predictor than that one, despite that it iterates twice.

ckw

about 2 months ago

My slow solution:

  class Solution:
    def trap(self, height: List[int]) -> int:
      l,r,s = [*accumulate(height,max),[*accumulate(height[::-1],max)],len(height)
      return sum(max(0, min(l[i], r[s-i-1])-height[i]) for i in range(s))

okaleniuk

about 2 months ago

1 reply

Some safety critical real time systems have strict time predictability requirements. This means that the whole loop should pass in exactly X microsecond, not more not less. For this, all the programming should be pretty much branchless.

For instance, all the sorting algorithm turn to, effectively, bubble sort since without branches, you always go with the worst case - and the sorting complexity is always the O(n^2). But it's okay, since the algorithm becomes predictable. You just swap a conditional swap:

    if a < b then swap(a, b)

with arithmetic swap:

    c = (a+b)/2
    d = |a-b|/2
    a = c + d 
    b = c - d

and that's it.

ashdnazg

about 2 months ago

O(n^2) isn't required. One could do an in-place merge-sort, which is also always worst case, but with O(n*log(n)).

I suspect everyone turns to Bubblesort since the inputs are small enough that it doesn't matter (evident by the fact that it should fit within microseconds).

Scene_Cast2

about 2 months ago

Another interesting take is how GPUs handle branching (latest NVidia GPUs in particular - see the register interconnect and how they swap threads without saving registers).

nixpulvis

about 2 months ago

Branchless programming tactics also avoid security issues around timing side channels leaking predicate bits, which is somewhat interesting.

styfle

about 2 months ago

This same blog posted yesterday and got flagged with “How I accidently created the fastest CSV parser ever made”

https://news.ycombinator.com/item?id=45400009

HarHarVeryFunny

about 2 months ago

The ARM instruction set used to make all instructions conditional to avoid the need for branching, but has now moved away from that since the branch prediction is so good nowadays.

gtirloni

about 2 months ago

I think I'm getting too sensitive to AI-assisted articles. So many whimsical terms to make something "fun".

mshockwave

about 2 months ago

The article is easy to follow but I think the author missed the e point: branchless programming (a subset of the more known constant time programming) is almost exclusively used in cryptography only nowadays. As shown by the benchmarks in the article, modern branch predictors can easily achieve over 95% if not 99% precision since like a decade ago

stevefan1999

about 2 months ago

I kind of like branchless programming, because in SIMD programming, some of its concepts are directly borrowed from branchless techniques, for example using bit masks to represent which vector to enable scatter and gather for.

For example, I can give it an array A and a vector V and a mask which output another vector O, then if the specific position i in the bit mask is 1, then O[i] will pick this element from A[V[i]], otherwise just A[i].

In Python this may sound like [A[V[i]] if M[i] else V[i] for i in range(len(V))], so it is very branchy, but in SIMD this would just be a bunch of SIMD operations without branch!

Speaking of which, this is particularly informative about what "branchless programming" really are: https://en.m.wikipedia.org/wiki/Predication_(computer_archit...

If you want to learn about the ultimate form of branchless programming, check out Church encoding: https://en.m.wikipedia.org/wiki/Church_encoding and https://gautier.difolco.dev/2025-09/extreme-branchless-expr-...

presentation

about 2 months ago

In the partition example isn’t the for loop condition a branch technically?

starchild3001

about 2 months ago

I’m a big advocate of branchless programming — keeping configurations to a minimum and maintaining as much linear flow as possible, with little to no cfg-driven branching.

Why? I once took over a massive statistics codebase with hundreds of configuration variables. That meant, in theory, upwards of 2^100 possible execution paths — a combinatorial explosion that turned testing into a nightmare. After I linearized the system, removing the exponential branching and reducing it to a straightforward flow, things became dramatically simpler. What had once taken years to stabilize, messy codebase, became easy to reason about and, in practice, guaranteed bug-free.

Some people dismissed the result as “monolithic,” which is a meaningless label if you think about it. Yes, the code did one thing and only one thing —- but it did that thing perfectly, every single time. It wasn’t pretending to be a bloated, half-tested “jack of all trades” statistics library with hidden modes and brittle edge cases.

I’m proud of writing branchless (or “monolithic” code if you prefer). To me, it’s a hallmark of programming maturity -- choosing correctness and clarity over endless configurability, complexity and hidden modes.

klas_segeljakt

about 2 months ago

Would it be possible to create a programming language that allows you to write branching programs, but is guaranteed to compile it into branchless code?

QuantumSeed

about 2 months ago

I was an APL programmer for many years and I insisted on writing as much of my code as possible without branches. Fortunately, it was reasonably easy to do in that language, if sometimes resulting in very cryptic code

dehrmann

about 2 months ago

Shameless plug, but a while back, I implemented a "branchless" (I think it actually branches, but not in the usual sense) binary search in C. It was just a POC to see if 1) I could be clever enough with bitwise operators to do it and 2) someday write a SIMD binary search.

https://github.com/ehrmann/branchless-binary-search

View full discussion on Hacker News

ID: 45405750Type: storyLast synced: 11/20/2025, 5:23:56 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Read Article View on HN

== Benchmarking ABS == ABS (branch): 0.260 sec ABS (branchless): 0.264 sec == Benchmarking CLAMP == CLAMP (branch): 0.332 sec CLAMP (branchless): 0.538 sec == Benchmarking PARTITION == PARTITION (branch): 0.043 sec PARTITION (branchless): 0.091 sec

== Benchmarking ABS == ABS (branch): 0.743 sec ABS (branchless): 0.948 sec == Benchmarking CLAMP == CLAMP (branch): 4.275 sec CLAMP (branchless): 1.429 sec == Benchmarking PARTITION == PARTITION (branch): 0.156 sec PARTITION (branchless): 0.164 sec

int main (void) { for (int i = 0; i < 1000000000; ++i) { asm volatile ( ".intel_syntax\n" "mov eax, edi\n" "sar eax, 31\n" "add edi, eax\n" "xor eax, edi\n" :::); } return 0; }

abs_branchless(int): mov eax, edi neg eax cmovs eax, edi ret abs_branch(int): mov eax, edi neg eax cmovs eax, edi ret

// branch version switch(var1) { case 1: var4 = var2 + var3; break; case 2: var4 = var2 - var3; break; case 3: var4 = var2 * var3; break; case 4: var4 = var2 / var3; break; // ... default: var4 = 0; // optional and arbitrary break; } // branchless version var4 = magic(var1, var2, var3);

JZ, JE - Jump if Zero, Jump if Equal JNZ, JNE - Jump if Not Zero, Jump if Not Equal JC - Jump if Carry JNC - Jump if No Carry JO - Jump if Overflow JNO - Jump if No Overflow JS - Jump if Signed (Negative) JNS - Jump if Not Signed (Positive or Zero) JP, JPE - Jump if Parity, Jump if Parity is Even JNP, JPO - Jump if Not Parity, Jump if Parity is Odd

class Solution { public: int trap(vector<int>& height) { int i = 1; int n = height.size(); int j = n - 1; int leftmax = height[0]; int rightmax = height[n - 1]; int tw = 0; while (i <= j) { if (leftmax <= rightmax) { if (leftmax > height[i]) { tw += leftmax - height[i]; } else { leftmax = height[i]; } i++; } else { if (rightmax > height[j]) { tw += rightmax - height[j]; } else { rightmax = height[j]; } j--; } } return tw; } };

class Solution { public: int trap(const vector<int>& height) { int mx = 0; long long sum = 0; for (const auto h : height) { mx = max(mx, h); sum += mx - h; } int rMx = 0, h, i; for (i = height.size() - 1; (h = height[i]) != mx; --i) { rMx = max(h, rMx); sum += rMx; } return sum - (height.size() - 1 - i) * mx; } };

class Solution: def trap(self, height: List[int]) -> int: l,r,s = [*accumulate(height,max),[*accumulate(height[::-1],max)],len(height) return sum(max(0, min(l[i], r[s-i-1])-height[i]) for i in range(s))