The Weird Concept of Branchless Programming
Mood
calm
Sentiment
mixed
Category
other
Key topics
The article discusses the concept of branchless programming, its benefits, and examples, sparking a discussion on its relevance, implementation, and potential applications.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
1h
Peak period
83
Day 1
Avg / period
22.5
Based on 90 loaded comments
Key moments
- 01Story posted
Sep 28, 2025 at 12:40 PM EDT
about 2 months ago
Step 01 - 02First comment
Sep 28, 2025 at 1:54 PM EDT
1h after posting
Step 02 - 03Peak activity
83 comments in Day 1
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 3, 2025 at 10:01 AM EDT
about 2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Most code should focus on readability, then profile for busy areas under use, and finally refactor the busy areas though hand optimization or register hints as required.
If one creates something that looks suspect (inline Assembly macro), a peer or llvm build will come along and ruin it later for sure. Have a great day =3
In some architectures, both of the branch code motions are executed in parallel, and one is simply tossed after dependent operations finish. We can't be sure exactly how branch predictors and pre-fetch is implemented as it falls under manufacturer NDA. =3
At a 5ms baseline with millisecond precision, the smallest improvement you can measure is 20%. And you cannot distinguish a 20% speedup with a 20% slowdown that happened to get luck with clock ticks.
For what it is worth, I ran the provided test code on my machine with a 100x increase in iterations and got the following:
== Benchmarking ABS ==
ABS (branch): 0.260 sec
ABS (branchless): 0.264 sec
== Benchmarking CLAMP ==
CLAMP (branch): 0.332 sec
CLAMP (branchless): 0.538 sec
== Benchmarking PARTITION ==
PARTITION (branch): 0.043 sec
PARTITION (branchless): 0.091 sec
Which is not exactly encouraging (gcc 13.3.0, -ffast-math -march=native. I did not use the -fomit-this-entire-function flag, which my compiler does not understand).I had to drop down to O0 to see branchless be faster in any case:
== Benchmarking ABS ==
ABS (branch): 0.743 sec
ABS (branchless): 0.948 sec
== Benchmarking CLAMP ==
CLAMP (branch): 4.275 sec
CLAMP (branchless): 1.429 sec
== Benchmarking PARTITION ==
PARTITION (branch): 0.156 sec
PARTITION (branchless): 0.164 sechttps://gist.github.com/Stefan-JLU/3925c6a73836ce841860b55c8...
Did you check whether your branchy code actually still was branchy after the compiler processed it at higher optimization levels?
The video also showcases a much more impressive branchless speedup: computing CRC checksums. Doing this naïvely with an if-statement for each bit is ~10x slower than doing it branchless with a single bitwise operation for each bit. The author of the article should consider showcasing this too, since it's a lot more impressive than the measly 1.2x speedup highlighted in the article. I assume the minimal/nonexistent speedups in the article are due to modern CPU branch prediction being quite good. But branch predictors inherently fail miserably at CRC because the conditional is on whether the input bit is 1 or 0, which is essentially random.
>If you compare a 100W CPU and a 30W CPU that's a couple of years newer, you shouldn't be surprised there isn't much of a difference in performance
Sure, but this is over a lot more than a couple years. I'd expect a 2023 mobile i9 to be considerably more than twice as fast as a 2007 desktop Core 2 Duo.
When you get to think about branchless programming, especially for SIMD optimizations in the real world, you always learn a lot and it’s as if you get a +1 level on your algorithmic skills. The hardest part then is make sure the tricks are clearly laidout so that someone else can take it from here next time
These type of exercises should be mandatory for all compute intensive related jobs, especially for all data science people (i know some of them know this stuff but that does not seem to be the majority)
Real-world high-performance matrix multiplication functions do contain branches internally, even on GPUs. If you are ever curious about what that looks like, NVidia maintains an open-source library called CUTLASS.
(not a critique)
Obviously that requires a lot of extra transistors and you are doing computation that will be thrown away, so it's not free in terms of space or power/heat/energy. But perhaps it could handle cases that other approaches can't.
Even more of a wild idea is to pair up two cores and have them work together this way. When you have a core that would have been idle anyway, it can shadow an active core and be its doppelganger that takes the other branch. You'd need to have very fast communication between cores so the shadow core can spring into action instantly when you hit a branch.
My gut instinct is it's not worth it overall, but I'm curious whether it's been tried in the real world.
It doesn't require duplicating everything. You just need to add some additional bookkeeping of dependencies and what to retire vs kill at the end of the pipeline.
In practice branch predictors are so good that speculating off the "spine" of most likely path just isn't worth it.
In fact there were a lot of interesting microarchitectural ideas from the late 90s to early 00s that just ended up being moot because the combination of OoO speculation, branch predictors, and big caches proved so effective.
They already do it (edit: they don’t). It is difficult to get security right, however (see https://en.wikipedia.org/wiki/Spectre_(security_vulnerabilit...).
> Even more of a wild idea is to pair up two cores and have them work together this way
I don't think that'll be profitable, because...
> When you have a core that would have been idle anyway
...you'll just schedule in another process. Modern OS rarely runs short on available tasks to run
As others said: yes, it has been tried and it works, but it costs a lot in hardware and power usage. A problem is that lots of code has a branch every 10 or so instructions. Fast high-end CPUs (the only realistic target for this feature) can dispatch multiple instructions per cycle. Combined that means you will hit a branch every two or three cycles. Because of that, you do not end up with two of everything but with way more.
So, you’re throwing away not 50% of your work but easily 80%.
Some code has fewer branches, but that often can easily be parallelized or vectorized.
After thinking through this carefully, though, I do not see UB (except for signed overflow in a corner case):
Step 1, bit shift.
I understand that, until C++20, left shift of a signed int was UB. But this right shift appears to be implementation defined, which is ok if documented in the code.
Step 2: Add.
Then, (x + mask) is defined behavior (a) for positive x, since then mask=0, and (b) for most negative x, since then mask=-1. However, if x is numeric_limits::lowest, then you get signed integer overflow, which is UB.
Step 3, XOR.
Then the final XOR doesn't have UB AFAICT. It wouldn't be UB as of C++20, when signed integers became two's complement officially. It might have been implementation defined before then, which would be almost as good for something that ubiquitous, but I'm not sure.
Ok, so I think this does not involve UB except for an input of numeric_limits_lowest.
Sound about right?
To fix this, perhaps you would need to make that + an unsigned one?
It bothers me how hard you need to think to do this language lawyering. C++ is a system of rules. Computers interpret rules. The language lawyer should be a piece of software. I get it, not everything can be done statically, so, fine, do it dynamically? UBSan comes closest in practice, but doesn't detect everything. I understand formally modeled versions of C and C++ have been developed commercially, but these are not open source so they effectively do not exist. It's a strange situation.
Just the other day I considered writing something branchless and said, "nah", because of uncertainty around UB. How much performance is being left on the table around the world because of similar thinking, driven by the existence of UB?
Maybe I was supposed to write OCaml or Pascal or Rust.
The UB on the add happens in cases where all incarnations of abs() would fail as well, because there simply isn't a correct return value.
cargo miri run
And if your code actually touches UB, mirei will most likely point out exactly where and why.mov eax, edi
sar eax, 31
mov ecx, eax
add edi, ecx
xor eax, edi
Has this been generated by a C compiler? If yes, it's a bit puzzling, because can't you remove "mov ecx, eax", replace "add edi, ecx" by "add edi, eax" and have the exact same result?
But rules of thumb are like this. If you know enough to question the rule of thumb, go ahead. Hand assembly in hot code can be worth the cost.
It's also possible the value in ecx is used again outside the snippet?
Yes, there's the possibility that ecx is used elsewhere, and in that case, my second comment is irrelevant, because I was answering to the possibility that such big wart is to be expected from compilers because they crop up regularly.
But then again, it's unlikely that it's used elsewhere, because eax has the return value of the C snippet, there's nothing else to do, the function can return. So the original question remains: did this come from a C compiler? If yes, it's crappy code.
Huge in space sure. Not in execution time.
int main (void) {
for (int i = 0; i < 1000000000; ++i) {
asm volatile (
".intel_syntax\n"
"mov eax, edi\n"
"sar eax, 31\n"
"add edi, eax\n"
"xor eax, edi\n"
:::);
}
return 0;
}
option1 has the extraneous mov ecx, eax, and then add with ecx.I confirmed with objdump -d that the assembly hadn't been touched and that the loops were the same. On my otherwise mostly idle dual L5640 system and pinned to a single cpu (just in case), option1 consistently runs in 3.14 seconds and option2 consistently runs in 3.15 seconds.
Adding an extra zero, both option1 and option2 runs in 30.94-30.95 user seconds. The extraneous move doesn't seem to cost any actual time.
But if you look at your program that must go faster, and you see unnecessary moves in the hot section(s), go ahead and remove them, but don't be surprised if it doesn't change much.
If you went and did your whole program by hand, the debloating might also not change much. That's why there's a rule of thumb.
If you have the skill to make a change to the compiler so it can output a better sequence of instructions, I suspect thsat's pretty difficult, but it may make enough of a difference over a large number of programs to be worthwhile.
abs_branchless(int):
mov eax, edi
neg eax
cmovs eax, edi
ret
abs_branch(int):
mov eax, edi
neg eax
cmovs eax, edi
retThe idea is to take some number of inputs A, B, C, ... and conceptually perform all of the possible functions simultaneously, then keep the one that's desired and throw the rest away. For any arbitrary logic. Ideally using fewer operations than all of that, but that's optional. Driven by one variable, it would look like:
// branch version
switch(var1)
{
case 1:
var4 = var2 + var3;
break;
case 2:
var4 = var2 - var3;
break;
case 3:
var4 = var2 * var3;
break;
case 4:
var4 = var2 / var3;
break;
// ...
default:
var4 = 0; // optional and arbitrary
break;
}
// branchless version
var4 = magic(var1, var2, var3);
I don't know how to do this outside of programmable hardware like an FPGA. The problem is that it's extremely challenging to write/solve the formulas that map ordinary arithmetic and bitwise functions to the larger set of functions.So for now, I may just use a switch() command in a shader and let it figure out any reusable logic internally. I don't know if shaders have come far enough to allow that performantly, or if they end up just calculating everything and throwing away the paths not taken. Which would suggest that the max speed would be O(1/N) for the number of cases.
Does anyone know? Maybe truth tables? Or a SAT solver? I'm also not sure how this would work for floating point, but that's optional too.
Edit: I updated the cases to show how var1 controls the math performed on var2, var3 and var4.
Eg. If we had just 1 or 2 as options for the case statements note that test1 = xor(var1 & 0x1, var1>>1 & 0x1) & var1 will, with integer rounding, equal ‘1’ if var1 is one or ‘0’ if var1 is 2,3 or 4. (If it goes to 5 you need another xor at the next highest bit).
Likewise test2 = xor(var1 & 0x1, var1>>1 & 0x1) & var1<<1 equals ‘0’ if var1 is 1, 3 or 4 but it equals ‘1’ if it’s two.
So (test1 * (value on var1 being 1) + (test2 * (value on var1 being 2)) will get you a branchless version for the simple either 1 or 2 case. This will pretty much always be much slower than the branching version but some types of systems out there don’t really do branching and are linear. This type of zeroing trick is really common on matrix multipliers.
Essentially your creating binary logic gates that give 0 or 1 for the exact value you want and blindly multiplying that by the value it would have produced if 1.
vars[1] = var2 + var3;
vars[2] = var2 - var3;
...
vars[0] = 0;
var4 = vars[var1];https://godbolt.org/z/14h5djYe8
Although this looks branchless if you don't care about divide-by-zero. The ALU might not be too happy:
In case anyone ever finds this, I realized that a brute force solution would be a lookup table of all combinations. For three 8 bit variables, that's 2^24=16,777,216 or 16 MB of EPROM.
Then we could apply truth table simplification techniques like Karnaugh maps, sum of products (SOP), products of sums (POS), etc:
https://workforce.libretexts.org/Bookshelves/Electronics_Tec...
https://tma.main.jp/logic/index_en.html
https://karnaughmapsolver.com/
I asked AI, and if the bits approximate a pseudorandom pattern, then we would need approximately log2(2^24)=24 levels of 2-input boolean logic gates. For NAND or NOR gates, apparently that's something like 3x2^27=400 million gates. If we could manage 24-input gates somehow, that might be reduced to ceil(log2(24))=5 levels, but internally that would occupy about the same amount of die area. Perhaps using a hybrid of gates, multiplexers and EPROMS could result in a much smaller circuit at the cost of higher propagation delay (latency), which is the goal of FPGAs.
We could then play with the ordering of the control variable var1 to explore all 256 possibilities and see if any result in a reduced number of logic levels (by maximizing the size of Karnaugh map loops). I suspect that might only save a few levels at most though, but even that could cut the number of gates by 50-90%.
Thankfully the control variable var1 might only have 16-64 possibilities, which lowers it to 4-6 bits or just 2^20=1,048,576 to 2^22=4,194,304 or 1-4 MB of EPROM, but still about 3*2^25=100 million gates worst case.
For 16, 32 or 64 bit integers or floats, it might be better to just calculate all cases with dedicated hardware and select the right one.
var4 = (var1==1)*(var2+var3) | (var1==2)*(var2-var3) | ...
Of course this is basically the slowest possible option, but it might work as a starting point for a general-purpose optimizer to find a faster solution. If it doesn't, this will likely compile to conditional move instructions.It might help if you replace the comparison and multiplication with an equivalent expression made from bitwise operations, but I believe most compilers already know how to do this transformation.
abs_branch:
mov eax, edi
neg eax
cmovs eax, edi
ret
on x64, and into abs_branch:
srai a5,a0,31
xor a0,a5,a0
sub a0,a0,a5
ret
on RISC-V if you use a C compiler with a half-decent codegen. And "branchy" clamp() translates into clamp:
cmp edi, edx
mov eax, esi
cmovle edx, edi
cmp edi, esi
cmovge eax, edx
ret
Seriously, the automatic transformation between ?: and if-then-else (in both directions) is quite well studied by now. And if you try to benchmark difference between branching and branchless implementations, please make sure that the branches you expect are actually there in the compiler's output.> int partition_branchless(int* arr, int low, int high) {... for (int j = low; j < high; j++) { ... }... }
That for loop is just a sugared while loop which is just a sugared cmp and jmp
There is this common misconception that conditional moves == branching. On actually relevant software architectures, they very much are not. Replacing a p=0.5 branch into a conditional move is in itself a significant optimization.
add<<<1, 1>>>(N, x, y);
All N adds are conceptually done in parallel, with no side effects. In practice, hundreds or thousands of adds are done simultaneously, depending on the available hardware.This is true branchless programming.
[1] https://developer.nvidia.com/blog/even-easier-introduction-c...
JZ, JE - Jump if Zero, Jump if Equal
JNZ, JNE - Jump if Not Zero, Jump if Not Equal
JC - Jump if Carry
JNC - Jump if No Carry
JO - Jump if Overflow
JNO - Jump if No Overflow
JS - Jump if Signed (Negative)
JNS - Jump if Not Signed (Positive or Zero)
JP, JPE - Jump if Parity, Jump if Parity is Even
JNP, JPO - Jump if Not Parity, Jump if Parity is OddSay in Rust:
let foo = if bar { 1 } else { 2 };
And
let mut foo; if bar { foo = 1; } else { foo = 2; }
Despite they looked the same, functions the same and effectively the same, but the first one is the conditional move, and the second one would be a jump initially (until further compiler optimization kick in)
You will notice that for conditional move, you "get" a predictable expression for the result, but with branched jump, it's like you "get" a bunch of arbitrary statements, that writes to the expression. It may end up folding so both will essentially be compiled to cmov, but the way to representation of the assignment is different. You can be certain with conditional instructions, but you can't be certain with branched jump, otherwise we don't need branch prediction.
In fact, the way conditional instructions work is due to Church encoding, that you created a lambda function that calls the left or right function depending on the input evaluation, which can be seen as implicitly embedding the branch.
On leetcode the "elegant" and fastest solution was something like this:
class Solution {
public:
int trap(vector<int>& height) {
int i = 1;
int n = height.size();
int j = n - 1;
int leftmax = height[0];
int rightmax = height[n - 1];
int tw = 0;
while (i <= j) {
if (leftmax <= rightmax) {
if (leftmax > height[i]) {
tw += leftmax - height[i];
} else {
leftmax = height[i];
}
i++;
} else {
if (rightmax > height[j]) {
tw += rightmax - height[j];
} else {
rightmax = height[j];
}
j--;
}
}
return tw;
}
};
My solution was this: class Solution {
public:
int trap(const vector<int>& height) {
int mx = 0;
long long sum = 0;
for (const auto h : height) {
mx = max(mx, h);
sum += mx - h;
}
int rMx = 0, h, i;
for (i = height.size() - 1; (h = height[i]) != mx; --i) {
rMx = max(h, rMx);
sum += rMx;
}
return sum - (height.size() - 1 - i) * mx;
}
};
However, I compiled and run some benchmarks on my local machine, and it turned out while on average my solution was slower, but in the worst case it was beating that solultion. My solution was less sensitive to the random input and the degradation was far less dramatic, around 2x, while that solution degraded much much more. I also thought that it was probably caused by the fact that my solution seem to be more predictable for the CPU branch predictor than that one, despite that it iterates twice. class Solution:
def trap(self, height: List[int]) -> int:
l,r,s = [*accumulate(height,max),[*accumulate(height[::-1],max)],len(height)
return sum(max(0, min(l[i], r[s-i-1])-height[i]) for i in range(s))For instance, all the sorting algorithm turn to, effectively, bubble sort since without branches, you always go with the worst case - and the sorting complexity is always the O(n^2). But it's okay, since the algorithm becomes predictable. You just swap a conditional swap:
if a < b then swap(a, b)
with arithmetic swap: c = (a+b)/2
d = |a-b|/2
a = c + d
b = c - d
and that's it.I suspect everyone turns to Bubblesort since the inputs are small enough that it doesn't matter (evident by the fact that it should fit within microseconds).
For example, I can give it an array A and a vector V and a mask which output another vector O, then if the specific position i in the bit mask is 1, then O[i] will pick this element from A[V[i]], otherwise just A[i].
In Python this may sound like [A[V[i]] if M[i] else V[i] for i in range(len(V))], so it is very branchy, but in SIMD this would just be a bunch of SIMD operations without branch!
Speaking of which, this is particularly informative about what "branchless programming" really are: https://en.m.wikipedia.org/wiki/Predication_(computer_archit...
If you want to learn about the ultimate form of branchless programming, check out Church encoding: https://en.m.wikipedia.org/wiki/Church_encoding and https://gautier.difolco.dev/2025-09/extreme-branchless-expr-...
Why? I once took over a massive statistics codebase with hundreds of configuration variables. That meant, in theory, upwards of 2^100 possible execution paths — a combinatorial explosion that turned testing into a nightmare. After I linearized the system, removing the exponential branching and reducing it to a straightforward flow, things became dramatically simpler. What had once taken years to stabilize, messy codebase, became easy to reason about and, in practice, guaranteed bug-free.
Some people dismissed the result as “monolithic,” which is a meaningless label if you think about it. Yes, the code did one thing and only one thing —- but it did that thing perfectly, every single time. It wasn’t pretending to be a bloated, half-tested “jack of all trades” statistics library with hidden modes and brittle edge cases.
I’m proud of writing branchless (or “monolithic” code if you prefer). To me, it’s a hallmark of programming maturity -- choosing correctness and clarity over endless configurability, complexity and hidden modes.
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.