86 Gb/s Bitpacking with Arm Simd (single Thread)
Posted3 months agoActive3 months ago
github.comTechstory
calmpositive
Debate
40/100
SimdArmBitpackingPerformance Optimization
Key topics
Simd
Arm
Bitpacking
Performance Optimization
The post showcases an impressive 86 GB/s bitpacking performance using ARM SIMD on a single thread, sparking discussions on implementation details, related research, and potential applications.
Snapshot generated from the HN discussion
Discussion Activity
Moderate engagementFirst comment
1h
Peak period
8
0-2h
Avg / period
3.2
Comment distribution29 data points
Loading chart...
Based on 29 loaded comments
Key moments
- 01Story posted
Oct 5, 2025 at 8:27 AM EDT
3 months ago
Step 01 - 02First comment
Oct 5, 2025 at 9:39 AM EDT
1h after posting
Step 02 - 03Peak activity
8 comments in 0-2h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 6, 2025 at 10:44 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45481008Type: storyLast synced: 11/20/2025, 5:39:21 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Are the benchmark results in the README real? (The README itself feels very AI-generated)
Looking at the makefile, it tries to link the x86 SSE "baseline" implementation and the NEON version into the same binary. A real headscratcher!
Edit: The SSE impl gets shimmed via simd-everywhere, and the benchmark results do seem legit (aside from being slightly apples-to-oranges, but that's unavoidable)
For the baseline you need SIMDe headers: https://github.com/simd-everywhere/simde/tree/master/simde. These alias x86 intrinsics to ARM intrinsics. The baseline is based on the previous State-of-The-Art (https://arxiv.org/abs/1209.2137) which happens to be x86-based; using SIMDe to compile was the highest-integrity way I could think of to compare with the previous SOTA.
Note: M1 chips specifically have notoriously bad small-shift performance, so the benchmark results will be very bad on your machine. M3 partially fixed this, M4 fixed completely. My primary target is server-class rather than consumer-class hardware so I'm not too worried about this.
The benchmark results were cpy-pasted from the terminal. The README prose was AI generated from my rough notes (I'm confident when communicating with other experts/researchers, but less-so with communication to a general audience).
I used a geometric mean to calculate the top-line "86 GB/s" for NEON pack/unpack; so that's 91 GB/s for the C4A repro. Probably going to leave the title unmodified.
Pretty sure anyone going into this kind of post about simd would prefer your writing to llm
Popular narrative that NEON does not have a move mask alternative. Some time ago I published an article to simulate popular bit packing use cases with NEON with 1-2 instructions. This does not include unpacking cases but can be great for real world applications like compare+find, compare+iterate, compare+test.
https://community.arm.com/arm-community-blogs/b/servers-and-...
NEON in general is a bit sad, really; it's built around the idea of being implementable with a 64-bit ALU, and it shows. And SVE support is pretty much non-existent on the client.
I personally think calibrating ARM's ISA on smaller VL was a good choice: you get much better IPC. You also have an almost-complete absence of support for 8-bit elements with x86 ISAs, so elements per instruction is tied. And NEON kind-of-ish makes up for its small VL with multi-register TBL/TBX and LDP/STP.
Also: AVX512 is just as non-existent on clients as SVE2; although not really relevant for the server-side targets I'm optimising for (mostly OLAP).
Also AVX512 is way more common than SVE2, all Zen4 & Zen5 support it.
E.g. in RVV instead of vand.vv(a, vadd.vi(vsll.vv(1,k),-1)) you could do vandn.vv(vsll.vv(-1,k))
AVX-512 can do this with any binary or ternary bitwise logic function via vpternlog
I'm also skeptical of the "unified" paradigm: performance improvements are often realised by stepping away from generalisation and exploiting the specifics of a given problem; under a unified paradigm there's definite room for improvement vs Arrow, but that's very unlikely to bring you all the way to theoretically optimal performance.
If you loop through an array once, and then iterate through it again you can figure out where it will be cached based on the array size.
The core I developed on (Neoverse V2) has 4 SIMD ports and 6 scalar integer ports, however only 2 of those scalar ports support multicycle integer operations like the insert variant of BFM (essential for scalar packing).
More importantly, NEON progresses 16 elements per instruction instead of 1.