86 Gb/s Bitpacking with Arm Simd (single Thread)

Posted3 months agoActive3 months ago

ashtonsix

132 points

29 comments

github.comTechstory

calmpositive

Debate

40/100

SimdArmBitpackingPerformance Optimization

Key topics

Simd

Arm

Bitpacking

Performance Optimization

The post showcases an impressive 86 GB/s bitpacking performance using ARM SIMD on a single thread, sparking discussions on implementation details, related research, and potential applications.

Snapshot generated from the HN discussion

Discussion Activity

Moderate engagement

First comment

Peak period

0-2h

Avg / period

3.2

Comment distribution29 data points

Loading chart...

Based on 29 loaded comments

Key moments

01Story posted
Oct 5, 2025 at 8:27 AM EDT
3 months ago
Step 01
02First comment
Oct 5, 2025 at 9:39 AM EDT
1h after posting
Step 02
03Peak activity
8 comments in 0-2h
Hottest window of the conversation
Step 03
04Latest activity
Oct 6, 2025 at 10:44 AM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (29 comments)

Showing 29 comments

Retr0id

3 months ago

2 replies

I tried to run the benchmark on my M1 Pro macbook, but the "baseline" is written with x86 intrinsics and won't compile.

Are the benchmark results in the README real? (The README itself feels very AI-generated)

Looking at the makefile, it tries to link the x86 SSE "baseline" implementation and the NEON version into the same binary. A real headscratcher!

Edit: The SSE impl gets shimmed via simd-everywhere, and the benchmark results do seem legit (aside from being slightly apples-to-oranges, but that's unavoidable)

Asmod4n

3 months ago

1 reply

Maybe this could help you: https://github.com/simd-everywhere/simde/issues/1099

Retr0id

3 months ago

1 reply

But this project isn't using simd-everywhere. I'd like to reproduce the results as documented in the README

guipsp

3 months ago

1 reply

Look at the parent dir. I agree it is a bit confusing

Retr0id

3 months ago

1 reply

Ah! Yup, that works, I can compile the binary. I get an "Illegal instruction" error when I run it but that's probably just because M1 doesn't support some of the NEON instructions. I retract my implicit AI-slop accusations.

Retr0id

3 months ago

Results from M1 Pro (after setting CPU=native in the makefile): https://gist.github.com/DavidBuchanan314/e3cde76e4dab2758ec4...

ashtonsixAuthor

3 months ago

2 replies

Thank you so much for attempting a reproduction! (I posted this on Reddit and most commenters didn't even click the link)

For the baseline you need SIMDe headers: https://github.com/simd-everywhere/simde/tree/master/simde. These alias x86 intrinsics to ARM intrinsics. The baseline is based on the previous State-of-The-Art (https://arxiv.org/abs/1209.2137) which happens to be x86-based; using SIMDe to compile was the highest-integrity way I could think of to compare with the previous SOTA.

Note: M1 chips specifically have notoriously bad small-shift performance, so the benchmark results will be very bad on your machine. M3 partially fixed this, M4 fixed completely. My primary target is server-class rather than consumer-class hardware so I'm not too worried about this.

The benchmark results were cpy-pasted from the terminal. The README prose was AI generated from my rough notes (I'm confident when communicating with other experts/researchers, but less-so with communication to a general audience).

deadmutex

3 months ago

1 reply

Here is a repro using GCE's C4A Axion instances (c4A-highcpu-72). Seems to beat Graviton? Maybe the title of the thread can be updated to a larger number :) ? I used the largest instance to avoid noisy neighbor issues.

  $ ./out/bytepack_eval
  Bytepack Bench — 16 KiB, reps=20000 (pinned if available)
  Throughput GB/s

  K  NEON pack   NEON unpack  Baseline pack   Baseline unpack
  1  94.77       84.05        45.01           63.12          
  2  123.63      94.74        52.70           66.63          
  3  94.62       83.89        45.32           68.43          
  4  112.68      77.91        58.10           78.20          
  5  86.96       80.02        44.32           60.77          
  6  93.50       92.08        51.22           67.20          
  7  87.10       79.53        43.94           57.95          
  8  90.49       92.36        68.99           83.88

ashtonsixAuthor

3 months ago

Oh nice! Axion C4A and Graviton4 use the same core (Neoverse V2), so the performance difference is due to factors like clock speed and power management.

I used a geometric mean to calculate the top-line "86 GB/s" for NEON pack/unpack; so that's 91 GB/s for the C4A repro. Probably going to leave the title unmodified.

ozgrakkurt

3 months ago

Super cool!

Pretty sure anyone going into this kind of post about simd would prefer your writing to llm

danlark1

3 months ago

2 replies

Great work!

Popular narrative that NEON does not have a move mask alternative. Some time ago I published an article to simulate popular bit packing use cases with NEON with 1-2 instructions. This does not include unpacking cases but can be great for real world applications like compare+find, compare+iterate, compare+test.

https://community.arm.com/arm-community-blogs/b/servers-and-...

Sesse__

3 months ago

1 reply

I never understood why they couldn't just include a movmskb instruction to begin with. It's massively useful for integer tasks, not expensive to implement as far as I know, and the vshrn+mov trick often requires an extra instruction either in front or after (in addition to just being, well, pretty obscure).

NEON in general is a bit sad, really; it's built around the idea of being implementable with a 64-bit ALU, and it shows. And SVE support is pretty much non-existent on the client.

ashtonsixAuthor

3 months ago

2 replies

Not having (1 << k) - 1 as a single instruction sucks when it HAS to be in a hot loop, but you can usually hoist this to the loop prolougue: my stuff uses dummy inline assembly hints to force compilers to do this `asm volatile("" : "+w"(m));`.

I personally think calibrating ARM's ISA on smaller VL was a good choice: you get much better IPC. You also have an almost-complete absence of support for 8-bit elements with x86 ISAs, so elements per instruction is tied. And NEON kind-of-ish makes up for its small VL with multi-register TBL/TBX and LDP/STP.

Also: AVX512 is just as non-existent on clients as SVE2; although not really relevant for the server-side targets I'm optimising for (mostly OLAP).

TinkersW

3 months ago

1 reply

8 bit is not absent in x86 SIMD, it is a slightly less covered than 32 & 16 bit, but you can fully implement all the common 8 bit ops and most are 1 instruction(with AVX2). There are even various horizontal ops on 8 bit values(avg, dot etc).

Also AVX512 is way more common than SVE2, all Zen4 & Zen5 support it.

dzaima

3 months ago

1 reply

More specifically, basically the only absent 8-bit ops that have 32-bit equivalents in AVX2 are shifts and multiplies. Shifts are quite annoying (though, with a uniform shift they can be emulated on AVX-512 via GFNI abuse in 1 instr), multiplies are rather rare (though note that there is vpmaddubsw for an 8-bit→16-bit multiply-add). There's even a case of the opposite - saturating add/sub exist for 8-bit and 16-bit ints, but not wider.

namibj

3 months ago

GFNI is distinct from AVX-512; it was merely introduced in cores that also had AVX-512.

Sesse__

3 months ago

1 reply

Does _any_ SIMD instruction set have (1 << k) - 1 as a single instruction?

camel-cdr

3 months ago

1 reply

Not sure in which context this is used, but you can do -1 << k in most ISAs but that still requires a bit-not. But if you want to use the value in a bitwise instruction, then there are often variants that can invert the input operand.

E.g. in RVV instead of vand.vv(a, vadd.vi(vsll.vv(1,k),-1)) you could do vandn.vv(vsll.vv(-1,k))

AVX-512 can do this with any binary or ternary bitwise logic function via vpternlog

Sesse__

3 months ago

I don't know either; I was talking about the lack of PMOVMSKB in NEON and then the comment I replied to started talking about (1 << k) - 1 not being a single instruction sucks. Which I don't think it is in any non-NEON set either.

ashtonsixAuthor

3 months ago

Nice article! I personally find the ARM ISA far more cohesive than x86's: far less historical quirks. I also really appreciate the ubiquity of support for 8-bit elements in ARM and the absence of SMT (make performance much more predictable).

robert3005

3 months ago

1 reply

Highly recommend https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf for a comparable algorithm. It generalizes to arbitrary input and output bit widths.

ashtonsixAuthor

3 months ago

Good work, wish I saw it earlier as it overlaps with a lot of my recent work. I'm actually planning to release new SOTAs on zigzag/delta/delta-of-delta/xor-with-previous coding next week. Some areas the work doesn't give enough attention to (IMO): register pressure, kernel fusion, cache locality (wrt multi-pass). They also fudge a lot of the numbers and comparisons eg, pitching themselves against Parquet (storage-optimised) when Arrow (compute-optimised) is the most-comparable tech and obvious target to beat. They definitely improve on current best work, but only by a modest margin.

I'm also skeptical of the "unified" paradigm: performance improvements are often realised by stepping away from generalisation and exploiting the specifics of a given problem; under a unified paradigm there's definite room for improvement vs Arrow, but that's very unlikely to bring you all the way to theoretically optimal performance.

fzeroff

3 months ago

1 reply

Very interesting! Uh you don’t mind my basic question, but what would you use that for?

ashtonsixAuthor

3 months ago

If you have an array of numbers with a known upper-bound, such as enums with 8 possible values (representable with 3 bits), and a memory-bound operation on those numbers eg, for (int i; i < n; i++) if (user_category[i] == 0) filtered.push_pack(i), which is common in data warehouses, using my code can more than 2x performance by allowing more efficient usage of the DRAM<->CPU bus.

6r17

3 months ago

2 replies

I feel like I could learn a lot just studying this ; just a curiosity ; how do you know if stuff is within L1 cache or not ? Are there kernel fn for that ? or just trough benching ?

saati

3 months ago

Does it fit 32K? Does it have some weird aliasing issue because you caused cache extinction with too many power of two sizes? And if you don't know the answer to these just check L1d hitrate with perf.

ashtonsixAuthor

3 months ago

From the working set size and knowledge of hardware cache behaviour. Whenever you access data from memory not already in-cache it's copied four times: to L3, L2, L1 and to CPU registers. As you access data, the hardware evicts old cache entries to make space for it.

If you loop through an array once, and then iterate through it again you can figure out where it will be cached based on the array size.

crest

3 months ago

1 reply

ARMv8A has nice scalar bit (un)packing instructions. I wonder if NEON is really an improvement over those given that ARM cores tend to have few SIMD ports and NEON is just 128 wide.

ashtonsixAuthor

3 months ago

I'm assuming you're referring to BFM/EXTR? NEON absolutely improves here.

The core I developed on (Neoverse V2) has 4 SIMD ports and 6 scalar integer ports, however only 2 of those scalar ports support multicycle integer operations like the insert variant of BFM (essential for scalar packing).

More importantly, NEON progresses 16 elements per instruction instead of 1.

View full discussion on Hacker News

ID: 45481008Type: storyLast synced: 11/20/2025, 5:39:21 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN