Engineering a Fixed-Width Bit-Packed Integer Vector in Rust

Posted4 months agoActive3 months ago

lukefleed

89 points

21 comments

lukefleed.xyzTechstory

calmpositive

Debate

40/100

RustData CompressionBit-PackingPerformance Optimization

Key topics

Rust

Data Compression

Bit-Packing

Performance Optimization

The article discusses implementing a fixed-width bit-packed integer vector in Rust, and the comments explore various aspects of the implementation, including performance, edge cases, and potential applications.

Snapshot generated from the HN discussion

Discussion Activity

Moderate engagement

First comment

Peak period

18-20h

Avg / period

2.3

Comment distribution21 data points

Loading chart...

Based on 21 loaded comments

Key moments

01Story posted
Sep 24, 2025 at 11:17 AM EDT
4 months ago
Step 01
02First comment
Sep 24, 2025 at 5:13 PM EDT
6h after posting
Step 02
03Peak activity
6 comments in 18-20h
Hottest window of the conversation
Step 03
04Latest activity
Sep 25, 2025 at 5:30 PM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (21 comments)

Showing 21 comments

pjsg

4 months ago

1 reply

I suspect that this unaligned read apporach doesn't work for a bit length of 59, 61, 62 and 63.

In the case of 63, reading out value at offset 1, requires two reads as it need 1 bit from the first byte, then 56 bits from the next 7 bytes and finally 6 bits from the 9th byte. Nine bytes cannot be loaded in a single u64 load.

lukefleedAuthor

4 months ago

That's a good catch!! Thank you, you are right. I (incorrectly) assumed that a single u64 could capture the entire bit_width-value read starting from byte_pos. However, as you said, this assumption breaks for some large bit widths.

I already patched it, thanks again.

Const-me

4 months ago

2 replies

The support for BMI1 instruction set extension is almost universal by now. The extension was introduced in AMD Jaguar and Intel Haswell, both launched in 2013 i.e. 12 years ago.

Instead of doing stuff like (word >> bit_offset) & self.mask, in C or C++ I usually write _bextr_u64, or when using modern C# Bmi1.X64.BitFieldExtract. Note however these intrinsics compile into 2 instructions not one, because the CPU instruction takes start/length arguments from lower/higher bytes of a single 16-bit number.

Phemist

4 months ago

2 replies

I am handling a lot of vectors of size 6 or less with integers of values lower than 500.000. I am packing individual vectors in a u128, which comes down to 6*19 + a 3 bit length field = 117 bits with 11 bit left to spare.

I chose u128 instead of an array of [u64;2] due to the boundary crossing issue that I saw would happen, which I could avoid using u128.

I am not very familiar with these specific bit manipulation instruction sets, so would you know if there is something similar that works on u128 instead of just u32/u64?

anematode

4 months ago

1 reply

I don't think there's a single instruction to do this, but you could probably do it with a combination of shld + bzhi + cmov. rustc already seems to do a great job, and whatever I could come up with that assumes [src, src + len] is always in bounds isn't that much better.

Edit: https://godbolt.org/z/rrhW6T7Mc

Phemist

4 months ago

1 reply

Hvala puno. Very interesting! I will see how my implementation compares in asm.

anematode

3 months ago

Cool :) Feel free to reach out as well. I should note that that link is optimized for variable lengths and offsets -- if your lengths and offsets are constant then it can be much more efficient and I'd expect rustc/LLVM to nail it.

Const-me

3 months ago

> would you know if there is something similar that works on u128 instead of just u32/u64?

Not as far as I’m aware, but I think your use case is handled by the u64 version rather well. Instead of u128, use array of two uint64 integers, pack the length into unused high bits of one of them.

Here’s example C++ https://godbolt.org/z/Mrfv3hrzr The packing function in that source file requires AVX2, unpack is scalar code based on that BMI1 instruction.

Another version with even fewer instructions to unpack, but one extra memory load: https://godbolt.org/z/hnaMY48zh Might be faster if you have a lot of these packed vectors, extracting numbers in a tight loop, and s_extractElements lookup table remains in L1D cache.

P.S. I’ve tested that code just a couple of times, might be bugs

lukefleedAuthor

4 months ago

BEXTR basically does the same thing, yes. I'm sticking with the portable shift-and-mask, though. My bet is that LLVM is smart enough to see that pattern and emit a BEXTR on its own when the target supports BMI1.

Using the intrinsic directly would also kill portability. I'd need #[cfg] for ARM and runtime checks for older x86 CPUs, which adds complexity for a tiny, if any, gain. The current code just lets the compiler pick the best instruction on any architecture.

capyba

4 months ago

3 replies

Forgive me - much of the article went over my head so I’m trying to understand: this article is excellently written, but I’m struggling to understand why a `Vec<u8>` wouldn’t have sufficed? And disregarding the first question, is there a way to extend this to floats, for instance if I have a collection of values that I know will never exceed the bounds of +/- 10.0 with 1e-6 precision, can I use this technique for more efficient storage of them?

lukefleedAuthor

4 months ago

The other replies nailed it, but I'll add my two cents.

Vec<u8> may be the right call most of the time for most use cases. This library, however, is for when even 8 bits compared to 4 is too much. Another example, if all your values fit in 9 bits, you'd be forced to use Vec<u16> and waste 7 bits for every single number. With this structure, you just use 9 bits. That's almost a 2x space saving. At scale, that makes a difference.

For floats, you'd use fixed-point math, just like the other replies said. Your example of +/- 10.0 with 1e-6 precision would mean multiplying by 1,000,000 and storing the result as a 25-bit signed integer. When you read it back, you just divide. It's a decent saving over a 32-bit float.

dathinab

4 months ago

> wouldn’t have sufficed?

most times it will suffice and be the recommended solution (less complexity; no ups I though I only need 3bit but at runtime needed 4bits situations; one less dependencies to review for soundness issues / bugs with every update)

while using Vec<u64> with a 3bit number is a pretty strange/misleading example (as you can use a Vec<u8>) my guess is that they used it because many packed vectors operate on u64s internally and allows up-to 64bit values.

anyway while often it doesn't matter sometimes storing a u8 where you only need 3bit numbers or similar makes a huge difference (e.g. if with packaging you application state fits in memory and without it doesn't and start swapping)

you can do similar stuff for non integer values, but things get more complicated as you can't just truncate/mask floats in the same way you can do so for integers. Further complications come from decimal based precision bounds and the question of if you need perfect precision in the whole number range (floats aren't evenly spaced).

A common solution is to use fixed point floats or in general convert your floats to integers. For example, assuming a perfect precision requirement for you range, you could store a number in range [-1e7; 1e7] and by context/in code remember there is a implicit missing div 1e6. Then you can store it in a 25 bit integer and bit package it (log2(1e7).ceil()+1 sign bit).

Dylan16807

4 months ago

If you need absolute precision then floating points aren't going to be optimal. In particular your constraints need pretty much all the precision a 32-bit float has to offer, and you still need a handful of exponent bits, so your total savings would be like 4 bits.

simgt

4 months ago

1 reply

For the 8 bits case in the first benchmark, why isn't `Vec<u8>` just as fast as the rest? Is it due to the compiler emitting poorer instructions in that case or doing extra checks the other implementations don't?

lukefleedAuthor

4 months ago

It's not about poorer instructions; a get_unchecked on a Vec<u8> is just a single memory access, which is as good as it gets. The difference is likely down to cache locality effects created by the benchmark loop itself.

The benchmark does a million random reads. For the FixedVec implementation with bit_width=8, the underlying storage is a Vec<u64>. This means the data is 8x more compact than the Vec<u8> baseline for the same number of elements.

When the random access indices are generated, they are more likely to fall within the same cache lines for the Vec<u64>-backed structure than for the Vec<u8>. Even though Vec<u8> has a simpler access instruction, it suffers more cache misses across the entire benchmark run.

fainpul

4 months ago

2 replies

Maybe this is a thing people commonly have to do, but I have never heard of it and would have wished the article started with an explanation why simply casting the u64 values to a more appropriate type (u16 in this case) is not good enough. 6 of 16 bits would go to waste. It isn't obvious to me why this is considered substantial enough to go through all this trouble.

gspr

4 months ago

I've certainly found myself in a situation where a few percent of memory use would be the difference between holding all data in RAM and having to do IO to load it on demand. 6 of 16 bits wasted is enormous then.

lukefleedAuthor

4 months ago

For many applications, casting to u16 and wasting 6 bits is perfectly fine. The "trouble" is only worth it when you're operating at a scale where those wasted bits add up to gigabytes.

This is common in fields like bioinformatics, search engine indexing, or implementing other succinct data structures. In these areas, the entire game is about squeezing massive datasets into RAM to avoid slow disk I/O. Wasting 6 out of 16 bits means your memory usage is almost 40% higher than it needs to be. That can be the difference between a server needing 64GB of RAM versus 100GB.

On top of that, as I mentioned in another comment, packing the data more tightly often makes random access faster than a standard Vec, not slower. Better cache locality means the CPU spends less time waiting for data from main memory, and that performance gain often outweighs the tiny cost of the bit-fiddling instructions.

tln

4 months ago

"The first step is to abstract the physical storage layer"

How deliciously rusty

robert3005

4 months ago

You might be interested in https://github.com/spiraldb/fastlanes. It gives you fast bit packed vectors. It needs padding to 1024 elements though for the performance.

burnt-resistor

4 months ago

Tangentially (un)related, I recently had to implement a arbitrary bit vector Rust type backed by bytes 0-padded on the MSB side. It also offers an AsRef<[bool]> trait and iterator. Probably going to add serde as a feature and crate it at some point. The hardest part was removing bit(s). push() and append() were cake.

View full discussion on Hacker News

ID: 45361578Type: storyLast synced: 11/20/2025, 12:44:40 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN