Engineering a Fixed-Width Bit-Packed Integer Vector in Rust
Posted4 months agoActive3 months ago
lukefleed.xyzTechstory
calmpositive
Debate
40/100
RustData CompressionBit-PackingPerformance Optimization
Key topics
Rust
Data Compression
Bit-Packing
Performance Optimization
The article discusses implementing a fixed-width bit-packed integer vector in Rust, and the comments explore various aspects of the implementation, including performance, edge cases, and potential applications.
Snapshot generated from the HN discussion
Discussion Activity
Moderate engagementFirst comment
6h
Peak period
6
18-20h
Avg / period
2.3
Comment distribution21 data points
Loading chart...
Based on 21 loaded comments
Key moments
- 01Story posted
Sep 24, 2025 at 11:17 AM EDT
4 months ago
Step 01 - 02First comment
Sep 24, 2025 at 5:13 PM EDT
6h after posting
Step 02 - 03Peak activity
6 comments in 18-20h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 25, 2025 at 5:30 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45361578Type: storyLast synced: 11/20/2025, 12:44:40 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
In the case of 63, reading out value at offset 1, requires two reads as it need 1 bit from the first byte, then 56 bits from the next 7 bytes and finally 6 bits from the 9th byte. Nine bytes cannot be loaded in a single u64 load.
I already patched it, thanks again.
Instead of doing stuff like (word >> bit_offset) & self.mask, in C or C++ I usually write _bextr_u64, or when using modern C# Bmi1.X64.BitFieldExtract. Note however these intrinsics compile into 2 instructions not one, because the CPU instruction takes start/length arguments from lower/higher bytes of a single 16-bit number.
I chose u128 instead of an array of [u64;2] due to the boundary crossing issue that I saw would happen, which I could avoid using u128.
I am not very familiar with these specific bit manipulation instruction sets, so would you know if there is something similar that works on u128 instead of just u32/u64?
Edit: https://godbolt.org/z/rrhW6T7Mc
Not as far as I’m aware, but I think your use case is handled by the u64 version rather well. Instead of u128, use array of two uint64 integers, pack the length into unused high bits of one of them.
Here’s example C++ https://godbolt.org/z/Mrfv3hrzr The packing function in that source file requires AVX2, unpack is scalar code based on that BMI1 instruction.
Another version with even fewer instructions to unpack, but one extra memory load: https://godbolt.org/z/hnaMY48zh Might be faster if you have a lot of these packed vectors, extracting numbers in a tight loop, and s_extractElements lookup table remains in L1D cache.
P.S. I’ve tested that code just a couple of times, might be bugs
Using the intrinsic directly would also kill portability. I'd need #[cfg] for ARM and runtime checks for older x86 CPUs, which adds complexity for a tiny, if any, gain. The current code just lets the compiler pick the best instruction on any architecture.
Vec<u8> may be the right call most of the time for most use cases. This library, however, is for when even 8 bits compared to 4 is too much. Another example, if all your values fit in 9 bits, you'd be forced to use Vec<u16> and waste 7 bits for every single number. With this structure, you just use 9 bits. That's almost a 2x space saving. At scale, that makes a difference.
For floats, you'd use fixed-point math, just like the other replies said. Your example of +/- 10.0 with 1e-6 precision would mean multiplying by 1,000,000 and storing the result as a 25-bit signed integer. When you read it back, you just divide. It's a decent saving over a 32-bit float.
most times it will suffice and be the recommended solution (less complexity; no ups I though I only need 3bit but at runtime needed 4bits situations; one less dependencies to review for soundness issues / bugs with every update)
while using Vec<u64> with a 3bit number is a pretty strange/misleading example (as you can use a Vec<u8>) my guess is that they used it because many packed vectors operate on u64s internally and allows up-to 64bit values.
anyway while often it doesn't matter sometimes storing a u8 where you only need 3bit numbers or similar makes a huge difference (e.g. if with packaging you application state fits in memory and without it doesn't and start swapping)
you can do similar stuff for non integer values, but things get more complicated as you can't just truncate/mask floats in the same way you can do so for integers. Further complications come from decimal based precision bounds and the question of if you need perfect precision in the whole number range (floats aren't evenly spaced).
A common solution is to use fixed point floats or in general convert your floats to integers. For example, assuming a perfect precision requirement for you range, you could store a number in range [-1e7; 1e7] and by context/in code remember there is a implicit missing div 1e6. Then you can store it in a 25 bit integer and bit package it (log2(1e7).ceil()+1 sign bit).
The benchmark does a million random reads. For the FixedVec implementation with bit_width=8, the underlying storage is a Vec<u64>. This means the data is 8x more compact than the Vec<u8> baseline for the same number of elements.
When the random access indices are generated, they are more likely to fall within the same cache lines for the Vec<u64>-backed structure than for the Vec<u8>. Even though Vec<u8> has a simpler access instruction, it suffers more cache misses across the entire benchmark run.
This is common in fields like bioinformatics, search engine indexing, or implementing other succinct data structures. In these areas, the entire game is about squeezing massive datasets into RAM to avoid slow disk I/O. Wasting 6 out of 16 bits means your memory usage is almost 40% higher than it needs to be. That can be the difference between a server needing 64GB of RAM versus 100GB.
On top of that, as I mentioned in another comment, packing the data more tightly often makes random access faster than a standard Vec, not slower. Better cache locality means the CPU spends less time waiting for data from main memory, and that performance gain often outweighs the tiny cost of the bit-fiddling instructions.
How deliciously rusty