My Favourite Small Hash Table

Posted24 days agoActive22 days ago

speckx

157 points

36 comments

corsix.orgTech Discussionstory

informativepositive

Debate

20/100

Data StructuresHash TablesAI Performance Analysis

Key topics

Data Structures

Hash Tables

AI Performance Analysis

The quest for an efficient small hash table has sparked a lively discussion, with commenters diving into the code and implementation details of the author's favorite data structure. One key debate revolves around the author's decision to store key-value pairs as a single `uint64_t` instead of a struct, with some suggesting it might be to avoid struct padding, while others point out that alignment constraints are the likely reason. The discussion sheds light on the intricacies of data structure design and the trade-offs involved in optimizing for performance. As commenters share their insights and alternative implementations, the thread becomes a fascinating exploration of the nuances of low-level programming.

Snapshot generated from the HN discussion

Discussion Activity

Active discussion

First comment

Peak period

0-6h

Avg / period

5.6

Comment distribution28 data points

Loading chart...

Based on 28 loaded comments

Key moments

01Story posted
Dec 9, 2025 at 9:47 AM EST
24 days ago
Step 01
02First comment
Dec 9, 2025 at 11:20 AM EST
2h after posting
Step 02
03Peak activity
15 comments in 0-6h
Hottest window of the conversation
Step 03
04Latest activity
Dec 11, 2025 at 5:07 PM EST
22 days ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (36 comments)

Showing 28 comments of 36

clbrmbr

24 days ago

2 replies

Awesome blog! Looking at the code I feel like there’s a kindred soul behind that keyboard, but there’s no About page afaict. Who beeth this mysterious writer?

loeg

24 days ago

He’s done some interesting work with crc32 ~recently:

https://www.corsix.org/content/fast-crc32c-4k

https://github.com/corsix/fast-crc32/

LargoLasskhyfv

24 days ago

https://github.com/corsix

judofyr

24 days ago

4 replies

Is there a specific reason to store the key + value as an `uint64_t` instead of just using a struct like this?

    struct slot {
      uint32_t key;
      uint32_t value;
    }

zimpenfish

24 days ago

1 reply

Maybe trying to avoid struct padding? Although having done a quick test on {arm64, amd64} {gcc, clang}, they all give the same `sizeof` for a struct with 2x`uint32_t`, a struct with a single `uint64_t`, or a bare `uint64_t`.

simonask

24 days ago

1 reply

In any struct where all fields have the same size (and no field type requires higher alignment than its size), it is guaranteed on every (relevant) ABI that there is no padding bytes.

zimpenfish

24 days ago

TIL! Thanks!

nitnelave

24 days ago

2 replies

The alignment constraint is different, which they use to be able to load both as a 64-bit integer and compare to 0 (the empty slot).

You could work around that with a union or casts with explicit alignment constraints, but this is the shortest way to express that.

Asooka

24 days ago

1 reply

In that case you can use bit fields in a union:

    union slot {
        uint64_t keyvalue;
        struct {
            uint64_t key: 32;
            uint64_t value: 32;
        };
    };

Since both members of the union are effectively the exact same type, there is no issue. C99: "If the member used to access the contents of a union is not the same as the member last used to store a value, the object representation of the value that was stored is reinterpreted as an object representation of the new type". Meaning, you can initialise keyvalue and that will initialise both key and value, so writing "union slot s{0}" initialises everything to 0. One issue is that the exact layout for bit fields is implementation defined, so if you absolutely need to know where key and value are in memory, you will have to read GCC's manual (or just experiment). Another is that you cannot take the address of key or value individually, but if your code was already using uint64_t, you probably don't need to.

nitnelave

24 days ago

You can probably get away with just a union between a 64 bit and 2 32 bit integers.

crest

24 days ago

C has finally gained `alignas` so you can avoid the union hack or you could just rely on malloc to alway return the maximum alignment anyway.

mwkaufma

24 days ago

Or better, just store keys and values in separate arrays, so you can have compact cache lines of just keys when probing.

loeg

24 days ago

No real reason. Slightly terser to compare with zero to find an empty slot.

Aardwolf

24 days ago

3 replies

> The table occupies at most 32 GiB of memory.

This constraint allows simply making a linear array of all the 4 billion values for each key with the key as index, which fits in 16 GiB. Another 500 MiB is enough to have a bit indicating present or not for each.

Perhaps strings as keys and values would give a more interesting example...

kazinator

24 days ago

[delayed]

24 days ago

> a linear array of all the 4 billion values, with the key as array index, which fits in 16 GiB

The hash table has the significant advantage of having a much smaller minimum size.

> Perhaps text strings as keys and values would give a more interesting example

Keep reading to "If keys and values are larger than 32 bits"

dragontamer

24 days ago

This hashtable implements a multiset. Not (merely) a simple set.

Joker_vD

24 days ago

1 reply

> If the Zba extension is present, sh3add.uw is a single instruction for zero-extending idx from 32 bits to 64, multiplying it by sizeof(uint64_t), and adding it to slots.

Yay, we've got an equivalent of SIB byte but as three (six?) separate opcodes. Well, sub-opcodes.

It's a shame though that Zcmp extension didn't get into RVA23 even as an optional extension.

camel-cdr

23 days ago

1 reply

> It's a shame though that Zcmp extension didn't get into RVA23 even as an optional extension

Zcmp is only for embedded applications without D support.

You wouldn't want an instruction with up to 13 destinations in high performance designs anyways.

If you want load/store pair, we already have that, you can just interpret two adjacent 16-bit load or stores as a single 32-bit instruction.

Joker_vD

22 days ago

1 reply

> You wouldn't want an instruction with up to 13 destinations in high performance designs anyways.

Why not? Code density matters even in high-performance designs although I guess the "millicode routines" can help with that somewhat. Still, the ordering of stores/loads is undefined, and they are allowed to be re-done however many times, so... it shouldn't be onerous to implement? Expanding it into μops during the decoding stages seems straightforward.

camel-cdr

22 days ago

> Expanding it into μops during the decoding stages seems straightforward.

I wouldn't say so, because if you want to be able to crack an instruction into up to N uops, now the second instruction could be placed in any slot from the 2nd to the 1+Nth and you now have to create huge shuffle hardware tk support this.

Apple for example can only crack instructions that generate up to 3 μops at decode (or before rename) anything beyond needs to be microcoded and stall decoding other instructions.

air217

24 days ago

Used Gemini to explain this. Very interesting and I think Gemini did a good job explaining the Robin hood fairness mechanism

https://gemini.google.com/share/5add15a1c12f

submeta

24 days ago

Great article; really enjoyed it!

One observation: the implementation is a simple in-memory, single-threaded hash table. It’s perfect for its intended use, but concurrent access or very large tables would need further adaptation.

beeforpork

23 days ago

Concerning hash algorithms, for me, it was cuckoo hashing that blew me away. The algorithm is so simple and yet has O(1) guaranteed lookup complexity, i.e., it is better on theoretical level than other open addressing. It is also versatile by having a configurable trade-off (the number of buckets) between time constant and fill ratio. Most impressive is its simplicity combined with that it was invented only so recently (2001).

https://en.wikipedia.org/wiki/Cuckoo_hashing

unixhero

23 days ago

I am dipping my toes in programming. But I can't follow this without graphical representations of the tables as the author (brilliantly) walks through it.

attractivechaos

24 days ago

This is a smart implementation of Robin Hood hashing I am not aware of. In my understanding, a standard implementation keeps the probe length of each entry. This one avoids that due to its extra constraints. I don't quite understand the following strategy, though

> To meet property (3) [if the key 0 is present, its value is not 0] ... "array index plus one" can be stored rather than "array index".

If hash code can take any value in [0,2^32), how to define a special value for empty buckets? The more common solution is to have a special key (not hash code) for empty slots which is easier to achieve. In addition, as the author points out, supporting generic keys requires to store 32-bit hash values. With the extra 4 bytes per bucket, it is not clear if this implementation is better than plain linear probing. The fastest hash table implementations like boost and abseil don't often use Robin Hood hashing.

ww520

24 days ago

This hash table is pretty good. It has at best one memory read if there’s no collision. Randomized key might introduce any level of key collision though.

Panzerschrek

23 days ago

I don't understand the part about key removal. Is it correct to shift values to the left? Can't it break lookups for keys inserted into their natural positions?

8 more comments available on Hacker News

View full discussion on Hacker News

ID: 46205461Type: storyLast synced: 12/12/2025, 2:35:30 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN