Building an Efficient Hash Table in Java
Key topics
Delving into the intricacies of Java hash tables, a recent article sparked a lively discussion on optimization techniques, with commenters dissecting the author's approach and sharing insights on open addressing, SIMD, and bloom filters. The conversation revealed a mix of praise and puzzlement, with some commenters, like EmberTwin, questioning why SWAR outperformed SIMD, prompting the author, bluuewhale, to clarify that this is a topic for their next post. Meanwhile, others, like yardstick, appreciated the core idea behind the hash table implementation, likening it to a bloom filter, and offering constructive feedback on article structure. As the discussion unfolded, it became clear that the thread is relevant now due to its timely exploration of performance optimization in a popular programming language.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
1h
Peak period
11
Day 3
Avg / period
4.2
Key moments
- 01Story posted
Dec 13, 2025 at 12:41 PM EST
21 days ago
Step 01 - 02First comment
Dec 13, 2025 at 2:01 PM EST
1h after posting
Step 02 - 03Peak activity
11 comments in Day 3
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 23, 2025 at 3:38 AM EST
11 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
The Java SIMD API used here must not result in using actual SIMD machine code.
Thanks for the great point. This is actually the main topic I'm working on for the next post.
It's understandable to expect SIMD to win purely because it's wider, but in practice the end-to-end cost matters more than raw VL.
With the Java Vector API, the equality compare can indeed be compiled down to real SIMD instructions, yet the overall path may still lose if turning a VectorMask into a scalar bitmask is expensive. The "best case" is a vector compare followed by a single instruction that packs the result into a bitmask; if the JIT doesn't hit that lowering, it can fall back to extra work such as materializing the mask and repacking it in scalar code. From what I can tell, they have been working on intrinsic for VectorMask.toLong (https://bugs.openjdk.org/browse/JDK-8273949).
Also, SWAR avoids that entire transition by staying in GPR and producing the bitmask directly with a small, predictable sequence of bit operations. For small fixed-size probes, that simplicity often outweighs SIMD's theoretical advantage, and on some CPUs heavier vector usage can even come with frequency effects that further narrow the gap. So, I guess the more likely explanation isn't that the Vector API never uses SIMD.
I'll take a closer look at how it compiles down to machine code and share what I find.
P.S. Benchmark results can vary a lot depending on the environment (OS, CPU and JDK/JIT version and flags), so it’s also possible the benchmark picture changes on a different setup.
You've nailed the core idea. I'll tweak the structure a bit to make the concepts clearer up front.
@dang perhaps auto-flagged as was top of front-page for a few minutes then disappeared, shame as a fun read.
If we just wrap every API call with synchronized, I'd expect heavy contention (some adaptive spinning and then OS-level park/unpark), so it'll likely bottleneck pretty quickly.
Doing something closer to ConcurrentHashMap (locking per bin rather than globally) could mitigate that.
For the open-addressing table itself, I'm also considering adding lightweight locking at the group level (e.g., a small spinlock per group) so reads stay cheap and writes only lock a narrow region along the probe path.
I'm still figuring out Hugo, so I'm not sure I can fix it right away, but I'll take a look.
For fun, I swapped it for a bitwise calculation from Hacker’s Delight. It benchmarked surprisingly well, multiple percent improvement at various map sizes (even after JIT warmup). Swapping a 3 line loop for a one line bit twiddle wasn’t the most beautiful change, but I was surprised I could beat the original with such a trivial change. It also made me wonder why it hadn’t already been done.
I didn’t have any interest in trying to push it back upstream, but it was a fun to do.
I honestly love hearing about these hidden gem micro-optimizations.
Thanks a lot for sharing!
The problem is that for a regular hash table, eventually keys must be compared because two keys could have the same hash value. So maybe we relegate key comparisons only in cases when we encounter a collision.
The only case where this can work is when the set of keys that could ever be looked up is static. Otherwise we could always get a lookup for a new, non-existent key that creates a new collision and return the wrong value. Still, there are cases where this could be useful, e.g. looking up enum values, or a frozendict implementation. Something like minimal perfect hashing, but simpler.
So I came up with this silly project and benchmarked it in Java, C# and a hacky CPython "extension": https://github.com/kunalkandekar/Picadillo
A very micro optimization, but turned out there was a 5% - 30% speedup on a PC of that era.
In SwissTable, the main difference is that you only get a probabilistic signal of presence (kind of like a Bloom filter), but the core idea feels very similar.
I wanted to share a follow-up to this post. https://bluuewhale.github.io/posts/further-optimizing-my-jav...
This time I went back with a profiler and optimized the actual hot path.
A huge chunk of time was going to Objects.equals() because of profile pollution / missed devirtualization.
After fixing that, the next bottleneck was ARM/NEON “movemask” pain (VectorMask.toLong()), so I tried SWAR… and it ended up faster (even on x86, which I did not expect).
Can see the rest of that file and the adjacent `raw_hashtable.h` for the rest of the SwissTable-like implementation and `hashing.h` for the hash function.
FWIW, it consistently out-performs SwissTable in some respects, but uses a weaker but faster hash function that is good enough for the hash table, but not good for other use cases.
Don’t worry, my New Year’s Resolution will be to stop complaining and just flag.