Full Unicode Search at 50× Icu Speed with Avx‑512
Key topics
Diving into the complexities of Unicode search, a developer recently achieved a 50x speedup using AVX-512, sparking a lively debate about the nuances of case-insensitive matching. Commenters pointed out that ICU's handling of certain characters, like the German "ß" and its uppercase equivalent, can be misleading, with some arguing that "Maß" and "Mass" shouldn't be considered equivalent. The discussion highlights the challenges of localized case insensitivity, with examples like Turkish dotless "i" and German "ß" showing that "correct" handling depends on context. As the conversation unfolds, it becomes clear that Unicode's complexity demands careful consideration, and even the ICU standard, while imperfect, remains a crucial benchmark.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
20h
Peak period
52
18-24h
Avg / period
11.1
Based on 78 loaded comments
Key moments
- 01Story posted
Dec 15, 2025 at 11:42 AM EST
18 days ago
Step 01 - 02First comment
Dec 16, 2025 at 7:37 AM EST
20h after posting
Step 02 - 03Peak activity
52 comments in 18-24h
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 18, 2025 at 11:23 AM EST
15 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Funnily enough, Mass means one liter beer.
Which is why we also have to deal with the ue, ae, oe kind of trick, also known as Ersatzschreibweise.
Then German language users from de-CH region, consider Mass the correct way.
Yeah, localization and internalization is a mess to get right.
In practice you can do pretty well with a universal approach, but it can’t be 100% correct.
However, it is likely that it has never been pronounced "sz", but always "ss" and the habit of writing "sz" for the double consonant may have had the same reason as the writing of "ck" instead of the double "kk".
Unicode avoids "different" and "same", https://www.unicode.org/reports/tr15/ uses phrases like compatibility equivalence.
The whole thing is complicated, because it actually is complicated in the real world. You can spell the name of Gießen "Giessen" and most Germans consider it correct even if not ideal, but spelling Massachusetts "Maßachusetts" is plainly wrong in German text. The relationship between ß and ss isn't symmetric. Unicode captures that complexity, when you get into the fine details.
Interestingly enough this library doesn't provide grapheme cluster tokenization and/or boundary checking which is one of the most useful primitive for this.
If you’re in control of all data representations in your entire stack, then yes of course, but that’s hardly ever the case and different tradeoffs are made at different times (eg storage in UTF-8 because of efficiency, but in-memory representation in UTF-32 because of speed).
First normalizing everything and then comparing normalized versions isn’t as fast.
You’re running the exact same code, but are more more efficient in terms of “I immediately use the data for comparison after converting it”, which means it’s likely either in a register or L1 cache already.
Modern processors are generally computing stuff way faster than they can load and store bytes from main memory.
The code which does on the fly normalization only needs to normalize a small window. If you’re careful, you can even keep that window in registers, which have single CPU cycle access latency and ridiculously high throughput like 500GB/sec. Even if you have to store and reload, on-the-fly normalization is likely to handle tiny windows which fit in the in-core L1D cache. The access cost for L1D is like ~5 cycles of latency, and equally high throughput because many modern processors can load two 64-bytes vectors and store one vector each and every cycle.
StringZilla added full Unicode case folding in an earlier release, and that implementation was state-of-the-art for years. However, doing a full fold of the entire haystack is significantly slower than the new case-insensitive search path.
The key point is that you don’t need to fully normalize the haystack to correctly answer most substring queries. The search algorithm can rule out the vast majority of positions using cheap, SIMD-friendly probes and only apply fold logic on a very small subset of candidates.
I go into the details in the “Ideation & Challenges in Substring Search” section of the article
Also, as shown in the later tables, the Armenian and Georgian fast paths still have room for improvement. Before introducing higher-level APIs, I need to tighten the existing Armenian kernel and add a dedicated one for Georgian. It’s not a true bicameral script, but some characters are folding fold targets for older scripts, which currently forces too many fallbacks to the serial path.
I was worried (I find it confusing when Unicode "shadows" of normal letters exist, and those are of course also dangerous in some cases when they can be mis-interpreted for the letter they look more or less exactly like) by the article's use of U+212A (Kelvin symbol) as sample text, so I had to look it up [1].
Anyway, according to Wikipedia the dedicated symbol should not be used:
However, this is a compatibility character provided for compatibility with legacy encodings. The Unicode standard recommends using U+004B K LATIN CAPITAL LETTER K instead; that is, a normal capital K.
That was comforting, to me. :)
[1]: https://en.wikipedia.org/wiki/Kelvin#Orthography
Isn't this why Unicode normalization exists? This would let you compare Unicode letters and determine if they are canonically equivalent.
If you look in allkeys.txt (the base UCA data, used if you don't have language-specific stuff in your comparisons) for the two code points in question, you'll find:
The numbers in the brackets are values on level 1 (base), level 2 (typically used for accents), level 3 (typically used for case). So they are to compare identical under the UCA, in almost every case except for if you really need a tiebreaker.Compare e.g. :
which would compare equal to those under a case-insensitive accent-sensitive collation, but _not_a case-sensitive one (case-sensitive collations are always accent-sensitive, too).I grouped all Unicode 17 case-folding rules and built ~3K lines of AVX-512 kernels around them to enable fully standards-compliant, case-insensitive substring search across the entire 1M+ Unicode range, operating directly on UTF-8 bytes. In practice, this is often ~50× faster than ICU, and also less wrong than most tools people rely on today—from grep-style utilities to products like Google Docs, Microsoft Excel, and VS Code.
StringZilla v4.5 is available for C99, C++11, Python 3, Rust, Swift, Go, and JavaScript. The article covers the algorithmic tradeoffs, benchmarks across 20+ Wikipedia dumps in different languages, and quick starts for each binding.
Thanks to everyone for feature requests and bug reports. I'll do my best to port this to Arm as well — but first, I'm trying to ship one more thing before year's end.
do the go bindings require cgo?
Thanks for the work you do
There's only two "easy" solutions I can see: switch to N:N threading model or make the C code goroutine-aware. The former would speed up C calls at the expense of slowing down lots of ordinary Go code. Personally, I can still see some scenarios where that's beneficial, but it's pretty niche. The latter would greatly complicate the use of cgo, and defeat one of its core purposes, namely having access to large hard-to-translate C codebases without requiring extensive modifications of them.
However, the "normal" execution model on all of them is using heavyweight native threads, not green threads. As far as I can tell, FFI is either unsupported entirely or has the same kind of overhead as Go and Erlang do, when used from those languages' green threads.
My impression is that the go ffi is with big overhead because of the specific choices made to not care about ffi because it would benefit the go code more?
My point was that there's other gc languages/envorionments that have good ffi and were somehow able all these decades to create scalable multithreaded applications.
Both Java and Rust flirted with green threads in their early days. Java abandoned them because the hardware wasn't ready yet, and Rust abandoned them because they require a heavyweight runtime that wasn't appropriate for many applications Rust was targeting. And yet, both languages (and others besides) ended up adding something like them in later anyway, albeit sitting beside, rather than replacing, the traditional threading they primarily support.
Your question might just be misdirected; one could view it as operating systems, and not programming languages per se, that screwed it all up. Their threads, which were conservatively designed to be as compatible as possible with existing code, have too much overhead for many tasks. They were good enough for awhile, especially as multicore systems started to enter the scene, but their limitations became apparent after e.g. nginx could handle 10x the requests of Apache httpd on the same hardware. This gap would eventually be narrowed, to some extent, but it required a significant amount of rework in Apache.
If you can answer the question why ThreadPoolExecutor exists in Java, then you are halfway to answering the question about why M:N threading exists. The other half is ergonomics, ThreadPoolExecutor is great for fanning out pieces of a single, subdividable task, but it isn't great for handling a stream of unrelated tasks that ebb and flow over time.
But why are you using the case-folding rules and not the collation rules?
""" Continuing with the previous example of “ß”, one has lowercase("ss") != lowercase("ß") but uppercase("ss") == uppercase("ß"). Conversely, for legacy reasons (compatibility with encodings predating Unicode), there exists a Kelvin sign “K”, which is distinct from the Latin uppercase letter “K”, but also lowercases to the normal Latin lowercase letter “k”, so that uppercase("K") != uppercase("K") but lowercase("K") == lowercase("K").
The correct way is to use Unicode case folding, a form of normalization designed specifically for case-insensitive comparisons. Both casefold("ß") == casefold("ss") and casefold("K") == casefold("K") are true. Case folding usually yields the same result as lowercasing, but not always (e.g., “ß” lowercases to itself but case-folds to “ss”). """
One question I have is why have Kelvin sign that is distinct Kelvin sign and other indistinguishable symbols? To make quantified machine readable (oh, this is not a 100K license plate or money amount, but a temperature)? Or to make it easier for specialized software to display it in correct placed/units?
IMO the confusing bit is giving it a lower case. It is a symbol, not a letter.
But, I dunno. Why would anybody apply upper or lower case operators to a temperature measurement? It just seems like a nonsense thing to do.
IMO this is somewhere where if we were really doing something, we might as well go all the way and double check the relevant standards, right? The filesystem should accept some character set for use as names, and if we’re generating a name inside our program we should definitely find a character set that fits inside what the filesystem expects and that captures what we want to express… my gut says upper case Latin K would be the best pick if we needed to most portably represent Kelvin in a filename on a reasonably modern, consumer filesystem.
> The Kelvin sign (K, Unicode U+212A) is included as a distinct character in certain legacy East Asian character encodings, including those based on the South Korean national standard KS X 1001 (formerly KS C 5601), which influenced IBM code pages supporting Korean text. ... The Kelvin sign was added to support distinct representation of the kelvin unit in scientific contexts, possibly reflecting typographic conventions where a stylized or script-like "K" distinguished the unit from the ordinary letter "K".
- https://grok.com/share/bGVnYWN5_62023ec2-efb0-4897-b8cf-6f71...
However, those symbols doesn't have lower case variants. Moreover, lower case k means kilo-, not a «smaller Kelvin».
To maximally confuse things, I suggest we start using little k alone to resolve another annoying unit issue: let’s call 1 kilocalorie “k.”
To allow round-tripping.
Unicode did not win by being better than all previously existing encodings, even though it clearly was.
It won by being able to coexist with all those other encodings for years (decades) while the world gradually transitioned. That required the ability to take text in any of those older encodings and transcode it to Unicode and back again without loss (or "gain"!).
Unicode has the goal of being a 1:1 mapping for all other character encodings. Usually weird things like this is so there can be a 1:1 reversible mapping to some ancient character encoding.
> ICU has many bindings. The Rust one doesn’t expose any substring search functionality, but the Python one does:
The Python's ICU support is based on ICU4C. Rust's ICU "bindings" are actually a new implementation called ICU4X.
Also very cool and approachable guy.
(Best wishes if you're reading this.)