Unicode Footguns in Python
Key topics
The article discusses the potential issues with Unicode in Python, highlighting the complexities and pitfalls of handling Unicode characters, and the discussion revolves around the challenges and potential solutions.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
13d
Peak period
15
Days 13-14
Avg / period
6.7
Based on 20 loaded comments
Key moments
- 01Story posted
Oct 23, 2025 at 8:51 PM EDT
3 months ago
Step 01 - 02First comment
Nov 5, 2025 at 7:31 PM EST
13d after posting
Step 02 - 03Peak activity
15 comments in Days 13-14
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 10, 2025 at 11:44 AM EST
about 2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Fortunately you can usually outsource this to a UI toolkit which can do it.
(Fonts may disagree on supported ligatures, for instance, or not support an emoji grapheme cluster and fall back to displaying multiple component emoji, or layout multiple graphemes in two dimensions for a given language [CJK, Arabic, even Math symbols, even before factoring in if a font layout engine supports the optional but cool UnicodeMath [0]] or any number of other tweaks and distinctions between encoding a block of text and displaying a block of text.)
[0] https://www.unicode.org/notes/tn28/UTN28-PlainTextMath-v3.1....
However, ligatures are a part of grapheme clustering and it isn't entirely independent: Examples above and elsewhere include ligatures that have Unicode encodings (ex: fi). Ligatures have been a part of character encodings and have affected grapheme clustering since EBCDIC (directly from which Unicode inherits a lot of its single codepoint Latin ligatures). There are way too many debates if normal forms should encode ligatures or always decompose them or some other strategy. The normal forms (NFC and NFD, NFKC and NFKD) themselves are embedded as a step of several of the standard grapheme clustering algorithms.
Some people think ligatures should be entirely left to fonts and Unicode ligatures a relic of the IBM bitmap font past. Some fonts think it would be nice if some ligatures were more directly encoded and normal forms could do a first pass for them. Unicode has had portions of the specification on both sides of that argument. It generally leans away from the backward compatibility ligatures and generally normalizes them to decomposition, especially in locales like "en-us" these days, but it doesn't always and in every locale. (All of that is before you start to consider languages that are almost nothing but long sequences of ligatures, including but not limited to Arabic.)
You can't do grapheme clustering and entirely ignore ligatures. You certainly can't count "display width" or "display position" without paying attention to ligatures, which was the point of bringing them up alongside mentions of grapheme clustering length and why it is insufficient for "display position" and "display width".
I’m puzzled about the assertion of EBCDIC having Latin ligatures because I never encountered them in my IBM mainframe days and a casual search didn’t turn any up. The only 8-bit encoding that I was aware of that included any ligatures was Mac extended ASCII which included fi and fl (well, putting aside TeX’s 7-bit encoding which was only used by TeX and because of its using x20 to hold the acute accent was unusable by other applications which expected a blank character in that spot).
The question about dealing with ligatures for non-Latin scripts generally came down to compatibility with existing systems more than anything else. This is why, for example, the Devanagari and Thai encodings which both have vowel markings which can occur before, after or around a consonant handle the sequence of input differently. Assembly of jambo into characters in Hangul is another case where theoretically, one could have all the displayed characters handled through ligatures and not encode the syllables directly (which is, in fact, how most modern Korean is encoded as evidenced by Korean wikipedia), but because the original Korean font encoding has all the syllables encoded in it² those syllables are part of Unicode as well.
But the bottom line here, is that you seem to be confusing terminology a lot here. You can very much do grapheme clustering without paying attention to ligatures, and rules for things like normalized composition/decomposition are entirely independent of grapheme clustering (the K-forms of both, manage transitions like ¹ to 1 or fi to fi and represent a one way transition).
⸻
1. I wrote a Rust library implementing this and I’ve been following Unicode since it was a proposal from Microsoft and Apple in the 90s, so I know a little about this.
2. I think the more accurate term would be “most” of the syllables as I’ve seen additional syllables added to Unicode over time.
> I’m puzzled about the assertion of EBCDIC having Latin ligatures because I never encountered them in my IBM mainframe days and a casual search didn’t turn any up. The only 8-bit encoding that I was aware of that included any ligatures was Mac extended ASCII which included fi and fl (well, putting aside TeX’s 7-bit encoding which was only used by TeX and because of its using x20 to hold the acute accent was unusable by other applications which expected a blank character in that spot).
EBCDIC had multiple "code pages" to handle various international encodings. (DOS and Windows inherited the "code page" concept from IBM mainframes.) Of the code pages EBCDIC supported in the mainframe era, many were "Publishing" code pages intended for "pretty printing" text to printers that supported presumably and primarily bitmap fonts. A low ID example of such is IBM Code Page (CCSID) 361: https://web.archive.org/web/20130121103957/http://www-03.ibm...
You can see that code page includes fi, fl, ff, ffi, ij, among others even less common in English text.
Most but not all of the IBM Code Pages were a part of the earliest Unicode standards encoding efforts.
Why endorse a bad winner when you can make more of the trade-offs more obvious and give programmers a better chance of asking for the right information instead of using the wrong information because it is the default and assuming it is correct?
By the way, try looking up the standardized Unicode casefolding algorithm sometime, it is a thing to behold.
in particular, the differences between NFC and NFKC are "fun", and rather meaningful in many cases. e.g. NFC says that "fi" and "fi" are different and not equal, though the latter is just a ligature of the former and is literally identical in meaning. this applies to ffi too. half vs full width Chinese characters are also "different" under NFC. NFKC makes those examples equal though... at the cost of saying "2⁵" is equal to "25".
language is fun!
With `bytes` it was obvious that byte length was not the same as $whatever length, and that was really the only semi-common bug (and was mostly limited to English speakers who are new to programming). All other bugs come from blindly trusting `unicode` whose bugs are far more subtle and numerous.
https://docs.python.org/3/library/stdtypes.html#binary-seque...
"The core built-in types for manipulating binary data are bytes and bytearray."
Additionally, python2 supported a much richer set of operations on its bytes type than python3 does.
Python 3 introduced the bytes type that you like so much. It sounds like you would enjoy a Python 4 with only a bytes type and no string type, and presumably with a strong convention to only use UTF-8 or with required encoding arguments everywhere.
In both Python 2 and Python 3, you still have to learn how to handle grapheme clusters carefully.
UnicodeDecodeError