Utf-8 History (2003)
Posted4 months agoActive4 months ago
doc.cat-v.orgTechstory
calmpositive
Debate
40/100
Utf-8Character EncodingComputing History
Key topics
Utf-8
Character Encoding
Computing History
The story of UTF-8's creation is shared, sparking discussion on its history, impact, and the factors that contributed to its adoption, as well as related topics like the influence of language on computing.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
3d
Peak period
26
72-84h
Avg / period
12
Comment distribution36 data points
Loading chart...
Based on 36 loaded comments
Key moments
- 01Story posted
Sep 13, 2025 at 4:56 AM EDT
4 months ago
Step 01 - 02First comment
Sep 16, 2025 at 1:06 PM EDT
3d after posting
Step 02 - 03Peak activity
26 comments in 72-84h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 20, 2025 at 1:03 AM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45230515Type: storyLast synced: 11/20/2025, 6:27:41 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
There was a 9 month time window between the invention of UTF-8 and the first release of WinNT (Sep 1992 to Jul 1993).
But ok fine, UTF-8 didn't really become popular until the web became popular.
But then missing the other opportunity to make the transition with the release of the first consumer version of WinNT (WinXP) nearly a decade later is inexcusable.
On the other hand, Cyrillic and Greek are two examples of short alphabets that allowed combining them with ASCII into a single-byte encoding for countries like Greece, Bulgaria, Russia, etc. For those locations switching to UTF-8 meant that you need extra bytes for all characters in a local language, and thus higher storage, memory, and bandwidth requirements for all computing. So, non-Unicode encodings stuck there for a lot longer.
Hey team, we're working to release an ambitious new operating system in about 6 months, but I've decided we should burn the midnight oil to rip out and redo all of the text handling we worked on to replace it with something that was just introduced at a conference..
Oh and all the folks building their software against the beta for the last few months, well they knew what they were getting themselves into, after all it is a beta (https://books.google.com/books?id=elEEAAAAMBAJ&pg=PA1#v=onep...)
As for Windows XP, so now we're going to add a third version of the A/W APIs?
More background: https://devblogs.microsoft.com/oldnewthing/20190830-00/?p=10...
https://news.ycombinator.com/item?id=45265240
Changing an in-progress system design to a similar chip that was much less expensive ($25 at the convention vs $175 for a 6800, dropped to $69 the month after the convention) is a leap of faith, but the difference in cost is obvious justification, and the Apple I had no legacy to work with.
It would have been great if Windows NT could have picked up utf-8, but it's a bigger leap and the benefit wasn't as clear; variable width code points are painful in a lot of ways, and 16-bits for a code point seemed like it would be enough for anybody.
Potential source: https://ia802804.us.archive.org/13/items/os2developmentrelat...
I think all the actual OS was still codepage (with the "multibyte" versions for things like Eastern languages being pretty much forks), and windows95 wasn't really much different.
16bit IE brings its own MSNLS.DLL for handling different codepages to ACP(Active Codepage) in Win3.1x.
and win9x also works mainly in ANSI codepage with some kernel side unicode support.
Furthermore, the development of Windows NT already began in 1989 (then planned as OS/2 3.0) and proceeded in parallel to the finalization of Unicode 1.0, and to its eventual adoption by ISO that lead to Unicode 1.1 and ISO/IEC 10646-1:1993. It was natural to adopt that standardization effort.
Once established, the 16-bit encoding used by Windows NT was engrained in kernel and userspace APIs, notably the BSTR string type used by Visual Basic and COM, and importantly in NTFS. Adopting UTF-8 for Windows XP would have provided little benefit at that point, while causing a lot of complications. For backwards compatibility, something like WTF-8 would effectively have been required, and there would have been an additional performance penalty for converting back and forth from the existing WCHAR/BSTR APIs and serializations. It wasn't remotely a viable opportunity for such a far-reaching change.
Lastly, my recollection is that UTF-8 only became really widespread on the web some time after the release of Windows XP (2001), maybe roughly around Vista.
[0] https://en.wikipedia.org/wiki/Universal_Coded_Character_Set#...
[1] "Internationalization and character set standards", September 1993, https://dl.acm.org/doi/pdf/10.1145/174683.174687
Take the final and second form where the use of multiple letters was eliminated, instead using "v" to indicate bits of the encoded character.
I also chuckle at the initial implementation's note about the desire to delete support for 4/5/6 byte versions. Someone was still laboring under the UCS/UTF-16 delusion that 16-bits was sufficient.
The RFC that restricted it: https://www.rfc-editor.org/rfc/rfc3629#page-11
A UTF-8 playground: https://utf8-playground.netlify.app/
That simplicity made early character encodings like 7-bit ASCII feasible, which in turn lowered the hardware and software barriers for building computers, keyboards, and programming languages. In other words, the Latin alphabet’s compactness may have given English-speaking engineers a “low-friction” environment for both computation and communication. And now it’s the lingua franca for most computing on top of which support for other languages is now built.
It’s very interesting to think about how written scripts give different cultures advantages in computing and elsewhere. I wonder for instance how scripts and AI interact, like LLMs trained in Chinese are working with a high-density orthography with a stable, 3500 year dataset.
A zero byte is ^@ because 0x00 + 64 = '@'. The same pattern holds for all C0 control codes.
https://en.wikipedia.org/wiki/Nabataean_script
People like Torres Quevedo happen to exist everywhere because there are no anti-scientific people messing the education to the levels of something coming from the 18th century and before. I am no kidding. Pure creationism with Franco. By law. If you said something against religion, you were either fined, jailed or beaten up.
The English language has diacritics (see words like naïve, façade, résumé, or café). It's just that the English language uses them so rarely that they are largely dropped in any context where they are hard to introduce. Note that this adaptation to lack-of-diacritic can be found in other Latin script languages: French similarly is prone to loss-of-diacritic (especially in capital letters), whereas German has alternative spelling rules (e.g., Schroedinger instead of Schrödinger).
early character encoding was 6-bit ASCII, no lower case
https://prestigediner.com/
While I love the Hacker News purity, takes me back to Usenet, it makes me wonder if a little AI could take a repost and auto insert previous postings to allow people to see previous discussions.