Utf-8 Is a Brilliant Design

Posted4 months agoActive4 months ago

vishnuharidas

849 points

348 comments

iamvishnu.comTechstoryHigh profile

supportivepositive

Debate

60/100

Utf-8UnicodeCharacter Encoding

Key topics

Utf-8

Unicode

Character Encoding

The article discusses the brilliant design of UTF-8, a character encoding standard that has become ubiquitous, and the HN discussion highlights its benefits, design decisions, and comparisons with other encoding schemes.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

29m

Peak period

118

0-12h

Avg / period

17.8

Comment distribution160 data points

Loading chart...

Based on 160 loaded comments

Key moments

01Story posted
Sep 12, 2025 at 2:30 PM EDT
4 months ago
Step 01
02First comment
Sep 12, 2025 at 2:59 PM EDT
29m after posting
Step 02
03Peak activity
118 comments in 0-12h
Hottest window of the conversation
Step 03
04Latest activity
Sep 20, 2025 at 12:41 AM EDT
4 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (348 comments)

Showing 160 comments of 348

happytoexplain

4 months ago

4 replies

I have a love-hate relationship with backwards compatibility. I hate the mess - I love when an entity in a position of power is willing to break things in the name of advancement. But I also love the cleverness - UTF-8, UTF-16, EAN, etc. To be fair, UTF-8 sacrifices almost nothing to achieve backwards compat though.

amluto

4 months ago

6 replies

> To be fair, UTF-8 sacrifices almost nothing to achieve backwards compat though.

It sacrifices the ability to encode more than 21 bits, which I believe was done for compatibility with UTF-16: UTF-16’s awful “surrogate” mechanism can only express code units up to 2^21-1.

I hope we don’t regret this limitation some day. I’m not aware of any other material reason to disallow larger UTF-8 code units.

throw0101d

4 months ago

1 reply

> It sacrifices the ability to encode more than 21 bits, which I believe was done for compatibility with UTF-16: UTF-16’s awful “surrogate” mechanism can only express code units up to 2^21-1z

Yes, it is 'truncated' to the "UTF-16 accessible range":

* https://datatracker.ietf.org/doc/html/rfc3629#section-3

* https://en.wikipedia.org/wiki/UTF-8#History

Thompson's original design could handle up to six octets for each letter/symbol, with 31 bits of space:

* https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

gpvos

4 months ago

You could even extend UTF-8 to make 0xFE and 0xFF valid starting bytes, with 6 and 7 following bytes each, and get 42 bits of space. I seem to remember Perl allowed that for a while in its v-strings notation.

Edit: just tested this, Perl still allows this, but with an extra twist: v-notation goes up to 2^63-1. From 2^31 to 2^36-1 is encoded as FE + 6 bytes, and everything above that is encoded as FF + 12 bytes; the largest value it allows is v9223372036854775807, which is encoded as FF 80 87 BF BF BF BF BF BF BF BF BF BF. It probably doesn't allow that one extra bit because v-notation doesn't work with negative integers.

mort96

4 months ago

2 replies

That isn't really a case of UTF-8 sacrificing anything to be compatible with UTF-16. It's Unicode, not UTF-8 that made the sacrifice: Unicode is limited to 21 bits due to UTF-16. The UTF-8 design trivially extends to support 6 byte long sequences supporting up to 31-bit numbers. But why would UTF-8, a Unicode character encoding, support code points which Unicode has promised will never and can never exist?

GuB-42

4 months ago

1 reply

Is 21 bits really a sacrifice. It is 2 million codepoints, we currently use about a tenth of that.

Even with all Chinese characters, de-unified, all the notable historical and constructed scripts, technical symbols, and all the submitted emoji, including rejections, you are still way short of a million.

We are probably never need more than 21 bits unless we start stretching the definition of what text is.

moefh

4 months ago

1 reply

It's not 2 million, it's a little over 1 million.

The exact number is 1112064 = 2^16 - 2048 + 16*2^16: in UTF-16, 2 bytes can encode 2^16 - 2048 code points, and 4 bytes can encode 16*2^16 (the 2048 surrogates are not counted because they can never appear by themselves, they're used purely for UTF-16 encoding).

chuckadams

4 months ago

1 reply

Even with just 1 million codepoints, why did they feel the need for CJK unification? Was it so it would all fit in UCS-2 or something?

rwallace

4 months ago

Yes, that was exactly the reason. CJK unification happened during the few years when we were all trying to convince ourselves that 16 bits would be enough. By the time we acknowledged otherwise, it was too late.

MyOutfitIsVague

4 months ago

In an ideal future (read: fantasy), utf-16 gets formally deprecated and trashed, freeing the surrogate sequences and full range for utf-8.

Or utf-16 is officially considered a second class citizen, and some code points are simply out of its reach.

Analemma_

4 months ago

1 reply

It's always dangerous to stick one's neck out and say "[this many bits] ought to be enough for anybody", but I think it's very unlikely we'll ever run out of UTF-8 sequences. UTF-8 can represent about 1.1 million code points, of which we've assigned about 160,000 actual characters, plus another ~140,000 in the Private Use Area, which won't expand. And that's after getting nearly all of the world's known writing systems: the last several Unicode updates have added a few thousand characters here and there for very obscure and/or ancient writing systems, but those won't go on forever (and things like emojis rarely only get a handful of new code points per update, because most new emojis are existing code points with combining characters).

If I had to guess, I'd say we'll run out of IPv6 addresses before we run out of unassigned UTF-8 sequences.

lyu07282

4 months ago

The oldest script in unicode, sumerian cuneiform, is ~5,200 years old, if we were to invent new scripts at the same rate we would hit 1.1 million code points in around 31,000 years. So yeah nothing to worry about, you are absolutely right. Unless we join some intergalactic federation of planets, although they probably already have their own encoding standards we could just adopt.

cryptonector

4 months ago

> It sacrifices the ability to encode more than 21 bits

No, UTF-8's design can encode up to 31 bits of codepoints. The limitation to 21 bits comes from UTF-16, which was then adopted for UTF-8 too. When UTF-16 dies we'll be able to extend UTF-8 (well, compatibility will be a problem).

layer8

4 months ago

That limitation will be trivial to lift once UTF-16 compatibility can be disregarded. This won’t happen soon, of course, given JavaScript and Windows, but the situation might be different in a hundred or thousand years. Until then, we still have a lot of unassigned code points.

In addition, it would be possible to nest another surrogate-character-like scheme into UTF-16 to support a larger character set.

1oooqooq

4 months ago

the limitation tomorrow will be today's implementations, sadly.

procaryote

4 months ago

1 reply

> I love when an entity in a position of power is willing to break things in the name of advancement.

It's less fun when you have things that need to keep working break because someone felt like renaming a parameter, or that a part of the standard library looks "untidy"

happytoexplain

4 months ago

2 replies

I agree! And yet I lovingly sacrifice my man-hours to it when I decide to bump that major version number in my dependency manifest.

procaryote

4 months ago

1 reply

Or minor versions of python...

Honestly python is probably one of the worst offender in this as they combine happily making breaking changes for low value rearranging of deck chairs with a dynamic language where you might only find out in runtime.

The fact that they've also decided to use an unconventional intepretation of minor version shows how little they care.

chuckadams

4 months ago

2 replies

The term "semantic versioning" didn't even exist until 2010, which is well after the birth of Python. Sure, it semi-formalized a convention from long before, but it was hardly universal.

procaryote

4 months ago

They of course get to break their thing however much they like, but it sure sucks

account42

4 months ago

The ideals behind semantic versioning existed long before the marketing term.

account42

4 months ago

The key words here being "I decide". I'm going to express a lot less love when someone else decides.

mort96

4 months ago

Yeah I honestly don't know what I would change. Maybe replace some of the control characters with more common characters to save a tiny bit of space, if we were to go completely wild and break Unicode backward compatibility too. As a generic multi byte character encoding format, it seems completely optimal even in isolation.

cryptonector

4 months ago

> To be fair, UTF-8 sacrifices almost nothing to achieve backwards compat though.

There were apps that completely rejected non-7-bit data back in the day. Backwards compatibility wasn't the only point. The point of UTF-8 is more (IMO) that UTF-32 is too bulky, UCS-2 was insufficient, UTF-16 was an abortion, and only UTF-8 could have the right trade-offs.

3pt14159

4 months ago

3 replies

I remember a time before UTF-8's ubiquity. It was such a headache moving to i18z. I love UTF-8.

linguae

4 months ago

2 replies

I remember learning Japanese in the early 2000s and the fun of dealing with multiple encodings for the same language: JIS, Shift-JIS, and EUC. As late as 2011 I had to deal with processing a dataset encoded under EUC in Python 2 for a graduate-level machine learning course where I worked on a project for segmenting Japanese sentences (typically there are no spaces in Japanese sentences).

UTF-8 made processing Japanese text much easier! No more needing to manually change encoding options in my browser! No more mojibake!

pezezin

4 months ago

1 reply

I live in Japan and I still receive the random email or work document encoded in Shit-JIS. Mojibake is not as common as it once was, but still a problem.

rmunn

4 months ago

I'm assuming you misspelled Shift-JIS on purpose because you're sick and tired of dealing with it. If that was an accidental misspelling, it was inspired. :-)

layer8

4 months ago

On the other hand, you now have to deal with the issues of Han unification: https://en.wikipedia.org/wiki/Han_unification#Examples_of_la...

glxxyz

4 months ago

I worked on an email client. Many many character set headaches.

acdha

4 months ago

I worked on a site in the late 90s which had news in several Asian languages, including both simplified and traditional Chinese. We had a partner in Hong Kong sending articles and being a stereotypical monolingual American I took them at their word that they were sending us simplified Chinese and had it loaded into our PHP app which dutifully served it with that encoding. It was clearly Chinese so I figured we had that feed working.

A couple of days later, I got an email from someone explaining that it was gibberish — apparently our content partner who claimed to be sending GB2312 simplified Chinese was in fact sending us Big5 traditional Chinese so while many of the byte values mapped to valid characters it was nonsensical.

bruce511

4 months ago

1 reply

While the backward compatibility of utf-8 is nice, and makes adoption much easier, the backward compatibility does not come at any cost to the elegance of the encoding.

In other words, yes it's backward compatible, but utf-is also compact and elegant even without that.

nextaccountic

4 months ago

1 reply

UTF-8 also enables this mindblowing design for small string optimization - if the string has 24 bytes or less it is stored inline, otherwise it is stored on the heap (with a pointer, a length, and a capacity - also 24 bytes)

https://github.com/ParkMyCar/compact_str

How cool is that

(Discussed here https://news.ycombinator.com/item?id=41339224)

adgjlsfhk1

4 months ago

1 reply

How is that UTF8 specific?

ubitaco

4 months ago

1 reply

It's slightly buried in the readme on Github:

> how can we store a 24 byte long string, inline? Don't we also need to store the length somewhere?

> To do this, we utilize the fact that the last byte of our string could only ever have a value in the range [0, 192). We know this because all strings in Rust are valid UTF-8, and the only valid byte pattern for the last byte of a UTF-8 character (and thus the possible last byte of a string) is 0b0XXXXXXX aka [0, 128) or 0b10XXXXXX aka [128, 192)

Dylan16807

4 months ago

Any Unicode encoding would allow that.

UTF-32 has an entire spare byte to put flags into. 24 or 21 bit encodings have spare bits that could act as flags. UTF-16 has plenty of invalid code units, or you could use a high surrogate in the last 2 bytes as your flag.

quectophoton

4 months ago

11 replies

Having the continuation bytes always start with the bits `10` also make it possible to seek to any random byte, and trivially know if you're at the beginning of a character or at a continuation byte like you mentioned, so you can easily find the beginning of the next or previous character.

If the characters were instead encoded like EBML's variable size integers[1] (but inverting 1 and 0 to keep ASCII compatibility for the single-byte case), and you do a random seek, it wouldn't be as easy (or maybe not even possible) to know if you landed on the beginning of a character or in one of the `xxxx xxxx` bytes.

[1]: https://www.rfc-editor.org/rfc/rfc8794#section-4.4

1oooqooq

4 months ago

1 reply

so you replace one costly sweeping with a costly sweeping. i wouldn't call that an advantage in any way over junping n bytes.

what you describe is the bare minimum so you even know what you are searching for while you scan pretty much everything every time.

hk__2

4 months ago

1 reply

What do you mean? What would you suggest instead? Fixed-length encoding? It would take a looot of space given all the character variations you can have.

gertop

4 months ago

10 replies

UTF-16 is both simpler to parse and more compact than utf-8 when writing non-english characters.

UTF-8 didn't win on technical merits, it won becausw it was mostly backwards compatible with all American software that previously used ASCII only.

When you leave the anglosphere you'll find that some languages still default to other encodings due to how large utf-8 ends up for them (Chinese and Japanese, to name two).

ISV_Damocles

4 months ago

2 replies

UTF-16 is also just as complicated as UTF-8 requiring multibyte characters to cover the entirety of Unicode, so it doesn't avoid the issue you're complaining about for the newest languages added, and it has the added complexity of a BOM being required to be sure you have the pairs of bytes in the right order, so you are more vulnerable to truncated data being unrecoverable versus UTF-8.

UTF-32 would be a fair comparison, but it is 4 bytes per character and I don't know what, if anything, uses it.

Mikhail_Edoshin

4 months ago

1 reply

No, UTF-16 is much simpler in that aspect. And its design is no less brilliant. (I've written an state machine encoder and decoder for both these encodings.) If an application works a lot with text I'd say UTF-16 looks more attractive for the main internal representation.

rmunn

4 months ago

UTF-16 is simpler most of the time, and that's precisely the problem. Anyone working with UTF-8 knows they will have to deal with multibyte codepoints. People working with UTF-16 often forget about surrogate characters, because they're a lot rarer in most major languages, and then end up with bugs when their users put emoji into a text field.

adgjlsfhk1

4 months ago

python does (although it will use 8 or 16 bits per character if all characters in the string fit)

syncsynchalt

4 months ago

1 reply

UTF-16 has endian concerns and surrogates.

Both UTF-8 and UTF-16 have negatives but I don't think UTF-16 comes out ahead.

Mikhail_Edoshin

4 months ago

1 reply

Here is what an UTF-8 decoder needs to handle:

1. Invalid bytes. Some bytes cannot appear in an UTF-8 string at all. There are two ranges of these.

2. Conditionally invalid continuation bytes. In some states you read a continuation byte and extract the data, but in some other cases the valid range of the first continuation byte is further restricted.

3. Surrogates. They cannot appear in a valid UTF-8 string, so if they do, this is an error and you need to mark it so. Or maybe process them as in CESU but this means to make sure they a correctly paired. Or maybe process them as in WTF-8, read and let go.

4. Form issues: an incomplete sequence or a continuation byte without a starting byte.

It is much more complicated than UTF-16. UTF-16 only has surrogates that are pretty straightforward.

syncsynchalt

4 months ago

I've written some Unicode transcoders; UTF-8 decoding devolves to a quartet of switch statements and each of the issues you've mentioned end up being a case statement where the solution is to replace the offending sequence with U+FFFD.

UTF-16 is simple as well but you still need code to absorb BOMs, perform endian detection heuristically if there's no BOM, and check surrogate ordering (and emit a U+FFFD when an illegal pair is found).

I don't think there's an argument for either being complex, the UTFs are meant to be as simple and algorithmic as possible. -8 has to deal with invalid sequences, -16 has to deal with byte ordering, other than that it's bit shifting akin to base64. Normalization is much worse by comparison.

My preference for UTF-8 isn't one of code complexity, I just like that all my 70's-era text processing tools continue working without too many surprises. The features like self-synchronization are nice too compared to what we _could_ have gotten as UTF-8.

cyphar

4 months ago

1 reply

UTF-16 is absolutely not easier to work with. The vast majority of bugs I remember having to fix that were directly related to encoding were related to surrogate pairs. I suspect most programs do not handle them correctly because they come up so rarely but the bugs you see are always awful. UTF-8 doesn't have this problem and I think that's enough reason to avoid UTF-16 (though "good enough" compatibility with programs that only understand 8-bit-clean ASCII is an even better practical reason). Byte ordering is also a pernicious problem (with failure modes like "all of my documents are garbled") that UTF-8 also completely avoids.

It is 33% more compact for most (but not all) CJK characters, but that's not the case for all non-English characters. However, one important thing to remember is that most computer-based documents contain large amounts of ASCII text purely because the formats themselves use English text and ASCII punctuation. I suspect that most UTF-8 files with CJK contents are much smaller than UTF-16 files, but I'd be interested in an actual analysis from different file formats.

The size argument (along with a lot of understandable contention around UniHan) is one of the reasons why UTF-8 adoption was slower in Japan and Shift-JIS is not completely dead (though mainly for esoteric historical reasons like the 漢検 test rather than active or intentional usage) but this is quite old history at this point. UTF-8 now makes up 99% of web pages.

cyphar

4 months ago

I went through a Japanese ePUB novel I happened to have on hand (the Japanese translation of 1984) and 65% of the bytes are ASCII bytes. So in this case UTF-16 would end up resulting in something like 53% more bytes (going by napkin math).

You could argue that because it will be compressed (and UTF-16 wastes a whole NUL byte for all ASCII) that the total file-size for the compressed version would be better (precisely because there are so many wasted bytes) but there are plenty of examples where files aren't compressed and most systems don't have compressed memory so you will pay the cost somewhere.

But in the interest of transparency, a very crude test of the same ePUB yields a 10% smaller file with UTF-16. I think a 10% size penalty (in a very favourable scenario for UTF-16) in exchange for all of the benefits of UTF-8 is more than an acceptable tradeoff, and the incredibly wide proliferation of UTF-8 implies most people seem to agree.

kbolino

4 months ago

1 reply

Thanks to UTF-16, which came out after UTF-8, there are 2048 wasted 3-byte sequences in UTF-8.

And unlike the short-sighted authors of the first version of Unicode, who thought the whole world's writing systems could fit in just 65,536 distinct values, the authors of UTF-8 made it possible to encode up to 2 billion distinct values in the original design.

xigoi

4 months ago

1 reply

Thanks to UTF-8, there are 13 wasted 1-byte sequences in UTF-8 :P

kbolino

4 months ago

1 reply

Assuming your count is accurate, then 9 (edit: corrected from 11) of those 13 are also UTF-16's fault. The only bytes that were impossible in UTF-8's original design were 0b11111110 and 0b11111111. Remember that UTF-8 could handle up to 6-byte sequences originally.

Now all of this hating on UTF-16 should not be misconstrued as some sort of encoding religious war. UTF-16 has a valid purpose. The real problem was Unicode's first version getting released at a critical time and thus its 16-bit delusion ending up baked into a bunch of important software. UTF-16 is a pragmatic compromise to adapt that software so it can continue to work with a larger code space than it originally could handle. Short of rewiting history, it will stay with us forever. However, that doesn't mean it needs to be transmitted over the wire or saved on disk any more often than necessary.

Use UTF-8 for most purposes especially new formats, use UTF-16 only when existing software requires it, and use UTF-32 (or some other sequence of full code points) only internally/ephemerally to convert between the other two and perform high-level string functions like grapheme cluster segmentation.

xigoi

4 months ago

1 reply

Pretty sure 0b11000000 and 0b11000001 are also UTF-8’s fault. Good point with the others, I guess. And I agree about UTF-8 being the best, just found it funny.

kbolino

4 months ago

Yep, you're right. Those two bytes are forbidden to prevent overlong encodings. A number of multibyte sequences are forbidden for the same reason too.

A true flaw of UTF-8 in the long run. They should have biased the values of multibyte sequences to remove redundant encodings.

jcranmer

4 months ago

2 replies

> UTF-16 is both simpler to parse and more compact than utf-8 when writing non-english characters.

UTF-8 and UTF-16 take the same number of characters to encode a non-BMP character or a character in the range U+0080-U+07FF (which includes most of the Latin supplements, Greek, Cyrillic, Arabic, Hebrew, Aramaic, Syriac, and Thaana). For ASCII characters--which includes most whitespace and punctuation--UTF-8 takes half as much space as UTF-16, while characters in the range U+0800-U+FFFF, UTF-8 takes 50% more space than UTF-16. Thus, for most European languages, and even Arabic (which ain't European), UTF-8 is going to be more compact than UTF-16.

The Asian languages (CJK-based languages, Indic languages, and South-East Asian, largely) are the ones that are more compact in UTF-16 than UTF-8, but if you embed those languages in a context likely to have significant ASCII content--such as an HTML file--well, it turns out the UTF-8 still wins out!

> When you leave the anglosphere you'll find that some languages still default to other encodings due to how large utf-8 ends up for them (Chinese and Japanese, to name two).

You'll notice that the encodings that are used are not UTF-16 either. Also, my understanding is that China generally defaults to UTF-8 nowadays despite a government mandate to use GB18030 instead, so it's largely Japan that is the last redoubt of the anti-Unicode club.

amake

4 months ago

Even Japan is mostly Unicode these days.

GoblinSlayer

4 months ago

And when you download many megabytes of jabbascript to render 4kb of text, how does it matter what encoding you use?

airza

4 months ago

1 reply

It's all fun and games until you hit an astral plane character in utf-16 and one of the library designers didn't realize not all characters are 2 bytes.

rmunn

4 months ago

1 reply

Which is why I've seen lots of people recommend testing your software with emojis, particularly recently-added emojis (many of the earlier emojis were in the basic multilingual plane, but a lot of newer emojis are outside the BMP, i.e. the "astral" planes). It's particularly fun to use the (U+1F4A9) emoji for such testing, because of what it implies about the libraries that can't handle it correctly.

EDIT: Heh. The U+1F4A9 emoji that I included in my comment was stripped out. For those who don't recognize that codepoint by hand (can't "see" the Matrix just from its code yet?), that emoji's official name is U+1F4A9 PILE OF POO.

GoblinSlayer

4 months ago

For more fun you can use flag characters.

adgjlsfhk1

4 months ago

With BOM issues, UTF-16 is way more complicated. For Chinese and Japenese, UTF8 is a maximum of 50% bigger, but can actually end up smaller if used within standard file formats like JSON/HTML since all the formatting characters and spaces are single bytes.

simonask

4 months ago

All of Europe outside of the UK and Enligh-speaking Ireland need characters outside of ASCII, but most letters are ASCII. For example, the string "blåbærgrød" in Danish (blueberry porridge) has about the densest occurrence of non-ASCII characters, but that's still only 30%. It takes 13 bytes in UTF-8, but 20 bytes in UTF-16.

Spanish has generally at most one accented vowel (á, ó, ü, é, ...) per word, and generally at most one ñ per word. German rarely has more than two umlauts per word, and almost never more than one ß.

UTF-16 is a wild pessimization for European languages, and UTF-8 is only slightly wasteful in Asian languages.

Iwan-Zotow

4 months ago

There are no sane Chinese Japanese people who uses old encodings. None

kccqzy

4 months ago

Two decades ago the typical simplified Chinese website did in fact use GB2312 and not UTF-8; traditional Chinese website used Big5; Japanese sites used Shift JIS. These days that's not true at all. Your comment is twenty years out of date.

Animats

4 months ago

7 replies

Right. That's one of the great features of UTF-8. You can move forwards and backwards through a UTF-8 string without having to start from the beginning.

Python has had troubles in this area. Because Python strings are indexable by character, CPython used wide characters. At one point you could pick 2-byte or 4-byte characters when building CPython. Then that switch was made automatic at run time. But it's still wide characters, not UTF-8. One emoji and your string size quadruples.

I would have been tempted to use UTF-8 internally. Indices into a string would be an opaque index type which behaved like an integer to the extent that you could add or subtract small integers, and that would move you through the string. If you actually converted the opaque type to a real integer, or tried to subscript the string directly, an index to the string would be generated. That's an unusual case. All the standard operations, including regular expressions, can work on a UTF-8 representation with opaque index objects.

btown

4 months ago

1 reply

This is Python; finding new ways to subscript into things directly is a graduate student’s favorite pastime!

In all seriousness I think that encoding-independent constant-time substring extraction has been meaningful in letting researchers outside the U.S. prototype, especially in NLP, without worrying about their abstractions around “a 5 character subslice” being more complicated than that. Memory is a tradeoff, but a reasonably predictable one.

meindnoch

4 months ago

>without worrying about their abstractions around “a 5 character subslice” being more complicated than that

Combining characters still exist.

nostrademons

4 months ago

2 replies

PyCompactUnicodeObject was introduced with Python 3.3, and uses UTF-8 internally. It's used whenever both size and max code point are known, which is most cases where it comes from a literal or bytes.decode() call. Cut memory usage in typical Django applications by 2/3 when it was implemented.

https://peps.python.org/pep-0393/

I would probably use UTF-8 and just give up on O(1) string indexing if I were implementing a new string type. It's very rare to require arbitrary large-number indexing into strings. Most use-cases involve chopping off a small prefix (eg. "hex_digits[2:]") or suffix (eg. "filename[-3:]"), and you can easily just linear search these with minimal CPU penalty. Or they're part of library methods where you want to have your own custom traversals, eg. .find(substr) can just do Boyer-Moore over bytes, .split(delim) probably wants to do a first pass that identifies delimiter positions and then use that to allocate all the results at once.

barrkel

4 months ago

4 replies

You usually want O(1) indexing when you're implementing views over a large string. For example, a string containing a possibly multi-megabyte text file and you want to avoid copying out of it, and work with slices where possible. Anything from editors to parsing.

I agree though that usually you only need iteration, but string APIs need to change to return some kind of token that encapsulates both logical and physical index. And you probably want to be able to compute with those - subtract to get length and so on.

ori_b

4 months ago

1 reply

You don't particularly want indexing for that, but cursors. A byte offset (wrapped in an opaque type) is sufficient for that need.

bjoli

4 months ago

You could add a LUT for decently fast indexing as well. I believe Java does that.

nostrademons

4 months ago

Sure, but for something like that whatever constructs the view can use an opaque index type like Animats suggested, which under the hood is probably a byte index. The slice itself is kinda the opaque index, and then it can just have privileged access to some kind of unsafe_byteIndex accessor.

There are a variety of reasons why unsafe byte indexing is needed anyway (zero-copy?), it just shouldn’t be the default tool that application programmers reach for.

MrBuddyCasino

4 months ago

If you have multi-MB strings in an editor, that’s the problem right there. People use ropes instead of strings for a reason.

naniwaduni

4 months ago

You really just very rarely want codepoint indexing. A byte index is totally fine for view slices.

masklinn

4 months ago

1 reply

> PyCompactUnicodeObject was introduced with Python 3.3, and uses UTF-8 internally.

UTF8 is used for C level interactions, if it were just that being used there would be no need to know the highest code point.

For Python semantics it uses one of ASCII, iso-8859-1, ucs2, or ucs4.

nostrademons

4 months ago

Interesting. You're right. Code pointer:

https://github.com/python/cpython/blob/main/Objects/unicodeo...

Also implies that Animats is correct that including an emoji in a Python string can bloat the memory consumption by a factor of 4.

kccqzy

4 months ago

1 reply

Indices into a Unicode string is a highly unusual operation that is rarely needed. A string is Unicode because it is provided by the user or a localized user-facing string. You don't generally need indices.

Programmer strings (aka byte strings) do need indexing operations. But such strings usually do not need Unicode.

mjevans

4 months ago

They can happen to _be_ Unicode. Composition operations (for fully terminated Unicode strings) should work, but require eventual normalization.

That's the other part of the resume UTF8 strings mid way, even combining broken strings still results in all the good characters present.

Substring operations are more dicey; those should be operating with known strings. In pathological cases they might operate against portions of Unicode bits... but that's as silly as using raw pointers and directly mangling the bytes without any protection or design plans.

cryptonector

4 months ago

2 replies

Variable width encodings like UTF-8 and UTF-16 cannot be indexed in O(1), only in O(N). But this is not really a problem! Instead of indexing strings we need to slice them, and generally we read them forwards, so if slices (and slices of slices) are cheap, then you can parse textual data without a problem. Basically just keep the indices small and there's no problem.

account42

4 months ago

1 reply

Unicode itself is variable with due to combining characters, variant selectors, etc.

cryptonector

4 months ago

Yes, quite.

bjoli

4 months ago

Or just use immutsble strings and look-up-tales. Say, every 32 characters, combined with cursors. This is going to make indexing fast enough for randomly jumping into a striong and the using cursors.

otabdeveloper4

4 months ago

1 reply

"Unicode" aka "wide characters" is the dumbest engineering debacle of the century.

> ascii and codepage encodings are legacy, let's standardize on another forwards-incompatible standard that will be obsolete in five years > oh, and we also need to upgrade all our infrastructure for this obsolete-by-design standard because we're now keeping it forever

Dylan16807

4 months ago

1 reply

What about Unicode isn't forward compatible?

UCS-2 was an encoding mistake, but even it was pretty forward compatible

otabdeveloper4

4 months ago

1 reply

"Unicode" here means the OG Unicode that was supposed to fit all of past, current and future languages in exactly 16 bits.

Yes, it's a silly idea but it's exactly the reason why Python, Javascript and Java use the most brainded way of storing text known to man. (UCS-2)

Dylan16807

4 months ago

> "Unicode" here means the OG Unicode that was supposed to fit all of past, current and future languages in exactly 16 bits.

Well... it explicitly wasn't supposed to fit all past characters when they decided on 16 bits.

And they weren't sure on size for a while, and only kept it for a couple years, so I would make the fact that you're complaining about the 16 bits more explicit.

But also it did turn out to be forward compatible. That's part of why we're stuck with it!

johncolanduoni

4 months ago

Your solution is basically what Swift does. Plus they do the same with extended grapheme clusters (what a human would consider distinct characters mostly), and that’s the default character type instead of Unicode code point. Easily the best Unicode string support of any programming language.

zahlman

4 months ago

> If you actually converted the opaque type to a real integer, or tried to subscript the string directly, an index to the string would be generated.

What conversion rule do you want to use, though? You either reject some values outright, bump those up or down, or else start with a character index that requires an O(N) translation to a byte index.

deepsun

4 months ago

2 replies

That's assuming the text is not corrupted or maliciously modified. There were (are) _numerous_ vulnerabilities due to parsing/escaping of invalid UTF-8 sequences.

Quick googling (not all of them are on-topic tho):

https://www.rapid7.com/blog/post/2025/02/13/cve-2025-1094-po...

https://www.cve.org/CVERecord/SearchResults?query=utf-8

s1mplicissimus

4 months ago

3 replies

I was just wondering a similar thing: If 10 implies start of character, doesn't that require 10 to never occur inside the other bits of a character?

gavinsyancey

4 months ago

Generally you can assume byte-aligned access. So every byte of UTF-8 either starts with 0 or 11 to indicate an initial byte, or 10 to indicate a continuation byte.

pklausler

4 months ago

10 never implies the start of a character; those begin with 0 or 11.

dbaupp

4 months ago

UTF-8 encodes each character into a whole number of bytes (8, 16, 24, or 32 bits), and the 10 continuation marker is only at the start of the extra continuation bytes, it is just data when that pattern occurs within a byte.

You are correct that it never occurs at the start of a byte that isn’t a continuation bytes: the first byte in each encoded code point starts with either 0 (ASCII code points) or 11 (non-ASCII).

restalis

4 months ago

This tendency of requirement overloading, for what can otherwise be a simple solution for a simple problem, is the bane of engineering. In this case, if security is important, it can be addressed separately, e.g. for the underlying text treated as an abstract information block that has to be packaged with corresponding error codes then checked for integrity before consumption. The UTF-8 encoding/decoding process itself doesn't necessarily have to answer the security concerns. Please let the solutions be simple, whenever they can be.

thesz

4 months ago

1 reply

> so you can easily find the beginning of the next or previous character.

It is not true [1]. While it is not UTF-8 problem per se, it is a problem of how UTF-8 is being used.

[1] https://paulbutler.org/2025/smuggling-arbitrary-data-through...

layer8

4 months ago

Parent means “character” as defined here in Unicode: https://www.unicode.org/versions/Unicode17.0.0/core-spec/cha..., effectively code points. Meanings 2 and 3 in the Unicode glossary here: https://www.unicode.org/glossary/#character

spankalee

4 months ago

1 reply

Wouldn't you only need to read backwards at most 3 bytes to see if you were currently at a continuation byte? With a max multi-byte size of 4 bytes, if you don't see a multi-byte start character by then you would know it's a single-byte char.

I wonder if a reason is similar though: error recovery when working with libraries that aren't UTF-8 aware. If you slice naively slice an array of UTF-8 bytes, a UTf-8 aware library can ignore malformed leading and trailing bytes and get some reasonable string out of it.

Sharlin

4 months ago

1 reply

It’s not always possible to read backwards.

Dylan16807

4 months ago

1 reply

Okay so you seek by 3 less bytes.

Or you accept that if you're randomly losing chunks, you might lose an extra 3 bytes.

The real problem is that seeking a few bytes won't work with EMBL. If continuation bytes store 8 payload bits, you can get into a situation where every single byte could be interpreted as a multi-byte start character and there are 2 or 3 possible messages that never converge.

Sharlin

4 months ago

1 reply

The point is that you don’t have a "seek" operation available. You are given a bytestream and aren’t told if you’re at the start, in a valid position between code points, or in the middle of a code point. UTF-8’s self-synchronizing property means that by reading a single byte you immediately know if you’re in the middle of a code point, and that by reading and discarding at most two additional bytes you’re synchronized and can start/return decoding. That wouldn’t be possible if continuation bytes used all the bits for payload.

Dylan16807

4 months ago

Yes, the point is being able to synchronize.

But it doesn't matter if it takes 1 byte or 3 bytes to synchronize. And being unable to read backwards is not a problem.

(EMBL doesn't synchronize in three bytes but other encodings do.)

jridgewell

4 months ago

1 reply

This isn't quite right. In invalid UTF8, a continuation byte can also emit a replacement char if it's the start of the byte sequence. Eg, `0b01100001 0b10000000 0b01100001` outputs 3 chars: a�a. Whether you're at the beginning of an output char depends on the last 1-3 bytes.

rockwotj

4 months ago

1 reply

> outputs 3 chars

You mean codepoints or maybe grapheme clusters?

Anyways yeah it’s a little more complicated but the principle of being able to truncate a string without splitting a codepoint in O(1) is still useful

jridgewell

4 months ago

Yah, I was using char interchangeably with code point. I also used byte instead of code unit.

> truncate a string without splitting a codepoint in O(1) is still useful

Agreed!

sparkie

4 months ago

1 reply

VLQ/LEB128 are a bit better than the EBML's variable size integers. You test the MSB in the byte - `0` means it's the end of a sequence and the next byte is a new sequence. If the MSB is `1`, to find the start of the sequence you walk back until you find the first zero MSB at the end of the previous sequence (or the start of the stream). There are efficient SIMD-optimized implementations of this.

The difference between VLQ and LEB128 is endianness, basically whether the zero MSB is the start or end of a sequence.

    0xxxxxxx                   - ASCII
    1xxxxxxx 0xxxxxxx          - U+0080 .. U+3FFF
    1xxxxxxx 1xxxxxxx 0xxxxxxx - U+4000 .. U+10FFFD

                      0xxxxxxx - ASCII
             0xxxxxxx 1xxxxxxx - U+0080 .. U+3FFF
    0xxxxxxx 1xxxxxxx 1xxxxxxx - U+4000 .. U+10FFFD

It's not self-synchronizing like UTF-8, but it's more compact - any unicode codepoint can fit into 3 bytes (which can encode up to 0x1FFFFF), and ASCII characters remain 1 byte. Can grow to arbitrary sizes. It has a fixed overhead of 1/8, whereas UTF-8 only has overhead of 1/8 for ASCII and 1/3 thereafter. Could be useful compressing the size of code that uses non-ASCII, since most of the mathematical symbols/arrows are < U+3FFF. Also languages like Japanese, since Katakana and Hiragana are also < U+3FFF, and could be encoded in 2 bytes rather than 3.

kstenerud

4 months ago

2 replies

Unfortunately, VLQ/LEB128 is slow to process due to all the rolling decision points (one decision point per byte, with no ability to branch predict reliably). It's why I used a right-to-left unary code in my stuff: https://github.com/kstenerud/bonjson/blob/main/bonjson.md#le...

  | Header     | Total Bytes | Payload Bits |
  | ---------- | ----------- | ------------ |
  | `.......1` |      1      |       7      |
  | `......10` |      2      |      14      |
  | `.....100` |      3      |      21      |
  | `....1000` |      4      |      28      |
  | `...10000` |      5      |      35      |
  | `..100000` |      6      |      42      |
  | `.1000000` |      7      |      49      |
  | `10000000` |      8      |      56      |
  | `00000000` |      9      |      64      |

The full value is stored little endian, so you simply read the first byte (low byte) in the stream to get the full length, and it has the exact same compactness of VLQ/LEB128 (7 bits per byte).

Even better: modern chips have instructions that decode this field in one shot (callable via builtin):

https://github.com/kstenerud/ksbonjson/blob/main/library/src...

    static inline size_t decodeLengthFieldTotalByteCount(uint8_t header) {
        return (size_t)__builtin_ctz(header) + 1;
    }

After running this builtin, you simply re-read the memory location for the specified number of bytes, then cast to a little-endian integer, then shift right by the same number of bits to get the final payload - with a special case for `00000000`, although numbers that big are rare. In fact, if you limit yourself to max 56 bit numbers, the algorithm becomes entirely branchless (even if your chip doesn't have the builtin).

https://github.com/kstenerud/ksbonjson/blob/main/library/src...

It's one of the things I did to make BONJSON 35x faster to decode/encode compared to JSON.

https://github.com/kstenerud/bonjson

If you wanted to maintain ASCII compatibility, you could use a 0-based unary code going left-to-right, but you lose a number of the speed benefits of a little endian friendly encoding (as well as the self-synchronization of UTF-8 - which admittedly isn't so important in the modern world of everything being out-of-band enveloped and error-corrected). But it would still be a LOT faster than VLQ/LEB128.

sparkie

4 months ago

1 reply

We can do better than one branch per byte - we can have it per 8-bytes at least.

We'd use `vpmovb2m`[1] on a ZMM register (64-bytes at a time), which fills a 64-bit mask register with the MSB of each byte in the vector.

Then process the mask register 1 byte at a time, using it as an index into a 256-entry jump table. Each entry would be specialized to process the next 8 bytes without branching, and finish with conditional branch to the next entry in the jump table or to the next 64-bytes. Any trailing ones in each byte would simply add them to a carry, which would be consumed up to the most significant zero in the next eightbytes.

[1]:https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

kstenerud

4 months ago

1 reply

Sure, but with the above algorithm you could do it in zero branches, and in parallel if you like.

sparkie

4 months ago

Decoding into integers may be faster, but it's kind of missing the point why I suggested VLQs as opposed to EBML's variable length integers - they're not a good fit for string handling. In particular, if we wanted to search for a character or substring we'd have to start from the beginning of the stream and traverse linearly, because there's no synchronization - the payload bytes are indistinguishable from header bytes, making a parallel search not practical.

While you might be able to have some heuristic to determine whether a character is a valid match, it may give false positives and it's unlikely to be as efficient as "test if the previous byte's MSB is zero". We can implement parallel search with VLQs because we can trivially synchronize the stream to next nearest character in either direction - it's partially-synchronizing.

Obviously not as good as UTF-8 or UTF-16 which are self-synchronizing, but it can be implemented efficiently and cut encoding size.

roytam87

4 months ago

"what about using VLQ as a Unicode transformation format?"

a rough implementation is not hard. (for writing, my implementation will write BOM in beginning and only do 28bits)

https://github.com/roytam1/rtoss/commit/b09bd53d7f4166f34c8b...

procaryote

4 months ago

also, the redundancy means that you get a pretty good heuristic for "is this utf-8". Random data or other encodings are pretty unlikely to also be valid utf-8, at least for non-tiny strings

cryptonector

4 months ago

This is referred to as UTF-8 being "self-synchronizing". You can jump to the middle and find a codepoint boundary. You can read it backwards. You can read it forwards.

jancsika

4 months ago

> Having the continuation bytes always start with the bits `10` also make it possible to seek to any random byte, and trivially know if you're at the beginning of a character or at a continuation byte like you mentioned, so you can easily find the beginning of the next or previous character.

Given four byte maximum, it's a similarly trivial algo for the other case you mention.

The main difference I see is that UTF8 increases the chance of catching and flagging an error in the stream. E.g., any non-ASCII byte that is missing from the stream is highly likely to cause an invalid sequence. Whereas with the other case you mention the continuation bytes would cause silent errors (since an ASCII character would be indecipherable from continuation bytes).

Encoding gurus-- am I right?

PaulHoule

4 months ago

It's not uncommon when you want variable length encodings to write the number of extension bytes used in unary encoding

https://en.wikipedia.org/wiki/Unary_numeral_system

and also use whatever bits are left over encoding the length (which could be in 8 bit blocks so you write 1111/1111 10xx/xxxx to code 8 extension bytes) to encode the number. This is covered in this CS classic

https://archive.org/details/managinggigabyte0000witt

together with other methods that let you compress a text + a full text index for the text into less room than text and not even have to use a stopword list. As you say, UTF-8 does something similar in spirit but ASCII compatible and capable of fast synchronization if data is corrupted or truncated.

twbarr

4 months ago

1 reply

It should be noted that the final design for UTF-8 was sketched out on a placemat by Rob Pike and Ken Thompson.

hu3

4 months ago

1 reply

I wonder if that placemat still exists today. It would be such an important piece of computer history.

4 months ago

> It was so easy once we saw it that there was no reason to keep the placemat for notes, and we left it behind. Or maybe we did bring it back to the lab; I'm not sure. But it's gone now.

https://commandcenter.blogspot.com/2020/01/utf-8-turned-20-y...

cyberax

4 months ago

3 replies

UTF-8 is simply genius. It entirely obviated the need for clunky 2-byte encodings (and all the associated nonsense about byte order marks).

The only problem with UTF-8 is that Windows and Java were developed without knowledge about UTF-8 and ended up with 16-bit characters.

Oh yes, and Python 3 should have known better when it went through the string-bytes split.

wrs

4 months ago

1 reply

UTF-16 made lots of sense at the time because Unicode thought "65,536 characters will be enough for anybody" and it retains the 1:1 relationship between string elements and characters that everyone had assumed for decades. I.e., you can treat a string as an array of characters and just index into it with an O(1) operation.

As Unicode (quickly) evolved, it turned out not that only are there WAY more than 65,000 characters, there's not even a 1:1 relationship between code points and characters, or even a single defined transformation between glyphs and code points, or even a simple relationship between glyphs and what's on the screen. So even UTF-32 isn't enough to let you act like it's 1980 and str[3] is the 4th "character" of a string.

So now we have very complex string APIs that reflect the actual complexity of how human language works...though lots of people (mostly English-speaking) still act like str[3] is the 4th "character" of a string.

UTF-8 was designed with the knowledge that there's no point in pretending that string indexing will work. Windows, MacOS, Java, JavaScript, etc. just missed the boat by a few years and went the wrong way.

rowls66

4 months ago

6 replies

I think more effort should have been made to live with 65,536 characters. My understanding is that codepoints beyond 65,536 are only used for languages that are no longer in use, and emojis. I think that adding emojis to unicode is going to be seen a big mistake. We already have enough network bandwith to just send raster graphics for images in most cases. Cluttering the unicode codespace with emojis is pointless.

dudeinjapan

4 months ago

1 reply

CJK unification (https://en.wikipedia.org/wiki/CJK_Unified_Ideographs) i.e. combining "almost same" Chinese/Japanese/Korean characters into the same codepoint, was done for this reason, and we are now living with the consequence that we need to load separate Traditional/Simplified Chinese, Japanese, and Korean fonts to render each language. Total PITA for apps that are multi-lingual.

mort96

4 months ago

2 replies

This feels like it should be solveable with introducing a few more marker characters, like one code point representing "the following text is traditional Chinese", "the following text is Japanese", etc? It would add even more statefulness to Unicode, but I feel like that ship has already sailed with the U+202D LEFT-TO-RIGHT OVERRIDE and U+202E RIGHT-TO-LEFT OVERRIDE characters...

cyberax

4 months ago

1 reply

There is a way to do it: https://en.wikipedia.org/wiki/Variation_Selectors_(Unicode_b...

However, it's not used widely and has problems with variant-naïve fonts.

dudeinjapan

4 months ago

Yeah. I would have favored something like introducing new codepoints with "automatic fallbacks" if the font doesn't support that codepoint, to ensure backward compatibility. There would be a one-time hardcoded mapping table introduced that font renderers would have to adopt.

fanf2

4 months ago

Unicode used to have a system of in-band language tags, but it was deprecated https://www.unicode.org/faq//languagetagging.html

mort96

4 months ago

1 reply

The silly thing is, lots of emoji these days aren't even a single code point. So many emoji these days are two other code points combined with a zero width joiner. Surely we could've introduced one code point which says "the next code point represents an emoji from a separate emoji set"?

wongarsu

4 months ago

1 reply

With that approach you could no longer look at a single code point and decide if it's e.g. a space. You would always have to look back at the previous code point to see if you are now in the emoji set. That would bring its own set of issues for tools like grep.

But what if instead of emojis we take the CJK set and make it more compositional. Instead of >100k characters with different glyphs we could have defined a number of brush stroke characters and compositional characters (like "three of the previous character in a triangle formation). We could still make distinct code points for the most common couple thousand characters, just like ä can be encoded as one code point or two (umlaut dots plus a).

Alas, in the 90s this would have been seen as too much complexity

nathanhammond

4 months ago

Seeing your handle I am at risk of explaining something you may already know, but, this exists! And it was standardized in 1993, though I don't know when Unicode picked it up.

Ideographic Description Characters: https://www.unicode.org/charts/PDF/U2FF0.pdf

The fine people over at Wenlin actually have a renderer that generates characters based on this sort of programmatic definition, their Character Description Language: https://guide.wenlininstitute.org/wenlin4.3/Character_Descri... ... in many cases, they are the first digital renderer for new characters that don't yet have font support.

Another interesting bit, the Cantonese linguist community I regularly interface with generally doesn't mind unification. It's treated the same as a "single-storey a" (the one you write by hand) and a "two-storey a" (the one in this font). Sinitic languages fractured into families in part because the graphemes don't explicitly encode the phonetics + physical distance, and the graphemes themselves fractured because somebody's uncle had terrible handwriting.

I'm in Hong Kong, so we use 説 (8AAC, normalized to 8AAA) while Taiwan would use 說 (8AAA). This is a case my linguist friends consider a mistake, but it happened early enough that it was only retroactively normalized. Same word, same meaning, grapheme distinct by regional divergence. (I think we actually have three codepoints that normalize to 8AAA because of radical variations.)

The argument basically reduces "should we encode distinct graphemes, or distinct meanings." Unicode has never been fully-consistent on either side of that. The latest example, we're getting ready to do Seal Script as a separate non-unified code point. https://www.unicode.org/roadmaps/tip/

In Hong Kong, some old government files just don't work unless you have the font that has the specific author's Private Use Area mapping (or happen to know the source encoding and can re-encode it). I've regularly had to pull up old Windows in a VM to grab data about old code pages.

In short: it's a beautiful mess.

daneel_w

4 months ago

1 reply

I entirely agree that we could've cared better for the leading 16 bit space. But protocol-wise adding a second component (images) to the concept of textual strings would've been a terrible choice.

The grande crime was that we squandered the space we were given by placing emojis outside the UTF-8 specification, where we already had a whooping 1.1 million code points at our disposal.

duskwuff

4 months ago

> The grande crime was that we squandered the space we were given by placing emojis outside the UTF-8 specification

I'm not sure what you mean by this. The UTF-8 specification was written long before emoji were included in Unicode, and generally has no bearing on what characters it's used to encode.

jasonwatkinspdx

4 months ago

1 reply

You are mistaken. Chinese Hanzi and the languages that derive from or incorporate them require way more than 65,536 code points. In particular a lot of these characters are formal family or place names. USC-2 failed because it couldn't represent these, and people using these languages justifiably objected to having to change how their family name is written to suit computers, vs computers handling it properly.

This "two bytes should be enough" mistake was one of the biggest blind spots in Unicode's original design, and is cited as an example of how standards groups can have cultural blind spots.

duskwuff

4 months ago

UTF-16 also had a bunch of unfortunate ramifications on the overall design of Unicode, e.g. requiring a substantial chunk of BMP to be reserved for surrogate characters and forcing Unicode codepoints to be limited to U+10FFFF.

gred

4 months ago

> My understanding is that codepoints beyond 65,536 are only used for languages that are no longer in use, and emojis

This week's Unicode 17 announcement [1] mentions that of the ~160k existing codepoints, over 100k are CJK codepoints, so I don't think this can be true...

[1] https://blog.unicode.org/2025/09/unicode-170-release-announc...

duskwuff

4 months ago

Your understanding is incorrect; a substantial number of the ranges allocated outside BMP (i.e. above U+FFFF) are used for CJK ideographs which are uncommon, but still in use, particularly in names and/or historical texts.

KerrAvon

4 months ago

NeXTstep was also UTF-16 through OpenStep 4.0, IIRC. Apple was later able to fix this because the string abstraction in the standard library was complete enough no one actually needed to care about the internal representation, but the API still retains some of the UTF-16-specific weirdnesses.

wongarsu

4 months ago

Yeah, Java and Windows NT3.1 had really bad timing. Both managed to include Unicode despite starting development before the Unicode 1.0 release, but both added unicode back when Unicode was 16 bit and the need for something like UTF-8 was less clear

hyperman1

4 months ago

7 replies

One thing I always wonder: It is possible to encode a unicode codepoint with too much bytes. UTF-8 forbids these, only the shortest one is valid. E.g 00000001 is the same as 11000000 10000001.

So why not make the alternatives impossible by adding the start of the last valid option? So 11000000 10000001 would give codepoint 128+1 as values 0 to 127 are already covered by a 1 byte sequence.

The advantages are clear: No illegal codes, and a slightly shorter string for edge cases. I presume the designers thought about this, so what were the disadvantages? The required addition being an unacceptable hardware cost at the time?

UPDATE: Last bitsequence should of course be 10000001 and not 00000001. Sorry for that. Fixed it.

rhet0rica

4 months ago

1 reply

See quectophoton's comment—the requirement that continuation bytes are always tagged with a leading 10 is useful if a parser is jumping in at a random offset—or, more commonly, if the text stream gets fragmented. This was actually a major concern when UTF-8 was devised in the early 90s, as transmission was much less reliable than it is today.

rhet0rica

4 months ago

Addendum: This was posted to the front page today: https://doc.cat-v.org/bell_labs/utf-8_history

It also notes that UTF-8 protects against the dangers of NUL and '/' appearing in filenames, which would kill C strings and DOS path handling, respectively.

toast0

4 months ago

2 replies

The siblings so far talk about the synchronizing nature of the indicators, but that's not relevant to your question. Your question is more of

Why is U+0080 encoded as c2 80, instead of c0 80, which is the lowest sequence after 7f?

I suspect the answer is

a) the security impacts of overlong encodings were not contemplated; lots of fun to be had there if something accepts overlong encodings but is scanning for things with only shortest encodings

b) utf-8 as standardized allows for encode and decode with bitmask and bitshift only. Your proposed encoding requires bitmask and bitshift, in addition to addition and subtraction

You can find a bit of email discussion from 1992 here [1] ... at the very bottom there's some notes about what became utf-8:

> 1. The 2 byte sequence has 2^11 codes, yet only 2^11-2^7 are allowed. The codes in the range 0-7f are illegal. I think this is preferable to a pile of magic additive constants for no real benefit. Similar comment applies to all of the longer sequences.

The included FSS-UTF that's right before the note does include additive constants.

[1] https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

layer8

4 months ago

1 reply

A variation of a) is comparing strings as UTF-8 byte sequences if overlong encodings are also accepted (before and/or later). This leads to situations where strings tested as unequal are actually equal in terms of code points.

torstenvl

4 months ago

1 reply

Ehhh I view things slightly differently. Overlong encodings are per se illegal, so they cannot encode code points, even if a naive algorithm would consistently interpret them as such.

I get what you mean, in terms of Postel's Law, e.g., software that is liberal in what it accepts should view 01001000 01100101 01101010 01101010 01101111 as equivalent to 11000001 10001000 11000001 10100101 11000001 10101010 11000001 10101010 11000001 10101111, despite the sequence not being byte-for-byte identical. I'm just not convinced Postel's Law should be applied wrt UTF-8 code units.

layer8

4 months ago

The context of my comment was (emphasis mine): “lots of fun to be had there if something accepts overlong encodings but is scanning for things with only shortest encodings”.

Yes, software shouldn’t accept overlong encodings, and I was pointing out another bad thing that can happen with software that does accept overlong encodings, thereby reinforcing the advice to not accept them.

hyperman1

4 months ago

Oops yeah. One of my bit sequences is of course wrong and seems to have derailed this discussion. Sorry for that. Your interpretation is correct.

I've seen the first part of that mail, but your version is a lot longer. It is indeed quite convincing in declaring b) moot. And security was not that big of a thing then as it is now, so you're probalbly right

rightbyte

4 months ago

I think that would garble random access?

nostrademons

4 months ago

I assume you mean "11000000 10000001" to preserve the property that all continuation bytes start with "10"? [Edit: looks like you edited that in]. Without that property, UTF-8 loses self-synchronicity, the property that given a truncated UTF-8 stream, you can always find the codepoint boundaries, and will lose at most codepoint worth rather than having the whole stream be garbled.

In theory you could do it that way, but it comes at the cost of decoder performance. With UTF-8, you can reassemble a codepoint from a stream using only fast bitwise operations (&, |, and <<). If you declared that you had to subtract the legal codepoints represented by shorter sequences, you'd have to introduce additional arithmetic operations in encoding and decoding.

gpvos

4 months ago

That would make the calculations more complicated and a little slower. Now you can do a few quick bit shifts. This was more of an issue back in the '90s when UTF-8 was designed and computers were slower.

variadix

4 months ago

https://en.m.wikipedia.org/wiki/Self-synchronizing_code

umanwizard

4 months ago

Because then it would be impossible to tell from looking at a byte whether it is the beginning of a character or not, which is a useful property of UTF-8.

rmccue

4 months ago

1 reply

Love the UTF-8 playground that's linked: https://utf8-playground.netlify.app/

Would be great if it was possible to enter codepoints directly; you can do it via the URL (`/F8FF` eg), but not in the UI. (Edit, the future is now. https://github.com/vishnuharidas/utf8-playground/pull/6)

vishnuharidasAuthor

4 months ago

Thanks for the contribution, this is now merged and live.

alberth

4 months ago

1 reply

I’ve re-read so many times Joel’s article on Unicode. It’s also very helpful.

https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

mixmastamyk

4 months ago

2 replies

Read that a few times back then as well, but that and other pieces of the day never told you how to actually write a program that supported Unicode. Just facts about it.

So I went around fixing UnicodeErrors in Python at random, for years, despite knowing all that stuff. It wasn't until I read Batchelder's piece on the "Unicode Sandwich," about a decade later that I finally learned how to write a program to support it properly, rather than playing whack-a-mole.

rmunn

4 months ago

1 reply

> ... Batchelder's piece on the "Unicode Sandwich," ...

Is this the piece you mean? https://nedbatchelder.com/text/unipain.html

mixmastamyk

4 months ago

I think so, looks like it was from a presentation.

mixmastamyk

4 months ago

^Necessary but not sufficient.

librasteve