Rfc 9839 and Bad Unicode
Posted4 months agoActive4 months ago
tbray.orgTechstoryHigh profile
calmmixed
Debate
80/100
UnicodeCharacter EncodingString Validation
Key topics
Unicode
Character Encoding
String Validation
The article discusses RFC 9839 and the challenges of handling 'bad Unicode' characters, sparking a discussion on the complexities of Unicode and the need for careful string validation.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
1h
Peak period
81
0-6h
Avg / period
14.8
Comment distribution148 data points
Loading chart...
Based on 148 loaded comments
Key moments
- 01Story posted
Aug 23, 2025 at 8:54 AM EDT
4 months ago
Step 01 - 02First comment
Aug 23, 2025 at 10:00 AM EDT
1h after posting
Step 02 - 03Peak activity
81 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Aug 26, 2025 at 3:39 AM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 44995640Type: storyLast synced: 11/20/2025, 6:24:41 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Unicode validation/cleanup should be done separately because it's needed in multiple places, not just JSON.
All the letters in this string are “just text”:
JSON itself allows putting sequences of escape characters in the string that don’t unescape to valid Unicode. That’s fine, because the strings aren’t required to represent any particular encoding: it’s up to a layer higher than JSON to be opinionated about that.I wouldn’t want my shell’s pipeline buffers to reject data it doesn’t like, why should a JSON serializer?
I would almost always go for “signaling an error”.
The only sanity I’ve found is to treat Unicode strings as if they’re some proprietary data unit format. You can accept them, store them, render them, and compare them with each other for (data, not semantic) equality. But you just don’t ever try to reason about their content. Heck I’m not even comfortable trying to concatenate them or anything like that.
Unicode 1.0 also explains: “The convention used by the Unicode standard is consistent with the logical order of other non-spacing marks in Semitic and Indic scripts, the great majority of which follow the base characters with respect to which they are positioned. To avoid the complication of defining and implementing non-spacing marks on both sides of base characters, the Unicode standard specifies that all non-spacing marks must follow their base characters. This convention conforms to the way modern font technology handles the rendering of non-spacing graphical forms, so that mapping from character store to font rendering is simplified.”
On the other hand, prefix combining characters would have vastly simplified keyboard handling, since that's exactly what typewriter dead keys are.
You generally also don’t want to produce different Unicode sequences depending on whether you have an “é” key you can press or have to use a dead-key “’”.
If you want to define such a key, you can probably still do it, you’ll just have to press it in the opposite order and use backspace if you want to cancel it.
The fact that dead keys happen to be prefix is in principle arbitrary, they could as well be suffix. On physical typewriters, suffix was more customary I think, i.e. you’d backspace over the character you want to accent and type the accent on top of it. To obtain just the accent, you combine it with Space either way.
Why would anyone type like that? Instead of pressing two keys (the accent key followed by the letter key), you'd need to press four (letter, backspace, accent, space bar) for no reason.
* European-style combining characters, as well as precomposed versions for some arbitrary subset of legal combinations, and nothing preventing you from stacking them arbitrarily (as in Zalgo text) or on illogical base characters (who knows what your font renderer will do if you ask to put a cedilla on a kanji? It might even work!)
* Jamo for Hangul that are three pseudo-characters representing the parts of a larger character, that have to be in order (and who knows what you're supposed to do with an invalid jamo sequence)
* Emoji that are produced by applying a "variation selector" to a normal character
* Emoji that are just single characters — including ones that used to be normal characters and were retconned to now require the variation selector to get the original appearance
* Some subset of emoji that can have a skin-tone modifier applied as a direct suffix
* Some other subset of emoji that are formed by combining other emoji, which requires a zero-width-joiner in between (because they'd also be valid separately), which might be rendered as the base components anyway if no joined glyph is available
* National flags that use a pair of abstract characters used to spell a country code; neither can be said to be the base vs the modifier (this lets them say that they never removed or changed the meaning of a "character" while still allowing for countries to change their country codes, national flags or existence status)
* Other flags that use a base flag character, followed by "tag letter" characters that were originally intended for a completely different purpose that never panned out; and also there was temporary disagreement about which base character should be used
* Other other flags that are vendor-specific but basically work like emoji with ZWJ sequences
And surely more that I've forgotten about or not learned about yet.
I don't think a wire protocol should necessarily restrict them, for the sake of compatibility with existing text corpus out there, but a fair observation.
1: https://trojansource.codes/
The rules for "username", "display name", "biography", "email address", "email body" and "contents of uploaded file with name foo.txt" are not all going to be the same.
I’m imagining a listing of regex rules for the various gotchas, and then a validation-level use that unions the ones you want.
Of course, this is an “annex”, not part of the core Unicode spec. So in situations where you can’t rely on the presentation layer’s (correct) implementation of the Bidi algorithm, you can fall back to directional override/embedding characters.
I just think they got distracted when the problems got harder, and instead of tackling them head-on, they now waste a lot of their resources on busywork - good intentions notwithstanding. Sure, it’s more fun standardizing sparkling disco balls than dealing with real-world pain points. That OpenType is a good and powerful standard which masks some of Unicode’s shortcomings doesn’t really help.
It’s not too late, and I hope they will find their way back to their original mission and be braver in solving long-standing issues.
https://news.ycombinator.com/item?id=44971254
This is the rejected proposal.
https://www.unicode.org/L2/L2003/03215-n2593-umlaut-trema.pd...
If you read thread from above you will find more examples from other people.
ah, and now I understand what the hell people mean when they put dots on coordinate! (but they are obviously wrong they should use the flying point from Catalan :)
... hm, so this issue is easily more than 20 years old. and since then there's no solution (or the German libraries consider the problem "solved" and ... no one else is making proposals to the WG about this nowadays)?
also, technically - since there are already more than 150K allocated code points - adding a different combining mark seems the correct way to do, right?
or it's now universally accepted that people who want to type ambigüité need to remember to type U+034F before the ü? (... or, of course it's up to their editor/typesetter software to offer this distinction)
regarding the Han unification, is there some kind of effort to "fix" that? (adding language-start language-end markers perhaps? or virtual code points for languages to avoid the need for searching strings for the being-end markers?)
The versioning is actually almost completely backwards by semver reasoning; 1.1 should have been 2.0, 2.0 should have been 3.0 and we should still be on 3.n now (since they have since kept the promise not to remove anything).
In the example, username validation is a job of another layer. For example I want to make sure username is shorter than 60 characters, has no emojis or zalgo text, and yes, no null bytes, and return a proper error from the API. I don't want my JSON parsing to fail on completely different layer pre-validation.
And for username some classes are obviously bad - like explained. But what if I send text files that actually use those weird tabs. I expect things that work in my language utf8 "string" type to be encodable. Even more importantly, I see plenty of use cases for null byte, and it is in fact often seen in JSON in the wild.
On the other hand, if we have to use a restricted set of "normal" Unicode characters, having a standard feels useful - better than everyone creating their own mini standard. So I think I like the idea, just don't buy the argumentation or examples in the blog post.
- "Unicode Scalars", aka "well-formed UTF-16", aka "the Python string type"
- "Potentially ill-formed UTF-16", aka "WTF-8", aka "the JavaScript string type"
- "Potentially ill-formed UTF-8", aka "an array of bytes", aka "the Go string type"
- Any of the above, plus "no U+0000", if you have to interface with a language/library that was designed before people knew what buffer overflow exploits were
This sequence of characters is a work of art.
Either way, I think the bitter lesson is a parser really can’t rely on the well-formedness of a Unicode string over the wire. Practically speaking, all wire formats are potentially ill-formed until parsed into a non-wire format (or rejected by same parser).
[1] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...
If you run into this when encoding to UTF-8 then your source data isn't valid Unicode and it depends on what it really is if not proper Unicode. If you can validate at other boundaries then you won't have to deal with it there.
If you don't actively make a choice then decoding al la WTF-8 comes natural. Anything else is going to need additional branches.
Can you elaborate more on this? I understood the Python string to be UTF-32, with optimizations where possible to reduce memory use.
Like, the basic code points -> bytes in memory logic that underlies UTF-32, or UTF-8 for that matter, is perfectly capable of representing [U+D83D U+DE00] as a sequence distinct from [U+1F600]. But UTF-16 can't because the first sequence is a surrogate pair. So if your language applies the restriction that strings can't contain surrogate code points, it's basically emulating the UTF-16 worldview on top of whatever encoding it uses internally. The set of strings it supports is the same as the set of strings a language that does use well-formed UTF-16 supports, for the purposes of deciding what's allowed to be represented in a wire protocol.
Using those codepoints makes for invalid Unicode, not just invalid UTF-16. Rust, which uses utf-8 for its String type, also forbids unpaired surrogates. `let illegal: char = 0xDEADu32.try_into().unwrap();` panics.
It's not that these languages emulate the UTF-16 worldview, it's that UTF-16 has infected and shaped all of Unicode. No code points are allowed that can't be unambiguously represented in UTF-16.
edit: This cousin comment has some really good detail on Python in particular: https://news.ycombinator.com/item?id=44997146
I thought WTF-8 was just, "UTf-8, but without the restriction to not encode unpaired surrogates"? Windows and Java and JavaScript all use "possibly ill-formed UTF-16" as their string type, not WTF-8.
Surrogate pairs were only added with Unicode 2.0 in 1996, at which point Windows NT and Java already existed. The fact that those continue to allow unpaired surrogate characters is in parts due to backwards compatibility.
Linux got to adopt UTF-8 because the just stuck their head in the sand and stayed on ASCII well past the time they needed to change. Even now, a lot of programs only support ASCII character streams.
(Some people instead encode each WTF-16 surrogate independently regardless of whether it participates in a valid pair or not, yielding an UTF-8-like but UTF-8-incompatible-beyond-U+FFFF thing usually called CESU-8. We don’t talk about those people.)
So in your last example, UTF-8 & UTF-32 are the same type, containing the same infinite set of values, and — of course — one can convert between them infallibly.
But you can't encode arbitrary Go strings in WTF-8 (some are not representable), you can't encode arbitrary Python strings in UTF-8 or WTF-8 (n.b. that upthread is wrong about Python being equivalent to Unicode scalars/well-formed UTF-*.) and attempts to do so might error. (E.g., `.encode('utf-8')` in Python on a `str` can raise.)
If you imagine a format that can encode JavaScript strings containing unpaired surrogates, that's WTF-8. (Well-formed WTF-8 is the same type as a JS string, through with a different representation.)
(Though that would have been cute name for the UTF-8/latin1/UTF-8 fail.)
[1]: https://simonsapin.github.io/wtf-8/
(On review, it appears that the thread mentions much earlier uses...)
In the last few years, the name has become very popular with Simon Sapin’s definition.
https://en.wikipedia.org/wiki/Mojibake
"the Python string type" is neither "UTF-16" nor "well-formed", and there are very deliberate design decisions behind this.
Since Python 3.3 with the introduction of https://peps.python.org/pep-0393/ , Python does not use anything that can be called "UTF-16" regardless of compilation options. (Before that, in Python 2.2 and up the behaviour was as in https://peps.python.org/pep-0261/ ; you could compile either a "narrow" version using proper UTF-16 with surrogate pairs, or a "wide" version using UTF-32.)
Instead, now every code point is represented as a separate storage element (as they would be in UTF-32) except that the allocated memory is dynamically chosen from 1/2/4 bytes per element as needed. (It furthermore sets a flag for 1-byte-per-element strings according to whether they are pure ASCII or if they have code points in the 128..255 range.)
Meanwhile, `str` can store surrogates even though Python doesn't use them normally; errors will occur at encoding time:
They're even disallowed for an explicit encode to utf-16: But this can be overridden: Which subsequently allows for decoding that automatically interprets surrogate pairs: Storing surrogates in `str` is used for smuggling in binary data. For example, the runtime does it so that it can try to interpret command line arguments as UTF-8 by default, but still allow arbitrary (non-null) bytes to be passed (since that's a thing on Linux): It does this by decoding with the same 'surrogateescape' error handler that the above diagnostic needs when re-encoding:Which ones, and why? Tim and Paul collectively have around 100,000X the experience with this than most people do, so it'd be interesting to read substantive criticism.
It seems like you think this standard is JSON-specific?
OK, but where does it get decided what even counts a character? Should that be in the same layer? Even within a single system, there may be different sensible answers to that.
I was responding to the parent's empty sniping as gently as I could, but the answer to your (good) question has nothing to do with this RFC specifically. It's something that people doing sanitation/validation/serialization have had to learn.
The answer to your question is that you make decisions like this as a policy in your business layer/domain, and then you enforce it (consistently) in multiple places. For example, usernames might be limited to lowercase letters, numbers, and dashes so they're stable for identity and routing, while display names generally have fewer limitations so people can use accented characters or scripts from different languages. The rules live in the business/domain layer, and then you use libraries to enforce them everywhere (your API, your database, your UI, etc.).
Yes, I do realize that there is a lot of text markup formats that encode into plain text, for better interoperability.
It is (or, at least, used to be) common to have FF characters on plain text files, as a signal for your (dot matrix) printer to advance to the next page. So I'd add at least FF to that list.
https://terminals-wiki.org/wiki/index.php/DEC_VT240
https://www.1000bit.it/ad/bro/digital/DECVT240.pdf
If I had to make a specific choice I would probably whitelist the most common emojis for some definition and allow those
I mean, I don't even know my keyboard software is consistent in UTF-8 for the exact same intended visual representation outside of ASCII range, let alone across different operating systems and configurations, or over time. Or vice versa; the binary I would leave behind in time to consistently correspond to future Unicode interpretation AIs.
... speaking of consistency, neither the article nor RFC 9839 don't mention IVS situations or NFC/NFD/NFKC/NFKD regularizations problem as explicitly in or out of scope. Overall it feels like this RFC is missing the entire "Purpose" section except there is vague notion of there being non-character code points.
Unicode is that bad.
For usernames, I think your point is valid; you might restrict usernames to a subset of ASCII (not arbitrary ASCII; e.g. you might disallow spaces and some punctuations), or use numeric user IDs, while the display name might be less restricted. (In some cases (probably uncommon) you might also use a different character set than ASCII if that is desirable for your application, but Unicode is not a good way to do it.)
(I also think that Unicode is not good; it is helpful for many applications to have i18n (although you should be aware what parts should use it and what shouldn't), but Unicode is not a good way to do it.)
That would be reasonable if there were strict 1:1 correspondence between intended text and binary representations, but there isn't. Unicode has equivalents of British and American spellings, and users has no control over which to use. Precomposed vs Combining characters, Variant Selectors, etc. Ensuring it all regularize into canonical password string as developer obligation is unreasonable, and just falling back to ASCII is much more reasonable.
I guess everyone using alphanumeric sequences for every identifiers is somewhat imperialistic in a sense, but it's close to the least controversial of general cultural imperialism problems. It's probably okay to leave it to be solved for a century or two.
This seems like an extremely sheltered person’s point of view. I’m sure the worst case scenario involves a software defect in the parser or whatever and some kind of terrible security breach…
It's not about JSON, or the web, those are just example vehicles for the discussion. The RFC is completely agnostic about what thing the protocols or data formats are intended for, as long as they're text based, and specifically unicode text based.
So it sounds like you like misread the blog post, and what you should be doing is now read the RFC. It's short. You can cruise through https://www.rfc-editor.org/rfc/rfc9839.html in a few minutes and see it's not actually about what you're focussing on.
Usernames are a bad examples. Because at the point you mention, you may as well only allow a subset of visible ASCII. Which a lot of sites do and that works perfectly fine.
But for stuff like family names you have to restrict so many thing otherwise you'll have little-bobby-zalgo-with-hangul-modifiers breaking havoc.
Unicode is the problem. And workarounds are sadly needed due to the clusterfuck that Unicode is.
Like TFA shows. Like any single homographic attack using Unicode characters shows.
If Unicode was good, it wouldn't regularly be frontpage of HN.
I’d also suggest people check out the accompanying RFCs 8265 and 8266:
PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols:
— https://www.rfc-editor.org/rfc/rfc8264
Preparation, Enforcement, and Comparison of Internationalized Strings: Representing Usernames and Passwords
— https://www.rfc-editor.org/rfc/rfc8265
Preparation, Enforcement, and Comparison of Internationalized Strings Representing Nicknames:
— https://www.rfc-editor.org/rfc/rfc8266
Generally speaking, you don’t want usernames being displayed that can change the text direction, or passwords that have different byte representations depending on the device that was used to type it in. These RFCs have specific profiles to avoid that.
I think for these kinds of purposes, failing closed is more secure than failing open. I’d rather disallow whatever the latest emoji to hit the streets is from usernames than potentially allow it to screw up every page that displays usernames.
The problem with failing open can manifest tomorrow, and the outcome can cause your site to become unreadable.
E.g. Annexe.txt (that you might assume would be safely opened by a text editor) could actually be Ann\u202Etxt.exe, a dangerous executable.
Another comment linked to this:
https://trojansource.codes
And some of the codepoints, such as the surrogate codepoints (which MUST come in pairs in properly encoded UTF-16), may not break your code but break poorly-written spaghetti-ridden UTF-16-based hellholes that do not expect unpaired surrogates.
Something like:
1. You send a UTF-8 string containing normal characters and an unpaired surrogate: "Hello /uDEADworld" to FooApp.
2. FooApp converts the UTF-8 string to UTF-16 and saves it in a file. All without validation, so no crashes will actually occur; worst case scenario, the unpaired surrogate is rendered by the frontend as "�".
3. Next time, when it reads the file again, this time it is expecting normal UTF-16, and it crashes because of the unpaired surrogate.
(A more fatal failure mode of (3) is out-of-bounds memory read if the unpaired surrogate happens at the end of string)
what does bad/dangerous this code catch that `unicode.IsPrint` is not catching?
or other way, what good/useful does `unicode.IsPrint`removing, that this code keeps?
> IsPrint == .. categories L, M, N, P, S and the ASCII space character.
how does that compare to this standard (RFC 9839)? (don't mind that this is Go. just consider same unicode categories).
The list of code points which are problematic (non-printing, etc) are IMO much more useful and nontrivial. But it’d be useful to treat those as a separate concept from plain-old illegal UTF-8 byte sequences.
No they shouldn't because that's how you get file managers that can't manage files.
If you have a type that says “my contents are valid UTF-8”, then you should reject invalid UTF-8 when populating it, obviously. Why would it work any other way? If you need a type that can hold arbitrary byte sequences, use a type that can hold arbitrary byte sequences.
It's is also not what happens in practice. File managers that cannot rename or delete some files because they are unnecessarily "smart" about interpreting strings passed to them is very much how things have worked out in reality.
Defining a subset of unicode to accept does not obviate the need to check that values conform to type definitions.
which does away with control flow and specially positioning garbage. and doesn't consider as valid unknown (or missing) surrogates.
why do i want very specialized text tabulation and positioning chars in my string?
it's as if they tried to solve text encoding AND solve CSV and pagination and desktop publishing all in one go.
Also, where are the test vectors? Because when I implement this, that's the first thing I have to write, and you could save me a lot of work here. Bonus points if it's in JSON and UTF-8 already, though the invalid UTF-8 in an RFC might really gum things up: hex encode maybe?
How does this help me check my implementation? I guess I could ask ChatGPT to convert your tests to my code, but that seems the long way around.
I don't know rust at all but I can pretty quickly understand:
I also think that, regardless of the character set, what to include (control characters, graphic characters, maximum length, etc) will have to depend on the specific application anyways, so trying to include/exclude in JSON doesn't work as well.
Giving a name to a specific subset (or sometimes a superset, but usually subset) of Unicode (or any other character sets, such as ASCII or ISO 2022 or TRON code) can be useful, but don't assume it is good for everyone or even is good for most things, because it isn't.
RFC 9839 does give names to some subsets of Unicode, which may sometimes be useful, but should not automatically assume that is right for what you will be making. My opinion is to consider to not use or require Unicode.
https://en.wikipedia.org/wiki/Unicode_character_property#Gen...
e.g. in Python,
Shows "Cc" (control) and "Cs" (surrogate).Last time I checked (a couple of years ago admittedly) there was no such restriction in the standard. There was however a recommendation to restrict a graphical unit to 128 bytes for "streaming applications".
Bringing this or at least a limit on the scalar units into the standard would make implementation and processing so much easier without restricting sensible applications.
This part encourages more active usage of U+0000, so that programmers of certain programming languages get a message that they are not welcome
If you aren't doing something useful with the text, you're best off passing a byte-sequence through unchanged. Unfortunately, Microsoft Windows exists, so sometimes you have to pass `char16_t` sequences through instead.
The worst part about UTF-16 is that invalid UTF-16 is fundamentally different than invalid UTF-8. When translating between them (really: when transforming external data into an internal form for processing), the former can use WTF-8 whereas the latter can use Python-style surrogateescape, but you can't mix these.