Rfc 9839 and Bad Unicode

Posted5 months agoActive5 months ago

Bogdanp

268 points

148 comments

tbray.orgTechstoryHigh profile

calmmixed

Debate

80/100

UnicodeCharacter EncodingString Validation

Key topics

Unicode

Character Encoding

String Validation

The article discusses RFC 9839 and the challenges of handling 'bad Unicode' characters, sparking a discussion on the complexities of Unicode and the need for careful string validation.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

Peak period

0-6h

Avg / period

14.8

Comment distribution148 data points

Loading chart...

Based on 148 loaded comments

Key moments

01Story posted
Aug 23, 2025 at 8:54 AM EDT
5 months ago
Step 01
02First comment
Aug 23, 2025 at 10:00 AM EDT
1h after posting
Step 02
03Peak activity
81 comments in 0-6h
Hottest window of the conversation
Step 03
04Latest activity
Aug 26, 2025 at 3:39 AM EDT
5 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (148 comments)

Showing 148 comments

ape4

5 months ago

4 replies

Seems like libraries that serialize to JSON should have an option to filter out these bad characters.

CharlesW

5 months ago

1 reply

This RFC and Go-language reference library is designed to be used by existing libraries that do serialization/sanitation/validation. This is hot off the press, so I'm sure Tim would appreciate it if you'd let your favorite library know it exists.

nikolayasdf123

5 months ago

interesting. isn't in Go it is just `unicode.IsPrint(r rune)`? https://pkg.go.dev/unicode#IsPrint

xdennis

5 months ago

3 replies

How is Unicode in any way related to JSON? JSON should just encode whatever dumb data someone wants to transport.

Unicode validation/cleanup should be done separately because it's needed in multiple places, not just JSON.

recursive

5 months ago

2 replies

JSON is text. If you're not going to use unicode in the representation of your text, you'll need some other way.

ninkendo

5 months ago

1 reply

So?

All the letters in this string are “just text”:

    "\u0000\u0089\uDEAD\uD9BF\uDFFF"

JSON itself allows putting sequences of escape characters in the string that don’t unescape to valid Unicode. That’s fine, because the strings aren’t required to represent any particular encoding: it’s up to a layer higher than JSON to be opinionated about that.

I wouldn’t want my shell’s pipeline buffers to reject data it doesn’t like, why should a JSON serializer?

recursive

5 months ago

I actually agree, now that I understand what you're talking about.

dcrazy

5 months ago

The current JSON spec mandates UTF-8, but practically speaking encoding is a higher-level concept. I suspect there are many server implementations that will respect the Content-Encoding header in a POST request containing JSON.

zzo38computer

5 months ago

JSON (unfortunately) requires strings to be Unicode. (JSON has other problems too, but Unicode is one of them.)

layer8

5 months ago

The contents of JSON strings doesn’t admit random binary data. You need to use an encoding like Base64 for that purpose.

Manfred

5 months ago

My experience writing Unicode related libraries is that people don't use features when you have to explain why and when to use them. I assume that's why Tim puts the emphasis on "working on something new".

layer8

5 months ago

No. As the RFC notes: “Silently deleting an ill-formed part of a string is a known security risk. Responding to that risk, Section 3.2 of [UNICODE] recommends dealing with ill-formed byte sequences by signaling an error or replacing problematic code points, ideally with "�" (U+FFFD, REPLACEMENT CHARACTER).”

I would almost always go for “signaling an error”.

Waterluvian

5 months ago

4 replies

I’m frustrated by things like Unicode where it’s “good” except… you need to know to exclude some of them. Unicode feels like a wild jungle of complexity. An understandable consequence of trying to formalize so many ways to write language. But it really sucks to have to reason about some characters being special compared to others.

The only sanity I’ve found is to treat Unicode strings as if they’re some proprietary data unit format. You can accept them, store them, render them, and compare them with each other for (data, not semantic) equality. But you just don’t ever try to reason about their content. Heck I’m not even comfortable trying to concatenate them or anything like that.

Etheryte

5 months ago

1 reply

As a simple example off the top of my head, if the first string ends in an orphaned emoji modifier and the second one starts with a modifiable emoji, you're already going to have trouble. It's only downhill from there with more exotic stuff.

kps

5 months ago

2 replies

Unicode combining/modifying/joining characters should have been prefix rather than suffix/infix, in blocks by arity.

layer8

5 months ago

1 reply

One benefit of the suffix convention is that strings sort more usefully that way by default, without requiring special handling for those characters.

Unicode 1.0 also explains: “The convention used by the Unicode standard is consistent with the logical order of other non-spacing marks in Semitic and Indic scripts, the great majority of which follow the base characters with respect to which they are positioned. To avoid the complication of defining and implementing non-spacing marks on both sides of base characters, the Unicode standard specifies that all non-spacing marks must follow their base characters. This convention conforms to the way modern font technology handles the rendering of non-spacing graphical forms, so that mapping from character store to font rendering is simplified.”

kps

5 months ago

2 replies

Sorting is a good point.

On the other hand, prefix combining characters would have vastly simplified keyboard handling, since that's exactly what typewriter dead keys are.

layer8

5 months ago

1 reply

Keyboard input handling at that level generally isn’t character-based, and instead requires looking at scancodes and modifier keys, and sometimes also distinguishing between keyup and keydown events.

You generally also don’t want to produce different Unicode sequences depending on whether you have an “é” key you can press or have to use a dead-key “’”.

kps

5 months ago

1 reply

Depends on the system. X11/Wayland do it at a higher level where you have `<dead_acute> <e> : eacute` and keysyms are effectively a superset of Unicode with prefix combiners. (This can lead to weirdness since the choice of Compose rules is orthogonal to the choice of keyboard layout.)

layer8

5 months ago

1 reply

I guess your conception is that one could then define

    <dead_acute> : <combining_acute_accent>

instead and use it for arbitrary letters. However, that would fail in locales using a non-Unicode encoding such as iso-8859-1 that only contain the combined character. Unless you have the input system post-process the mapped input again to normalize it to e.g. NFC before passing it on to the application, in which case the combination has to be reparsed anyway. So I don’t see what would be gained with regard to ease of parsing.

If you want to define such a key, you can probably still do it, you’ll just have to press it in the opposite order and use backspace if you want to cancel it.

The fact that dead keys happen to be prefix is in principle arbitrary, they could as well be suffix. On physical typewriters, suffix was more customary I think, i.e. you’d backspace over the character you want to accent and type the accent on top of it. To obtain just the accent, you combine it with Space either way.

moefh

5 months ago

1 reply

> On physical typewriters, suffix was more customary I think

Why would anyone type like that? Instead of pressing two keys (the accent key followed by the letter key), you'd need to press four (letter, backspace, accent, space bar) for no reason.

kps

5 months ago

The mechanical typewriter dead key worked by omitting the linkage that advances the carriage. That established the method of pressing the dead key and then the accompanying letter.

dcrazy

5 months ago

Not all input methods use dead keys to emit combining characters.

zahlman

5 months ago

They should have at least all used a single system. Instead, we have:

* European-style combining characters, as well as precomposed versions for some arbitrary subset of legal combinations, and nothing preventing you from stacking them arbitrarily (as in Zalgo text) or on illogical base characters (who knows what your font renderer will do if you ask to put a cedilla on a kanji? It might even work!)

* Jamo for Hangul that are three pseudo-characters representing the parts of a larger character, that have to be in order (and who knows what you're supposed to do with an invalid jamo sequence)

* Emoji that are produced by applying a "variation selector" to a normal character

* Emoji that are just single characters — including ones that used to be normal characters and were retconned to now require the variation selector to get the original appearance

* Some subset of emoji that can have a skin-tone modifier applied as a direct suffix

* Some other subset of emoji that are formed by combining other emoji, which requires a zero-width-joiner in between (because they'd also be valid separately), which might be rendered as the base components anyway if no joined glyph is available

* National flags that use a pair of abstract characters used to spell a country code; neither can be said to be the base vs the modifier (this lets them say that they never removed or changed the meaning of a "character" while still allowing for countries to change their country codes, national flags or existence status)

* Other flags that use a base flag character, followed by "tag letter" characters that were originally intended for a completely different purpose that never panned out; and also there was temporary disagreement about which base character should be used

* Other other flags that are vendor-specific but basically work like emoji with ZWJ sequences

And surely more that I've forgotten about or not learned about yet.

csande17

5 months ago

3 replies

Unicode really is an impossibly bottomless well of trivia and bad decisions. As another example, the article's RFC warns against allowing legacy ASCII control characters on the grounds that they can be confusing to display to humans, but says nothing about the Explicit Directional Overrides characters that https://www.unicode.org/reports/tr9/#Explicit_Directional_Ov... suggests should "be avoided wherever possible, because of security concerns".

estebank

5 months ago

1 reply

The security concerns are those of "Trojan source", where the displayed text doesn't correspond to the bytes on the wire.[1]

I don't think a wire protocol should necessarily restrict them, for the sake of compatibility with existing text corpus out there, but a fair observation.

1: https://trojansource.codes/

yencabulator

5 months ago

1 reply

The enforcement is an app-level issue, depending on the semantics of the field. I agree it doesn't belong in the low-level transport protocol.

The rules for "username", "display name", "biography", "email address", "email body" and "contents of uploaded file with name foo.txt" are not all going to be the same.

Waterluvian

5 months ago

1 reply

Can a regular expression be used to restrict Unicode chars like the ones described?

I’m imagining a listing of regex rules for the various gotchas, and then a validation-level use that unions the ones you want.

fluoridation

5 months ago

1 reply

Why would you need a regular expression for that? It's just a list of characters.

Waterluvian

5 months ago

There’s cases where certain characters coming before or after others is what creates the issue.

arp242

5 months ago

2 replies

I always thought you kind of need those directional control characters to correctly render bidi text? e.g. if you write something in Hebrew but include a Latin word/name (or the reverse).

dcrazy

5 months ago

1 reply

This is the job of the Bidi Algorithm: https://www.unicode.org/reports/tr9/

Of course, this is an “annex”, not part of the core Unicode spec. So in situations where you can’t rely on the presentation layer’s (correct) implementation of the Bidi algorithm, you can fall back to directional override/embedding characters.

acdha

5 months ago

Over the years I’ve run into a few situations where the rules around neutral characters didn’t produce the right result and so we had to use the override characters to force the correct display. It’s completely a niche but very handy when you are mixing quotes within a complex text.

layer8

5 months ago

Read the parent’s link. The characters “to be avoided” are a particular special-purpose subset, not directional control characters in general.

weinzierl

5 months ago

3 replies

I wouldn’t be so harsh. I think the Unicode Consortium not only started with good intentions but also did excellent work for the first decade or so.

I just think they got distracted when the problems got harder, and instead of tackling them head-on, they now waste a lot of their resources on busywork - good intentions notwithstanding. Sure, it’s more fun standardizing sparkling disco balls than dealing with real-world pain points. That OpenType is a good and powerful standard which masks some of Unicode’s shortcomings doesn’t really help.

It’s not too late, and I hope they will find their way back to their original mission and be braver in solving long-standing issues.

socalgal2

5 months ago

2 replies

What could be better? Human languages are complex

weinzierl

5 months ago

1 reply

Yes, exactly, human languages are complex and in my opinion Unicode used to be on a good track to tackle these complexities. I just think that nowadays they are not doing enough to help people around the world solving these problems.

pas

5 months ago

1 reply

can you describe a few examples? what are you missing? or maybe are you aware of something they rejected that would be useful?

weinzierl

5 months ago

1 reply

The elephant in the room is Han Unification but there are plenty of other issues. Here is one of my favourites from another thread just two days ago.

https://news.ycombinator.com/item?id=44971254

This is the rejected proposal.

https://www.unicode.org/L2/L2003/03215-n2593-umlaut-trema.pd...

If you read thread from above you will find more examples from other people.

pas

5 months ago

thanks! very interesting!

ah, and now I understand what the hell people mean when they put dots on coordinate! (but they are obviously wrong they should use the flying point from Catalan :)

... hm, so this issue is easily more than 20 years old. and since then there's no solution (or the German libraries consider the problem "solved" and ... no one else is making proposals to the WG about this nowadays)?

also, technically - since there are already more than 150K allocated code points - adding a different combining mark seems the correct way to do, right?

or it's now universally accepted that people who want to type ambigüité need to remember to type U+034F before the ü? (... or, of course it's up to their editor/typesetter software to offer this distinction)

regarding the Han unification, is there some kind of effort to "fix" that? (adding language-start language-end markers perhaps? or virtual code points for languages to avoid the need for searching strings for the being-end markers?)

pas

5 months ago

sure, but they have both human and machine stuff in the same "universe" - again, sure, it made sense, but maybe it would make sense to have a parser that helps to recover "human stuff" from "machine gibberish" (ie. filter out the presentation and control stuff), but, but, of course some in-band logic makes sense, after all, for the combinations (diacritics, emoji skin color, and so on).

zahlman

5 months ago

A big part of the problem is that the reaction to early updates was so bad that they promised they would never un-assign or re-assign a code point ever again, making it impossible for them to actually correct any mistakes (not even typos in the official standard names given to characters).

The versioning is actually almost completely backwards by semver reasoning; 1.1 should have been 2.0, 2.0 should have been 3.0 and we should still be on 3.n now (since they have since kept the promise not to remove anything).

5 months ago

I would. The original sin of Unicode is really their manifold idea, at that point they stopped trying to write a string standard and started to become a kinda general description of how string standards should look like and hopefully string standards that more or less conform to this description are interoperable if you remember which direction "string".decode() and "string".encode() is.

ivanjermakov

5 months ago

Unicode sucks, but it sucks less than every other encoding standard.

eviks

5 months ago

Indeed, though a lot of that complexity like surrogates and control codes aren't due to attempts to write language, that's just awful designs preserved for posterity

integralid

5 months ago

8 replies

I'm not certain... On one hand I agree that some characters are problematic (or invalid) - like unpaired surrogates. But the worst case scenario is imo when people designing data structures and protocols start to feel the need to disallow arbitrary classes of characters, even properly escaped.

In the example, username validation is a job of another layer. For example I want to make sure username is shorter than 60 characters, has no emojis or zalgo text, and yes, no null bytes, and return a proper error from the API. I don't want my JSON parsing to fail on completely different layer pre-validation.

And for username some classes are obviously bad - like explained. But what if I send text files that actually use those weird tabs. I expect things that work in my language utf8 "string" type to be encodable. Even more importantly, I see plenty of use cases for null byte, and it is in fact often seen in JSON in the wild.

On the other hand, if we have to use a restricted set of "normal" Unicode characters, having a standard feels useful - better than everyone creating their own mini standard. So I think I like the idea, just don't buy the argumentation or examples in the blog post.

csande17

5 months ago

6 replies

Yeah, I feel like the only really defensible choices you can make for string representation in a low-level wire protocol in 2025 are:

- "Unicode Scalars", aka "well-formed UTF-16", aka "the Python string type"

- "Potentially ill-formed UTF-16", aka "WTF-8", aka "the JavaScript string type"

- "Potentially ill-formed UTF-8", aka "an array of bytes", aka "the Go string type"

- Any of the above, plus "no U+0000", if you have to interface with a language/library that was designed before people knew what buffer overflow exploits were

stuartjohnson12

5 months ago

1 reply

> "WTF-8", aka "the JavaScript string type"

This sequence of characters is a work of art.

wging

5 months ago

For more details: https://simonsapin.github.io/wtf-8/

dcrazy

5 months ago

2 replies

Why didn’t you include “Unicode Scalars”, aka “well-formed UTF-8”, aka “the Swift string type?”

Either way, I think the bitter lesson is a parser really can’t rely on the well-formedness of a Unicode string over the wire. Practically speaking, all wire formats are potentially ill-formed until parsed into a non-wire format (or rejected by same parser).

csande17

5 months ago

1 reply

IMO if you care about surrogate code points being invalid, you're in "designing the system around UTF-16" territory conceputally -- even if you then send the bytes over the wire as UTF-8, or some more exotic/compressed format. Same as how "potentially ill-formed UTF-16" and WTF-8 have the same underlying model for what a string is.

dcrazy

5 months ago

1 reply

The Unicode spec itself is designed around UTF-16: the block of code points that surrogate pairs would map to are reserved for that purpose and explicitly given “no interpretation” by the spec. [1] An implementation has to choose how to behave if it encounters one of these reserved code points in e.g. a UTF-8 string: Throw an encoding error? Silently drop the character? Convert it to an Object Replacement character?

[1] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...

duckerude

5 months ago

1 reply

RFC 3629 says surrogate codepoints are not valid in UTF-8. So if you're decoding/validating UTF-8 it's just another kind of invalid byte sequence like a 0xFF byte or an overlong encoding. AFAIK implementations tend to follow this. (You have to make a choice but you'd have to make that choice regardless for the other kinds of error.)

If you run into this when encoding to UTF-8 then your source data isn't valid Unicode and it depends on what it really is if not proper Unicode. If you can validate at other boundaries then you won't have to deal with it there.

account42

5 months ago

> You have to make a choice but you'd have to make that choice regardless for the other kinds of error.

If you don't actively make a choice then decoding al la WTF-8 comes natural. Anything else is going to need additional branches.

layer8

5 months ago

There is no disagreement that what you can receive over the wire can be ill-formed. There is disagreement about what to reject when it is first parsed at a point where it is known that it should be representing a Unicode string.

alright2565

5 months ago

2 replies

> "Unicode Scalars", aka "well-formed UTF-16", aka "the Python string type"

Can you elaborate more on this? I understood the Python string to be UTF-32, with optimizations where possible to reduce memory use.

csande17

5 months ago

1 reply

I could be mistaken, but I think Python cares about making sure strings don't include any surrogate code points that can't be represented in UTF-16 -- even if you're encoding/decoding the string using some other encoding. (Possibly it still lets you construct such a string in memory, though? So there might be a philosophical dispute there.)

Like, the basic code points -> bytes in memory logic that underlies UTF-32, or UTF-8 for that matter, is perfectly capable of representing [U+D83D U+DE00] as a sequence distinct from [U+1F600]. But UTF-16 can't because the first sequence is a surrogate pair. So if your language applies the restriction that strings can't contain surrogate code points, it's basically emulating the UTF-16 worldview on top of whatever encoding it uses internally. The set of strings it supports is the same as the set of strings a language that does use well-formed UTF-16 supports, for the purposes of deciding what's allowed to be represented in a wire protocol.

MyOutfitIsVague

5 months ago

1 reply

You're somewhat mistaken, in that "UTF-32, or UTF-8 for that matter, is perfectly capable of representing [U+D83D U+DE00] as a sequence distinct from [U+1F600]." You're right that the encoding on a raw level is technically capable of this, but it is actually forbidden in Unicode. Those are invalid codepoints.

Using those codepoints makes for invalid Unicode, not just invalid UTF-16. Rust, which uses utf-8 for its String type, also forbids unpaired surrogates. `let illegal: char = 0xDEADu32.try_into().unwrap();` panics.

It's not that these languages emulate the UTF-16 worldview, it's that UTF-16 has infected and shaped all of Unicode. No code points are allowed that can't be unambiguously represented in UTF-16.

edit: This cousin comment has some really good detail on Python in particular: https://news.ycombinator.com/item?id=44997146

csande17

5 months ago

The Unicode Consortium has indeed published documents recommending that people adopt the UTF-16 worldview when working with strings, but it is not always a good idea to follow their recommendations.

zahlman

5 months ago

You're not wrong; I gave more detail in a direct reply https://news.ycombinator.com/item?id=44997146 .

mort96

5 months ago

4 replies

> - "Potentially ill-formed UTF-16", aka "WTF-8", aka "the JavaScript string type"

I thought WTF-8 was just, "UTf-8, but without the restriction to not encode unpaired surrogates"? Windows and Java and JavaScript all use "possibly ill-formed UTF-16" as their string type, not WTF-8.

layer8

5 months ago

2 replies

Also known as UCS-2: https://www.unicode.org/faq/utf_bom.html#utf16-11

Surrogate pairs were only added with Unicode 2.0 in 1996, at which point Windows NT and Java already existed. The fact that those continue to allow unpaired surrogate characters is in parts due to backwards compatibility.

account42

5 months ago

No, UCS-2 decoding would convert all surrogates into individual code points but this isn't how "WTF-16" systems like Windows behave - paired surrogates get decoded into a combined code point.

da_chicken

5 months ago

Yeah, people forget that Windows and Java appear to be less compliant, but the reality is that they moved on i18n before anybody else did so their standard is older.

Linux got to adopt UTF-8 because the just stuck their head in the sand and stayed on ASCII well past the time they needed to change. Even now, a lot of programs only support ASCII character streams.

mananaysiempre

5 months ago

1 reply

WTF-8 is more or less the obvious thing to use when NT/Java/JavaScript-style WTF-16 needs to fit into a UTF-8-shaped hole. And yes, it’s UTF-8 except you can encode surrogates except those surrogates can’t form a valid pair (use the normal UTF-8 encoding of the codepoint designated by that pair in that case).

(Some people instead encode each WTF-16 surrogate independently regardless of whether it participates in a valid pair or not, yielding an UTF-8-like but UTF-8-incompatible-beyond-U+FFFF thing usually called CESU-8. We don’t talk about those people.)

layer8

5 months ago

1 reply

The parent’s point was that “potentially ill-formed UTF-16" and "WTF-8" are inherently different encodings (16-bit word sequence vs. byte sequence), and thus not “aka”.

csande17

5 months ago

1 reply

Although they're different encodings, the thing that they are encoding is exactly the same. I kinda wish I could edit "string representation" to "modeling valid strings" or something in my original comment for clarity...

layer8

5 months ago

1 reply

By that logic, you could say ‘“UTF-8” aka “UTF-32”’, since they are encoding the same value space. But that’s just wrong.

deathanatos

5 months ago

The type is the same, i.e., if you look at a type as an infinite set of values, they are the same infinite set. Yes, their in-memory representations might differ, but it means all values in one exist in the other, and only those, so conversion between them are infallible.

So in your last example, UTF-8 & UTF-32 are the same type, containing the same infinite set of values, and — of course — one can convert between them infallibly.

But you can't encode arbitrary Go strings in WTF-8 (some are not representable), you can't encode arbitrary Python strings in UTF-8 or WTF-8 (n.b. that upthread is wrong about Python being equivalent to Unicode scalars/well-formed UTF-*.) and attempts to do so might error. (E.g., `.encode('utf-8')` in Python on a `str` can raise.)

zahlman

5 months ago

2 replies

I've always taken "WTF-8" to mean that someone had mistakenly interpreted UTF-8 data as being in Latin-1 (or some other code page) and UTF-8 encoded it again.

deathanatos

5 months ago

1 reply

No, WTF-8[1] is a precisely defined format (that isn't that).

If you imagine a format that can encode JavaScript strings containing unpaired surrogates, that's WTF-8. (Well-formed WTF-8 is the same type as a JS string, through with a different representation.)

(Though that would have been cute name for the UTF-8/latin1/UTF-8 fail.)

[1]: https://simonsapin.github.io/wtf-8/

Izkata

5 months ago

1 reply

GP is right about the original meaning, author of that page acknowledges hijacking it here: https://news.ycombinator.com/item?id=9611710

zahlman

5 months ago

1 reply

When I posted that, I was honestly projecting from my own use. I think I may have independently thought of the term on Stack Overflow prior to koalie's tweet, but it's not the easiest thing (by design) to search for comments there (and that's assuming they don't get deleted, which they usually should).

(On review, it appears that the thread mentions much earlier uses...)

Izkata

5 months ago

I did the search because I have a similar memory, I'd place it in the early 2000s before StackOverflow existed, around when people were first switching from latin1 and Windows-1251 and others to UTF-8 on the web and browsers would often pick the wrong encoding, and IE had a submenu where you could tell it which one to use on the page. WTF-8 was a thing because occasionally none of these options would work, because the layers server-side would be misconfigured and cause the double (or more, if it involved user input) encoding. It was also used just in general to complain about UTF-8 breaking everything as it was slowly being introduced.

chrismorgan

5 months ago

2 replies

That thing was occasionally called WTF-8, but not often—it was normally called “double UTF-8” (if given a name at all).

In the last few years, the name has become very popular with Simon Sapin’s definition.

jibal

5 months ago

1 reply

"if given a name at all"

https://en.wikipedia.org/wiki/Mojibake

zahlman

5 months ago

This describes a broader concept.

LocalH

5 months ago

Say "double UTF-8" out loud ;)

account42

5 months ago

Yes they use WTF-16 not WTF-8 but WTF-8 is a compatible encoding.

OCTAGRAM

5 months ago

Seed7 uses UTF-32. Ada standard library has got UTF-32 for I/O data and file names. Ada is such a language where almost nothing disappears in standard library, so 8-bit and UTF-16 I/O and/or file names are all still there.

zahlman

5 months ago

>"Unicode Scalars", aka "well-formed UTF-16", aka "the Python string type"

"the Python string type" is neither "UTF-16" nor "well-formed", and there are very deliberate design decisions behind this.

Since Python 3.3 with the introduction of https://peps.python.org/pep-0393/ , Python does not use anything that can be called "UTF-16" regardless of compilation options. (Before that, in Python 2.2 and up the behaviour was as in https://peps.python.org/pep-0261/ ; you could compile either a "narrow" version using proper UTF-16 with surrogate pairs, or a "wide" version using UTF-32.)

Instead, now every code point is represented as a separate storage element (as they would be in UTF-32) except that the allocated memory is dynamically chosen from 1/2/4 bytes per element as needed. (It furthermore sets a flag for 1-byte-per-element strings according to whether they are pure ASCII or if they have code points in the 128..255 range.)

Meanwhile, `str` can store surrogates even though Python doesn't use them normally; errors will occur at encoding time:

  >>> x = '\ud800\udc00'
  >>> x
  '\ud800\udc00'
  >>> print(x)
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed

They're even disallowed for an explicit encode to utf-16:

  >>> x.encode('utf-16')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeEncodeError: 'utf-16' codec can't encode character '\ud800' in position 0: surrogates not allowed

But this can be overridden:

  >>> x.encode('utf-16-le', 'surrogatepass')
  b'\x00\xd8\x00\xdc'

Which subsequently allows for decoding that automatically interprets surrogate pairs:

  >>> y = x.encode('utf-16-le', 'surrogatepass').decode('utf-16-le')
  >>> y
  '𐀀'
  >>> len(y)
  1
  >>> ord(y)
  65536

Storing surrogates in `str` is used for smuggling in binary data. For example, the runtime does it so that it can try to interpret command line arguments as UTF-8 by default, but still allow arbitrary (non-null) bytes to be passed (since that's a thing on Linux):

  $ cat cmdline.py 
  #!/usr/bin/python
  
  import binascii, sys
  for arg in sys.argv[1:]:
      abytes = arg.encode(sys.stdin.encoding, 'surrogateescape')
      ahex = binascii.hexlify(abytes)
      print(ahex.decode('ascii'))
  $ ./cmdline.py foo
  666f6f
  $ ./cmdline.py 日本語
  e697a5e69cace8aa9e
  $ ./cmdline.py $'\x01\x00\x02'
  01
  $ ./cmdline.py $'\xff'
  ff
  $ ./cmdline.py ÿ
  c3bf

It does this by decoding with the same 'surrogateescape' error handler that the above diagnostic needs when re-encoding:

  >>> b'\xff'.decode('utf-8')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
  >>> b'\xff'.decode('utf-8', 'surrogateescape')
  '\udcff'

CharlesW

5 months ago

1 reply

> I like the idea, just don't buy the argumentation or examples in the blog post.

Which ones, and why? Tim and Paul collectively have around 100,000X the experience with this than most people do, so it'd be interesting to read substantive criticism.

It seems like you think this standard is JSON-specific?

doug_durham

5 months ago

2 replies

I thought the question was pretty substantive. What layer in the code stack should make the decisions about what characters to allow? I had exactly the same question. If the library declares that it will filter out certain subsets then that allows me to choose a different library if needed. I would hate to have this RFC blindly implemented randomly just because it's a standard.

vintermann

5 months ago

> What layer in the code stack should make the decisions about what characters to allow?

OK, but where does it get decided what even counts a character? Should that be in the same layer? Even within a single system, there may be different sensible answers to that.

CharlesW

5 months ago

> What layer in the code stack should make the decisions about what characters to allow?

I was responding to the parent's empty sniping as gently as I could, but the answer to your (good) question has nothing to do with this RFC specifically. It's something that people doing sanitation/validation/serialization have had to learn.

The answer to your question is that you make decisions like this as a policy in your business layer/domain, and then you enforce it (consistently) in multiple places. For example, usernames might be limited to lowercase letters, numbers, and dashes so they're stable for identity and routing, while display names generally have fewer limitations so people can use accented characters or scripts from different languages. The rules live in the business/domain layer, and then you use libraries to enforce them everywhere (your API, your database, your UI, etc.).

Joker_vD

5 months ago

2 replies

Seriously, please don't use C0 (except for LF and, I cede grudgingly, HT) and C1 characters in your plain text files. I understand that you may want to store some "ANSI coloring markup" (it's not "VT100 colors" — the VT series was monochrome until VT525 of 1994), sure, but it's then, arguably, not a plain text anymore, is it? It's in a text markup format of sorts, not unlike Markdown, only the one that uses a different encoding that dips into the C0 range. Just because your favourite output device can display it prettily when you cat your data into it doesn't really mean it's a plain text.

Yes, I do realize that there is a lot of text markup formats that encode into plain text, for better interoperability.

cesarb

5 months ago

1 reply

> Seriously, please don't use C0 (except for LF and, I cede grudgingly, HT) and C1 characters in your plain text files.

It is (or, at least, used to be) common to have FF characters on plain text files, as a signal for your (dot matrix) printer to advance to the next page. So I'd add at least FF to that list.

afiori

5 months ago

I think we can deprecate dot printer specific advice wrt what we consider plain text today

Aaron2222

5 months ago

The DEC VT241 from 1984 had colour.

https://terminals-wiki.org/wiki/index.php/DEC_VT240

https://www.1000bit.it/ad/bro/digital/DECVT240.pdf

singpolyma3

5 months ago

4 replies

Why ban emoji in username?

numpad0

5 months ago

1 reply

Because username and password MUST be in ASCII range?

Timwi

5 months ago

1 reply

No.

numpad0

5 months ago

How no? UTF-8 strings has no singular canonical binary representations and typing sequences that correspond to intended texts. Which means it can't be hashed and compared for authentication purposes. No?

account42

5 months ago

You typically want a bijection from rendered glyphs to binary representation and restricting to ASCII is the most straight forward way to achieve that.

pas

5 months ago

I think for username it's fine, where a bit of restraint makes sense is for billing/shipping/legal-ish data.

afiori

5 months ago

My reasoning for it would be that they can be very keyboard specific and might require more normalisation than other character classes.

If I had to make a specific choice I would probably whitelist the most common emojis for some definition and allow those

numpad0

5 months ago

2 replies

What system take UTF-8 for usernames? Everyone knows that all programmatically manipulated and/or evaluated identifiers including login usernames and passwords need to be in ASCII - not even ISO-8859-1, just plain old ASCII. Unicode generally don't work for those purposes. Username as in friendly display strings is fine, but for username as in system login, the entire non-ASCII encoding is a no go.

I mean, I don't even know my keyboard software is consistent in UTF-8 for the exact same intended visual representation outside of ASCII range, let alone across different operating systems and configurations, or over time. Or vice versa; the binary I would leave behind in time to consistently correspond to future Unicode interpretation AIs.

... speaking of consistency, neither the article nor RFC 9839 don't mention IVS situations or NFC/NFD/NFKC/NFKD regularizations problem as explicitly in or out of scope. Overall it feels like this RFC is missing the entire "Purpose" section except there is vague notion of there being non-character code points.

Timwi

5 months ago

2 replies

This is such a provincial attitude, wanting to prohibit people from using perfectly normal names like Amélie or Jürgen or Ольга to log in just because you as a programmer can't be bothered to deal with a numerical ID instead.

account42

5 months ago

No, it's an entirely practical attitude, unlike outrage culture.

numpad0

5 months ago

You can have such names as Amélie or Jürgen or Ольга or 𠮷野 or 鎮󠄁 and have them displayed on account management screens, you just can't use it for login IDs because there are no guarantees that those blobs can be reproduced in the future or be consistent with what it was at time of entry.

Unicode is that bad.

zzo38computer

5 months ago

1 reply

For passwords, you might not need to care about the character encoding, since they are not going to be displayed anyways. You should allow any password, and the maximum length should not be too short.

For usernames, I think your point is valid; you might restrict usernames to a subset of ASCII (not arbitrary ASCII; e.g. you might disallow spaces and some punctuations), or use numeric user IDs, while the display name might be less restricted. (In some cases (probably uncommon) you might also use a different character set than ASCII if that is desirable for your application, but Unicode is not a good way to do it.)

(I also think that Unicode is not good; it is helpful for many applications to have i18n (although you should be aware what parts should use it and what shouldn't), but Unicode is not a good way to do it.)

numpad0

5 months ago

> For passwords, you might not need to care about the character encoding, since they are not going to be displayed anyways.

That would be reasonable if there were strict 1:1 correspondence between intended text and binary representations, but there isn't. Unicode has equivalents of British and American spellings, and users has no control over which to use. Precomposed vs Combining characters, Variant Selectors, etc. Ensuring it all regularize into canonical password string as developer obligation is unreasonable, and just falling back to ASCII is much more reasonable.

I guess everyone using alphanumeric sequences for every identifiers is somewhat imperialistic in a sense, but it's close to the least controversial of general cultural imperialism problems. It's probably okay to leave it to be solved for a century or two.

justin66

5 months ago

> But the worst case scenario is imo when people designing data structures and protocols start to feel the need to disallow arbitrary classes of characters, even properly escaped.

This seems like an extremely sheltered person’s point of view. I’m sure the worst case scenario involves a software defect in the parser or whatever and some kind of terrible security breach…

TheRealPomax

5 months ago

I think you missed the part where the RFC is about which Unicode is bad for protocols and data formats, and so which Unicode you should avoid when designing those from now on, with an RFC to consult to know which ones those are. It has nothing to do with "what if I have a file with X" or "what if I want Y in usernames", it's about "what should I do if I want a normal, well behaved, unicode-text-based protocol or data format".

It's not about JSON, or the web, those are just example vehicles for the discussion. The RFC is completely agnostic about what thing the protocols or data formats are intended for, as long as they're text based, and specifically unicode text based.

So it sounds like you like misread the blog post, and what you should be doing is now read the RFC. It's short. You can cruise through https://www.rfc-editor.org/rfc/rfc9839.html in a few minutes and see it's not actually about what you're focussing on.

TacticalCoder

5 months ago

> In the example, username validation is a job of another layer. For example I want to make sure username is shorter than 60 characters, has no emojis or zalgo text, and yes, no null bytes, and return a proper error from the API. I don't want my JSON parsing to fail on completely different layer pre-validation.

Usernames are a bad examples. Because at the point you mention, you may as well only allow a subset of visible ASCII. Which a lot of sites do and that works perfectly fine.

But for stuff like family names you have to restrict so many thing otherwise you'll have little-bobby-zalgo-with-hangul-modifiers breaking havoc.

Unicode is the problem. And workarounds are sadly needed due to the clusterfuck that Unicode is.

Like TFA shows. Like any single homographic attack using Unicode characters shows.

If Unicode was good, it wouldn't regularly be frontpage of HN.

JimDabell

5 months ago

1 reply

> PRECISion · You may find yourself wondering why the IETF waited until 2025 to provide help with Bad Unicode. It didn’t; here’s RFC 8264: PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols; the first PRECIS predecessor was published in 2002. 8264 is 43 pages long, containing a very thorough discussion of many more potential Bad Unicode issues than 9839 does.

I’d also suggest people check out the accompanying RFCs 8265 and 8266:

PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols:

— https://www.rfc-editor.org/rfc/rfc8264

Preparation, Enforcement, and Comparison of Internationalized Strings: Representing Usernames and Passwords

— https://www.rfc-editor.org/rfc/rfc8265

Preparation, Enforcement, and Comparison of Internationalized Strings Representing Nicknames:

— https://www.rfc-editor.org/rfc/rfc8266

Generally speaking, you don’t want usernames being displayed that can change the text direction, or passwords that have different byte representations depending on the device that was used to type it in. These RFCs have specific profiles to avoid that.

I think for these kinds of purposes, failing closed is more secure than failing open. I’d rather disallow whatever the latest emoji to hit the streets is from usernames than potentially allow it to screw up every page that displays usernames.

singpolyma3

5 months ago

2 replies

The problem with failing closed is that you end up 20 years later still not supporting emoji from 20 years ago and users get annoyed...

JimDabell

5 months ago

That’s a problem that can happen if you don’t perform any maintenance for 20 years, and the outcome is mild annoyance from a small fraction of users.

The problem with failing open can manifest tomorrow, and the outcome can cause your site to become unreadable.

afiori

5 months ago

That is a problem for future us, I sure don't envy them

develatio

5 months ago

2 replies

I was not able to understand why these code points are bad. The post states that they are bad, but why? Any examples? Any actual situations and PoC that might help me understand how will that break "my code"?

JimDabell

5 months ago

1 reply

Suppose, when you were registering your username `develatio`, you decided to put U+202E RIGHT-TO-LEFT OVERRIDE in there as well. Now when somebody is reading this page and their browser gets to your username, it switches the text direction to render it right-to-left.

develatio

5 months ago

4 replies

and "that's it"? I mean, it does sound like it might introduce unexpected UI behaviour, but are there any other more serious / dangerous consequences?

LikesPwsh

5 months ago

RTL lets you obfuscate file extensions.

E.g. Annexe.txt (that you might assume would be safely opened by a text editor) could actually be Ann\u202Etxt.exe, a dangerous executable.

yencabulator

5 months ago

One of my pet peeves is when UIs don't clearly constrain and delineate the extent of user-controlled text. Plenty of phishing attacks have relied on having attacker-controlled input seem authoritative, e.g. getting gmail to repeat back something to the victim.

JimDabell

5 months ago

Making any page that mentions you – including admin pages that might be used to disable your account – become unreadable is bad enough.

Another comment linked to this:

https://trojansource.codes

immibis

5 months ago

Yes, dangerous consequences of unexpected UI behaviour: imagine writing a URL backwards with a right-to-left override, and it clearly says www.yourbank.example but it goes to www.evilsite.example/example.yourbank.www

orangeboats

5 months ago

1 reply

Sometimes it's not just "your code". Strings are often interchanged and sent to many other parties.

And some of the codepoints, such as the surrogate codepoints (which MUST come in pairs in properly encoded UTF-16), may not break your code but break poorly-written spaghetti-ridden UTF-16-based hellholes that do not expect unpaired surrogates.

Something like:

1. You send a UTF-8 string containing normal characters and an unpaired surrogate: "Hello /uDEADworld" to FooApp.

2. FooApp converts the UTF-8 string to UTF-16 and saves it in a file. All without validation, so no crashes will actually occur; worst case scenario, the unpaired surrogate is rendered by the frontend as "�".

3. Next time, when it reads the file again, this time it is expecting normal UTF-16, and it crashes because of the unpaired surrogate.

(A more fatal failure mode of (3) is out-of-bounds memory read if the unpaired surrogate happens at the end of string)

pixl97

5 months ago

I had a github action with a phrase 'filter: \directory\u02filename.txt' or something close to this and the the filename got interpreted as a utf-8 character rather than a string literal causing the application to throw an error about invalid utf 8 in the path. Had to go about setting it up to quote the strings differently, but you get to see a lot of these issues in the wild.

arp242

5 months ago

2 replies

Excluding all of "legacy controls" not just as literals but also escaped strings (e.g. "\u0027") seems too much. C1 is essentially unused AFAIK and that's okay, but a number of C0 characters do see real-world use (escape, EOF, NUL). IMHO there are valid and reasonable use cases to use some of them.

NelsonMinar

5 months ago

1 reply

I've made good use of unusual C0 characters like U+001E (Record Separator). I think it makes sense to exclude them from documents but they can be useful in text data streams.

senderista

5 months ago

Agreed, I would be very annoyed to see separator characters arbitrarily rejected by software I don't control. I think these characters are seriously underused, considering all the issues with in-band separators.

msgodel

5 months ago

I've seen program source code with form feeds (U+000C) in it. Apparently Emacs has built in support for using them for navigation so adjacent things occasionally contain them.

nikolayasdf123

5 months ago

1 reply

how does this compare to Go `unicode.IsPrint(r rune)`? https://pkg.go.dev/unicode#IsPrint

what does bad/dangerous this code catch that `unicode.IsPrint` is not catching?

or other way, what good/useful does `unicode.IsPrint`removing, that this code keeps?

mort96

5 months ago

1 reply

I don't know all the details of `unicode.IsPrint` function, but one major issue is: it's Go-specific. If you're defining a protocol, you probably don't want the spec to include text such as, "the username field must only contain Unicode code points which are conidered printable by the Go programming language's 'unicode.IsPrint' function". You would rather want to write, "the username field must not contain Unicode code points which are considered problematic by RFC 9839".

nikolayasdf123

5 months ago

interesting. so Go seem to be using unicode categories, which is part of unicode spec/standard. so it is fairly language agnostic.

> IsPrint == .. categories L, M, N, P, S and the ASCII space character.

how does that compare to this standard (RFC 9839)? (don't mind that this is Go. just consider same unicode categories).

ninkendo

5 months ago

2 replies

It seems like most of these are handled by just rejecting invalid UTF-8 byte sequences (ideally, erroring out altogether) when interpreting a string as UTF-8. I mean, unpaired surrogates, or any surrogate for that matter, is already illegal as a UTF-8 byte sequence. Any competent language that uses UTF-8 for strings should already be returning errors when given such sequences.

The list of code points which are problematic (non-printing, etc) are IMO much more useful and nontrivial. But it’d be useful to treat those as a separate concept from plain-old illegal UTF-8 byte sequences.

account42

5 months ago

1 reply

> Any competent language that uses UTF-8 for strings should already be returning errors when given such sequences.

No they shouldn't because that's how you get file managers that can't manage files.

ninkendo

5 months ago

1 reply

The file manager wouldn’t use the “string” type to hold file names, if it’s written properly. Languages like Rust have things like OsString as separate from String for just this reason.

If you have a type that says “my contents are valid UTF-8”, then you should reject invalid UTF-8 when populating it, obviously. Why would it work any other way? If you need a type that can hold arbitrary byte sequences, use a type that can hold arbitrary byte sequences.

account42

5 months ago

This is an unrealistic expectation. Local file names are just one example of many where you need to deal with UTF-8ish data that you should interpret as UTF-8 for display but pass along unmangled to other systems. Storing all that data twice and duplicating all relevant operations is both inefficient and will introduce more bugs as the two strings get out of sync. The gains from enforcing strict UTF-8 validation are minimal while the downsides are many - not the least of which is intentionally breaking forward compatibility with future Unicode versions that may extend what is valid.

It's is also not what happens in practice. File managers that cannot rename or delete some files because they are unnecessarily "smart" about interpreting strings passed to them is very much how things have worked out in reality.

doug_durham

5 months ago

That seems reasonable. It should be up to the application implementer to make that choice and not a lower level more general purpose library. I haven't run into any JSON parsers for usernames only code.

djoldman

5 months ago

1 reply

I don't understand how this helps.

Defining a subset of unicode to accept does not obviate the need to check that values conform to type definitions.

skybrian

5 months ago

Yes, that's true. It's for generic, low-level parsing code that doesn't know what a username is. There will also need to be field-specific validation.

1oooqooq

5 months ago

1 reply

it's almost like every language need another standards: safeunicode.

which does away with control flow and specially positioning garbage. and doesn't consider as valid unknown (or missing) surrogates.

why do i want very specialized text tabulation and positioning chars in my string?

it's as if they tried to solve text encoding AND solve CSV and pagination and desktop publishing all in one go.

Dwedit

5 months ago

But then what would safe Unicode allow and reject? Carriage Returns? Horizontal Tabs? Form Feeds?

RustyRussell

5 months ago

1 reply

Did anyone else find the use if ABNF annoying?

  unicode-assignable =
   %x9 / %xA / %xD /               ; useful controls
   %x20-7E /                       ; exclude C1 controls and DEL
   %xA0-D7FF /                     ; exclude surrogates
   %xE000-FDCF /                   ; exclude FDD0 nonchars
   %xFDF0-FFFD /                   ; exclude FFFE and FFFF nonchars
   %x10000-1FFFD / %x20000-2FFFD / ; (repeat per plane)
   %x30000-3FFFD / %x40000-4FFFD /
   %x50000-5FFFD / %x60000-6FFFD /
   %x70000-7FFFD / %x80000-8FFFD /
   %x90000-9FFFD / %xA0000-AFFFD /
   %xB0000-BFFFD / %xC0000-CFFFD /
   %xD0000-DFFFD / %xE0000-EFFFD /
   %xF0000-FFFFD / %x100000-10FFFD

I mean, just define ranges.

Also, where are the test vectors? Because when I implement this, that's the first thing I have to write, and you could save me a lot of work here. Bonus points if it's in JSON and UTF-8 already, though the invalid UTF-8 in an RFC might really gum things up: hex encode maybe?

timbray

5 months ago

1 reply

The tests for the go code at https://github.com/timbray/RFC9839 are in effect test vectors.

RustyRussell

5 months ago

1 reply

I want to implement this. My code is in C.

How does this help me check my implementation? I guess I could ask ChatGPT to convert your tests to my code, but that seems the long way around.

djoldman

5 months ago

https://github.com/timbray/RFC9839/blob/main/unichars.go

I don't know rust at all but I can pretty quickly understand:

    var unicodeAssignables = []runePair{
     {0x20, 0x7E},       // ASCII
     {0xA, 0xA},         // newline
     {0xA0, 0xD7FF},     // most of the BMP
     {0xE000, 0xFDCF},   // BMP after surrogates
     {0xFDF0, 0xFFFD},   // BMP after noncharacters block
     {0x9, 0x9},         // Tab
     {0xD, 0xD},         // CR
     {0x10000, 0x1FFFD}, // astral planes from here down
     {0x20000, 0x2FFFD},
     {0x30000, 0x3FFFD},
     {0x40000, 0x4FFFD},
     {0x50000, 0x5FFFD},
     {0x60000, 0x6FFFD},
     {0x70000, 0x7FFFD},
     {0x80000, 0x8FFFD},
     {0x90000, 0x9FFFD},
     {0xA0000, 0xAFFFD},
     {0xB0000, 0xBFFFD},
     {0xC0000, 0xCFFFD},
     {0xD0000, 0xDFFFD},
     {0xE0000, 0xEFFFD},
     {0xF0000, 0xFFFFD},
     {0x100000, 0x10FFFD},
    }

zzo38computer

5 months ago

1 reply

I do not agree that Unicode is good. I think that Unicode is not good.

I also think that, regardless of the character set, what to include (control characters, graphic characters, maximum length, etc) will have to depend on the specific application anyways, so trying to include/exclude in JSON doesn't work as well.

Giving a name to a specific subset (or sometimes a superset, but usually subset) of Unicode (or any other character sets, such as ASCII or ISO 2022 or TRON code) can be useful, but don't assume it is good for everyone or even is good for most things, because it isn't.

RFC 9839 does give names to some subsets of Unicode, which may sometimes be useful, but should not automatically assume that is right for what you will be making. My opinion is to consider to not use or require Unicode.

msgodel

5 months ago

It's really the combining characters that are the problem. That changed it from a character set to a DSL for describing characters.

yencabulator

5 months ago

I am torn on one decision: Whether to control inputs, or to wrap untrusted input in a datatype that displays it safely (web+log+debug).

Grimeton

5 months ago

Imagine the possibilities:

  PS C:\> mkdir x
  
      Directory: C:\
  
  Mode                 LastWriteTime         Length Name
  ----                 -------------         ------ ----
  d----          23.08.2025    21:43                x
  
  PS C:\> cd x
  PS C:\x> new-item -ItemType File 'C꞉⧵Windows⧵System32⧵rundll32.exe'
  
      Directory: C:\x
  
  Mode                 LastWriteTime         Length Name
  ----                 -------------         ------ ----
  -a---          23.08.2025    21:43              0 C꞉⧵Windows⧵System32⧵rundll32.exe
  
  PS C:\x> $f=gi .\C꞉⧵Windows⧵System32⧵rundll32.exe
  PS C:\x> $f | fl -Property * -Force
  
  PSPath              : Microsoft.PowerShell.Core\FileSystem::C:\x\C꞉⧵Windows⧵System32⧵rundll32.exe
  PSParentPath        : Microsoft.PowerShell.Core\FileSystem::C:\x
  PSChildName         : C꞉⧵Windows⧵System32⧵rundll32.exe
  PSDrive             : C
  PSProvider          : Microsoft.PowerShell.Core\FileSystem
  PSIsContainer       : False
  Mode                : -a---
  ModeWithoutHardLink : -a---
  VersionInfo         : File:             C:\x\C꞉⧵Windows⧵System32⧵rundll32.exe
                        InternalName:
                        OriginalFilename:
                        FileVersion:
                        FileDescription:
                        Product:
                        ProductVersion:
                        Debug:            False
                        Patched:          False
                        PreRelease:       False
                        PrivateBuild:     False
                        SpecialBuild:     False
                        Language:
  
  BaseName            : C꞉⧵Windows⧵System32⧵rundll32
  ResolvedTarget      : C:\x\C꞉⧵Windows⧵System32⧵rundll32.exe
  Target              :
  LinkType            :
  Name                : C꞉⧵Windows⧵System32⧵rundll32.exe
  Length              : 0
  DirectoryName       : C:\x
  Directory           : C:\x
  IsReadOnly          : False
  Exists              : True
  FullName            : C:\x\C꞉⧵Windows⧵System32⧵rundll32.exe
  Extension           : .exe
  CreationTime        : 23.08.2025 21:43:35
  CreationTimeUtc     : 23.08.2025 19:43:35
  LastAccessTime      : 23.08.2025 21:43:35
  LastAccessTimeUtc   : 23.08.2025 19:43:35
  LastWriteTime       : 23.08.2025 21:43:35
  LastWriteTimeUtc    : 23.08.2025 19:43:35
  LinkTarget          :
  UnixFileMode        : -1
  Attributes          : Archive
  
  PS C:\x>

account42

5 months ago

No, please do in fact allow unpaired surrogates and other weird characters so people can round trip "bad" data like file names through your protocol.

msgodel

5 months ago

The form feed character (U+000C) should probably have been included in useful controls. That's used in a fair amount of program source code.

ks2048

5 months ago

It's worth noting that Unicode already defines a "General Category" for all code points that categorizes some of these types of "weird" characters.

https://en.wikipedia.org/wiki/Unicode_character_property#Gen...

e.g. in Python,

   import unicodedata
   print(unicodedata.category(chr(0)))
   print(unicodedata.category(chr(0xdead)))

Shows "Cc" (control) and "Cs" (surrogate).

weinzierl

5 months ago

I think there should be a restriction in the standard on how many Unicode scalar values a graphical unit can have.

Last time I checked (a couple of years ago admittedly) there was no such restriction in the standard. There was however a recommendation to restrict a graphical unit to 128 bytes for "streaming applications".

Bringing this or at least a limit on the scalar units into the standard would make implementation and processing so much easier without restricting sensible applications.

OCTAGRAM

5 months ago

> The first code point is zero, in Unicode jargon U+0000. In human-readable text it has no meaning, but it will interfere with the operation of certain programming languages.

This part encourages more active usage of U+0000, so that programmers of certain programming languages get a message that they are not welcome

o11c

5 months ago

I have had real-world programs broken by blind assumption of "does not deliberately contain controls" (form feed is particularly common for things intended to be paginated, escape is common for things designed for a terminal, etc.) and even "is fully UTF-8" (there are lots of old data files and logs that are never going away).

If you aren't doing something useful with the text, you're best off passing a byte-sequence through unchanged. Unfortunately, Microsoft Windows exists, so sometimes you have to pass `char16_t` sequences through instead.

The worst part about UTF-16 is that invalid UTF-16 is fundamentally different than invalid UTF-8. When translating between them (really: when transforming external data into an internal form for processing), the former can use WTF-8 whereas the latter can use Python-style surrogateescape, but you can't mix these.

View full discussion on Hacker News

ID: 44995640Type: storyLast synced: 11/20/2025, 6:24:41 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN