Toon – Token Oriented Object Notation
Posted2 months agoActive2 months ago
github.comTechstoryHigh profile
calmmixed
Debate
60/100
Data SerializationLlmsJSON Alternatives
Key topics
Data Serialization
Llms
JSON Alternatives
TOON is a new data serialization format designed to be more efficient for Large Language Models (LLMs), sparking discussion on its value and potential applications.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
3h
Peak period
24
24-30h
Avg / period
6.4
Comment distribution58 data points
Loading chart...
Based on 58 loaded comments
Key moments
- 01Story posted
Oct 26, 2025 at 6:19 PM EDT
2 months ago
Step 01 - 02First comment
Oct 26, 2025 at 9:10 PM EDT
3h after posting
Step 02 - 03Peak activity
24 comments in 24-30h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 29, 2025 at 7:50 AM EDT
2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45715632Type: storyLast synced: 11/20/2025, 4:53:34 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Your LLM, however, may experience cross-format feature superposition and consequential spurious activation.
For LLM consumption, this might not matter, don’t use this for anything else.
That said: I like the idea!
Which doesn't address the question: do LLMs understand TOON the same as they would JSON? It's quite likely that this notation is not interpreted the same by most LLM, as they would JSON. So benchmarks on, say, data processing tasks, would be warranted.
[0] https://github.com/johannschopplich/toon?tab=readme-ov-file#...
1. Retrieval Accuracy - https://github.com/johannschopplich/toon?tab=readme-ov-file#...
2. Performance by dataset - https://github.com/johannschopplich/toon?tab=readme-ov-file#...
I've only looked at one model (gpt-4.1-nano) so far. I'm hoping to run similar tests on some other models but it gets challenging to discern statistically significant differences with better models as their accuracy tends to be a lot better across the board.
The current models unfortunately do not have TOON in their training set, so they would probably require additional input tokens to grok the notation, and even then probably won’t have the same accuracy as they do for JSON.
I guess its about LLMs so the idea is has to be plaintext? But if you can train it on TOON can't you train it on BSON?
Supporting this use case doesn’t require perfectly marshaling every data structure ever.
But to your point the tool could have wider use cases without the limitations.
It seems like a nice idea to me if restricted to that. Although I guess I am not sure if it's really intended that way - the array count for example is probably pretty bad for LLM output.
CUE can emit the other formats (minus XML because it's a beast of ambiguity, but there are other tools for json->xml i.e.)
It also has modules and imports, a very underrated feature for config languages if you haven't experienced it before
Let's take the example:
We can keep it JSON, but use more compact list expressions, as tuples when pragmatic: The thing is the game with LLMs is not what's shortest, but what's:1. Mainstream, so they understand it.
2. What they're tuned for, and their tuned for what's mainstream (JSON).
If you want to go extreme compression you can shove it all in JSON strings too and keep the larger structure JSON:
You may say "how is this better". Well it's better because it's still JSON, there's less to explain to the LLM, and to your other devs. Even if we use a weird compact format like "id:role:name" this is still shorter to explain than a completely different syntax with its whole world of rules.Not sure LLM are more “tuned” to JSON.
That said, your general point holds that toon maybe unnecessary. Especially in the examples given. But perhaps plan text would suffice. Toon could be useful when automating inputs with many different shapes.
Starting with already compressed data doesn't necessarily mean fewer tokens, you can probably assume similar entropy (or probably worse entropy) in expanding "Dictionary words" in a compressed stream versus "tokens" from a plaintext stream.
You also wouldn't need indentation levels to be syntactically meaningful.
You could also get rid of LLM tokens like square brackets, curly braces, colons, and commas.
And you could have objects nested to arbitrary depth.
In near the same character count as TOON (sometimes more, sometimes less).
(I was telling someone over the weekend that there are only a few small wins for Lisps in most AI work right now. I hadn't considered that the printed syntax itself might have a use with these LLM huge black boxes.)
> users[2]{id,name,role}: 1,Alice,admin 2,Bob,user
differently than me, i guess. I would read that as "at index value of two, i.e. the third element of an array, the values 1aliceadmin and 2bobuser are stored, or not, since we want to destructure these values and a pair value of a tuple of three is given. and would be confused and think wtf is that, dear user, did you omit or misformat values?
I've seen a ton of people who just paste a CSV into a prompt and expect it to work well because they don't know any better, but the results are typically hot garbage. It's too repetitive, it can't memorize and/or process such a big chunk of data. Asking an LLM to use pandas to iteratively analyze some CSV works great, though.