Karpathy on Deepseek-Ocr Paper: Are Pixels Better Inputs to Llms Than Text?

Posted3 months agoActive2 months ago

JnBrymn

410 points

173 comments

twitter.comTechstoryHigh profile

calmmixed

Debate

70/100

LlmsDeepseek-OcrAI Input Representation

Key topics

Llms

Deepseek-Ocr

AI Input Representation

https://xcancel.com/karpathy/status/1980397031542989305

The HN discussion revolves around Andrej Karpathy's tweet questioning whether pixels are better inputs to LLMs than text, sparked by the DeepSeek-OCR paper, with commenters exploring the implications and potential benefits of using image inputs for LLMs.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

Peak period

36-48h

Avg / period

22.9

Comment distribution160 data points

Loading chart...

Based on 160 loaded comments

Key moments

01Story posted
Oct 21, 2025 at 1:43 PM EDT
3 months ago
Step 01
02First comment
Oct 21, 2025 at 4:11 PM EDT
2h after posting
Step 02
03Peak activity
83 comments in 36-48h
Hottest window of the conversation
Step 03
04Latest activity
Oct 27, 2025 at 1:50 PM EDT
2 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (173 comments)

Showing 160 comments of 173

yunwal

3 months ago

4 replies

> The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible, at the input.

> Maybe it makes more sense that all inputs to LLMs should only ever be images.

So, what, every time I want to ask an LLM a question I paint a picture? I mean at that point why not just say "all input to LLMs should be embeddings"?

smegma2

3 months ago

1 reply

No? He’s talking about rendered text

rhdunn

3 months ago

From the post he's referring to text input as well:

> Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in:

Italicized emphasis mine.

So he's suggesting that/wondering if the vision model should be the only input to the LLM and have that read the text. So there would be a rasterization step on the text input to generate the image.

Thus, you don't need to draw a picture but generate a raster of the text to feed it to the vision model.

awesome_dude

3 months ago

I mean, text is, after all, highly stylised images

It's trivial for text to be pasted in, and converted to pixels (that's what my, and every computer on the planet, does when showing me text)

CuriouslyC

3 months ago

All inputs being embeddings can work if you have embedding like Matryoshka, the hard part is adaptively selecting the embedding size for a given datum.

fspeech

3 months ago

If you can read your input on your screen your computer apparently knows how to convert your texts to images.

sabareesh

3 months ago

2 replies

It might be that our current tokenization is inefficient compared to how well image pipeline does. Language already does lot of compression but there might be even better way to represent it in latent space

ACCount37

3 months ago

1 reply

People in the industry know that tokenizers suck and there's room to do better. But actually doing it better? At scale? Now that's hard.

typpilol

3 months ago

3 replies

It will require like 20x the compute

Mehvix

3 months ago

1 reply

Why do you suppose this is a compute limited problem?

ACCount37

3 months ago

1 reply

It's kind of a shortcut answer by now. Especially for anything that touches pretraining.

"Why aren't we doing X?", where X is a thing that sounds sensible, seems like it would help, and does indeed help, and there's even a paper here proving that it helps.

The answer is: check the paper, it says there on page 12 in a throwaway line that they used 3 times the compute for the new method than for the controls. And the gain was +4%.

A lot of promising things are resource hogs, and there are too many better things to burn the GPU-hours on.

typpilol

3 months ago

Thanks.

Also, saying it needs 20x compute is exactly that. It's something we could do eventually but not now

ACCount37

3 months ago

1 reply

A lot of cool things are shot down by "it requires more compute, and by a lot, and we're already compute starved on any day of the week that ends in y, so, not worth it".

If we had a million times the compute? We might have brute forced our way to AGI by now.

Jensson

3 months ago

But we don't have a million times the compute, we have the compute we have so its fair to argue that we want to prioritize other things.

kenjackson

3 months ago

1 reply

Why so much compute? Can you tie it to the problem?

typpilol

3 months ago

2 replies

Tokenizers are the reason LLMs are even possible to run at a decent speed on our best hardware.

Removing the tokenizer would 1/4 the context and 4x the compute and memory, assuming an avg token length of 4.

Also, you would probably need to 4x the parameters to have to learn understanding between individual characters as well as words and sentences etc.

There's been a few studies on small models, even then those only show a tiny percentage gain over tokenized models.

So essentially you would need 4x compute, 1/4 the context, and 4x the parameters to squeeze 2-4% more performance out of it.

And that fails when you use more then 1/4 context. So realistically you need to support the same context, so you r compute goes up another 4x to 16x.

That's why

kenjackson

3 months ago

Thanks. That helps a lot.

ashirviskas

3 months ago

This has a ton of seemingly random assumptions, why can't we compress multiple latent space representations into one? Even in simple tokenizers token "and" has no right being the same size as "scientist".

CuriouslyC

3 months ago

2 replies

Image models use "larger" tokens. You can get this effect with text tokens if you use a larger token dictionary and generate common n-gram tokens, but the current LLM architecture isn't friendly to large output distributions.

mark_l_watson

3 months ago

Interesting idea! Haven’t heard that before.

yorwba

3 months ago

You don't have to use the same token dictionary for input and output. There's things like simultaneously predicting multiple tokens ahead as an auxiliary loss and for speculative decoding, where the output is larger than the input, and similarly you could have a model where the input tokens combine multiple output tokens. You would still need to do a forward pass per output token during autoregressive generation, but prefill would require fewer passes and the KV cache would be smaller too, so it could still produce a decent speedup.

But in the DeepSeek-OCR paper, compressing more text into the same number of visual input tokens leads to progressively worse output precision, so it's not a free lunch but a speed-quality tradeoff, and more fine-grained KV cache-compression methods might deliver better speedups without degrading the output as much.

hbarka

3 months ago

3 replies

Chinese writing is logographic. Could this be giving Chinese developers a better intuition for pixels as input rather than text?

anabis

3 months ago

1 reply

Yeah, mapping chinese characters to linear UTF-8 space is throwing a lot of information away. Each language brings some ideas for text processing. sentencepiece inventor is Japanese, which doesn't have explicit word delimiters, for example.

ComputerGuru

3 months ago

It's not throwing any information away because it can be faithfully reconstructed (via an admittedly arduous process), therefore no entropy has been lost (if you consider the sum of both "input bytes" and "knowledge of utf-8 encoding/decoding").

est

3 months ago

1 reply

Chinese text == Method of loci

Many Chinese student have good memory to recall a particular paragraph, understand the meaning, but no idea how those words were pronouced.

yandie

3 months ago

1 reply

I can read Kanji (Japanese) and sometimes I will understand the sentence but can't pronounce it (Japanese Kanji rules are quite arbitrary). Your brain definitely handles information differently with Chinese letters

est

3 months ago

and if you master the skill, it will speed up your reading dramatically.

Ideograms could help you establish meanings to graphs directly, skipping the "vocal serialization" single-thread part.

hobofan

3 months ago

Yeah, that sounds quite interesting. I'm wondering whether there is a bigger gap in performance (= quality) between text-only<->vision OCR in Chinese language than in English.

There is indeed a lot of semantic information contained in the signs that should help an LLM. E.g. there is a clear visual connection between 木 (wood/tree) and 林 (forest), while an LLM that purely has to draw a connection between "tree" and "forest" would have a much harder time seeing that connection independent of whether it's fed that as text or vision tokens.

varispeed

3 months ago

6 replies

Text is linear, whereas image is parallel. I mean when people often read they don't scan text from left to right (or different direction, depending on language), but rather read the text all at once or non-linearly. Like first lock on keywords and then read adjacent words to get meaning, often even skipping some filler sentences unconsciously.

Sequential reading of text is very inefficient.

sosodev

3 months ago

2 replies

LLMs don't "read" text sequentially, right?

olliepro

3 months ago

1 reply

The causal masking means future tokens don’t affect previous tokens embeddings as they evolve throughout the model, but all tokens a processed in parallel… so, yes and no. See this previous HN post (https://news.ycombinator.com/item?id=45644328) about how bidirectional encoders are similar to diffusion’s non-linear way of generating text. Vision transformers use bidirectional encoding b/c of the non-causal nature of image pixels.

Merik

3 months ago

1 reply

Didn’t anthropic show that the models engage in a form of planning such that it is predicting a possible future subsequent tokens that then affects prediction of the next token: https://transformer-circuits.pub/2025/attribution-graphs/bio...

ACCount37

3 months ago

Sure, an LLM can start "preparing" for token N+4 at token N. But that doesn't change that the token N can't "see" N+1.

Causality is enforced in LLMs - past tokens can affect future tokens, but not the other way around.

anon291

3 months ago

If the attention is masked, then yes they do.

spiralcoaster

3 months ago

2 replies

What people do you know that do this? I absolutely read in a linear fashion unless I'm deliberately skimming something to get the gist of it. Who can read the text "all at once"?!

numpad0

3 months ago

2 replies

I don't know how common it is, but I tend to read novels in a buttered heterogeneous multithreading mode - image and logical and emotional readings all go at each their own paces, rather than a singular OCR engine feeding them all with 1D text

is that crazy? I'm not buying it is

alwa

3 months ago

That description feels relatable to me. Maybe buffered more than buttered, in my case ;)

It seems to me that would be a tick in the “pro” column for this idea of using pixels (or contours, a la JPEG) as the models’ fundamental stimulus to train against (as opposed to textual tokens). Isn’t there a comparison to be drawn between the “threads” you describe here, and the multi-headed attention mechanisms (or whatever it is) that the LLM models use to weigh associations at various distances between tokens?

bigbluedots

3 months ago

Don't know, probably? I'm a linear reader

ants_everywhere

3 months ago

1 reply

I do this. I'm autistic and have ADHD so I'm not representative of the normal person. However, I don't think this is entirely uncommon.

The relevant technical term is "saccade"

> ADHD: Studies have shown a consistent reduction in ability to suppress unwanted saccades, suggesting an impaired functioning of areas like the dorsolateral prefrontal cortex.

> Autism: An elevated number of antisaccade errors has been consistently reported, which may be due to disturbances in frontal cortical areas.

https://eyewiki.org/Saccade

Also see https://en.wikipedia.org/wiki/Eye_movement_in_reading

alwa

3 months ago

2 replies

I do this too. I suspect it may involve a subtly different mechanism from the saccade itself though? If the saccade is the behavior, and per the eyewiki link skimming is a voluntary type of saccade, there’s still the question of what leads me to use that behavior when I read (and others to read more linearly). Although you could certainly watch my eyes “saccade” around as I move nonlinearly through a passage, I’m not sure it’s out of a lack of control.

Rather, I feel like I absorb written meaning in units closer to paragraphs than to words or sentences. I’d describe my rapid up-and-down, back-and-forth eye motions as something closer to going back to soak up more, if that makes sense. To reinterpret it in the context of what came after it. The analogy that comes to mind is to a Progressive JPEG getting crisper as more loads.

That eyewiki entry was really cool. Among the unexpectedly interesting bits:

> The initiation of a saccade takes about 200 milliseconds[4]. Saccades are said to be ballistic because the movements are predetermined at initiation, and the saccade generating system cannot respond to subsequent changes in the position of the target after saccade initiation[4].

ProofHouse

3 months ago

also ping pong around the page (ADHD'r). At times I read a sentance or two in linear fashion, then start jumping, or start or move to the end and read backwards, or any mix of this depending.

ants_everywhere

3 months ago

If you're an adult you probably have compensated for the saccades and developed a strategy that doesn't force you to read linearly. This is much of what "speed reading" courses try to do intentionally.

jb1991

3 months ago

1 reply

I think you’re making a lot of assumptions about how people read.

com2kid

3 months ago

1 reply

He isn't, plenty of studies have been done on the topic. Eyes dart around a lot when reading.

jb1991

3 months ago

2 replies

People do skip words or scan for key phrases, but reading still happens in sequence. The brain depends on word order and syntax to make sense of text, so you cannot truly read it all at once. Skimming just means you sample parts of a linear structure, not that reading itself is non-linear. Eye-tracking studies confirm this sequential processing (check out the Rayner study in Psychological Bulletin if you are interested).

com2kid

3 months ago

2 replies

Thanks for the reference!

Reading is def not 100% linear, as I find myself skipping ahead to see who is talking or what type of sentence I am reading (question, exclamation, statement).

There is an interesting discussion down thread about ADHD and sequential reading. As someone who has ADHD I may be biased by how my brain works. I definitely don't read strictly linearly, there is a lot of jumping around and assembling of text.

jb1991

3 months ago

You may very well skip ahead for context, etc, and that is fine, but that doesn't mean you are actually reading out of order. It's one thing to get distracted or interested in other parts of a sentence or paragraph and jump around. But ultimately, if you are actually gathering the meaning that was written, you have to consume the words linearly at some point. Perhaps with ADHD you just have to endure some distractions on the way to doing so.

dahart

3 months ago

> Reading is def not 100% linear, as I find myself skipping ahead to see who is talking or what type of sentence I am reading (question, exclamation, statement).

My initial reaction was to say speak for yourself about what reading is or isn’t, and that text is written linearly, but the more I think about it, the more I think you have a very good point. I think I read mostly linear and don’t often look ahead for punctuation. But sentence punctuation changes both the meaning and presumed tone of words that preceded it, and it’s useful to know that while reading the words. Same goes for something like “, Barry said.” So meaning in written text is definitely not 100% linear, and that justifies reading in non-linear ways. This, I’m sure, is one reason that Spanish has the pre-sentence question mark “¿”. And I think there are some authors who try to put who’s talking in front most of the time, though I can’t name any off the top of my head.

varispeed

3 months ago

1 reply

That's not exactly correct. You can totally read whole sentences or paragraphs at once without having to piece individual words together.

I can give you an analogy that should hopefully help. If you look at a house, you don't look at the doors, windows, facade, roof individually, then ponder how they are related together to come to a conclusion that it is a house. You immediately know. This is similar with reading. It might require practice though (and a lot of reading!).

jb1991

3 months ago

Your comparison makes no sense to me. Looking at an object and understanding what it is completely different than processing a sequential series of symbols that are designed to have meaning due to a linear order.

sota_pop

3 months ago

1 reply

I absolutely don’t “read the text all at once” and do read “left to right”. Could be why I usually find that my reading speed is slower than most. Although I’ve never really had a hard time with comprehension or remembering details.

jerojero

3 months ago

I remember doing speed reading courses back when I was young and a big part of it was learning to read a paragraph diagonally.

Its much, much faster. At first there's a loss of understanding of course but once you've practiced enough you will be much faster.

krackers

3 months ago

Sure, but when people listen to speech it is literally one word at a time. So while there might be some benefit to being able to read non-linearly, it's probably not a bottleneck.

ants_everywhere

3 months ago

some of us with ADHD just kind of read all the words at once

ianbutler

3 months ago

1 reply

https://arxiv.org/abs/2510.17800 (Glyph: Scaling Context Windows via Visual-Text Compression)

You can also see this paper from the GLM team where they explicitly test this assumption to some pretty good results.

scotty79

3 months ago

I couldn't imagine how rendering text tokens to images could bring any savings, but then I remembered esch token is converted into hundreds of floating point numbers before feeding it to neural network. So in a way it's already rendered into a multidimensional pixel (or hundreds of arbitrary 2-dimensional pixels). This papers shows that you don't need that many numbers to keep the accuracy and that using numbers that represent the text visually (which is pretty chaotic) is just as good as the way we currently do it.

viraptor

3 months ago

2 replies

https://xcancel.com/karpathy/status/1980397031542989305

kirubakaran

3 months ago

Thanks. There are also these:

- https://addons.mozilla.org/en-US/firefox/addon/toxcancel/

- https://chromewebstore.google.com/detail/xcancelcom-redirect...

dang

3 months ago

Thanks! Added to toptext also.

3 months ago

2 replies

Kapathy's points are correct (of course).

One thing I like about text tokens though is that it learns some understanding of the text input method (particularly the QWERTY keyboard).

"Hello" and "Hwllo" are closer in semantic space than you'd think because "w" and "e" are next to each other.

This is much easier to see in hand coded spelling models, where you can get better results by including a "keybaord distance" metric along with a string distance metric.

swyx

3 months ago

im particularly sympathetic to typo learning, which i think gets lost in the synthetic data discussion (mine here https://www.youtube.com/watch?v=yXPPcBlcF8U )

but i think in this case you can still generate typos in images and it'd be learnable. not a hard issue relevant to the OP

harperlee

3 months ago

But assuming that pixel input gets us to an AI capable of reading, they would presumably also be able to detect HWLLO as semantically close to HELLO (similarly to H3LL0, or badly handwritten text - although there would be some graphical structure in these latter examples to help). At the end of the day we are capable of identifying that... Might require some more training effort but the result would be more general.

tcdent

3 months ago

6 replies

"Kill the tokenizer" is such a wild proposition but is also founded in fundamentals.

Tokenizing text is such a hack even though it works pretty well. The state-of-the-art comes out of the gate with an approximation for quantifying language that's wrong on so many levels.

It's difficult to wrap my head around pixels being a more powerful representation of information, but someone's gotta come up with something other than tokenizer.

dgently7

3 months ago

2 replies

I consume all text as images when I read as a vision capable person so it kinda passes the evolution does it that way test and maybe we shouldn’t be that surprised that vision is a great input method?

Actually thinking more about that I consume “text” as images and also as sounds… I kinda wonder if instead of render and ocr like this suggests we did tts and just encoded like the mp3 sample of the vocalization of the word if that would be less bytes than the rendered pixels version… probably depends on the resolution / sample rate.

visarga

3 months ago

4 replies

Funny, I habitually read while engaging TTS on same text. I have even made a Chrome extension for web reading, it highlights text and reads it, while keeping the current position in the viewport. I find using 2 modalities at the same time improves my concentration. TTS is sped up to 1.5x to match reading speed. Maybe it is just because I want to reduce visual strain. Since I consume a lot of text every day, it can be tiring.

lukevp

3 months ago

1 reply

What’s your extension? Sounds interesting!

zirror

3 months ago

1 reply

Just FYI, Firefox reader mode does the same thing. It's a little button in the address bar.

FergusArgyll

3 months ago

1 reply

Reading mode in chrome does this too. Although the tts sounds like it's far behind sota

trenchpilgrim

3 months ago

Probably because it needs to run locally on older CPUs, so it's likely using an old-school phonemizer that will run on a 15 year old computer.

Version467

3 months ago

I do this too. It's great. The term I've seen used to describe this is 'Immersion Reading'. It seems to be quite a popular way for neurodivergent people to get into reading.

fluidcruft

3 months ago

This is also feature is built into Edge (and I agree it's great, but I mostly use it so I can listen to pages while doing chores around the office/closing my eyes.

What I would love is an easy way to just convert the page to a mp3 that queues into my podcast app to listen to while taking a walk or driving. It probably exists, but I haven't spent a lot of time looking into it.

gavinray

3 months ago

Any chance you could share the source?

I found that I can read better if individual words or chunks are highlighted in alternating pastel colors while I scan then with my eyes

psadri

3 months ago

The pixel to sounds would pass through “reading” so there might be information loss. It is no longer just pixels.

317070

3 months ago

1 reply

There was the Byte Latent Transformer, to end the tokenizer, which seemingly went nowhere. https://ai.meta.com/research/publications/byte-latent-transf...

htrp

3 months ago

fair team currently subject to tbd labs politics

Tarq0n

3 months ago

2 replies

Ok but what are you going to decode into at generation time, a jpeg of text? Tokens have value beyond how text appears to the eye, because we process text in many more ways than just reading it.

jhanschoo

3 months ago

There are some concerns here that should be addressed separately:

> Ok but what are you going to decode into at generation time, a jpeg of text?

Presumably, the output may still be in token space, but for the purpose of conditioning on context for the immediate next token, it must then be immediately translated into a suitable input space.

> we process text in many more ways than just reading it

As a token stream is a straightforward function of textual input, then in the case of textual input we should expect to handle the conversion of the character stream to semantic/syntactic units to happen in the LLM.

Moreover, in the case of OCR, graphical information possesses information/degrades information in the way that humans expect; what comes to mind is the eggplant/dick emoji symbolism, or smiling emoji possessing a graphical similarity that can't be deduced from proximity in Unicode codepoints.

samus

3 months ago

Output really doesn't have to be the same datatypes as the input. Text tokens are good enough for a lot of interesting applications, and transforming percels (name suggested by another commenter here) into text tokens is exactly what an OCR model is anyway trained to do.

esafak

3 months ago

2 replies

I do not get it, either. How can a picture of text be better than the text itself? Why not take a picture of the screen while you're at it, so the model learns how cameras work?

jerojero

3 months ago

1 reply

In a very simple way: because the image can be fed directly into the network without first having to transform the text into a series of tokens as we do now.

But the tweet itself is kinda an answer to the question you're asking.

olejorgenb

2 months ago

How is it materially different from using each char (or each byte) as the token?

corysama

3 months ago

From the paper I saw that the model includes an approximation of the layout, diagrams and other images of the source documents.

Now imagine growing up only allowed to read books and the internet through a browser with CSS, images and JavaScript disabled. You’d be missing out on a lot of context and side-channel information.

naasking

3 months ago

Using pixels is still tokenizing. What's needed is something more like "Byte Latent Transformers", which has dynamically sized patches based on information content rather than tokens.

ReptileMan

3 months ago

I guess it is because of the absurdly high information density of text - so text is quite a good input.

shikon7

3 months ago

2 replies

Seems we're now at a point of time when OCR is doing so well, that printing text out and letting computers literally read it is suggested to be superior to processing the endoded text directly.

Legend2440

3 months ago

1 reply

Neural networks have essentially solved perception. It doesn't matter what format your data comes in, as long as you have enough of it to learn the patterns.

Sharlin

3 months ago

The information density of a bitmap representation of text is just silly low compared to normal textual encodings, even compressed.

programmarchy

3 months ago

PDF is arguably a confusing format for LLMs to read.

bni

3 months ago

4 replies

Of course PowerPoint is the best input to LLMs. They will come to that eventually.

brokencode

3 months ago

1 reply

I'd actually prefer to communicate to ChatGPT via Microsoft Paint. Much more efficient than typing.

saaaaaam

3 months ago

1 reply

Leading scientists claim interpretative dance is the AI breakthrough the world has been waiting for!

falcor84

3 months ago

In all seriousness, I found those sorting dance videos to be really educationally effective (when coupled with going over the pseudocode) - e.g. https://youtu.be/3San3uKKHgg?si=09EQYJNIkRqvQgWG

cat5e

3 months ago

Yeah, I’ve seen great results with this approach.

jtwaleson

3 months ago

It's slides all the way down. Once models support this natively, it's a major threat to slides ai / gamma and the careers of product managers.

amelius

3 months ago

Clippy knew this all along.

seydor

3 months ago

1 reply

we re going to get closer and closer to removing all hand-engineered features of neural network architecture, and letting a giant all-to-all fully connected network collapse on its own to the appropriate architecture for the data, a true black box.

justlikereddit

3 months ago

Which is the Logical conclusion.

If the neural network can distill a model out of complex input data.

Especially when many model are frequently trained through data augmentation practices that actively degrade input to achieve generalisation abilities.

Then why are we stuck wearing silk glove tokenizers?

alexchamberlain

3 months ago

4 replies

I'm probably one of the least educated software engineers on LLMs, so apologies if this is a very naive question. Has anyone done any research into just using words as the tokens rather than (if I understand it correctly) 2-3 characters? I understand there would be limitations with this approach, but maybe the models would be smaller overall?

lyu07282

3 months ago

1 reply

The way modern tokenizers are constructed is by iteratively doing frequency analysis of arbitrary length sequences using a large corpus. So what you suggested is already the norm, tokens aren't n-grams. Words and any sequence really that is common enough will already be one token only, the less frequent a sequence is the more tokens it needs. That's the Byte-pair encoding algorithm:

https://en.wikipedia.org/wiki/Byte-pair_encoding

It's also not lossy compression at all, it's lossless compression if anything, unlike what some people have claimed here.

Shocking comments here, what happened to HN? People are so clueless it reads like reddit wtf

alexchamberlain

3 months ago

1 reply

Thanks, that's really interesting. Do they correct for spelling mistakes or internationalised spellings? For example, does `colour` and `color` end up in the same token stream?

lyu07282

3 months ago

No it just looks at exact character sequences, try it out yourself here: https://platform.openai.com/tokenizer

murkt

3 months ago

You will need dictionaries with millions of tokens, which will make models much larger. Also, any word that has too low frequency to appear in the dictionary is now completely unknown to your model.

plaguuuuuu

3 months ago

presumably anyone tokenizing chinese characters, which are basically entire words.

mhuffman

3 months ago

Along with the other commenter, the reason the dictionary would start getting so big is that words with a stem would have all its variations being different tokens (cat, cats, sit, sitting, etc). Also any out-of-dictionary words or combo words, eg. "cat bed" would not be able to be addressed.

js8

3 months ago

5 replies

Not pixels, but percels. Pixels are points in the image, while a "percel" is unit of perceptual information. It might be a pixel with an associated sound, in a given moment of time. In case of humans, percels include other senses as well, and they can also be annotated with your own thoughts (i.e. percels can also include tokens or embeddings).

Of course, NNs like LLM never process a percel in isolation, but always as a group of neighboring percels (aka context), with an initial focus on one of the percels.

falcor84

3 months ago

1 reply

I love this idea, but can't find anything about it. Is this a neologism you just coined? If so, is there any particular paper or work that led you to think about in those terms?

js8

3 months ago

4 replies

Yes, I just coined the neologism. It was supposed to be partly sarcastic (why stay at pixels, why not just go fully multimodal and treat the missing channels as missing information?), I am kind of surprised why it got so upvoted.

(IME, often my comments which I think are deep get ignored but silly things, where I was thinking "this is too much trolling or obvious", get upvoted; but don't take it the wrong way, I am flattered you like it.)

SJMG

3 months ago

Deep things often, not always, take more attention to appreciate than the superficial. It's a precious resource people are seldom disposed to allocate a lot of when headline-surfing HN.

causal

3 months ago

Pretending channels can be effectively merged into a single percel vector, that would open up interesting channels beyond human perception even, e.g. lidar. Or it would be interesting to train a model that feels at home in 4D space.

throwaway-aws9

3 months ago

Should future attributions in white papers go to js8 from HN?

jaredhansen

3 months ago

I think there's a decent chance you may have just created the ideal name for what will become one of the most important concepts ever. Bravo!

BrokenCogs

3 months ago

1 reply

I was going to say toxel

causal

3 months ago

1 reply

Like a tokenized 3D voxel?

BrokenCogs

3 months ago

Tokenized pixel. I understand now that's not what js8 was talking about, so my original comment doesn't really make sense

causal

3 months ago

1 reply

This is an interesting thought. Trying to imagine how you represent that as a vector.

You still need to map percels to a latent space. But perhaps with some number of dimensions devoted to modes of perception? E.g. audio, visual, etc

milanove

3 months ago

I'm not an ML expert or practitioner, so someone might need to correct me.

However, I believe the parcel's components together as a whole would capture the state of the audio+visual+time. However, I don't think the state of one particular mode (e.g. audio or visual or time) is encoded with a specific subset of the percel's components. Rather, each component of the percel itself would represent a mixture (or a portion of a mixture) of the audio+video+time. So, you couldn't isolate out just the audio or visual or time state specifically by looking at some specific subset of the percel's components, because each component is itself a mixture of the audio+visual+time state.

I think the classic analogy is that if river 1 and river 2 combine to form river 3, you cannot take a cup of water from river 3 and separate out the portions from river 1 and river 2; they're irreversibly mixed.

almoehi

3 months ago

3 replies

I’ve had written up a proposal for a research grant to basically work exactly on this idea.

It got reviewed by 2 ML scientists and one neuroscientist.

Got totally slammed (and thus rejected) by the ML scientists due to „lack of practical application“ and highly endorsed by the neuroscientist.

There’s so much unused potential in interdisciplinary research but nobody wants to fund it because it doesn’t „fit“ into one of the boxes.

Enginerrrd

3 months ago

1 reply

That's unfortunate. My personal sense is that while agentic LLM's are not going to get us close to AGI, a few relatively modest architectural changes to the underlying models might actually do that, and I do think mimicry of our own self-referential attention is a very important component of that.

While the current AI boom is a bubble, I actually think that AGI nut could get cracked quietly by a company with even modest resources if they get lucky on the right fundamental architectural changes.

almoehi

3 months ago

I agree - and I think having interdisciplinary approach here is going to increase the odds here. There is a ton of useful knowledge in related disciplines - often just named differently - but turns out investigating the same problem from a different angle.

shepardrtc

3 months ago

1 reply

Sounds like those ML "scientists" were actually just engineers.

verdverm

3 months ago

A lot of progress is made through engineering challenges

This is also "science"

behnamoh

3 months ago

1 reply

Make sure the ML scientists don't take credit for your work. Sometimes they reject a paper so they can work on it on their own.

almoehi

3 months ago

Grant reviews are blind reviews - so you don’t know. Also - and even worse - there is no rebuttal process. It gets rejected without you having a chance to clarify / convince reviewers.

Instead you’d need to resubmit and start the entire process from scratch. What a waste of resources …

It’s the final nail what made me quit pursuing a scientific career path despite having good pubs & PhD /w honours.

Unfortunately it’s what I enjoy the most.

Workaccount2

3 months ago

2 replies

Isn't this effectively what the latent space is? A bunch of related vectors that all bundle together?

js8

3 months ago

No, latent space doesn't have to be made of percels, just like not every 2D array of 3-element vectors is an image made of pixels. Percels are tied to your sensors, components of what you perceive, in totality.

Of course there is an interesting paradox - each layer of the NN doesn't know whether it's connected to the sensors directly, or what kind of abstractions it works with in the latent space. So the boundary between the mind and the sensor is blurred and to some extent a subjective choice.

taneq

3 months ago

“Percel” is still a way cooler and arguably more descriptive term than “token” though.

koushikn

3 months ago

3 replies

Is it feasible that if we have a tokeniser that works on ELF (or PE/COFF) binaries, then we could have LLMs trained on existing binaries and have them generate binary code directly, skipping the need for programming languages?

kkukshtel

3 months ago

1 reply

I've thought about this a lot, and it comes down ultimately to context size. Programming languages themselves are sort of a "compression technique" for assembly code. Current models even at the high end (1M context windows) do not have near enough workable context to be effective at writing even trivial programs in binary or assembly. For simple instructions sure, but for now the compression of languages (or DSLs) is a context efficiency.

koushikn

3 months ago

Wouldn't all binaries be in the training data, rather than the context? And output context could be in pieces, with something concatenating the pieces into a working binary?

ChatGPT claims its possible, but not allowed due to OpenAI safety rules: https://chatgpt.com/share/68fb0a76-6bf8-800c-82f7-605ff9ca22...

trollbridge

3 months ago

I can attest that existing LLMs work surprisingly well for disassembly.

anon291

3 months ago

Possible but not precise depending on your use case. LLM compilers would suffer from the same sort of propensity to bugs as humans.

bob1029

3 months ago

2 replies

I think the DCT is a compelling way to interact with spatial information when the channel is constrained. What works for jpeg can likely work elsewhere. The energy compaction properties of the DCT mean you get most of the important information in a few coefficients. A quantizer can zero out everything else. Zig zag scanned + RLE byte sequences could be a reasonable way to generate useful "tokens" from transformed image blocks. Take everything from jpeg encoder except for perhaps the entropy coding step.

At some level you do need something approximating a token. BPE is very compelling for UTF8 sequences. It might be nearly the most ideal way to transform (compress) that kind of data. For images, audio and video, we need some kind of grain like that. Something to reorganize the problem and dramatically reduce the information rate to a point where it can be managed. Compression and entropy is at the heart of all of this. I think BPE is doing more heavy lifting than we are giving it credit for.

I'd extend this thinking to techniques like MPEG for video. All frame types also use something like the DCT too. The P and B frames are basically the same ideas as the I frame (jpeg), the difference is they take the DCT of the residual between adjacent frames. This is where the compression gets to be insane with video. It's block transforms all the way down.

An 8x8 DCT block for a channel of SDR content is 512 bits of raw information. After quantization and RLE (for typical quality settings), we can get this down to 50-100 bits of information. I feel like this is an extremely reasonable grain to work with.

jacquesm

3 months ago

1 reply

I can listen to music in my head. I don't think this is an extraordinary property but it is kind of neat. That hints at the fact that I somehow must have encoded this music. I can't imagine I'm storing the equivalent of a MIDI file, but I also can't imagine that I'm storing raw audio samples because there is just too much of it.

It seems to work for vocals as well, not just short samples but entire works. Of course that's what I think, but there is a pretty good chance they're not 'entire', but it's enough that it isn't just excerpts and if I was a good enough musician I could replicate what I remember.

Is there anybody that has a handle on how we store auditory content in our memories? Is it a higher level encoding or a lower level one? This capability is probably key in language development so it is not surprising that we should have the capability to encode (and replay) audio content, I'm just curious about how it works, what kind of accuracy is normally expected and how much of such storage we have.

Another interesting thing is that it is possible to search through it fairly rapidly to match a fragment heard to one that I've heard and stored before.

0xdeadbeefbabe

3 months ago

> Is there anybody that has a handle on how we store auditory content in our memories?

It's so weird that I don't know this. It's like I'm stuck in userland.

akrymski

3 months ago

Yes, DCT coefficients work even better than pixels:

https://www.uber.com/blog/neural-networks-jpeg/

nottorp

3 months ago

1 reply

The text should be printed and a photo of the printed paper on a wooden table should be passed as input into the LLM.

taneq

3 months ago

All questions must now be posed to the Oracle through interpretive dance.

InkCanon

3 months ago

1 reply

Could someone explain to me the difference? They both get turned to tensors of floats.

0x264

3 months ago

JavaScript code and Haskell code ultimately both get turned into instructions for a microprocessor, so there really isn't much of a difference between both.

sd9

3 months ago

3 replies

> more information compression (see paper) => shorter context windows, more efficiency

It seems crazy to me that image inputs (of text) are smaller and more information dense than text - is that really true? Can somebody help my intuition?

spiderfarmer

3 months ago

I absolutely think that it can, but it depends on what mean you associate with each pixel.

krackers

3 months ago

See this thread https://news.ycombinator.com/item?id=45640720

As I understood the responses, the benefit comes from making better use of the embedding space. BPE tokenization is basically like a fixed lookup table, whereas when you form "image tokens" you just throw each 16x16 patch into a neural-net and (handwave) out comes your embedding. From that, it should be fairly intuitive that since current text tokenization embedding vectors won't even form a subspace (it can only just be ~$VOCAB_SIZE points), image tokens have the capacity to be more information dense. And you might hope that the neural network can somehow make use of that extra capacity, as you're not encoding one subword at a time.

vjerancrnjak

3 months ago

It must be the tokenizer. Figuring out words from an image is harder (edges, shapes, letters, words, ...), yet internal representations are more efficient.

I always found it strange that tokens can't just be symbols but instead there's an alphabet of 500k tokens, completely removing low level information from language (rhythm, syllables, etc.), side-effect being a simple edge case of 2 rs in strawberry, or no way to generate predefined rhyming patterns (without constrained sampling). There's an understandable reason for these big token dictionaries, but feels like a hack.

yalogin

3 months ago

1 reply

I don’t quite follow. The way I see it, I hat the llm “reads” depends on the input modality. If the input is a human it will be in text form, has to be. If the input is through a camera then yes, even text will be camera frames and pixels, and that is how I expect the llms to process. So I would a vision llm would already be doing this.

danans

3 months ago

> if the input is a human it will be in text form, has to be.

Why can't it be a sequence of audio waveforms from human speech?

cnxhk

3 months ago

The paper is quite interesting but efficiency on OCR tasks does not mean it could be plugged into a general llm directly without performance loss. If you train a tokenizer only on OCR text you might be able to get better compression already.

hiddencost

3 months ago

Back before transformers, or even LSTMs, we used to joke that image recognition was so far ahead of language modeling that we should just convert our text to PDF and run the pixels through a CNN.

rustyconover

3 months ago

Yet again Hollywood is prescient. This post reminds me of the language of the aliens in Arrival. It seems like the OP would see that as a reasonable input to an LLM.

anon291

3 months ago

I made exactly this point at the inaugural Portland AI tinkerers meetup. I had been messing with large document understanding. Converting PDF to text and then sending to gpt was too expensive. It was cheaper to just upload the image and ask it questions directly. And about as accurate.

https://portland.aitinkerers.org/talks/rsvp_fGAlJQAvWUA

orliesaurus

3 months ago

one of the MOST interesting aspects of the recent discussion on this topic is how it underscores our reliance on lossy abstractions when representing language for machines. Tokenization is one such abstraction, but it's not the only one.... using raw pixels or speech signals is a different kind of approximation. what excites me about experiments like this is not so much that we'll all be handing images to language models tomorrow, but that researchers are pressure testing the design assumptions of current architectures. Approaches that learn to align multiple modalities might reveal better latent structures or training regimes, and that could trickle back into more efficient text encoders without throwing away a century of orthography. BUT there’s also a rich vein to mine in scripts and languages that don’t segment neatly into words: alternative encodings might help models handle those better.

a_bonobo

3 months ago

Somewhat related:

There's this older paper from Lex Flagel and others where they transform DNA-based text, stuff we'd normally analyse via text files, into images and then train CNNs on the images. They managed to get the CNNs to re-predict population genetics measurements we normally get from the text-based DNA alignments.

https://academic.oup.com/mbe/article/36/2/220/5229930

redbell

3 months ago

For reference, here's the paper: https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSe...

bahmboo

3 months ago

Not criticizing per se but I just watched this recent (and great!) interview where he extols how special written language is. That was my take away at least. Still trying to wrap my head around this vision encoder approach. He’s way smarter than me! https://youtu.be/lXUZvyajciY

jimdavid

3 months ago

Did anyone check the token feature dimension? If we're talking about compression, "token length" is just one of the dimensions.

antirez

3 months ago

This should be "pixels are (maybe) a better representation than the current representation of tokens". Which is very different. Text is surely more information dense than the image containing the same text, so the problem is finding the best representation of text. If each word is expanded to a very large embedding and you see pixels doing better, than the problem is in the representation and not in the text vs image.

hunglee2

3 months ago

Really interesting analysis on the latest DeepSeek innovation. I’m tempted to connect it to the information density of logographic script, which DeepSeek engineers would all be natively fluent.

cschmidt

3 months ago

There is other research that works with pixels of text, such as this recent paper I saw at COLM 2025 https://arxiv.org/abs/2504.02122.

dang

3 months ago

Recent and related:

Getting DeepSeek-OCR working on an Nvidia Spark via brute force with Claude Code - https://news.ycombinator.com/item?id=45646559 - Oct 2025 (43 comments)

DeepSeek OCR - https://news.ycombinator.com/item?id=45640594 - Oct 2025 (238 comments)

bonoboTP

3 months ago

Sometimes you want to be Unicode-precise, such as when checking if domain names are legit.

pcwelder

3 months ago

There are many unicode characters that look alike. There are also those zero width characters.

13 more comments available on Hacker News

View full discussion on Hacker News

ID: 45658928Type: storyLast synced: 11/20/2025, 8:47:02 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN