Karpathy on Deepseek-Ocr Paper: Are Pixels Better Inputs to Llms Than Text?
Key topics
The HN discussion revolves around Andrej Karpathy's tweet questioning whether pixels are better inputs to LLMs than text, sparked by the DeepSeek-OCR paper, with commenters exploring the implications and potential benefits of using image inputs for LLMs.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
2h
Peak period
83
36-48h
Avg / period
22.9
Based on 160 loaded comments
Key moments
- 01Story posted
Oct 21, 2025 at 1:43 PM EDT
3 months ago
Step 01 - 02First comment
Oct 21, 2025 at 4:11 PM EDT
2h after posting
Step 02 - 03Peak activity
83 comments in 36-48h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 27, 2025 at 1:50 PM EDT
2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
> Maybe it makes more sense that all inputs to LLMs should only ever be images.
So, what, every time I want to ask an LLM a question I paint a picture? I mean at that point why not just say "all input to LLMs should be embeddings"?
> Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in:
Italicized emphasis mine.
So he's suggesting that/wondering if the vision model should be the only input to the LLM and have that read the text. So there would be a rasterization step on the text input to generate the image.
Thus, you don't need to draw a picture but generate a raster of the text to feed it to the vision model.
It's trivial for text to be pasted in, and converted to pixels (that's what my, and every computer on the planet, does when showing me text)
"Why aren't we doing X?", where X is a thing that sounds sensible, seems like it would help, and does indeed help, and there's even a paper here proving that it helps.
The answer is: check the paper, it says there on page 12 in a throwaway line that they used 3 times the compute for the new method than for the controls. And the gain was +4%.
A lot of promising things are resource hogs, and there are too many better things to burn the GPU-hours on.
Also, saying it needs 20x compute is exactly that. It's something we could do eventually but not now
If we had a million times the compute? We might have brute forced our way to AGI by now.
Removing the tokenizer would 1/4 the context and 4x the compute and memory, assuming an avg token length of 4.
Also, you would probably need to 4x the parameters to have to learn understanding between individual characters as well as words and sentences etc.
There's been a few studies on small models, even then those only show a tiny percentage gain over tokenized models.
So essentially you would need 4x compute, 1/4 the context, and 4x the parameters to squeeze 2-4% more performance out of it.
And that fails when you use more then 1/4 context. So realistically you need to support the same context, so you r compute goes up another 4x to 16x.
That's why
But in the DeepSeek-OCR paper, compressing more text into the same number of visual input tokens leads to progressively worse output precision, so it's not a free lunch but a speed-quality tradeoff, and more fine-grained KV cache-compression methods might deliver better speedups without degrading the output as much.
Many Chinese student have good memory to recall a particular paragraph, understand the meaning, but no idea how those words were pronouced.
Ideograms could help you establish meanings to graphs directly, skipping the "vocal serialization" single-thread part.
There is indeed a lot of semantic information contained in the signs that should help an LLM. E.g. there is a clear visual connection between 木 (wood/tree) and 林 (forest), while an LLM that purely has to draw a connection between "tree" and "forest" would have a much harder time seeing that connection independent of whether it's fed that as text or vision tokens.
Sequential reading of text is very inefficient.
Causality is enforced in LLMs - past tokens can affect future tokens, but not the other way around.
is that crazy? I'm not buying it is
It seems to me that would be a tick in the “pro” column for this idea of using pixels (or contours, a la JPEG) as the models’ fundamental stimulus to train against (as opposed to textual tokens). Isn’t there a comparison to be drawn between the “threads” you describe here, and the multi-headed attention mechanisms (or whatever it is) that the LLM models use to weigh associations at various distances between tokens?
The relevant technical term is "saccade"
> ADHD: Studies have shown a consistent reduction in ability to suppress unwanted saccades, suggesting an impaired functioning of areas like the dorsolateral prefrontal cortex.
> Autism: An elevated number of antisaccade errors has been consistently reported, which may be due to disturbances in frontal cortical areas.
https://eyewiki.org/Saccade
Also see https://en.wikipedia.org/wiki/Eye_movement_in_reading
Rather, I feel like I absorb written meaning in units closer to paragraphs than to words or sentences. I’d describe my rapid up-and-down, back-and-forth eye motions as something closer to going back to soak up more, if that makes sense. To reinterpret it in the context of what came after it. The analogy that comes to mind is to a Progressive JPEG getting crisper as more loads.
That eyewiki entry was really cool. Among the unexpectedly interesting bits:
> The initiation of a saccade takes about 200 milliseconds[4]. Saccades are said to be ballistic because the movements are predetermined at initiation, and the saccade generating system cannot respond to subsequent changes in the position of the target after saccade initiation[4].
Reading is def not 100% linear, as I find myself skipping ahead to see who is talking or what type of sentence I am reading (question, exclamation, statement).
There is an interesting discussion down thread about ADHD and sequential reading. As someone who has ADHD I may be biased by how my brain works. I definitely don't read strictly linearly, there is a lot of jumping around and assembling of text.
My initial reaction was to say speak for yourself about what reading is or isn’t, and that text is written linearly, but the more I think about it, the more I think you have a very good point. I think I read mostly linear and don’t often look ahead for punctuation. But sentence punctuation changes both the meaning and presumed tone of words that preceded it, and it’s useful to know that while reading the words. Same goes for something like “, Barry said.” So meaning in written text is definitely not 100% linear, and that justifies reading in non-linear ways. This, I’m sure, is one reason that Spanish has the pre-sentence question mark “¿”. And I think there are some authors who try to put who’s talking in front most of the time, though I can’t name any off the top of my head.
I can give you an analogy that should hopefully help. If you look at a house, you don't look at the doors, windows, facade, roof individually, then ponder how they are related together to come to a conclusion that it is a house. You immediately know. This is similar with reading. It might require practice though (and a lot of reading!).
Its much, much faster. At first there's a loss of understanding of course but once you've practiced enough you will be much faster.
You can also see this paper from the GLM team where they explicitly test this assumption to some pretty good results.
- https://addons.mozilla.org/en-US/firefox/addon/toxcancel/
- https://chromewebstore.google.com/detail/xcancelcom-redirect...
One thing I like about text tokens though is that it learns some understanding of the text input method (particularly the QWERTY keyboard).
"Hello" and "Hwllo" are closer in semantic space than you'd think because "w" and "e" are next to each other.
This is much easier to see in hand coded spelling models, where you can get better results by including a "keybaord distance" metric along with a string distance metric.
but i think in this case you can still generate typos in images and it'd be learnable. not a hard issue relevant to the OP
Tokenizing text is such a hack even though it works pretty well. The state-of-the-art comes out of the gate with an approximation for quantifying language that's wrong on so many levels.
It's difficult to wrap my head around pixels being a more powerful representation of information, but someone's gotta come up with something other than tokenizer.
Actually thinking more about that I consume “text” as images and also as sounds… I kinda wonder if instead of render and ocr like this suggests we did tts and just encoded like the mp3 sample of the vocalization of the word if that would be less bytes than the rendered pixels version… probably depends on the resolution / sample rate.
What I would love is an easy way to just convert the page to a mp3 that queues into my podcast app to listen to while taking a walk or driving. It probably exists, but I haven't spent a lot of time looking into it.
I found that I can read better if individual words or chunks are highlighted in alternating pastel colors while I scan then with my eyes
> Ok but what are you going to decode into at generation time, a jpeg of text?
Presumably, the output may still be in token space, but for the purpose of conditioning on context for the immediate next token, it must then be immediately translated into a suitable input space.
> we process text in many more ways than just reading it
As a token stream is a straightforward function of textual input, then in the case of textual input we should expect to handle the conversion of the character stream to semantic/syntactic units to happen in the LLM.
Moreover, in the case of OCR, graphical information possesses information/degrades information in the way that humans expect; what comes to mind is the eggplant/dick emoji symbolism, or smiling emoji possessing a graphical similarity that can't be deduced from proximity in Unicode codepoints.
But the tweet itself is kinda an answer to the question you're asking.
Now imagine growing up only allowed to read books and the internet through a browser with CSS, images and JavaScript disabled. You’d be missing out on a lot of context and side-channel information.
If the neural network can distill a model out of complex input data.
Especially when many model are frequently trained through data augmentation practices that actively degrade input to achieve generalisation abilities.
Then why are we stuck wearing silk glove tokenizers?
https://en.wikipedia.org/wiki/Byte-pair_encoding
It's also not lossy compression at all, it's lossless compression if anything, unlike what some people have claimed here.
Shocking comments here, what happened to HN? People are so clueless it reads like reddit wtf
Of course, NNs like LLM never process a percel in isolation, but always as a group of neighboring percels (aka context), with an initial focus on one of the percels.
(IME, often my comments which I think are deep get ignored but silly things, where I was thinking "this is too much trolling or obvious", get upvoted; but don't take it the wrong way, I am flattered you like it.)
You still need to map percels to a latent space. But perhaps with some number of dimensions devoted to modes of perception? E.g. audio, visual, etc
However, I believe the parcel's components together as a whole would capture the state of the audio+visual+time. However, I don't think the state of one particular mode (e.g. audio or visual or time) is encoded with a specific subset of the percel's components. Rather, each component of the percel itself would represent a mixture (or a portion of a mixture) of the audio+video+time. So, you couldn't isolate out just the audio or visual or time state specifically by looking at some specific subset of the percel's components, because each component is itself a mixture of the audio+visual+time state.
I think the classic analogy is that if river 1 and river 2 combine to form river 3, you cannot take a cup of water from river 3 and separate out the portions from river 1 and river 2; they're irreversibly mixed.
It got reviewed by 2 ML scientists and one neuroscientist.
Got totally slammed (and thus rejected) by the ML scientists due to „lack of practical application“ and highly endorsed by the neuroscientist.
There’s so much unused potential in interdisciplinary research but nobody wants to fund it because it doesn’t „fit“ into one of the boxes.
While the current AI boom is a bubble, I actually think that AGI nut could get cracked quietly by a company with even modest resources if they get lucky on the right fundamental architectural changes.
This is also "science"
Instead you’d need to resubmit and start the entire process from scratch. What a waste of resources …
It’s the final nail what made me quit pursuing a scientific career path despite having good pubs & PhD /w honours.
Unfortunately it’s what I enjoy the most.
Of course there is an interesting paradox - each layer of the NN doesn't know whether it's connected to the sensors directly, or what kind of abstractions it works with in the latent space. So the boundary between the mind and the sensor is blurred and to some extent a subjective choice.
ChatGPT claims its possible, but not allowed due to OpenAI safety rules: https://chatgpt.com/share/68fb0a76-6bf8-800c-82f7-605ff9ca22...
At some level you do need something approximating a token. BPE is very compelling for UTF8 sequences. It might be nearly the most ideal way to transform (compress) that kind of data. For images, audio and video, we need some kind of grain like that. Something to reorganize the problem and dramatically reduce the information rate to a point where it can be managed. Compression and entropy is at the heart of all of this. I think BPE is doing more heavy lifting than we are giving it credit for.
I'd extend this thinking to techniques like MPEG for video. All frame types also use something like the DCT too. The P and B frames are basically the same ideas as the I frame (jpeg), the difference is they take the DCT of the residual between adjacent frames. This is where the compression gets to be insane with video. It's block transforms all the way down.
An 8x8 DCT block for a channel of SDR content is 512 bits of raw information. After quantization and RLE (for typical quality settings), we can get this down to 50-100 bits of information. I feel like this is an extremely reasonable grain to work with.
It seems to work for vocals as well, not just short samples but entire works. Of course that's what I think, but there is a pretty good chance they're not 'entire', but it's enough that it isn't just excerpts and if I was a good enough musician I could replicate what I remember.
Is there anybody that has a handle on how we store auditory content in our memories? Is it a higher level encoding or a lower level one? This capability is probably key in language development so it is not surprising that we should have the capability to encode (and replay) audio content, I'm just curious about how it works, what kind of accuracy is normally expected and how much of such storage we have.
Another interesting thing is that it is possible to search through it fairly rapidly to match a fragment heard to one that I've heard and stored before.
It's so weird that I don't know this. It's like I'm stuck in userland.
https://www.uber.com/blog/neural-networks-jpeg/
It seems crazy to me that image inputs (of text) are smaller and more information dense than text - is that really true? Can somebody help my intuition?
As I understood the responses, the benefit comes from making better use of the embedding space. BPE tokenization is basically like a fixed lookup table, whereas when you form "image tokens" you just throw each 16x16 patch into a neural-net and (handwave) out comes your embedding. From that, it should be fairly intuitive that since current text tokenization embedding vectors won't even form a subspace (it can only just be ~$VOCAB_SIZE points), image tokens have the capacity to be more information dense. And you might hope that the neural network can somehow make use of that extra capacity, as you're not encoding one subword at a time.
I always found it strange that tokens can't just be symbols but instead there's an alphabet of 500k tokens, completely removing low level information from language (rhythm, syllables, etc.), side-effect being a simple edge case of 2 rs in strawberry, or no way to generate predefined rhyming patterns (without constrained sampling). There's an understandable reason for these big token dictionaries, but feels like a hack.
Why can't it be a sequence of audio waveforms from human speech?
https://portland.aitinkerers.org/talks/rsvp_fGAlJQAvWUA
There's this older paper from Lex Flagel and others where they transform DNA-based text, stuff we'd normally analyse via text files, into images and then train CNNs on the images. They managed to get the CNNs to re-predict population genetics measurements we normally get from the text-based DNA alignments.
https://academic.oup.com/mbe/article/36/2/220/5229930
Getting DeepSeek-OCR working on an Nvidia Spark via brute force with Claude Code - https://news.ycombinator.com/item?id=45646559 - Oct 2025 (43 comments)
DeepSeek OCR - https://news.ycombinator.com/item?id=45640594 - Oct 2025 (238 comments)
13 more comments available on Hacker News