Should Llms Just Treat Text Content as an Image?
Key topics
The article proposes treating text content as images for LLMs, sparking a discussion on the effectiveness and implications of this approach, with some commenters highlighting potential benefits and others pointing out limitations and counterintuitive aspects.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
6d
Peak period
93
Day 7
Avg / period
33
Based on 99 loaded comments
Key moments
- 01Story posted
Oct 21, 2025 at 2:10 AM EDT
2 months ago
Step 01 - 02First comment
Oct 27, 2025 at 6:49 AM EDT
6d after posting
Step 02 - 03Peak activity
93 comments in Day 7
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 30, 2025 at 12:41 AM EDT
2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Even if true though not sure what we’d do with it. The bulk of knowledge available on the internet is text. Aside from maybe YouTube so I guess it could work for world model type things? Understanding physical interactions of objects etc
I would also argue tokens are outside the embedding space, and a large part of the magic of LLMs (and many other neural network types) is the ability to map sequences of rather crude inputs (tokens) into a more meaningful embedding space, and then map from a meaningful embedding space back to tokens we humans understand
https://google.github.io/deepvariant/posts/2020-02-20-lookin...
Tabulating data into tables similarly converts image visually so that mistakes or outliers can be spotted.
There's a transformation of the data that is analogous to how a human would use the data to find a variant. It's closer to inspecting a stack of widgets to find a defective widget than it is listing numbers in a table
The work is not trivial even if the intuition is easy
This goes against my limited understanding of how LLMs work — and computers generally for that matter. Isn’t that rendering of a smiling emoji still just a series of bits that need to be interpreted as a smiley face? The similar looking characters point makes more sense to me though assuming it’s something along the lines of recognizing that “S” and “$” are roughly the same thing except for the line down the middle. Still that seems like something that doesn’t come up much and is probably covered by observations made in the training corpus.
All that said, Karpathy knows way more than I will ever know on the subject, and I’m only posting my uninformed take here in hopes somebody will correct me in a way I understand.
User: cna uto nfeir waht tihs si msieplnig?
Google: Here are top web results for exploring this topic [few websites containing or discussing gibberish in general]
User: what was my previous question?
Google: Your previous question was "cna uto nfeir waht tihs si msieplnig?". I was able to understand this through a combination of contextual understanding and sub-word tokenization, which allowed me to infer the intended meaning despite the numerous typos. The corrected question is: "Can you infer what this is misspelling?".
q: c u i t m o t s? a: “Can you infer the meaning of this sentence?”
I can certainly believe that images bring certain advantages over text for LLMs: the image representation does contain useful information that we as humans use (like better information hierarchies encoded in text size, boldness, color, saturation and position, not just n levels of markdown headings), letter shapes are already optimized for this kind of encoding, and continuous tokens seem to bring some advantages over discrete ones. But none of these advantages need the roundtrip via images, they merely point to how crude the state of the art of text tokenization is
> We use it at Spotify for music recommendations. After running matrix factorization algorithms, every user/item can be represented as a vector in f-dimensional space. This library helps us search for similar users/items. We have many millions of tracks in a high-dimensional space, so memory usage is a prime concern.
[0] https://erikbern.com/2013/04/12/annoy.html
[1] https://github.com/spotify/annoy?tab=readme-ov-file
I guess where I was getting at is they do not technically even need to know genres to recommend songs. In practice though, they probably have to know them anyway for playlists, but I assume they can have the song owners provide that when the songs are uploaded, and artists specify it when they create their profile.
[0]: https://www.music-tomorrow.com/blog/how-spotify-recommendati...
[0] https://arxiv.org/abs/2503.08638 (grep for "audio token")
[1] https://arxiv.org/abs/2306.05284
[0] https://web.archive.org/web/20140402025221/http://m.nautil.u...
The theory is that vowels and voiced consonants have a fundamental frequency and 5-6 frequencies above that. For vowels the first two frequencies are enough to identify the vowel. For rhotic vowels (r-sounding vowels like American stARt) the 3rd frequency is important.
By converting the audio to the Mel-scale spectrum, it is easier to detect these features. Text to speech using the Mel-spectrum works by modelling and generating these values, which is often easier as the number of parameters is lower and the data is easier to work with [1].
[1] There are other approaches to text to speech such as overlapping short audio segments.
Speech recognition is based around recognizing the frequency correlates of speech generation/articulation, mainly the frequency bands that are attenuated by vocal tract resonances as articulation changes the shape of the vocal tract.
The fundamental frequency, f0, of someone's voice is not important to speech recognition - that is just the frequency with which their vocal chords are opening and closing, corresponding to a high pitched voice (e.g. typical female or child) vs a low pitched one (male).
What happens during speech production is that due to the complex waveform generated by the asymmetrically timed opening and closing of the vocal chords (slow open, fast close), not only is the fundamental frequency, f0, generated, but also harmonics of it - 2xf0, 3xf0, 4xf0, etc. The resonances of the vocal tract then attenuate certain frequency ranges within this spectrum of frequencies, and it's these changing attenuated frequency ranges, aka formants, that effectively carry the articulation/speech information.
The frequency ranges of the formats also vary according to the length of the vocal tract, which varies between individuals, so it's not specific frequencies such as f0 or its harmonics that carry the speech information, but rather changing patterns of attenuation (formants).
If it gives better results (something that there’s no evidence presented of), that’d be interesting, but it wouldn’t be because of the larger data size of the uploaded image vs the text.
Even having learnt to do all of this, or perhaps with a page-of-text sequence-of-word extractor pre-processor, the LLM would then need to learn to generalize over different font faces and sizes, and imperfect speckled and/or distorted scans.
Finally, but surely not least, if the goal is to reduce (inference?) computational load by representing multiple words as a single image token, then it seems that more training epochs may be needed, with variations in word grouping, since the same sequence of words would not always be grouped together, so the LLM would have to learn that an image token representing "the cat sat" may also have been split up as "today the cat" and "sat on the".
A better way to reduce number of tokens being processed might be to have the LLM learn how to combine multiple adjacent tokens into one, perhaps starting with individual letters at the input, although this would of course require a fairly major change to the Transformer architecture.
I've had mixed results with LLMs for OCR.. sometimes excellent (zero errors on a photo of my credit card bill), but poor if the source wasn't a printed page - sometimes "reusing" the same image section for multiple extracted words!
FWIW, I highly doubt that LLMs have just learnt to scan pages from (page image, page text) training pairs - more likely text-heavy image input is triggering special OCR handling.
Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text? - https://news.ycombinator.com/item?id=45658928 - Oct 2025 (172 comments)
I think I saw a benchmark on HN for formats of text files given to an LLM to see which would give better results. I wish that study had thrown in images to the comparison.
Sort of at the level of word2vec, where the representation of "flies" in "fruit flies like a banana" vs "time flies like an arrow" would be the same.
> The first explanation is that text tokens are discrete while image tokens are continuous. Each model has a finite number of text tokens - say, around 50,000. Each of those tokens corresponds to an embedding of, say, 1000 floating-point numbers. Text tokens thus only occupy a scattering of single points in the space of all possible embeddings. By contrast, the embedding of an image token can be sequence of those 1000 numbers. So an image token can be far more expressive than a series of text tokens.
Does someone understand the difference he is pointing at?
Since the first multimodal llms came out, I'm using this approach when I deal with documents. It makes the code much simpler because everything is an image and it's surprisingly robust.
Works also for embeddings (cohere embed v4)
Like text in dreams. Interesting coincidence.
fasttext can learn words it haven't seen before by combining words from ngrams, word2vec can learn better meaning of the whole words, but then missing out on the "unknown words".
image tokens are "text2vec" here, while text tokens are a proxy towards building a text embedding of even unseen before texts.
To improve your memory recall on any task, like say memorizing the order of an entire deck of cards... convert the number sequences to a set of visual images!
Anyone have an LLM that can take a 300 page PDF magazine (with no OCR) and summarize it? :)
This is an interesting point. But if this was true, couldn't we just put a text convolutional nn below the transformer stack to reduce the token count by a similar factor 10?