Should Llms Just Treat Text Content as an Image?

Posted2 months agoActive2 months ago

ingve

167 points

99 comments

seangoedecke.comTechstoryHigh profile

calmmixed

Debate

70/100

Large Language ModelsText ProcessingImage ProcessingArtificial Intelligence

Key topics

Large Language Models

Text Processing

Image Processing

Artificial Intelligence

The article proposes treating text content as images for LLMs, sparking a discussion on the effectiveness and implications of this approach, with some commenters highlighting potential benefits and others pointing out limitations and counterintuitive aspects.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

Peak period

Day 7

Avg / period

Comment distribution99 data points

Loading chart...

Based on 99 loaded comments

Key moments

01Story posted
Oct 21, 2025 at 2:10 AM EDT
2 months ago
Step 01
02First comment
Oct 27, 2025 at 6:49 AM EDT
6d after posting
Step 02
03Peak activity
93 comments in Day 7
Hottest window of the conversation
Step 03
04Latest activity
Oct 30, 2025 at 12:41 AM EDT
2 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (99 comments)

Showing 99 comments

onesandofgrain

2 months ago

1 reply

A picture is worth a thousand words

hshdhdhehd

2 months ago

2 replies

And a picture of a thousand words is worth a thousand words.

favoboa

2 months ago

A picture of a thousand words with some of them colored, bolded, underlined, etc is worth more than a thousand words

genghisjahn

2 months ago

I type at .08 pictures per minute.

Havoc

2 months ago

2 replies

Seems wildly counterintuitive to me frankly.

Even if true though not sure what we’d do with it. The bulk of knowledge available on the internet is text. Aside from maybe YouTube so I guess it could work for world model type things? Understanding physical interactions of objects etc

bilsbie

2 months ago

1 reply

All text is technically converted to images before we see it.

thfuran

2 months ago

Only if you see it instead of hearing it or touching it.

hshdhdhehd

2 months ago

Trivial to convert text to images to process. But counter-intuitive to me too.

ToJans

2 months ago

1 reply

A series of tokens is one-dimensional (a sequence). An image is 2-dimensional. What about 3D/4D/... representation (until we end up with an LLM-dimensional solution ofc).

dvt

2 months ago

1 reply

This isn't exactly true, as tokens live in the embedding space, which is n-dimensional, like 256 or 512 or whatever (so you might see one word, but it's actually an array of a bunch of numbers). With that said, I think it's pretty intuitive that continuous tokens are more efficient than discrete ones, simply due to the fact that the LLM itself is basically a continuous function (with coefficients/parameters ∈ ℝ).

wongarsu

2 months ago

1 reply

We call an embedding-space n-dimensional, but in this context I would consider it 1-dimensional, as in it's a 1d vector of n values. The terminology just sucks. If we described images the same way we describe embeddings a 2 megapixel image would have to be called 2-million-dimensional (or 8-million-dimensional if we consider rgba to be four separate values)

I would also argue tokens are outside the embedding space, and a large part of the magic of LLMs (and many other neural network types) is the ability to map sequences of rather crude inputs (tokens) into a more meaningful embedding space, and then map from a meaningful embedding space back to tokens we humans understand

LudwigNagasena

2 months ago

Those are just dimensions of different things, and it’s usually pretty clear from context what is meant. Color space has 3 dimensions; or 4 with transparency; an image pixel has 6 dimensions (xy+RGBA) if we take its color into account, but only 2 spatial dimensions; if you think of an image as a function that maps continuous xy coordinates into continuous rgba coordinates, then you have an infinitely dimensional function space; embeddings have their own dimensions, but none of them relate to their position in text at hand, which is why text in this context said to be 1D and image said to be 2D.

bonsai_spool

2 months ago

1 reply

This doesn’t cite the very significant example of DeepVariant (and as of 10/16/25 DeepSomatic) which convert genomic data to images in order to find DNA mutations. This has been done since the late 2010s

https://google.github.io/deepvariant/posts/2020-02-20-lookin...

TZubiri

2 months ago

1 reply

Using these definitions, mapping datapoints in a graph is also converting data into an image in order to analyze it.

Tabulating data into tables similarly converts image visually so that mistakes or outliers can be spotted.

bonsai_spool

2 months ago

1 reply

> Using these definitions

There's a transformation of the data that is analogous to how a human would use the data to find a variant. It's closer to inspecting a stack of widgets to find a defective widget than it is listing numbers in a table

TZubiri

2 months ago

1 reply

Similar to instepcting a stack of rows to find a defective row

bonsai_spool

2 months ago

No actually, pileups aren’t just looking at a stack of rows. There are quality considerations,local realignment, multiple mapping - the data are high dimensional (genome mapping, qualities, indels) in a way that can’t be trivialized to “inspecting a stack of rows to find a defective row”

The work is not trivial even if the intuition is easy

LysPJ

2 months ago

1 reply

Andrej Karpathy made an interesting comment on the same paper: https://x.com/karpathy/status/1980397031542989305

onionisafruit

2 months ago

1 reply

> It makes two characters that look identical to the eye look as two completely different tokens internally in the network. A smiling emoji looks like a weird token, not an... actual smiling face, pixels and all

This goes against my limited understanding of how LLMs work — and computers generally for that matter. Isn’t that rendering of a smiling emoji still just a series of bits that need to be interpreted as a smiley face? The similar looking characters point makes more sense to me though assuming it’s something along the lines of recognizing that “S” and “$” are roughly the same thing except for the line down the middle. Still that seems like something that doesn’t come up much and is probably covered by observations made in the training corpus.

All that said, Karpathy knows way more than I will ever know on the subject, and I’m only posting my uninformed take here in hopes somebody will correct me in a way I understand.

jncfhnb

2 months ago

1 reply

You’re reading it backwards. He is not praising that behavior, he is complaining about it. He is saying that bots _should_ parse smiling face emoji’s as smiling face emoji’s, but they don’t do that currently because as text they get passed as gross unicode that has a lot of ambiguity and just happens to ultimately get rendered as a face to end users.

ares623

2 months ago

2 replies

Wouldn’t the training or whatever make that unicode sequence effectively a smiley face?

scotty79

2 months ago

2 replies

Don't ask ChatGPT about seahorse emoji.

astrange

2 months ago

That's caused by the sampler and chatbot UI not being part of the LLM. It doesn't get to see its own output before it's sent out.

tensor

2 months ago

Don't ask humans either, apparently.

jncfhnb

2 months ago

Yes, but the same face gets represented by many unique strings. Strings which may more may not be tokenized into a single clean “smiley face” token.

pcwelder

2 months ago

3 replies

I ϲаn guаrаntее thаt thе ОСR ϲаn't rеаd thіs sеntеnсе ϲоrrесtlу.

geysersam

2 months ago

1 reply

Really? How so?

moduspol

2 months ago

Looks like he’s using atypical “c” characters.

syntaxing

2 months ago

2 replies

What’s correct though? Even as a human, I read that “correctly”. Using weird representations of C doesn’t change the word?

LudwigNagasena

2 months ago

1 reply

I would even say that OCR can rеаd the sеntеnсе ϲоrrесtlу, while a tokenizer can't.

kgeist

2 months ago

Qwen3 8b perfectly understood it after 14 seconds of thinking.

metalliqaz

2 months ago

Yeah OCR would be much more likely to read that sentence the way a human would.

bitdivision

2 months ago

A lot of Cyrillic characters: https://apps.timwhitlock.info/unicode/inspect?s=I+%CF%B2%D0%...

mohsen1

2 months ago

2 replies

My understanding is that text tokens are too rigid. The way we read is not to process each character (tokens for LLMs) precociously but to see a word or sometimes a collection of familiar words and make sense of writing. That concept that we understand from written text is really what we read and not letter or words exactly. This is why we can easily read written text with typos. They are just similar enough. By letting LLMs not to be too hung up on exact tokens and "skim" through text we can make them more efficient just like how humans efficiently read.

lupire

2 months ago

2 replies

Google AI mode:

User: cna uto nfeir waht tihs si msieplnig?

Google: Here are top web results for exploring this topic [few websites containing or discussing gibberish in general]

User: what was my previous question?

Google: Your previous question was "cna uto nfeir waht tihs si msieplnig?". I was able to understand this through a combination of contextual understanding and sub-word tokenization, which allowed me to infer the intended meaning despite the numerous typos. The corrected question is: "Can you infer what this is misspelling?".

Miraltar

2 months ago

1 reply

I couldn't read it

CamperBob2

2 months ago

Same here, that's pretty amazing.

valdork59

2 months ago

GPT-5 thinking doesn't struggle with this. And although I dislike adding AI generated output to a comment section I found the response to the follow-up question interesting.

q: c u i t m o t s? a: “Can you infer the meaning of this sentence?”

bonsai_spool

2 months ago

You miss that we already have ‘context’ when we begin reading something, and that probably enables our fast reading. Maybe there’s a way to give that background setting information to an llm but then we could also just have it read the entire input stream

wongarsu

2 months ago

5 replies

Look long enough at literature on any machine learning task, and someone invariably gets the idea to turn the data into an image and do machine learning on that. Sometimes it works out (turning binaries into images and doing malware detection with a CNN surprisingly works), usually it doesn't. Just like in this example the images usually end up as a kludge to fix some deficiency in the prevalent input encoding.

I can certainly believe that images bring certain advantages over text for LLMs: the image representation does contain useful information that we as humans use (like better information hierarchies encoded in text size, boldness, color, saturation and position, not just n levels of markdown headings), letter shapes are already optimized for this kind of encoding, and continuous tokens seem to bring some advantages over discrete ones. But none of these advantages need the roundtrip via images, they merely point to how crude the state of the art of text tokenization is

adzm

2 months ago

2 replies

A great example of this is changing music into an image and using that to train and generate new images that get converted back into music. It was surprisingly successful. I think this approach is still used by the current music generators.

bjourne

2 months ago

2 replies

You are talking about piano roll notation, I think. While it's 2d data, it's not quite the same as actual image data. E.g., 2d conv and pooling operations are useless for music. The patterns and dependencies are too subtle to be captured by spatial filters.

adzm

2 months ago

3 replies

I am talking about using spectrograms (Fourier transform into frequency domain then plotted over time) that results in a 2d image of the song, which is then used to train something like stable diffusion (and actually using stable diffusion by some) to be able to generate these, which is then converted back into audio. Riffusion used this approach.

Mistletoe

2 months ago

2 replies

How does Spotify “think” about songs when it is using its algos to find stuff I like?

lbourdages

2 months ago

1 reply

Does it really need to think about the song contents? It can just cluster you with other people that listen to similar music and then propose music they listen to that you haven't heard.

vanderZwan

2 months ago

1 reply

That's one method they use, but "just cluster" is doing a lot of heavy lifting in that sentence. It's why Erik Bernhardsson came up with the Approximate Nearest Neighbors Oh Yeah algorithm (or ANNOY for short)

> We use it at Spotify for music recommendations. After running matrix factorization algorithms, every user/item can be represented as a vector in f-dimensional space. This library helps us search for similar users/items. We have many millions of tracks in a high-dimensional space, so memory usage is a prime concern.

[0] https://erikbern.com/2013/04/12/annoy.html

[1] https://github.com/spotify/annoy?tab=readme-ov-file

lbourdages

2 months ago

Yeah, I am obviously not a data scientist.

I guess where I was getting at is they do not technically even need to know genres to recommend songs. In practice though, they probably have to know them anyway for playlists, but I assume they can have the song owners provide that when the songs are uploaded, and artists specify it when they create their profile.

efskap

2 months ago

This article [0] investigates some of the feature extraction they do, so it's not just collaborative filtration.

[0]: https://www.music-tomorrow.com/blog/how-spotify-recommendati...

mrheosuper

2 months ago

1 reply

IF you think about it, a music sheet is just a graph of Fourier transform. It shows at any points of time, what frequency is present (the pitch of note), and for how long (duration of note),

112233

2 months ago

it is no such thing. nobody maps overtones on sheet, durations are toast, you need to macroexpand all flat/sharps, volume is passed by vibe-words, it has 500+ of historical compost and so on. sheet music to fft is like wine tasting to a healthy meal

bjourne

2 months ago

A spectrogram is lossy and not a one-to-one mapping of the waveform. Riffusion is, afaik, limited to five-second-clips. For these, structure and coherence over time isn't important and the data is strongly spatially correlated. E.g., adjacent to a blue pixel is another blue pixel. To the best of my knowledge no models synthesize whole songs from spectrograms.

yberreby

2 months ago

I've seen this approach applied to spectrograms. Convolutions do make enough sense there.

efskap

2 months ago

The current music generators use next token prediction, like LLMs, not image denoising.

[0] https://arxiv.org/abs/2503.08638 (grep for "audio token")

[1] https://arxiv.org/abs/2306.05284

rangestransform

2 months ago

This reminds me of how trajectory prediction networks for autonomous driving used to use a CNN to encode scene context (from map and object detection rasters) until vectornet showed up

pegasus

2 months ago

Exactly. The example the article give of reducing resolution as a form of compression highlights the limitations of the visual-only proposal. Blurring text is a poor form of compression, preserving at most information about paragraph sizes. Summarizing early paragraphs (as context compression does in coding agents) would be much more efficient.

vanderZwan

2 months ago

Doesn't this more or less boil down to OCR scans of books having more privileged information than a plaintext file? In which case a roundtrip won't add anything?

[0] https://web.archive.org/web/20140402025221/http://m.nautil.u...

arbot360

2 months ago

Another great example of this working is the genomic variant calling models from Deepmind "DeepVariant". They use the "alignment pile-up" images which are also used by humans to debug genomic alignments, with some additional channels to further feature engineer the CNN.

mannykannot

2 months ago

1 reply

Language was spoken long before it was written (or so it seems.) This article almost suggests that sound might be a superior input medium over either digital text or images.

falcor84

2 months ago

3 replies

I've never worked in that area, but recall reading about how images of spectrograms are often superior inputs to neural nets in comparison to the raw audio data.

rhdunn

2 months ago

1 reply

Speech to text and text to speech typically operate on the audio spectrogram, specifically the Mel-scale spectrum. This is a filtered spectrogram that decreases the noise in the data. Thus, they are not working on the images of these spectra but the computed values -- each spectral slice will be a matrix row or column of values.

The theory is that vowels and voiced consonants have a fundamental frequency and 5-6 frequencies above that. For vowels the first two frequencies are enough to identify the vowel. For rhotic vowels (r-sounding vowels like American stARt) the 3rd frequency is important.

By converting the audio to the Mel-scale spectrum, it is easier to detect these features. Text to speech using the Mel-spectrum works by modelling and generating these values, which is often easier as the number of parameters is lower and the data is easier to work with [1].

[1] There are other approaches to text to speech such as overlapping short audio segments.

HarHarVeryFunny

2 months ago

The Mel-scale spectrogram doesn't do anything specific to reduce noise compared to an FFT. It's just preferred for traditional speech recognition because it uses a non-linear frequency scale that better matches human perception.

Speech recognition is based around recognizing the frequency correlates of speech generation/articulation, mainly the frequency bands that are attenuated by vocal tract resonances as articulation changes the shape of the vocal tract.

The fundamental frequency, f0, of someone's voice is not important to speech recognition - that is just the frequency with which their vocal chords are opening and closing, corresponding to a high pitched voice (e.g. typical female or child) vs a low pitched one (male).

What happens during speech production is that due to the complex waveform generated by the asymmetrically timed opening and closing of the vocal chords (slow open, fast close), not only is the fundamental frequency, f0, generated, but also harmonics of it - 2xf0, 3xf0, 4xf0, etc. The resonances of the vocal tract then attenuate certain frequency ranges within this spectrum of frequencies, and it's these changing attenuated frequency ranges, aka formants, that effectively carry the articulation/speech information.

The frequency ranges of the formats also vary according to the length of the vocal tract, which varies between individuals, so it's not specific frequencies such as f0 or its harmonics that carry the speech information, but rather changing patterns of attenuation (formants).

scotty79

2 months ago

1 reply

Raw audio data is unnatural. Ear doesn't capture pressure samples thousands of times per second. It captures frequencies and sonic energy carried by them. Result of doing a spectrogram on the raw data is what comes out raw out of our biological sensor.

mannykannot

2 months ago

I'm breaking the commenting rules to say this, but this strikes me as a valuable insight. Thanks!

dekhn

2 months ago

I used to work with the guy who solved voice recognition for google and he said that to improve the quality significantly he spent so much time looking at spectrograms of speech that he could just glance at a spectrogram and perceive what was being said.

DonHopkins

2 months ago

1 reply

Wouldn't it be ironic if Comic Sans turned out to be the most efficient font for LLM OCR understanding.

ivape

2 months ago

It's probably more like languages that are more optimal. If you need a paragraph to explain the core of an idea in English, but only two sentences in another, a multi-language LLM can perform more optimally just by using a more optimal language.

skywhopper

2 months ago

1 reply

There’s some poor logic in this writeup. Yes, images can contain more information than words, but the extra information an image of a word conveys is usually not relevant to the intent of the communication, at least not for the purposes assumed in this writeup. Ie, pre-converting the text you would have typed into ChatGPT and uploading that as an image instead will not better convey the meaning and intent behind your words.

If it gives better results (something that there’s no evidence presented of), that’d be interesting, but it wouldn’t be because of the larger data size of the uploaded image vs the text.

ivape

2 months ago

This seems like the correct answer. However, if we shrink down all prompts down to pixel fonts, and make the interaction between the user and llm just be a constant exchange of very small gifs, then that would be something.

sojuz151

2 months ago

1 reply

This means that current tokenisers are bad, and something better is needed if text rendering + image input is a better tokeniser.

mr_toad

2 months ago

Humans recognise two vastly different types of language input (auditory and visual). I doubt that one type of tokeniser is inherently superior.

rebeccaskinner

2 months ago

2 replies

Although thus isn’t directly related to the idea in the article, I’m reminded that one of the most effective hacks I’ve found for working with ChatGPT has been to attach screen shots of files rather than the files themselves. I’ve noticed the model will almost always pay attention to an image and pull relevant data out of it, but it requires a lot of detailed prompting to get it to reliably pay attention to text and pdf attachments instead of just hallucinating their contents.

oceanplexian

2 months ago

This is probably because your provider is generating embeddings over the document to save money, and then simply running a vector search across it instead of fitting it all in context.

amluto

2 months ago

Hmm. Yesterday I stuck a >100 page PDF into a Claude Project and asked Claude to reference a table in the middle of it (I gave page numbers) to generate machine readable text. I watched with some bafflement as Claude convinced itself that the PDF wasn’t a PDF, but then it managed to recover all on its own and generated 100% correct output. (Well, 100% correct in terms of reading the PDF - it did get a bit confused a few times following my instructions.)

HarHarVeryFunny

2 months ago

1 reply

Vision tokens would only be a viable alternative to text if/when the LLM had learnt to read, and was able to control the page scanning - how to segment the page into sections of text and non-text, segment the text sections into lines, scan the lines in language-specific direction (left to right, or right to left), segment into words, etc - basically everything that an OCR program needs to do prior to the actual OCR bit.

Even having learnt to do all of this, or perhaps with a page-of-text sequence-of-word extractor pre-processor, the LLM would then need to learn to generalize over different font faces and sizes, and imperfect speckled and/or distorted scans.

Finally, but surely not least, if the goal is to reduce (inference?) computational load by representing multiple words as a single image token, then it seems that more training epochs may be needed, with variations in word grouping, since the same sequence of words would not always be grouped together, so the LLM would have to learn that an image token representing "the cat sat" may also have been split up as "today the cat" and "sat on the".

A better way to reduce number of tokens being processed might be to have the LLM learn how to combine multiple adjacent tokens into one, perhaps starting with individual letters at the input, although this would of course require a fairly major change to the Transformer architecture.

reissbaker

2 months ago

1 reply

Multimodal LLMs already learn to generalize over text inside images. In my experience most multimodal LLMs are significantly better than traditional OCR, especially if there's any unusual formatting going on.

HarHarVeryFunny

2 months ago

This thread is considering image input as an alternative to text input for text, not as an alternative to other types of OCR, so the accuracy bar is 100%.

I've had mixed results with LLMs for OCR.. sometimes excellent (zero errors on a photo of my credit card bill), but poor if the source wasn't a printed page - sometimes "reusing" the same image section for multiple extracted words!

FWIW, I highly doubt that LLMs have just learnt to scan pages from (page image, page text) training pairs - more likely text-heavy image input is triggering special OCR handling.

dang

2 months ago

Recent and related:

Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text? - https://news.ycombinator.com/item?id=45658928 - Oct 2025 (172 comments)

leemcalilly

2 months ago

and reading (aka “ocr”) is the fastest way for the brain to process language.

themoxon

2 months ago

There's a new paper from ICCV which basically tries to render every modality as images: https://openaccess.thecvf.com/content/ICCV2025/papers/Hudson...

fathermarz

2 months ago

I have been seeing this a lot. Does this add new meaning to “a picture is worth 1000 words?”

I think I saw a benchmark on HN for formats of text files given to an LLM to see which would give better results. I wish that study had thrown in images to the comparison.

northlondoner

2 months ago

This is interesting. There might be an information-theoretic reason — perhaps 'spatial tokenization' is more informative than 'words tokenization'.

Bolwin

2 months ago

Does anyone know if vlms suffer more from quantization? One thing I've noticed is practically every improvement in llms is already half there in quantization.

ghoul2

2 months ago

But does this not miss the "context" that the embeddings of the text tokens carry? An LLM embedding of a text token has a compressed version of the entire preceding set of tokens that came before it in the context. While the image embeddings are just representations of pixel values.

Sort of at the level of word2vec, where the representation of "flies" in "fruit flies like a banana" vs "time flies like an arrow" would be the same.

pmarreck

2 months ago

Wasn't there an AI project that made a brief radar blip on the news a few months ago where someone used autocompleted images to autocomplete text?

aitchnyu

2 months ago

The amount of video/imagery to make a million tokens vs the amount of text to do the same is a surprisingly low ratio. Did they have the same intuition?

cubefox

2 months ago

I don't understand this paragraph:

> The first explanation is that text tokens are discrete while image tokens are continuous. Each model has a finite number of text tokens - say, around 50,000. Each of those tokens corresponds to an embedding of, say, 1000 floating-point numbers. Text tokens thus only occupy a scattering of single points in the space of all possible embeddings. By contrast, the embedding of an image token can be sequence of those 1000 numbers. So an image token can be far more expressive than a series of text tokens.

Does someone understand the difference he is pointing at?

metalliqaz

2 months ago

Future headline: "The unreasonable effectiveness of text encoding"

unglaublich

2 months ago

Generalize it to a video at that point.

BrandiATMuhkuh

2 months ago

I'm using this approach quite often. I don't know of any documents created by humans for humans that have no formatting. The formatting, position etc. are usually an important part of the document.

Since the first multimodal llms came out, I'm using this approach when I deal with documents. It makes the code much simpler because everything is an image and it's surprisingly robust.

Works also for embeddings (cohere embed v4)

analog8374

2 months ago

If LLMs treat text as image then their textual hallucinations will consist of weird squirming pseudocharacters.

Like text in dreams. Interesting coincidence.

pmarreck

2 months ago

Our own minds do, so...

vindex10

2 months ago

reminds me of the difference between fasttext and word2vec.

fasttext can learn words it haven't seen before by combining words from ngrams, word2vec can learn better meaning of the whole words, but then missing out on the "unknown words".

image tokens are "text2vec" here, while text tokens are a proxy towards building a text embedding of even unseen before texts.

mingtianzhang

2 months ago

We actually don't need OCR: https://pageindex.ai/blog/do-we-need-ocr

nacozarina

2 months ago

the enshittifiers simply haven't yet weighted image processing fees with potential token charges; once they have, your cost advantage goes bye bye

TZubiri

2 months ago

Maybe, but it wouldn't be a language model.

LarsDu88

2 months ago

This reminds me of a trick from the world of "competitive memorization" illustrated in the wonderful book "Moonwalking with Einstein".

To improve your memory recall on any task, like say memorizing the order of an entire deck of cards... convert the number sequences to a set of visual images!

qiine

2 months ago

or maybe 3d objects, since that's closer to what real life is and how the brain shaped itself around?

qingcharles

2 months ago

OCR is fine for books which are just swathes of text, but for things like magazines it breaks down heavily. You have columns breaking in weird places, going up, down, left, right, fonts changing in the middle of a paragraph. And then the pages are heavy on images which the text is often referencing either explicitly or implicitly. Without the images, the meaning of the text is often changed or redundant.

Anyone have an LLM that can take a 300 page PDF magazine (with no OCR) and summarize it? :)

thomasahle

2 months ago

> text tokens are discrete while image tokens are continuous. Each model has a finite number of text tokens - say, around 50,000. Each of those tokens corresponds to an embedding of, say, 1000 floating-point numbers. Text tokens thus only occupy a scattering of single points in the space of all possible embeddings. By contrast, the embedding of an image token can be any sequence of those 1000 numbers. So an image token can be far more expressive than a series of text tokens.

This is an interesting point. But if this was true, couldn't we just put a text convolutional nn below the transformer stack to reduce the token count by a similar factor 10?

View full discussion on Hacker News

ID: 45652952Type: storyLast synced: 11/20/2025, 7:40:50 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN