Can You Save on LLM Tokens Using Images Instead of Text?

Posted2 months agoActiveabout 2 months ago

lpellis

48 points

19 comments

pagewatch.aiTechstory

calmmixed

Debate

40/100

Large Language ModelsToken OptimizationImage Processing

Key topics

Large Language Models

Token Optimization

Image Processing

The article explores using images instead of text to save on LLM tokens, sparking discussion on the trade-offs between token count, accuracy, and processing complexity.

Snapshot generated from the HN discussion

Discussion Activity

Moderate engagement

First comment

Peak period

144-156h

Avg / period

4.8

Comment distribution19 data points

Loading chart...

Based on 19 loaded comments

Key moments

01Story posted
Nov 1, 2025 at 6:34 PM EDT
2 months ago
Step 01
02First comment
Nov 7, 2025 at 11:33 PM EST
6d after posting
Step 02
03Peak activity
8 comments in 144-156h
Hottest window of the conversation
Step 03
04Latest activity
Nov 10, 2025 at 2:51 AM EST
about 2 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (19 comments)

Showing 19 comments

bikeshaving

about 2 months ago

1 reply

Does this mean we’ll finally get empirical proof for the aphorism “a picture is worth a thousand words”?

https://en.wikipedia.org/wiki/A_picture_is_worth_a_thousand_...

heltale

about 2 months ago

1 reply

I suppose it’s only worth 256 words at a time right now. ;)

https://arxiv.org/abs/2010.11929

estebarb

about 2 months ago

2 replies

The CALM paper https://shaochenze.github.io/blog/2025/CALM/ says it is possible to compress 4 tokens in a single embedding, so... image = 4×256=1024 words > 1000 words. QED

bikeshaving

about 2 months ago

1 reply

2.4% relative error is not bad.

pastor_williams

about 2 months ago

2 replies

Reminds me of Babbage making allowance for meter.

"""

    ... it is said that he [Babbage] sent the following letter to Alfred, Lord Tennyson about a couplet in "The Vision of Sin":

         Every minute dies a man,
         Every minute one is born

    I need hardly point out to you that this calculation would tend to keep the sum total of the world's population in a state of perpetual equipoise, whereas it is a well-known fact that the said sum total is constantly on the increase. I would therefore take the liberty of suggesting that in the next edition of your excellent poem the erroneous calculation to which I refer should be corrected as follows:

         Every minute dies a man,
         And one and a sixteenth is born

    I may add that the exact figures are 1.167, but something must, of course, be conceded to the laws of metre.

"""

    Charles Babbage and his Calculating Engines

cbhl

about 2 months ago

1 reply

Shouldn't it be the other way around if the population is increasing? Every minute one is born = 1440 born/day, every minute and a sixteenth ~= 1335 dead/day for a net population increase of 105/day.

BrenBarn

about 2 months ago

It means that in every minute, one and a sixteenth of a man is born.

zahlman

about 2 months ago

Wouldn't "one and a sixth" be more accurate in both respects?

behnamoh

about 2 months ago

2 replies

how do you decompress all those 4 words from one token?

HarHarVeryFunny

about 2 months ago

The mechanism would be prediction (learnt during training), not decompression.

It's the same as LLMs being able to "decode" Base64, or work with sub-word tokens for that matter, it just learns to predict that:

<compressed representation> will be followed by (or preceded by) <decompressed representation>, or vice versa.

estebarb

about 2 months ago

Not from one token, from one embedding. Text contains a low amount of information: it is possible to compress a few token embeddings into a single tiken embedding.

The how is variable. The calm paper seems to have used a MLP to compress from and ND input (N embeddings of size D) into a single D embedding and other for decompress them back

floodfx

about 2 months ago

2 replies

Why are completion tokens more with image prompts yet the text output was about the same?

Garlef

about 2 months ago

1 reply

"Thinking" Mode

nunodonato

about 2 months ago

it doesn't say that anywhere.

cma

about 2 months ago

Some multimodal models may have a hidden captioning step that may take completion tokens, others work on a fully native representation, and some do both I think.

ashed96

about 2 months ago

2 replies

In my experience, LLMs tend to take noticeably longer to process images than text.

weird-eye-issue

about 2 months ago

1 reply

It has to get the image data first, basically just IO time before processing it

ashed96

about 2 months ago

IIRC there's pre-processing (embedding/tokenization?) before feeding images to LLMs?

Hit this issue optimizing LLM request times. Ending up lowering image resolution. Lost some accuracy but could bear that.