Back to Home11/14/2025, 6:35:22 PM

AI World Clocks

1330 points
368 comments

Mood

excited

Sentiment

positive

Category

tech

Key topics

AI

art

generative models

Debate intensity20/100
"Every minute, a new clock is rendered by nine different AI models."

A website showcasing clocks generated by nine different AI models every minute.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

15m

Peak period

156

Day 1

Avg / period

53.3

Comment distribution160 data points

Based on 160 loaded comments

Key moments

  1. 01Story posted

    11/14/2025, 6:35:22 PM

    4d ago

    Step 01
  2. 02First comment

    11/14/2025, 6:50:01 PM

    15m after posting

    Step 02
  3. 03Peak activity

    156 comments in Day 1

    Hottest window of the conversation

    Step 03
  4. 04Latest activity

    11/18/2025, 12:25:44 AM

    1d ago

    Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (368 comments)
Showing 160 comments of 368
kfarr
4d ago
1 reply
Add some voting and you got yourself an AI World Clock arena! https://artificialanalysis.ai/image/arena
BrandoElFollito
4d ago
Thank you very much.... It was a fun game until I got to the prompt

Place a baby elephant in the green chair

I cannot unsee what I saw and it is 21:30 here so I have an hour or so to eliminate the picture from my mind or I will have nightmares.

syx
4d ago
1 reply
I’m very curious about the monthly bill for such a creative project, surely some of these are pre rendered?
coffeecoders
4d ago
2 replies
Napkin math:

9 AIs × 43,200 minutes = 388,800 requests/month

388,800 requests × 200 tokens = 77,760,000 tokens/month ≈ 78M tokens

Cost varies from 10 cents to $1 per 1M tokens.

Using the mid-price, the cost is around $50/month.

---

Hopefully, the OP has this endpoint protected - https://clocks.brianmoore.com/api/clocks?time=11:19AM

fouc
2d ago
It was limited to 2,000 tokens each. I assume it usually hit that. So could be closer to 777M. assuming they didn't just cache it and just start rotating after a day or two..
whimsicalism
4d ago
i think it is cached on the minute level, responses cannot be that fast
ugh123
4d ago
3 replies
Cool, and marginally informative on the current state of things. but kind of a waste of energy given everything is re-done every minute to compare. We'd probably only need a handful of each to see the meaningful differences.
whoisjuan
4d ago
4 replies
It's actually quite fascinating if you watch it for 5 minutes. Some models are overall bad, but others nail it in one minute and butcher it in the next.

It's perhaps the best example I have seen of model drift driven by just small, seemingly unimportant changes to the prompt.

alister
4d ago
2 replies
> model drift driven by just small, seemingly unimportant changes to the prompt

What changes to the prompt are you referring to?

According the comment on the site, the prompt is the following:

Create HTML/CSS of an analog clock showing ${time}. Include numbers (or numerals) if you wish, and have a CSS animated second hand. Make it responsive and use a white background. Return ONLY the HTML/CSS code with no markdown formatting.

The prompt doesn't seem to change.

sambaumann
4d ago
presumably the time is replaced with the actual current time at each generation. I wonder if they are actually generated every minute or if all 6480 permutations (720 minutes in a day * 9 llms) were generated and just show on a schedule
whoisjuan
4d ago
The time given to the model. So the difference between two generations is just somethng trivially different like: "12:35" vs 12:36"
moffkalast
4d ago
Kimi seems the only reliable one which is a bit surprising, and GPT 4o is consistently better than GPT 5 which on the other hand is unfortunately not surprising at all.
nbaugh1
4d ago
It is really interesting to watch them for a while. QWEN keeps outputting some really abstract interpretations of a clock, KIMI is consistently very good, GPT5's results line up exactly with my experience with its code output (overly complex and never working correctly)
bglusman
4d ago
We can't know how much is about the prompt though and how much is just stochastic randomness in the behavior of that model on that prompt, right? I mean, even given identical prompts, even at temp 0, models don't always behave identically.... at least, as far as I know? Some of the reasons why are I think still a research question, but I think its a fact nonetheless.
ascorbic
4d ago
2 replies
The energy usage is minuscule.
jdiff
4d ago
3 replies
It's wasteful. If someone built a clock out of 47 microservices that called out to 193 APIs to check the current time, location, time zone, and preferred display format we'd rightfully criticize it for similar reasons.

In a world where Javascript and Electron are still getting (again, rightfully) skewered for inefficiency despite often exceeding the performance of many compiled languages, we should not dismiss the discussion around efficiency so easily.

saulpw
4d ago
1 reply
Let's do some math.

60x24x30 = 40k AI calls per month per model. Let's suppose there are 1000 output tokens (might it be 10k tokens? Seems like a lot for this task). So 40m tokens per model.

The price for 1m output tokens[0] ranges from $.10 (qwen-2.5) to $60 (GPT-4). So $4/mo for the cheapest, and $2.5k/mo for the most expensive.

So this might cost several thousand dollars a month? Something smells funny. But you're right, throttling it to once an hour would achieve a similar goal and likely cost less than $100/mo (which is still more than I would spend on a project like this).

[0] https://pricepertoken.com/

qwe----3
4d ago
They use 4o (maybe a mini version?)(
berkes
4d ago
Yes it is wasteful.

But I presume you light up Christmas lights in December, drive to the theater to watch a movie or fire up a campfire on holiday. That too is "wasteful". It's not needed, other, or far more efficient ways exist to achieve the same. And in absolute numbers, far more energy intensive than running an LLM to create 9 clocks every minute. We do things to learn, have fun, be weird, make art, or just spend time.

Now, if Rolex starts building watches by running an LLM to drive its production machines or if we replace millions of wall clocks with ones that "Run an LLM every second", then sure, the waste is an actual problem.

Point I'm trying to make is that it's OK to consider or debate the energy use of LLMs compared to alternatives. But that bringing up that debate in a context where someone is creative, or having a fun time, its not, IMO. Because a lot of "fun" activities use a lot of energy, and that too isn't automatically "wasteful".

Arisaka1
4d ago
What I find amusing with this argument is that, no one ever brought power savings when e.g. used "let me google that for you" instead of giving someone the answer to their question, because we saw the utility of teaching others how to Google. But apparently we can't see the utility of measuring the oversold competence of current AI models, given sufficiently large sampling size.
ugh123
4d ago
Hmm, curious. How did you come up with that?
energy123
4d ago
I sort of assumed they cached like 30 inferences and just repeat them, but maybe I'm being too cynical.
PeterStuer
4d ago
2 replies
Why? This is diagonal to how LLM's work, and trivially solved by a minimal hybrid front/sub system.
bayindirh
4d ago
1 reply
Because, LLMs are touted to be the silver bullet of silver bullets. Built upon world's knowledge, and with the capacity to call upon updated information with agents, they are ought to rival the top programmers 3 days ago.
awkwam
4d ago
1 reply
They might be touted like that but it seems like you don't understand how they work. The example in the article shows that the prompt is limiting the LLM by giving it access to only 2000 tokens and also saying "ONLY OUTPUT ...". This is like me asking you to solve the same problem but forcing you do de-activate half of your brain + forget any programming experience you have. It's just stupid.
bayindirh
4d ago
> like you don't understand how they work.

I would not make such assumptions.

> The example in the article shows that the prompt is limiting the LLM by giving it access to only 2000 tokens and also saying "ONLY OUTPUT ..."

The site is pretty simple, method is pretty straightforward. If you believe this is unfair, you can always build one yourself.

> It's just stupid.

No, it's a great way of testing things within constraints.

em3rgent0rdr
4d ago
To gauge.
em3rgent0rdr
4d ago
4 replies
Most look like they were done by a beginner programmer on crack, but every once in a while a correct one appears.
pixl97
4d ago
2 replies
DeepSeek and Kimi seem to have correct ones most of the time I've looked.
em3rgent0rdr
4d ago
1 reply
yes, and sometimes Grok.
pixl97
4d ago
The hour hand commonly seems off on Grok.
BrandoElFollito
4d ago
DeepSeek told me that it cannot generate pictures and suggested code (which is very different)
shafoshaf
4d ago
2 replies
It's interesting how drawing a clock is one of the primary signals for dementia. https://www.verywellhealth.com/the-clock-drawing-test-98619
BrandoElFollito
4d ago
This is very interesting, thank you.

I could not get to the store because of the cookie banner that does not work (at left on mobile chrome and ff). The Internet Archive page: https://archive.ph/qz4ep

I wonder how this test could be modified for people that have neurological problems - my father's hands shake a lot but I would like to try the test on him (I do not have suspicions, just curious).

I passed it :)

technothrasher
4d ago
"One variation of the test is to provide the person with a blank piece of paper and ask them to draw a clock showing 10 minutes after 11. The word "hands" is not used to avoid giving clues."

Hmm, ambiguity. I would be the smart ass that drew a digital clock for them, or a shaku-dokei.

energy123
4d ago
If they can identify which one is correct, then it's the same as always being correct, just with an expensive compute budget.
morkalork
4d ago
I'd say more like a blind programmer in the early stages of dementia. Able to write code, unable to form a mental image of what it would render as and can't see the final result.
larodi
4d ago
1 reply
would be gr8t to also see the prompt this was done with
creade
4d ago
1 reply
The ? has "Create HTML/CSS of an analog clock showing ${time}. Include numbers (or numerals) if you wish, and have a CSS animated second hand. Make it responsive and use a white background. Return ONLY the HTML/CSS code with no markdown formatting."
larodi
1d ago
Hmm nothing fancy then, but perhaps with tubing results will vary.

I hate prompt discovery (not engineering this thing!), but it actually matters.

baltimore
4d ago
13 replies
Since the first (good) image generation models became available, I've been trying to get them to generate an image of a clock with 13 instead of the usual 12 hour divisions. I have not been successful. Usually they will just replace the "12" with a "13" and/or mess up the clock face in some other way.

I'd be interested if anyone else is successful. Share how you did it!

snek_case
4d ago
1 reply
From my experience they quickly fail to understand anything beyond a superficial description of the image you want.
atorodius
4d ago
1 reply
dang
4d ago
Related ongoing thread:

Nano Banana can be prompt engineered for nuanced AI image generation - https://news.ycombinator.com/item?id=45917875 - Nov 2025 (214 comments)

Scene_Cast2
4d ago
2 replies
I've noticed that image models are particularly bad at modifying popular concepts in novel ways (way worse "generalization" than what I observe in language models).
emp17344
4d ago
3 replies
Maybe LLMs always fail to generalize outside their data set, and it’s just less noticeable with written language.
cluckindan
4d ago
1 reply
This is it. They’re language models which predict next tokens probabilistically and a sampler picks one according to the desired ”temperature”. Any generalization outside their data set is an artifact of random sampling: happenstance and circumstance, not genuine substance.
cluckindan
3d ago
1 reply
However: do humans have that genuine substance? Is human invention and ingenuity more than trial and error, more than adaptation and application of existing knowledge? Can humans generalize outside their data set?

A yes-answer here implies belief in some sort of gnostic method of knowledge acquisition. Certainly that comes with a high burden of proof!

dawidloubser
3d ago
Yes
IshKebab
4d ago
1 reply
They definitely don't completely fail to generalise. You can easily prove that by asking them something completely novel.

Do you mean that LLMs might display a similar tendency to modify popular concepts? If so that definitely might be the case and would be fairly easy to test.

Something like "tell me the lord's prayer but it's our mother instead of our father", or maybe "write a haiku but with 5 syllables on every line"?

Let me try those ... nah ChatGPT nailed them both. Feels like it's particular to image generation.

immibis
3d ago
They used to do poorly with modified riddles, but I assume those have been added to their training data now (https://huggingface.co/datasets/marcodsn/altered-riddles ?)

Like, the response to "... The surgeon (who is male and is the boy's father) says: I can't operate on this boy! He's my son! How is this possible?" used to be "The surgeon is the boy's mother"

The response to "... At each door is a guard, each of which always lies. What question should I ask to decide which door to choose?" would be an explanation of how asking the guard what the other guard would say would tell you the opposite of which door you should go through.

phire
4d ago
1 reply
Most image models are diffusion models, not LLMs, and have a bunch of other idiosyncrasies.

So I suspect it's more that lessons from diffusion image models don't carry over to text LLMs.

And the Image models which are based on multi-mode LLMs (like Nano Banana) seem to do a lot better at novel concepts.

Gormo
1d ago
1 reply
But the clocks in this demo aren't images.
phire
1d ago
Yes, but they are reasoning within their dataset, which will contain multiple example of html+css clocks.

They are just struggling to produce good results because they are language models and don’t have great spatial reasoning skills, because they are language models.

Their output normally has all the elements, just not in the right place/shape/orientation.

CobrastanJorji
4d ago
Also, they're fundamentally bad at math. They can draw a clock because they've seen clocks, but going further requires some calculations they can't do.

For example, try asking Nano Banana to do something simpler, like "draw a picture of 13 circles." It likely will not work.

IAmGraydon
4d ago
6 replies
That's because they literally cannot do that. Doing what you're asking requires an understanding of why the numbers on the clock face are where they are and what it would mean if there was an extra hour on the clock (ie that you would have to divide 360 by 13 to begin to understand where the numbers would go). AI models have no concept of anything that's not included in their training data. Yet people continue to anthropomorphize this technology and are surprised when it becomes obvious that it's not actually thinking.
echelon
4d ago
1 reply
gpt-image-1 and Google Imagen understand prompts, they just don't have training data to cover these use cases.

gpt-image-1 and Imagen are wickedly smart.

The new Nano Banana 2 that has been briefly teased around the internet can solve incredibly complicated differential equations on chalk boards with full proof of work.

phkahler
4d ago
2 replies
>> The new Nano Banana 2 that has been briefly teased around the internet can solve incredibly complicated differential equations on chalk boards with full proof of work.

That's great, but I bet it can't tie it's own shoes.

echelon
4d ago
No, but I can get it to do a lot of work.

It's a part of my daily tool box.

esafak
4d ago
And a submarine can't swim. Big deal.
ryandrake
4d ago
1 reply
I wonder if you would have more success if you painstakingly described the shape and features of a clock in great detail but never used the words clock or time or anything that might give the AI the hint that they were supposed to output something like a clock.
BrandoElFollito
4d ago
And this is a problem for me. I guess that it would work, but as soon as the word "clock" appears, gone is the request because a clock HAS.12.HOURS.

I use this a lot in cybersecurity when I need to do something "illegal". I am refused help, until I say that I am doing research on cybersecurity. In that case no problem.

Workaccount2
4d ago
1 reply
The problem is more likely the tokenization of images than anything. These models do their absolute worst when pictures are involved, but are seemingly miraculous at generalizing with just text.
chemotaxis
4d ago
1 reply
I wonder if it's because we mean different things by generalization.

For text, "generalization" is still "generate text that conforms to all the usual rules of the language". For images of 13-hour clock faces, we're explicitly asking the LLM to violate the inferred rules of the universe.

I think a good analogy would be asking an LLM to write in English, except the word "the" now means "purple". They will struggle to adhere to this prompt in a conversation.

Workaccount2
4d ago
1 reply
That's true, but I think humans would stumble a lot too (try reading old printed text from the 18fh cenfury where fhey used "f" insfead of t in prinf, if's a real frick fo gef frough).

However humans are pretty adept at discerning images, even ones outside the norm. I really think there is some kind of architectural block hampering transformers ability to really "see" images. For instance if you show any model a picture of a dog with 5 legs (a fifth leg photoshopped to it's belly) they all say there are only 4 legs. And will argue with you about it. Hell GPT-5 even wrote a leg detection script in python (impressive) which detected the 5 legs, and then it said the script was bugged, and modified the parameters until one of the legs wasn't detected, lol.

onraglanroad
4d ago
An "f" never replaced a "t".

You probably mean the "long s" that looks like an "f".

energy123
4d ago
The hope was for this understanding to emerge as the most efficient solution to the next-token prediction problem.

Put another way, it was hoped that once the dataset got rich enough, developing this understanding is actually more efficient for the neural network than memorizing the training data.

The useful question to ask, if you believe the hope is not bearing fruit, is why. Point specifically to the absent data or the flawed assumption being made.

Or more realistically, put in the creative and difficult research work required to discover the answer to that question.

bobbylarrybobby
4d ago
It's interesting because if you asked them to write code to generate an SVG of a clock, they'd probably use a loop from 1 to 12, using sin and cos of the angle (given by the loop index over 12 times 2pi) to place the numerals. They know how to do this, and so they basically understand the process that generates a clock face. And extrapolating from that to 13 hours is trivial (for a human). So the fact that they can't do this extrapolation on their own is very odd.
godelski
4d ago
Yes, the problem is that these so called "world models" do not actually contain a model of the world, or any world
echelon
4d ago
2 replies
That's just a patch to the training data.

Once companies see this starting to show up in the evals and criticisms, they'll go out of their way to fix it.

rideontime
4d ago
What would the "patch" be? Manually create some images of 13-hour clocks and add them to the training data? How does that solution scale?
godelski
4d ago
s/13/17/g ;)
coffeecoders
4d ago
4 replies
LLMs are terrible for out-of-distribution (OOD) tasks. You should use chain of thought suppression and give constaints explictly.

My prompt to Grok:

---

Follow these rules exactly:

- There are 13 hours, labeled 1–13.

- There are 13 ticks.

- The center of each number is at angle: index * (360/13)

- Do not infer anything else.

- Do not apply knowledge of normal clocks.

Use the following variables:

HOUR_COUNT = 13

ANGLE_PER_HOUR = 360 / 13 // 27.692307°

Use index i ∈ [0..12] for hour marks:

angle_i = i * ANGLE_PER_HOUR

I want html/css (single file) of a 13-hour analog clock.

---

Output from grok.

https://jsfiddle.net/y9zukcnx/1/

BrandoElFollito
4d ago
1 reply
Well, that's cheating :) You asked it to generate code, which is ok because it does not represent a direct generated image of a clock.

Can grok generate images? What would the result be?

I will try your prompt on chatgpt and gemini

BrandoElFollito
4d ago
1 reply
Gemini failed miserably - a standard 12 hours clock

Same for chatgpt

And perplexity replaced 12 with 13

dwringer
4d ago
> Please create a highly unusual 13-hour analog clock widget, synchronized to system time, with fully animated hands that move in real time, and not 12 but 13 hour markings - each will be spaced at not 5-minute intervals, but at 4-minute-37-second intervals. This makes room for all 13 hour markings. Please pay attention to the correct alignment of the 13 numbers and the 13 hour marks, as well as the alignment of the hands on the face.

This gave me a correct clock face on Gemini- after the model spent a lot of time thinking (and kind of thrashing in a loop for a while). The functionality isn't quite right, not that it entirely makes sense in the first place, but the face - at least in terms of the hour marks - looks OK to me.[0]

[0] https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

chemotaxis
4d ago
2 replies
> Follow these rules exactly:

"Here's the line-by-line specification of the program I need you to write. Write that program."

signatoremo
4d ago
2 replies
Can you write this program in any language?
bigfishrunning
4d ago
Yes.
chemotaxis
4d ago
No, do I need to?
serf
4d ago
it's lazy to dust off the major advantages of a pseudocode-to-anylanguage transpiler as if it's somehow easy or commonplace.
chiwilliams
4d ago
1 reply
I'll also note that the output isn't quite right --- the top number should be 13 rather than 1!
layer8
4d ago
I mean, the specification for the hour marks (angle_i) starts with a mark at angle 0. It just followed that spec. ;)
NooneAtAll3
4d ago
close enough, but digit at the top should be the highest, not 1 :/
BrandoElFollito
4d ago
2 replies
This is really cool. I tried to prompt gemini but every time I got the same picture. I do not know how to share a session (like it is possible with Chatgpt) but the prompts were

If a clock had 13 hours, what would be the angle between two of these 13 hours?

Generate an image of such a clock

No, I want the clock to have 13 distinct hours, with the angle between them as you calculated above

This is the same image. There need to be 13 hour marks around the dial, evenly spaced

... And its last answer was

You are absolutely right, my apologies. It seems I made an error and generated the same image again. I will correct that immediately.

Here is an image of a clock face with 13 distinct hour marks, evenly spaced around the dial, reflecting the angle we calculated.

And the very same clock, with 12 hours, and a 13th above the 12...

ryandrake
4d ago
2 replies
This is probably my biggest problem with AI tools, having played around with them more lately.

"You're absolutely right! I made a mistake. I have now comprehensively solved this problem. Here is the corrected output: [totally incorrect output]."

None of them ever seem to have the ability to say "I cannot seem to do this" or "I am uncertain if this is correct, confidence level 25%" The only time they will give up or refuse to do something is when they are deliberately programmed to censor for often dubious "AI safety" reasons. All other times, they come back again and again with extreme confidence as they totally produce garbage output.

BrandoElFollito
4d ago
2 replies
I agree, I see the same even in simple code where they will bend backwards apologizing and generate very similar crap.

It is like they are sometimes stuck in a local energetic minimum and will just wobble around various similar (and incorrect) answers.

What was annoying in my attempt above is that the picture was identical for every attempt

ryandrake
4d ago
These tools 'attitude' reminds me of an eager, but incompetent intern or a poorly trained administrative assistant, who works for a powerful CEO. All sycophancy, confidence and positive energy, but not really getting much done.
SamBam
4d ago
The issue is the they always say "Here's the final, correct answer" before they've written the answer, so of course the LLM has no idea if it's going to be right before it starts, because it has no clue what it's going to say.

I wonder how it would do if instead it were told "Do not tell me at the start that the solution is going to be correct. Instead, tell me the solution, and at the end tell me if you think it's correct or not."

I have found that on certain logic puzzles that it simply cannot get right, it always tells me that it's going to get it quite "this last time," but if asked later it always recognizes its errors.

int_19h
4d ago
Gemini specifically is actually kinda notorious for giving up.

https://www.reddit.com/r/artificial/comments/1mp5mks/this_is...

notatoad
4d ago
1 reply
you can click the share icon (the two-way branch icon, it doesn't look like apple's share icon) under the image it generates to share the conversation.

i'm curious if the clock image it was giving you was the same one it was giving me

https://gemini.google.com/share/780db71cfb73

BrandoElFollito
3d ago
Thanks for the tip about sharing!

No, my clock was an old style one, to be put on a shelf. But at least it had a "13" proudly right above the "12" :)

This reminds me my kids when they were in kindergarden and were bringing home their art that needed extra explanation to realize what it was. But they were very proud!

deathanatos
4d ago
1 reply

  Generate an image of a clock face, but instead of the usual 12 hour numbering, number it with 13 hours. 

Gemini, 2.5 Flash or "Nano Banana" or whatever we're calling it these days. https://imgur.com/a/1sSeFX7

A normal (ish) 12h clock. It numbered it twice, in two concentric rings. The outer ring is normal, but the inner ring numbers the 4th hour as "IIII" (fine, and a thing that clocks do) and the 8th hour as "VIIII" (wtf).

bar000n
4d ago
4 replies
It should be pretty clear already that anything which is based (limited?) to communicating words/text can never grasp conceptual thinking.

We have yet to design a language to cover that, and it might be just a donquijotism we're all diving into.

bayindirh
4d ago
1 reply
> We have yet to design a language to cover that, and it might be just a donquijotism we're all diving into.

We have a very comprehensive and precise spec for that [0].

If you don't want to hop through the certificate warning, here's the transcript:

- Some day, we won't even need coders any more. We'll be able to just write the specification and the program will write itself.

- Oh wow, you're right! We'll be able to write a comprehensive and precise spec and bam, we won't need programmers any more.

- Exactly

- And do you know the industry term for a project specification that is comprehensive and precise enough to generate a program?

- Uh... no...

- Code, it's called code.

[0]: https://www.commitstrip.com/en/2016/08/25/a-very-comprehensi...

snickerbockers
4d ago
Ive been thinking about that a lot too. Fundamentally it's just a different way of telling the computer what to do and if it seems like telling an llm to make a program is less work than writing it yourself then either your program is extremely trivial or there are dozens of redundant programs in the training set that are nearly identical.

If you're actualy doing real work you have nothing to fear from LLMs because any prompt which is specific enough to create a given computer program is going to be comparable in terms of complexity and effort to having done it yourself.

Uehreka
4d ago
1 reply
I don’t think that’s clear at all. In fact the proficiency of LLMs at a wide variety of tasks would seem to indicate that language is a highly efficient encoding of human thought, much moreso than people used to think.
tsunamifury
4d ago
Yea it’s amazing that the parent post literally misunderstands the fundamental realities of LLMs and the compression they reveal in linguistics even if blurry is incredible.
rideontime
4d ago
Really? I can grasp the concept behind that command just fine.
XenophileJKO
4d ago
andix
4d ago
3 replies
I gave this "riddle" to various models:

> The farmer and the goat are going to the river. They look into the sky and see three clouds shaped like: a wolf, a cabbage and a boat that can carry the farmer and one item. How can they safely cross the river?

Most of them are just giving the result to the well known river crossing riddle. Some "feel" that something is off, but still have a hard time to figure out that wolf, boat and cabbage are just clouds.

jampa
4d ago
1 reply
andix
4d ago
It really shows how LLMs work. It's all about probabilities, and not about understanding. If something looks very similar to a well known problem, the llm is having a hard time to "see" contradictions. Even if it's really easy to notice for humans.
Recursing
3d ago
1 reply
Claude has no problem with this: https://imgur.com/a/ifSNOVU

Maybe older models?

andix
3d ago
Try to twist around words and phrases, at some point it might start to fail.

I tried it again yesterday with GPT. GPT-5 manages quite well too in thinking mode, but starts crackling in instant mode. 4o completely failed.

It's not that LLMs are unable to solve things like that at all, but it's really easy to find some variations that make them struggle really hard.

userbinator
4d ago
usui
4d ago
I've been trying for the longest time and across models to generate pictures or cartoons of people with six fingers and now they won't do it. They always say they accomplished it, but the result always has 5 fingers. I hate being gaslit.
chanux
4d ago
Ah! This is so sad. The manager types won't be able to add an hour (actually, two) to the day even with AI.
edub
4d ago
I was able to have AI generate an image that made this, but not by diffusion/autoregressive but by having it write Python code to create the image.

ChatGPT made a nice looking clock with matplotlib that had some bugs that it had to fix (hours were counter-clockwise). Gemini made correct code one-shot, it used Pillow instead of matplotlib, but it didn't look as nice.

giancarlostoro
4d ago
Weird, I never tried that, I tried all the usual tricks that usually work including swearing at the model (this scarily works surprisingly well with LLMs) and nothing. I even tried to go the opposite direction, I want a 6 hour clock.
nl
4d ago
I do playing card generation and almost all struggle beyond the "6 of X"

My working theory is that they were trained really hard to generate 5 fingers on hands but their counting drops off quickly.

abathologist
4d ago
1 reply
This is great. If you think that the phenomena of human-like text generation evinces human-like intelligence, then this should be taken to evince that the systems likely have dementia. https://en.wikipedia.org/wiki/Montreal_Cognitive_Assessment
AIorNot
4d ago
Imagine if I asked you to draw as pixels and operate a clock via html or create a jpeg with a pencil and paper and have it be accurate.. I suspect your handcoded work to be off by an order of magnitutde compared
kburman
4d ago
5 replies
These types of tests are fundamentally flawed. I was able to create perfect clock using gemini 2.5 pro - https://gemini.google.com/share/136f07a0fa78
sinak
4d ago
1 reply
How are they flawed?
earthnail
4d ago
1 reply
The results are not reproducable, as evidenced by parent poster.
micromacrofoot
4d ago
1 reply
isn't that kind of the point of non-determinism?
earthnail
4d ago
1 reply
No. Good nondeterministic models reproducibly generate equally desirable output - not identical output, but interchangeable.
micromacrofoot
4d ago
oh I see, thank you for clarifying
Drew_
4d ago
1 reply
The website is regenerating the clocks every minute. When I opened it, Gemini 2.5 was the only working one. Now, they are all broken.

Also, your example is not showing the current time.

system2
4d ago
It wouldn't be hard to tell to pick up browser time as the default start point. Just a piece of prompt.
allenu
4d ago
I don't think this is a serious test. It's just an art piece to contrast different LLMs taking on the same task, and against themselves since it updates every minute. One minute one of the results was really good for me and the next minute it was very, very bad.
jmdeon
4d ago
Aren't they attempting to also display current time though? Your share is a clock starting at midnight/noon. Kimi K2 seems to be the best on each refresh.
dwringer
4d ago
Even Gemini Flash did really well for me[0] using two prompts - the initial query and one to fix the only error I could identify.

> Please generate an analog clock widget, synchronized to actual system time, with hands that update in real time and a second hand that ticks at least once per second. Make sure all the hour markings are visible and put some effort into making a modern, stylish clock face.

Followed by:

> Currently the hands are working perfectly but they're translated incorrectly making then uncentered. Can you ensure that each one is translated to the correct position on the clock face?

[0] https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

S0y
4d ago
1 reply
To be fair, This is a deceptively hard task.
bobbylarrybobby
4d ago
2 replies
Without AI assistance, this should take ~10–15 minutes for a human. Maybe add 5 minutes if you're not allowed to use d3.
postalrat
4d ago
Whats your hourly rate? I'll pay you to make as many as you can in a few hours if you share the video.
alexmorley
4d ago
It's just html/css so no js at all let alone d3.
zkmon
4d ago
1 reply
Was Claude banned from this Olympics?
giancarlostoro
4d ago
Haiku is the lightweight Claude model, I'm not sure why they picked the weaker model.
system2
4d ago
1 reply
Ask Claude or ChatGPT to write it in Python, and you will see what they are capable of. HTML + CSS has never been the strong suit of any of these models.
camalouu
4d ago
Claude generates some js/css stuff even when i don't ask for it. I think Claude itself at least believes he is good at this.
munro
4d ago
4 replies
Amazing, some people are so enamored with LLMs who use them for soft outcomes, and disagree with me when I say be careful they're not perfect -- this is such a great non technical way to explain the reality I'm seeing when using on hard outcome coding/logic tasks. "Hey this test is failing", LLM deletes test, "FIXED!"
worldsayshi
4d ago
1 reply
Yeah it seems crazy to use LLM on any task where the output can't be easily verified.
palmotea
4d ago
> Yeah it seems crazy to use LLM on any task where the output can't be easily verified.

I disagree, those tasks are perfect for LLMs, since a bug you can't verify isn't a problem when vibecoding.

derbOac
4d ago
1 reply
Something that struck me when I was looking at the clocks is that we know what a clock is supposed to look and act like.

What about when we don't know what it's supposed to look like?

Lately I've been wrestling with the fact that unlike, say, a generalized linear model fit to data with some inferential theory, we don't have a theory or model for the uncertainty about LLM products. We recognize when it's off about things we know are off, but don't have a way to estimate when it's off other than to check it against reality, which is probably the exception to how it's used rather than the rule.

ehnto
4d ago
I need to be delicate with wording here, but this is why it's a worry that all the least intelligent people you know could be using AI.

It's why non-coders think it's doing an amazing job at software.

But it's worryingly why using it for research, where you necessarily don't know what you don't know, is going to trip up even smarter people.

markatkinson
4d ago
To be fair I'd probably also delete the test.
mopsi
4d ago

  > "Hey this test is failing", LLM deletes test, "FIXED!"
A nice continuation of the tradition of folk stories about supernatural entities like teapots or lamps that grant wishes and take them literally. "And that's why, kids, you should always review your AI-assisted commits."
otterley
4d ago
1 reply
Watching this over the past few minutes, it looks like Kimi K2 generates the best clock face most consistently. I'd never heard of that model before today!

Qwen 2.5's clocks, on the other hand, look like they never make it out of the womb.

bArray
4d ago
3 replies
It could be that the prompt is accidentally (or purposefully) more optimised for Kimi K2, or that Kimi K2 is better trained on this particular data. LLM's need "prompt engineers" for a reason to get the most out of a particular model.
observationist
4d ago
1 reply
It's not fair to use prompts tailored to a particular model when doing comparisons like this - one shot results that generalize across a domain demonstrate solid knowledge of the domain. You can use prompting and context hacking to get any particular model to behave pseudo-competently in almost any domain, even the tiny <1B models, for some set of questions. You could include an entire framework and model for rendering clocks and times that allowed all 9 models to perform fairly well.

This experiment, however, clearly states the goal with this prompt: `Create HTML/CSS of an analog clock showing ${time}. Include numbers (or numerals) if you wish, and have a CSS animated second hand. Make it responsive and use a white background. Return ONLY the HTML/CSS code with no markdown formatting.`

An LLM should be able to interpret that, and should be able to perform a wide range of tasks in that same style - countdown timers, clocks, calendars, floating quote bubble cycling through list of 100 pithy quotations, etc. Individual, clearly defined elements should have complex representations in latent space that correspond to the human understanding of those elements. Tasks and operations and goals should likewise align with our understanding. Qwen 2.5 and some others clearly aren't modeling clocks very well, or maybe the html/css rendering latents are broken. If you pick a semantic axis(like analog clocks), you can run a suite of tests to demonstrate their understanding by using limited one-shot interactions.

Reasoning models can adapt on the fly, and are capable of cheating - one shots might have crappy representations for some contexts, but after a lot of repetition and refinement, as long as there's a stable, well represented proxy for quality somewhere in the semantics it understands, it can deconstruct a task to fundamentals and eventually reach high quality output.

These type of tests also allow us to identify mode collapses - you can use complex sophisticated prompting to get most image models to produce accurate analog clocks displaying any time, but in the simple one shot tests, the models tend to only be able to produce the time 10:10, and you'll get wild artifacts and distortions if you try to force any other configuration of hands.

Image models are so bad at hands that they couldn't even get clock hands right, until recently anyway. Nano banana and some other models are much better at avoiding mode collapses, and can traverse complex and sophisticated compositions smoothly. You want that same sort of semantic generalization in text generating models, so hopefully some of the techniques cross over to other modalities.

I keep hoping they'll be able to use SAE or some form of analysis on static weight distributions in order to uncover some sort of structural feature of mode collapse, with a taxonomy of different failure modes and causes, like limited data, or corrupt/poisoned data, and so on. Seems like if you had that, you could deliberately iterate on, correct issues, or generate supporting training material to offset big distortions in a model.

jquery
4d ago
Qwen 2.5 is so bad it’s good. Some really insane results if you watch it for a while. Almost like it’s taking the piss.
bigfishrunning
4d ago
1 reply
How much engineering do prompt engineers do? Is it engineering when you add "photorealistic. correct number of fingers and teeth. High quality." to the end of a prompt?

we should call them "prompt witch doctors" or maybe "prompt alchemists".

Dilettante_
4d ago
"How is engineering a real science? You just build the bridge so it doesn't fall down."
energy123
4d ago
Goes to show the "frontier" is not really one frontier. It's a social/mathematical construct that's useful for a broad comparison, but if you have a niche task, there's no substitute for trying the different models.
awkwam
4d ago
Limiting the model to only use 2000 tokens while also asking it to output ONLY HTML/CSS is just stupid. It's like asking a programmer to perform the same task while removing half their brain and also forget about their programming experience. This is a stupid and meaningless benchmark.
jonplackett
4d ago
kimi is kicking ass
busymom0
4d ago
Because a new clock is generated every minute, looks like simply changing the time by a digit causes the result to be significantly different from the previous iteration.
lxe
4d ago
Honestly, I think if you track the performance of each over time, since these get regenerated once in a while, you can then have a very, very useful and cohesive benchmark.
mstipetic
4d ago
GPT-5 is embarrassing itself. Kimi and DeepSeek are very consistently good. Wild that you can just download these models.
novemp
4d ago
Oh cool, it's the schizophrenia clock-drawing test but for AI.
AlfredBarnes
4d ago
Its cool to see them get it right .....sometimes
zkmon
4d ago
Why are Deepseek and Kimi are beating other models by so much margin? Is this to do with their specialization for this task?
collimarco
4d ago
In any case those clocks are all extremely inaccurate, even if AI could build a decent UI (which is not the case).

Some months ago I published this site for fun: https://timeutc.com There's a lot of code involved to make it precise to the ms, including adjusting based on network delay, frame refresh rate instead of using setTimeout and much more. If you are curious take a look at the source code.

1yvino
4d ago
i wonder kwen prompt woud look like hallucination?
shubham_zingle
4d ago
not sure about the accuracy though, although shooting in the dark
fschuett
4d ago
shevy-java
4d ago
Now that is actually creative.

Granted, it is not a clock - but it could be art. It looks like a Picasso. When he was drunk. And took some LSD.

bananatron
4d ago
grok's looks like one of those clocks you'd find at a novelty shop

208 more comments available on Hacker News

ID: 45930151Type: storyLast synced: 11/16/2025, 9:42:57 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.