Agentic Pelican on a Bicycle
Postedabout 2 months agoActiveabout 2 months ago
robert-glaser.deTechstoryHigh profile
calmmixed
Debate
60/100
LlmsAI Image GenerationAgentic Systems
Key topics
Llms
AI Image Generation
Agentic Systems
The article explores the capabilities of LLMs in generating images through an 'agentic pelican on a bicycle' experiment, sparking discussion on their limitations and potential improvements.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
2h
Peak period
50
0-12h
Avg / period
12.3
Comment distribution74 data points
Loading chart...
Based on 74 loaded comments
Key moments
- 01Story posted
Nov 11, 2025 at 2:40 PM EST
about 2 months ago
Step 01 - 02First comment
Nov 11, 2025 at 4:34 PM EST
2h after posting
Step 02 - 03Peak activity
50 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 17, 2025 at 10:36 PM EST
about 2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45891817Type: storyLast synced: 11/20/2025, 3:50:08 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
I remember I even had one case where there was a stealth model running in preview via Open Router and I asked it for an SVG of a pelican riding a bicycle and correctly guessed the model vendor based on the response!
That's what working with GPT-5-Codex on actual code also feels like.
If Sonnet doesn't solve my problem, sometimes Codex actually does.
So it isn't like Codex is always worse. I just prefer to try Sonnet 4.5 first.
I wonder if there is a consistent way to force structural revisions. I have found Nano Banana particularly terrible at revisions, even something like "change the image dimensions to..." it will confidently claim success but do nothing.
And as you say, they cheerfully assert that they've done the job, for real this time, every time.
The naive approach that gets you results like ChatGPT is to produce output tokens based on the prompt and generate a new image from the output. It is really difficult to maintain details from the input image with this approach.
A more advanced approach is to generate a stream of "edits" to the input image instead. You see this with Gemini, which sometimes maintains original image details to a fault; e.g. it will preserve human faces at all cost, probably as a result of training.
I think the round-trip through SVG is an extreme challenge to train through and essentially forces the LLM to progressively edit the SVG source, which can result in something like the Gemini approach above.
[0]: https://www.groundlight.ai/blog/how-vlm-works-tokens
I think that the problem here is that svg is structured information and an image is unstructured blob, and the translation between them requires planning and understanding. Maybe if instead of treating an svg like a raster image in the prompt is wrong. I think that prompting the image like code (which svg basically is) would result in better outputs.
This is just my uninformed opinion.
Nano Banana is rather terrible at multi-turn chats, just like any other model, despite the claim it's been trained for it. Scattered context and irrelevant distractors are always bad, compressing the conversation into a single turn fixes this.
I suspect this is either a training data issue, or an issue with the people building these things not recognizing the problem, but it's weird how persistent and cross-model the issue is, even in model releases that specifically call out better/more steerable composition behavior.
Ask for multiple solutions?
See also the "warmth" that certain vinyl enthusiasts sought after from their analog recordings which most certainly was mainly dust and defects in the groves rather than any actual tangible quality of the audio itself.
Pixar films were setup with the idea of being put on film so the DVD digital transfers color is all wrong.
Good vinyl is "wait, did we have this back in 1970s" good(the recorder yes, the player not exactly, hence the prevalence of vinyl sound effects)
Take a look at pixel art on CRTs vs LCDs [0] and Toy Story on film vs. digital [1].
[0]: https://wackoid.com/game/10-pictures-that-show-why-crt-tvs-a...
[1]: https://animationobsessive.substack.com/p/the-toy-story-you-...
This is a better version of what I tried but suffers from the same problem - the models seem to stick close to their original shapes and add new details rather than creating an image from scratch that's a significantly better variant of what they tried originally.
For the svg generation, it would be an interesting experiment to seed it with increasingly poor initial images and see at what point if any the models don’t anchor on the initial image and just try something else
Do you use chatbot UIs like chatgtp.com, claudi.ai, LibreChat for coding, instead of Cursor, Windsurf, Kiro, etc?
If that is the case, I am really curious about this.
In either case case I'll often reset the context by starting a new session.
The whole "context engineering" concept is certainly a thing, though I do dislike throwing around the word "engineer" all willy-nilly like that. :)
In any case, thanks for the response. I just wanted to make sure that I was not missing something.
[0] https://github.com/adobe-research/NoLiMa
I’d be curious if the approach would be improved by having the model generate a full pelican from scratch each time and having it judge which variation is an improvement. Or if something should be altered in each loop, perhaps it should be the prompt instead
"creating an svg is surprisingly revealing" No it is not. They all do the same thing, they add suns and movement lines, and some more details. Like they were all trained on the same thing.
he makes up his own definition of "agent" there are at least 6 different definitions of this word now in this space. And his is again new for no reason.
The core idea here is "being vague and letting the models make weird random choice" This is the exact opposite of ALL direct instructions from the major model and coding agent programs at this time.
Actual interesting methodology would have been to create all combinations of the variables: let them use different svg to image tools and compare them, try many many different prompts with more specific instructions like "try to be more mechanically accurate"
Analysis is baseless assumptions: It is not "adding realism" all the models just had more pictures of roads trees suns and clouds... so it kept going back to the training data to add more like you keep telling it to do. It certainly wasn't understanding "more mechanically coherent" If it started focusing on the bike, it had more detailed bike pictures in the training data with chains.
This is why all the ai stuff is infuriating, people are mistaking so much for "good" or "useful" . At best this is a laugh once joke about how bad it is.
I admit the first time I saw that big george carlin generated video/stand up comedy, there was a special new feeling about "what on earth did they prompt to get this combination of visual and audio?" But that was such a fleeting thing I never need again
> Danielle Del, a spokeswoman for Sasso, said Dudesy is not actually an A.I.
> “It’s a fictional podcast character created by two human beings, Will Sasso and Chad Kultgen,” Del wrote in an email. “The YouTube video ‘I’m Glad I’m Dead’ was completely written by Chad Kultgen.”
Now... Was this article LLM written?
This part triggered all my LLM flags: ``` Adding a bicycle chain isn’t just decoration—it shows understanding of mechanical relationships. The wheel spokes, the adjusted proportions—these are signs of vision-driven refinement working as intended. ```
Does being bad at drawing bikes make a machine more intelligent/human?
Ignoring the wording, em-dashes, etc. I'd assume an LLM not only wrote the article but also judged the pictures. That or the author has a much more relaxed opinion on what a pelican on a bicycle should actually look like. I don't think I would call Sonnet's arms and handlebars improved, nor would I call Haiku's legs and feet "proper." And if you overlay GPT-5 Medium's two photos the shapes proportions are nearly identical.
It was a fun little post that felt accurate (ie confirmed my own biases ;)) about the current state of LLM models in a silly, but real, use-case.
The continual drive to out "llm written" articles feels a bit silly to me at this point. They are now part of the tools and tech we use, for better or worse. And to be clear, I think in a lot of cases it leans towards 'worse'.
But do you question if a video or photo was made with digital editing or filters or 'ai' tools (many of which we've had for years, just under different names) ? Do you worry about what tech was used in making your favorite album or song?
I get it, LLMs make it easy to produce trash content, but this is not a new problem. If you see trash, call it out as trash on its flaws, not on a presumption of how it was made.
I already spend too much time reading LLM outputs on my own interactions. And I get sick of their style because of it. So when I read it during leisure time, it just triggers a gut rejection.
Especially because they are so formulaic / template-y.
I wish it had the Wikipedia style of writing as a default, as in, much more matter-of-fact writing (even if not everything is a fact).
I think part of the problem is that people overwhelmingly vote for this style with up votes and revealed preferences.
Maybe there should be a more meticulous feedback / prompt system where I can highlight a paragraph or sentence and ask annotate my feedback so that it doesn't go for that style.
I agree about the silliness. God forbid i am a non-native English speaker and I have a bit of an of odd writing style in a real Brits eye. Or that I use ‘—‘ instead of ‘-‘ because usually typing two dashes converts to the long one on Mac (try even four, technology is crazy these days), and it just feels a bit nicer. OR that I adopt occasional use of ‘;’ because I feel like it (Yes. English is supposed to have short sentences. Unlike other languages. Beautiful. Sue me.)
I don’t care if they helped themselves with AI to improve writing or turn a bullet point into a sentence. It’s when the volume of text doesn’t justify the lack of content or value that I call bs and go to the next one. At this point it might as well be human generated content, but I don’t care, outcome’s the same.
Regarding the post — it’s a cute little article and the pelicans do seem be making a point with their funky shapes
"Create a drawing of a cliff in the desert."
Get something passing.
"Add a waterfall."
Get a waterfall that has no connection or outlet.
"Make the waterfall connect to a creek that runs off to the left."
Get a waterway that goes by the waterfall without connecting and goes straight through the center of the image.
Give up on that and notice that the shadows go to the left but the sun is straight behind.
"Move the sun to the right so that it matches the shadows more accurately."
Sun stays in the same spot, but grows while exaggerated and inaccurate shadows show up that seem to imply the backside of the cliff doesn't block light.
...
it uses AGENTS
ITS AGENTIC
AGENTIC.
Poor feet.
What about structuring the agentic loop to do a simple genetic algorithm -- generate N children (probably 2 or 3), choose the best of the N+1 options (original vs. child A vs. child B vs. child C, and so-on) and then iterate again?
If first output is crappy, the next 3 iterations will improve the same crap.
This was not a good test.