Nano Banana can be prompt engineered for nuanced AI image generation
Mood
excited
Sentiment
positive
Category
tech
Key topics
AI image generation
prompt engineering
Nano Banana
The 'Nano Banana' can be used as a prompt engineering technique for nuanced AI image generation, allowing for more specific and detailed outputs.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
38m
Peak period
151
Day 1
Avg / period
80
Based on 160 loaded comments
Key moments
- 01Story posted
11/13/2025, 5:39:13 PM
5d ago
Step 01 - 02First comment
11/13/2025, 6:16:56 PM
38m after posting
Step 02 - 03Peak activity
151 comments in Day 1
Hottest window of the conversation
Step 03 - 04Latest activity
11/15/2025, 12:26:08 AM
4d ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
okay, look at imagen 4 ultra:
https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
In this link, Imagen is instructed to render the verbatim prompt “the result of 4+5”, which shows that text, and not instructed, which renders “4+5=9”
Is Imagen thinking?
Let's compare to gemini 2.5 flash image (nano banana):
look carefully at the system prompt here: https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
Gemini is instructed to reply in images first, and if it thinks, to think using the image thinking tags. It cannot seemingly be prompted to show verbatim the result 4+5 without showing the answer 4+5=9. Of course it can show whatever exact text that you want, the question is, does it prompt rewrite (no) or do something else (yes)?
compare to ideogram, with prompt rewriting: https://ideogram.ai/g/GRuZRTY7TmilGUHnks-Mjg/0
without prompt rewriting: https://ideogram.ai/g/yKV3EwULRKOu6LDCsSvZUg/2
We can do the same exercises with Flux Kontext for editing versus Flash-2.5, if you think that editing is somehow unique in this regard.
Is prompt rewriting "thinking"? My point is, this article can't answer that question without dElViNg into the nuances of what multi-modal models really are.
(Do we say we software engineered something?)
You CREATED something, and I like to think that creating things that I love and enjoy and that others can love and enjoy makes creating things worth it.
No, that simply is not true. If you actually compare the before and after you can see it still regenerates all the details on the "unchanged" aspects. Texture, lighting, sharpness, even scale its all different even if varyingly similar to the original.
Sure they're cute for casual edits but it really pains me people suggesting these things are suitable replacements for actual photo editing. Especially when it comes to people, or details outside their training data theres a lot of nuance that can be lost as it regenerates them no matter how you prompt things.
Even if you
I figured that if you write the text in Google docs and share the screenshot with banana it will not make any spelling mistake.
So, use something like "can you write my name on this Wimbledon trophy, both images are attached. Use them" will work.
That's on my list of blog-post-worthy things to test, namely text rendering to image in Python directly and passing both input images to the model for compositing.
But it is still generating it with a prompt
> Logo: "A simple, modern logo with the letters 'G' and 'A' in a white circle.
My idea was do to it manually so that there is no probabilities involved.
Though your idea of using python is same.
It's actually fairly difficult to put to words any specific enough vision such that it becomes understandable outside of your own head. This goes for pretty much anything, too.
For anything, even back in the "classical" search days.
"This got searched verbatim, every time"
W*ldcards were handy
and so on...
Now, you get a 'system prompt' which is a vague promise that no really this bit of text is special you can totally trust us (which inevitably dies, crushed under the weight of an extended context window).
Unfortunately(?), I think this bug/feature has gotta be there. It's the price for the enormous flexibility. Frankly, I'd not be mad if we had less control - my guess is that in not too many years we're going to look back on RLHF and grimace at our draconian methods. Yeah, if you're only trying to build a "get the thing I intend done" machine I guess it's useful, but I think the real power in these models is in their propensity to expose you to new ideas and provide a tireless foil for all the half-baked concepts that would otherwise not get room to grow.
I give it,
Reposition the text bubble to be coming from the middle character.
DO NOT modify the poses or features of the actual characters.
Now sure, specs are hard. Gemini removed the text bubble entirely. Whatever, let's just try again: Place a speech bubble on the image. The "tail" of the bubble should make it appear that the middle (red-headed) girl is talking. The speech bubble should read "Hide the vodka." Use a Comic Sans like font. DO NOT place the bubble on the right.
DO NOT modify the characters in the image.
There's only one red-head in the image; she's the middle character. We get a speech bubble, correctly positioned, but with a sans-serif, Arial-ish font, not Comic Sans. It reads "Hide the vokda" (sic). The facial expression of the middle character has changed.Yes, specs are hard. Defining a spec is hard. But Gemini struggles to follow the specification given. Whole sessions are like this, and absolute struggle to get basic directions followed.
You can even see here that I & the author have started to learn the SHOUT AT IT rule. I suppose I should try more bulleted lists. Someone might learn, through experimentation "okay, the AI has these hidden idiosyncrasies that I can abuse to get what I want" but … that's not a good thing, that's just an undocumented API with a terrible UX.
(¹because that is what the AI on a previous step generated. No, that's not what was asked for. I am astounded TFA generated an NYT logo for this reason.)
Which is exactly why the current discourse is about 'who does it best' (IMO, the flux series is top dog here. No one else currently strikes the proper balance between following style / composition / text rendering quite as well). That said, even flux is pretty tricky to prompt - it's really, really easy to step on your own toes here - for example, by giving conflicting(ish) prompts "The scene is shot from a high angle. We see the bottom of a passenger jet".
Talking to designers has the same problem. "I want a nice, clean logo of a distressed dog head. It should be sharp with a gritty feel". For the person defining the spec, they actually do have a vision that fits each criteria in some way, but it's unclear which parts apply to what.
Discounting the testing around the character JSON which became extremely expensive due to extreme iteration/my own stupidity, I'd wager it took about $5 total including iteration.
That is why I always call technical writers "documentation engineers," why I call diplomats "international engineers," why I call managers "team engineers," and why I call historians "hindsight engineers."
So Prompt Philosopher/Communicator?
Despite needing much knowledge of how a planes inner workings function, a pilot is still a pilot and not an aircraft engineer.
Just because you know how human psychology works when it comes to making purchase decision and you are good at applying that to sell things, you're not a sales engineer.
Giving something a fake name, to make it seem more complicated or aspirational than it actually is makes you a bullshit engineer in my opinion.
This is a very different fuzzy interface compared to programming languages.
There will be techniques better or worse at interfacing.
This is what the term prompt engineering is alluding to since we don’t have the full suite of language to describe this yet.
now you can really use natural language and people want to debate you about how poor they are at articulating a shared concepts, amazing
it's like the people are regressing and the AI is improving
>Nano Banana is terrible at style transfer even with prompt engineering shenanigans
My context: I'm kind of fixated on visualizing my neighborhood as it would have appeared in the 18th century. I've been doing it in Sketchup, and then in Twinmotion, but neither of those produce "photorealistic" images... Twinmotion can get pretty close with a lot of work, but that's easier with modern architecture than it is with the more hand-made, brick-by-brick structures I'm modeling out.
As different AI image generators have emerged, I've tried them all in an effort to add the proverbial rough edges to snapshots of the models I've created, and it was not until Nano Banana that I ever saw anything even remotely workable.
Nano Banana manages to maintain the geometry of the scene, while applying new styles to it. Sometimes I do this with my Twinmotion renders, but what's really been cool to see is how well it takes a drawing, or engraving, or watercolor - and with as simple a prompt as "make this into a photo" it generates phenomenal results.
Similarly to the Paladin/Starbucks/Pirate example in the link though, I find that sometimes I need to misdirect a little bit, because if I'm peppering the prompt with details about the 18th century, I sometimes get a painterly image back. Instead, I'll tell it I want it to look like a photograph of a well preserved historic neighborhood, or a scene from a period film set in the 18th century.
As fantastic as the results can be, I'm not abandoning my manual modeling of these buildings and scenes. However, Nano Banana's interpretation of contemporary illustrations has helped me reshape how I think about some of the assumptions I made in my own models.
I added a CLI to it (using Gemini CLI) and submitted a PR, you can run that like so:
GEMINI_API_KEY="..." \
uv run --with https://github.com/minimaxir/gemimg/archive/d6b9d5bbefa1e2ffc3b09086bc0a3ad70ca4ef22.zip \
python -m gemimg "a racoon holding a hand written sign that says I love trash"
Result in this comment: https://github.com/minimaxir/gemimg/pull/7#issuecomment-3529...I'm exceptionally excited about Chinese editing models. They're getting closer and closer to NanoBanana in terms of robustness, and they're open source. This means you can supply masks and kernels and do advanced image operations, integrate them into visual UIs, etc.
You can even fine tune them and create LoRAs that will do the style transferring tasks that Nano Banana falls flat on.
I don't like how closed the frontier US models are, and I hope the Chinese kick our asses.
That said, I love how easy it'll be to distill Nano Banana into a new model. You can pluck training data right out of it: ((any image, any instruction) -> completion) tuples.
If that's true, it seems worth getting past the 'cumbersome' aspects. This tech may not put Hollywood out of business, but it's clear that the process of filmmaking won't be recognizable in 10 years if amateurs can really do this in their basements today.
Adobe's conference last week points to the future of image gen. Visual tools where you mold images like clay. Hands on.
Comfy appeals to the 0.01% that like toolkits like TouchDesigner, Nannou, and ShaderToy.
https://www.youtube.com/watch?v=YqAAFX1XXY8 - dynamic 3D scene relighting is insane, check out the 3:45 mark.
https://www.youtube.com/watch?v=BLxFn_BFB5c - molding photos like clay in 3D is absolutely wild at the 3:58 mark.
I don't have links to everything. They presented a deluge of really smart editing tools and gave their vision for the future of media creation.
Tangible, moldable, visual, fast, and easy.
For imagegen, agreed. But for textgen, Kimi K2 thinking is by far the best chat model at the moment from my experience so far. Not even "one of the best", the best.
It has frontier level capability and the model was made very tastefully: it's significantly less sycophantic and more willing to disagree in a productive, reasonable way rather than immediately shutting you out. It's also way more funny at shitposting.
I'll keep using Claude a lot for multimodality and artifacts but much of my usage has shifted to K2. Claude's sycophancy is particular is tiresome. I don't use ChatGPT/Gemini because they hide the raw thinking tokens, which is really cringe.
Also, yesterday I asked it a question and after the answer it complained about its poorly written system prompt to me.
They're really torturing their poor models over there.
I agree that a project.scripts would be good but that's a decision for the maintainer to take on separately!
is this just a manual copy/paste into a gist with some html css styling; or do you have a custom tool à la amp-code that does this more easily?
I made a video about building that here: https://simonwillison.net/2025/Oct/23/claude-code-for-web-vi...
It works much better with Claude Code and Codex CLI because they don't mess around with scrolling in the same way as Gemini CLI does.
- make massive, seemingly random edits to images - adjust image scale - make very fine grained but pervasive detail changes obvious in an image diff
For instance, I have found that nano-banana will sporadically add a (convincing) fireplace to a room or new garage behind a house. This happens even with explicit "ALL CAPS" instructions not to do so. This happens sporadically, even when the temperature is set to zero, and makes it impossible to build a reliable app.
Has anyone had a better experience?
This looks like it's caused by 99% of the relative directions in image descriptions describing them from the looker's point of view, and that 99% of the ones that aren't it they refer to a human and not to a skull-shaped pancake.
See below to demonstrate this weakness with the same prompts as the article see the link below, which demonstrates that it is a model weakness and not just a language ambiguity:
For some offline character JSON prompts I ended up adding an additional "any mentions of left and right are from the character's perspective, NOT the camera's perspective" to the prompt, which did seem to improve success.
Looks like specific f-stops don't actually make a difference for stable diffusion at least: https://old.reddit.com/r/StableDiffusion/comments/1adgcf3/co...
[0] https://www.lux.camera/content/images/size/w1600/2024/09/IMG...
I also created a small editing suite for myself where I can draw bounding boxes on images when they aren’t perfect, and have them fixed. Either just with a prompt or feeding them to Claude as image and then having it write the prompt to fix the issue for me (as a workflow on the api). It’s been quite a lot of fun to figure out what works. I am incredibly impressed by where this is all going.
Once you do have good storyboards. You can easily do start-to-end GenAI video generation (hopping from scene to scene) and bring them to life and build your own small visual animated universes.
Maybe a little mode collapse away from pale ugliness, not quite getting to the hints of unnatural and corpse-like features of a vampire - interesting what the limitations are. You'd probably have to spend quite a lot of time zeroing in, but Google's image models are supposed to have allowed smooth traversal of those feature spaces generally.
I see where you are coming from...
Which is not to say don’t be creative, I applaud all creativity, but also to be very critical of what you are doing.
It's pretty easy to get something decent. It's really hard to get something good. I share my creations with some close friends and some are like "that's hot!" but are too fixated on breasts to realize that the lighting or shadow is off. Other friends do call out the bad lighting.
You may be like "it's just porn, why care about consistent lighting?" and the answer for me is that I'm doing all this to learn how everything works. How to fine tune weights, prompts, using IP Adapter, etc. Once I have a firm understanding of this stuff, then I will probably be able to make stuff that's actually useful to society. Unlike that coke commercial.
But what I understood from parent comment is that they just do it for fun, not necessarily to be a boon to society. And then if it comes with new skills that actually can benefit society, then that's a win.
Granted, the commenter COULD play around with SFW stuff but if they're just doing it for fun then that's still not benefiting society either, so either way it's a wash. We all have fun in our own ways.
But it's impressive that this billion dollar company didn't have one single person say "hey it's shitty, make it better."
AI is shitty in its own new unique ways. And people don't like new. They want they old, polished shittiness they are used to.
It's only a matter of time before we get experienced AI filmmakers. I think we already have them, actually. It's clear that Coke does not employ them though.
Also, since it's new media, nobody knows how to budget time or money to fix the flaws. It could be infinitely expensive.
That's my entire point. Artists were fine with everybody making "art" as long as everybody except them (with their hard fought skill and dedication) achieved toddler level of output quality. As soon as everybody could truly get even close to the level of actual art, not toddler art, suddenly there's a horrible problem with all the amateur artists using the tools that are available to them to make their "toddler" art.
Folks in tech generally have very limited exposure to the art world — fan art communities online, Reddit subs, YouTubers, etc. It’s more representative of internet culture than the art world— no more representative of artists than X politics is representative of voters. People have real grievances here and you are not a victim of the world’s artists. Most artists also don’t care about online art communities or what you think about them. Not even a little bit.
I will be if they manage to slow down development of AI even by a smidgen.
> Most artists also don’t care about online art communities or what you think about them. Not even a little bit.
Fully agree. They care about whether there's going to be anyone willing to buy their stuff from them. And not-toddler art is a real competition for them. So they are super against everybody making it.
Imagine if you gave everyone a free guitar and people just started posting their electric guitar noodlings on social media after playing for 5 minutes.
It is not a judgement on the guitar. If anything it is a judgement on social media and the stupidity of the social media user who get worked up about someone creating "slop" after playing guitar for 5 minutes.
What did you expect them to sound like, Steve Vai?
It's intentionally hostile and inconsiderate.
But it would be _much_ better if when you hit reply, it gave you a message that you're "posting too fast" before you spend the time to write it up.
It’s pretty good, but one conspicuous thing is that most of the blueberries are pointing upwards.
I didn't expect that. I would have definitely counted that as a "probably real" tally mark if grading an image.
73 more comments available on Hacker News
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.