Pico-Banana-400k
Posted2 months agoActive2 months ago
github.comTechstoryHigh profile
calmmixed
Debate
60/100
Artificial IntelligenceMachine LearningDataset
Key topics
Artificial Intelligence
Machine Learning
Dataset
Apple releases the Pico-Banana-400k dataset, a collection of image editing data generated using Gemini-2.5-Pro, sparking discussion on its potential uses, limitations, and the broader implications for AI research.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
24m
Peak period
13
12-15h
Avg / period
6.9
Comment distribution62 data points
Loading chart...
Based on 62 loaded comments
Key moments
- 01Story posted
Oct 25, 2025 at 10:01 PM EDT
2 months ago
Step 01 - 02First comment
Oct 25, 2025 at 10:25 PM EDT
24m after posting
Step 02 - 03Peak activity
13 comments in 12-15h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 27, 2025 at 7:19 PM EDT
2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45708524Type: storyLast synced: 11/20/2025, 5:51:32 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
> The pipeline (bottom) shows how diverse OpenImages inputs are edited using Nano-Banana and quality-filtered by Gemini-2.5-Pro, with failed attempts automatically retried.
Pretty interesting. I run a fairly comprehensive image-comparison site for SOTA generative AI in text-to-image and editing. Managing it manually got pretty tiring, so a while back I put together a small program that takes a given starting prompt, a list of GenAI models, and a max number of retries which does something similar.
It generates and evaluates images using a separate multimodal AI, and then rewrites failed prompts automatically repeating up to a set limit.
It's not perfect (nine pointed star example in particular) - but often times the "recognition aspect of a multimodal model" is superior to its generative capabilities so you can run it in a sort of REPL until you get the desired outcome.
https://genai-showdown.specr.net/image-editing
Or there's another very similar site. But I'm pretty sure it's yours
How often do you update it? It seems like something new every time I check. Or I forget everything..
> Pico-Banana-400K serves as a versatile resource for advancing controllable and instruction-aware image editing. Beyond single-step editing, the dataset enables multi-turn, conversational editing and reward-based training paradigms.
I'm happy to see something from Apple but this seems so low-tech that it could be one of my own local ComfyUI workflows.
"You wouldn't steal a car," but anyone can distill an expensive, fully trained model in order to build their own.
This is going to be one of the most important categories of image model. It's good that we have more than Google and the Chinese (ByteDance, et al) with competent editing models. I don't think Flux Kontext is keeping up.
It'd be really nice if we had a Nano Banana-calibur model as open source.
And clearly, if training on copyrighted material is fair use as every LLM makers claim, then this license has literally no weight.
Also, NAL but IIRC an automatically generated dataset isn't copyrightable in the first place.
Even in the US, I think the situation is complex. If I prompt an LLM to edit a copyrighted human-written text, the LLM output is going to be copyrighted, because even if the LLM’s changes aren’t copyrightable, the underlying text is. And what happens if an LLM proposes edits, and then a human uses their own judgement to decide which LLM edits to accept and which not to? That act of human judgement might provide grounds for copyrightability which weren’t present in the raw LLM output.
Definitely very useful, but I’m so curious how the original datasets from these image editing models were created. I’m guessing a lot of it is synthetic data to construct scenes programmatically with layers
Personally it makes me less likely to read it but the content might be useful. I have some general tech interest but am not overwhelmingly interested in the subject. Sometimes good things crop up on HN too.
Now, if an author was writing for an audience with the intention to attract the interest of people who were not enthusiasts to become enthusiasts of their product they would create something readable and attractive. The LLM hasn't here.
Together, this leads me to think that the readme is not for me but is just for dedicated enthusiasts.
I guess that makes me an LLM
As an aside, perhaps they're using GPT/Codex for coding. Did anyone else notice the use of emojis and → in their code?
https://lmarena.ai/leaderboard/image-edit
https://genai-showdown.specr.net/image-editing
Don’t get me started in how “agent” is a term of art that means absolutely nothing, encompassing everything from a plain old shell script to a full language model.
> Dataset Statistics
> Nano-Banana-400K contains ~400K image editing data, covering a wide visual and semantic range drawn from real-world imagery.
For example, you might ask the second model to cover the person’s face with a black square; a VLM model notes that the person is a man with brown hair and round glasses. Then, during training, the resulting image is presented along with the prompt, “Remove the black square from the man’s face. He has brown hair and round glasses.”
The model now learns how to remove black squares and replace them with a man’s face with brown hair and round glasses.
Since the training data is easily synthesized using existing models, you can generate enormous amounts of it - often very cheaply. For specialized editing tasks, this technique is really powerful. Build your training set for your special purpose task, fine tune an existing image editing model such as Qwen Image Edit to produce a new checkpoint or LoRA (often a LoRA is more than good enough) and then you have a special purpose model to perform whatever narrow editing task you need it to perform on your image data.
If the commands all follow the same syntax, it's easy to imagine how you can generate a good training set.
But how to they fully grasp natural language to be able to perform tasks worded unexpectedly, which would be easy to parse, if they understood natural language?
A Large Language Model. Pardon me for spelling out the full acronym, but it is what it is for a reason.
I think a lot of the whiz-bang applications of LLMs have drowned it out, but LLMs are effectively the solution to the long-standing problem of natural language understanding, and that alone would be enough to make them a ground-breaking technology. Taking English text and translating it with very high fidelity into the vector space these models understand is amazing and I think somewhat underappreciated.
2 more comments available on Hacker News