Gemini 3 Pro Vs. 2.5 Pro in Pokemon Crystal
Key topics
A fascinating experiment pitting Gemini 3 Pro against 2.5 Pro in navigating the notoriously tricky Goldenrod Underground puzzle in Pokémon Crystal has sparked a lively debate about the capabilities and limitations of large language models (LLMs). Commenters weighed in on the models' performance, with some attributing their struggles to the puzzle's baffling design, while others wondered if the models' familiarity with the game or online walkthroughs influenced their results. As discussion participants dissected the models' strengths and weaknesses, disagreements emerged over the quality of recent model updates, with some praising improvements, while others claimed a decline in quality. The thread's relevance lies in its timely examination of LLMs' abilities and the potential for benchmark-chasing to impact model development.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
4d
Peak period
71
96-108h
Avg / period
19
Based on 95 loaded comments
Key moments
- 01Story posted
Dec 16, 2025 at 7:48 AM EST
17 days ago
Step 01 - 02First comment
Dec 20, 2025 at 10:14 AM EST
4d after posting
Step 02 - 03Peak activity
71 comments in 96-108h
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 23, 2025 at 3:20 PM EST
10 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
That said, it's definitely Gem's fault that it struggled so long, considering it ignored the NPCs that give clues.
Ummm... what? It's currently my best coding model.
Citation?
https://x.com/simonw/status/1924909405906338033
Just try other random, non-realistic things like “a giraffe walking a tightrope”, “a car sitting at a cafe eating a pizza”, etc.
If the results are dramatically different, then they gamed it. If they are similar in quality, then they probably didn’t.
Does this even have any effect?
It definitely doesn't wipe its internal knowledge of Crystal clean (that's not how LLMs work). My guess is that it slightly encourages the model to explore more and second-guess it's likely very-strong Crystal game knowledge but that's about it.
"It is worth noting that the instruction to "ignore internal knowledge" played a role here. In cases like the shutters puzzle, the model did seem to suppress its training data. I verified this by chatting with the model separately on AI Studio; when asked directly multiple times, it gave the correct solution significantly more often than not. This suggests that the system prompt can indeed mask pre-trained knowledge to facilitate genuine discovery."
Basically every benchmark worth it's salt uses bespoke problems purposely tuned to force the models to reason and generalize. It's the whole point of ARC-AGI tests.
Unsurprisingly Gemini 3 pro performs way better on ARC-AGI than 2.5 pro, and unsurprisingly it did much better in pokemon.
The benchmarks, by design, indicate you can mix up the switch puzzle pattern and it will still solve it.
It's a big part of why search overview summaries are so awful. Many times the answers are not grounded in the material.
Instead, what can happen is that, like a human, the model (hopefully) disregards the instruction, making it carry (close to) zero weight.
Whether the 'effect' something implied by the prompt, or even something we can understand, is a totally different question.
If you looked inside they would be spinning on something like "oh I know this is the tile to walk on, but I have to only rely on what I observe! I will do another task instead to satisfy my conditions and not reveal that I have pre-knowledge.
LLMs are literal douche genies. The less you say, generally, the better
This was a good while back but I'm sure a lot of people might find the process and code interesting even if it didn't succeed. Might resurrect that project.
But, it runs in browser and works with any supplied ROM, none of it is Pokémon-specific so I should set aside time to serve it and make the code available
And yeah, it's not the insanely priced AI Ultra plan, but if there are any hard limits on Gemini Pro usage I haven't found them. I have played a lot with really long Antigravity sessions to try to figure out what this thing is good for, and it seems like it will pretty much sit there and run all day. (And I can't really blame anyone for still remaining mad about AI to be completely honest, but the technology is too neat by this point to just completely ignore it.)
Seeing as Google is still giving away a bunch of free access, I'm guessing they're still in the ultra-cash-burning phase of things. My hope (hopium, realistically) is that by the time all of the cash burning is over, there will be open-weight local models that are striking near where Gemini 3 Pro strikes today. It doesn't have to be as good, getting nearby on hardware consumers can afford would be awesome.
But I'm not holding my breath, so let's hope the cash burning continues for a few years.
(There is, of course, the other way to look at it, which is that looking at the pricing per token may not tell the whole story. Given that Google is running their own data centers, it's possible the economic proposition isn't as bad as it looks. OTOH, it's also possible it is worse than it looks, if they happen to be selling tokens at a loss... but I quite doubt it, given they are currently SOTA and can charge a premium.)
I was unclear if this meant that the API was overloaded or if he was on a subscription plan and had hit his maximum.
If limit your token count to a fraction of 2 billion tokens, you can try it on your own game, and of course have it complete a shorter fraction of the game.
Did the streamer get subsidized by Google?
(The stream isn't run by Google themselves, is it?)
In other words, how much of this improvement is true generalization vs memorization?
Just don’t confuse it with a random benchmark!
That said, this writeup itself will probably be scraped and influence Gemini 4.
(I do think from personal benchmarks that Gemini 3 is better for the reasons stated by the author, but a single run from each is not strong evidence.)
Haven't tried a Figma design, but i built an internal tool entirely via instructions to agent. The kind of work I could easily quote 3 weeks previously.
It's hard to generalize but modern frontend is very good at isolating you from dealing with complex state machine states and you're dealing with single user/limited concurrency. It's usually easy to find all references/usecases for something.
Most modern backend is building consistent distributed state machines, you need to cover all the edge cases, deal with concurrency, different clients/contracts etc. I would say getting BE right (beyond simple CRUD) is going to be hard for LLM simply because the context is usually wider and hard to compress/isolate.
Seeing the kind of complexity that agents (not standalone llm) are able to navigate - I can only start to believe - just a matter of time it can do all kinds of programming, including state of the art backend programming - even writing a database on its own - good thing with backend is its easily testable and if there is documentation that a developer can read and comprehend - an llm/agent would be able to do that - not very far from today.
AI is the best at adding standard things into standard boilerplate situations, all those frameworks just makes it easier for AI. They also make it easier for humans once you know them and have seen examples, that is why they exist, once you know those frontend is not hard.
Testing is one of the things that's generally tedious in front end applications, but not inherently complex. There may be lots of config needed (e.g. for setting up and controlling a headless browser), and long turnarounds because tests are slow and shaky. But they are also boilerplatey.
That said, I still get surprising results from time to time, it just takes a lot more curation and handholding.
- History, most likely
It did ultimately decide Ozzy was alive. I pushed back on that, and it instantly corrected itself and partially blamed my query "what is he up to" for being formulated as if he was alive.
Is this baked into how the models are built? A model outputs a bunch of tokens, then reads them back and treats them as the existing "state" which has to be built on. So if the model has earlier said (or acted like) a given assumption is true, then it is going to assume "oh, I said that, it must be the case". Presumably one reason that hacks like "Wait..." exist is to work around this problem.