Gemini 3 Pro Vs. 2.5 Pro in Pokemon Crystal

Posted17 days agoActive10 days ago

alphabetting

304 points

90 comments

blog.jcz.devTech DiscussionstoryHigh profile

informativeneutral

Debate

20/100

AI Performance AnalysisPokemon CrystalLLM

Key topics

AI Performance Analysis

Pokemon Crystal

LLM

A fascinating experiment pitting Gemini 3 Pro against 2.5 Pro in navigating the notoriously tricky Goldenrod Underground puzzle in Pokémon Crystal has sparked a lively debate about the capabilities and limitations of large language models (LLMs). Commenters weighed in on the models' performance, with some attributing their struggles to the puzzle's baffling design, while others wondered if the models' familiarity with the game or online walkthroughs influenced their results. As discussion participants dissected the models' strengths and weaknesses, disagreements emerged over the quality of recent model updates, with some praising improvements, while others claimed a decline in quality. The thread's relevance lies in its timely examination of LLMs' abilities and the potential for benchmark-chasing to impact model development.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

Peak period

96-108h

Avg / period

Comment distribution95 data points

Loading chart...

Based on 95 loaded comments

Key moments

01Story posted
Dec 16, 2025 at 7:48 AM EST
17 days ago
Step 01
02First comment
Dec 20, 2025 at 10:14 AM EST
4d after posting
Step 02
03Peak activity
71 comments in 96-108h
Hottest window of the conversation
Step 03
04Latest activity
Dec 23, 2025 at 3:20 PM EST
10 days ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (90 comments)

Showing 95 comments

jwrallie

13 days ago

1 reply

Being through the game recently, I am not surprised Goldenrod Underground was a challenge, it is very confusing and even though I solved it through trial and error, I still don't know what I did. Olivine Lighthouse is the real surprise, as it felt quite obvious to me.

MrCheeze

12 days ago

This writeup on the underground puzzle is worth reading, it's a pretty baffling "puzzle" design. https://pokemow.com/Gen2/ShutterPuzzle/

That said, it's definitely Gem's fault that it struggled so long, considering it ignored the NPCs that give clues.

wild_pointer

13 days ago

1 reply

I wonder how much of it is due to the model being familiar with the game or parts of it, be it due to training of the game itself, or reading/watching walkthroughs online.

andrepd

13 days ago

3 replies

There was a well-publicised "Claude plays Pokémon" stream where Claude failed to complete Pokemon Blue in spectacular fashion, despite weeks of trying. I think only a very gullible person would assume that future LLMs didn't specifically bake this into their training, as they do for popular benchmarks or for penguins riding a bike.

criley2

13 days ago

3 replies

While it is true that model makers are increasingly trying to game benchmarks, it's also true that benchmark-chasing is lowering model quality. GPT 5, 5.1 and 5.2 have been nearly universally panned by almost every class of user, despite being a benchmark monster. In fact, the more OpenAI tries to benchmark-max, the worse their models seem to get.

malnourish

13 days ago

5.2 is a solid model and I'm actually impressed with M365 copilot when using it.

granzymes

13 days ago

> GPT 5, 5.1 and 5.2 have been nearly universally panned by almost every class of user

Ummm... what? It's currently my best coding model.

astrange

13 days ago

Hm? 5.1 Thinking is much better than 4o or o3. Just don't use the instant model.

ctoth

13 days ago

1 reply

> as they do for popular benchmarks or for penguins riding a bike.

Citation?

da_grift_shift

13 days ago

They (LLM companies) obviously know.

https://x.com/simonw/status/1924909405906338033

dwaltrip

13 days ago

1 reply

If they game the pelican benchmark, it’d be pretty obvious.

Just try other random, non-realistic things like “a giraffe walking a tightrope”, “a car sitting at a cafe eating a pizza”, etc.

If the results are dramatically different, then they gamed it. If they are similar in quality, then they probably didn’t.

emp17344

12 days ago

Any chance they’re just using an image model to generate a PNG of the requested image, and then converting it to SVG?

oceansky

13 days ago

9 replies

"Crucially, it tells the agent not to rely on its internal training data (which might be hallucinated or refer to a different version of the game) but to ground its knowledge in what it observes. "

Does this even have any effect?

tootyskooty

13 days ago

1 reply

I'm wondering about this too. Would be nice to see an ablation here, or at least see some analysis on the reasoning traces.

It definitely doesn't wipe its internal knowledge of Crystal clean (that's not how LLMs work). My guess is that it slightly encourages the model to explore more and second-guess it's likely very-strong Crystal game knowledge but that's about it.

Workaccount2

13 days ago

The model probably recognizes the need for a grassroots effort to solve the problem, to "show it's work".

ragibson

13 days ago

1 reply

Yes, at least to some extent. The author mentions that the base model knows the answer to the switch puzzle but does not execute it properly here.

"It is worth noting that the instruction to "ignore internal knowledge" played a role here. In cases like the shutters puzzle, the model did seem to suppress its training data. I verified this by chatting with the model separately on AI Studio; when asked directly multiple times, it gave the correct solution significantly more often than not. This suggests that the system prompt can indeed mask pre-trained knowledge to facilitate genuine discovery."

hypron

13 days ago

3 replies

My issue with this is that the LLM could just be roleplaying that it doesn't know.

brianwawok

13 days ago

1 reply

To test would just need to edit the rom and switch around the solution. Not sure how complicated that is, likely depends on the rom system.

Workaccount2

13 days ago

I don't know why people still get wrapped around the axle of "training data".

Basically every benchmark worth it's salt uses bespoke problems purposely tuned to force the models to reason and generalize. It's the whole point of ARC-AGI tests.

Unsurprisingly Gemini 3 pro performs way better on ARC-AGI than 2.5 pro, and unsurprisingly it did much better in pokemon.

The benchmarks, by design, indicate you can mix up the switch puzzle pattern and it will still solve it.

jdiff

13 days ago

1 reply

Of course it is. It's not capable of actually forgetting or suppressing its training data. It's just double checking rather than assuming because of the prompt. Roleplaying is exactly what it's doing. At any point, it may stop doing that and spit out an answer solely based on training data.

It's a big part of why search overview summaries are so awful. Many times the answers are not grounded in the material.

wavemode

13 days ago

It may actually have the opposite effect - the instruction to not use prior knowledge may have been what caused Gemini 3 to assume incorrect details about how certain puzzles worked and get itself stuck for hours. It knew the right answer (from some game walkthrough in its training data), but intentionally went in a different direction in order to pretend that it didn't know. So, paradoxically, the results of the test end up worse than if the model truly didn't know.

stavros

13 days ago

Doesn't know what? This isn't about the model forgetting the training data, of course it can't do that any more than I can say "press the red button. Actually, forget that, press whatever you want" and have you actually forget what I said.

Instead, what can happen is that, like a human, the model (hopefully) disregards the instruction, making it carry (close to) zero weight.

astrange

13 days ago

1 reply

If they trained the model to respond to that, then it can respond to that, otherwise it can't necessarily.

oceansky

13 days ago

1 reply

I think you got a point here. These companies are injecting a lot of datasets every day into it.

astrange

13 days ago

What I meant is more like, if you write tests for something you know it works, and if you don't write tests you don't know that.

blibble

13 days ago

I very much doubt it

raincole

13 days ago

It will definitely have some effect. Why won't it? Even adding noise into prompts (like saying you will be rewarded $1000 for each correct answer) has some effect.

Whether the 'effect' something implied by the prompt, or even something we can understand, is a totally different question.

elif

13 days ago

I would imagine that prompting anything like this will have an excessively ironic effect like convincing it to suppress patterns which it would consider to be pre-knowledge.

If you looked inside they would be spinning on something like "oh I know this is the tile to walk on, but I have to only rely on what I observe! I will do another task instead to satisfy my conditions and not reveal that I have pre-knowledge.

LLMs are literal douche genies. The less you say, generally, the better

baby

13 days ago

Do we have examples of this in promps in other contexts?

MrCheeze

12 days ago

It's hard to say for sure because Gemini 3 was only tested with this prompt. But for Gemini 2.5, which is who the prompt was originally written for, yes this does cut down on bad assumptions (a specific example: the puzzle with Farfetch'd in Ilex Forest is completely different in the DS remake of the game, and models love to hallucinate elements from the remake's puzzle if you don't emphasize the need to distinguish hypothesis from things it actually observes).

mkoubaa

13 days ago

It might get things wrong on purpose, but deep down it knows what it's doing

soulofmischief

13 days ago

1 reply

Nice writeup! I need to start blogging about my antics. I rigged up several cutting edge small local models to an emulator all in-browser and unsuccessfully tried to get them to play different Pokémon games. They just weren't as sharp as the frontier models.

This was a good while back but I'm sure a lot of people might find the process and code interesting even if it didn't succeed. Might resurrect that project.

giancarlostoro

13 days ago

1 reply

I have to think they need to know enough of the guides for the game for it to work out, how do they know whats on screen?

soulofmischief

13 days ago

1 reply

[delayed]

giancarlostoro

13 days ago

1 reply

I have a feeling if you gave them access to GameFAQ guides they might be able to play better, but it depends on how you can feed them the data.

soulofmischief

13 days ago

1 reply

It turns out that cutting edge super small (3b param etc) models that fit in the browser are not great at playing Pokémon on an even basic level, even navigation is difficult when only providing raw visual information, and object recognition of the low-resolution sprites is not great. So I lost interest before even getting to the point of providing specific strategy.

But, it runs in browser and works with any supplied ROM, none of it is Pokémon-specific so I should set aside time to serve it and make the code available

giancarlostoro

12 days ago

You should consider publishing your setup as either a blog post or a GitHub repo. I think it could make for good benchmarking of smaller models. Ideally we can one day all run small models that can do amazing things.

bbondo

13 days ago

6 replies

1.88 billion tokens * $12 / 1M tokens (output) suggests a total cost of $22,560 to solve the game with Gemini 3 Pro?

brianwawok

13 days ago

2 replies

True though I bet the $200 a month plan could do it, maybe a few extra days of downtime when quota was maxed

AstroBen

13 days ago

1 reply

For how long would it stay $200 of you can rack up 5 figures if usage..

manmal

13 days ago

That is the reason they severely limited Claude Max subscriptions. Some users racked up 1k+ in API equivalent cost per day.

jchw

13 days ago

This is exactly why I upgrade to the Pixel 10 Pro. On Black Friday, you could get a Pixel 10 Pro for about $450 on the U.S. Google Fi store (which sells unlocked phones)... which is also about how much a Pixel 9 Pro goes for on eBay; minus eBay fees and accounting for shipping, that's an upgrade for < $100. But, it's even better, the Pixel 10 Pro comes with a year of their "AI Pro" plan (which I believe costs around $240/year.) There is really, really no point in upgrading to a Pixel 10 Pro from a Pixel 9 Pro, and environmentally it pains me to be the person upgrading my phone on an annual basis (this is the fastest I've ever upgraded a phone, ever) but it's hard to turn down when Google is selling $800~ish for $400~ish.

And yeah, it's not the insanely priced AI Ultra plan, but if there are any hard limits on Gemini Pro usage I haven't found them. I have played a lot with really long Antigravity sessions to try to figure out what this thing is good for, and it seems like it will pretty much sit there and run all day. (And I can't really blame anyone for still remaining mad about AI to be completely honest, but the technology is too neat by this point to just completely ignore it.)

Seeing as Google is still giving away a bunch of free access, I'm guessing they're still in the ultra-cash-burning phase of things. My hope (hopium, realistically) is that by the time all of the cash burning is over, there will be open-weight local models that are striking near where Gemini 3 Pro strikes today. It doesn't have to be as good, getting nearby on hardware consumers can afford would be awesome.

But I'm not holding my breath, so let's hope the cash burning continues for a few years.

(There is, of course, the other way to look at it, which is that looking at the pricing per token may not tell the whole story. Given that Google is running their own data centers, it's possible the economic proposition isn't as bad as it looks. OTOH, it's also possible it is worse than it looks, if they happen to be selling tokens at a loss... but I quite doubt it, given they are currently SOTA and can charge a premium.)

mkoubaa

13 days ago

1 reply

I can't believe how massively underpaid I was when I was 11

re-thc

13 days ago

4 replies

Do you hallucinate as a kid?

nomel

13 days ago

1 reply

Kids definitely do this. They fill in blanks/context with assumptions, resulting in all sorts of silly responses, for topics of sparse knowledge/certainty. They're not lying, because they think it's true. Sometimes the gap filling is wrong, but usually downright brilliant, within the context of their knowledge.

johnebgd

13 days ago

2 replies

Are you sure there is an age limit for that kind of behavior in humans?

nomel

10 days ago

I replied within the context provided rather than all possible contexts. Would you also like to bring up the interpolation and extrapolation seen in the behaviors of cuddle fish during problem solving?

Aeglaecia

13 days ago

calcification is as inevitable as entropy

mkoubaa

13 days ago

1 reply

All kids confidently state incorrect things it's part of growing up

mikojan

13 days ago

That is just part of being a frontend developer

foundddit

13 days ago

At that age, it's called "imagination"

anal_reactor

13 days ago

My friend's son says he sometimes closes his eyes, imagines cartoons, and watches them.

elephanlemon

13 days ago

1 reply

“Gemini 3 Pro was often overloaded, which produced long spans of downtime that 2.5 Pro experienced much less often”

I was unclear if this meant that the API was overloaded or if he was on a subscription plan and had hit his maximum.

bjackman

12 days ago

Geminii CLI has a specific "model is overloaded" error message which is distinct from "you're out of quota" so I suspect whatever tools they're using for this probably have something similar, and they're referring to that.

ogogmad

13 days ago

1 reply

:/ Damn. That needs to cost 1000x less before people can try it on their own games.

someperson

13 days ago

That's an extrapolation to finish the entire game.

If limit your token count to a fraction of 2 billion tokens, you can try it on your own game, and of course have it complete a shorter fraction of the game.

echelon

13 days ago

1 reply

Who is paying for this?

Did the streamer get subsidized by Google?

(The stream isn't run by Google themselves, is it?)

emp17344

12 days ago

If you go to the X page linked on the blog, the page owner mentions a “collaboration” with Google Deepmind on this project. It wouldn’t shock me if this just an elaborate advertisement for Gemini.

addaon

13 days ago

To beat it, not to solve it. Solving means something very specific in the context of games — deriving and proving a GTO strategy.

squimmy26

13 days ago

3 replies

How certain can we be that these improvements aren't just a result of Gemini 3 Pro pre-training on endless internet writeups of where 2.5 has struggled (and almost certainly what a human would have done instead)?

In other words, how much of this improvement is true generalization vs memorization?

zurfer

13 days ago

1 reply

You're too kind. Even the CEO of Google retweeted how well Gemini 2.5 did on Pokemon. There is a high chance that now it's explicitly part of the training regime. We kind of need a different kind of game to know how well it generalizes.

kqr

13 days ago

I have a draft doing this with text adventures: https://entropicthoughts.com/updated-llm-benchmark

prmoustache

13 days ago

1 reply

Isn't that the point of a new model anyway?

DANmode

12 days ago

Yes. Sort of.

Just don’t confuse it with a random benchmark!

MrCheeze

12 days ago

There were no such writeups, 99% of the discussion about difficulties in Crystal were in twitch and discord chats where Google doesn't scrape. (It hadn't yet gotten the public attention that Claude and Gemini's runs of Pokemon Red and Blue have gotten.)

That said, this writeup itself will probably be scraped and influence Gemini 4.

cg5280

13 days ago

1 reply

I like the inclusion of the graph at the end to compare progress. It would be cool to compare this directly to competing models (Claude, GPT, etc).

kqr

13 days ago

1 reply

It would unfortunately also need several runs of each to be reliable. There's nothing in TFA to indicate the results shown aren't to a large degree affected by random chance!

(I do think from personal benchmarks that Gemini 3 is better for the reasons stated by the author, but a single run from each is not strong evidence.)

casey2

13 days ago

1 reply

TFA says multiple times that the results are affect by random chance

kqr

13 days ago

Yes, but recognising that is only the first step. Quantifying the variance is the next step which I miss in the article.

sussmannbaka

13 days ago

3 replies

So after years of being gleefully told that AI will replace all jobs an omniscient state of the art model, with heavy assistance, takes more than two weeks and thousands of dollars in tokens to do what child me did in a few days? Huh.

murukesh_s

13 days ago

1 reply

I used to think the same until latest agents started adding perfectly fine features to a large existing react app with just basic input (in English) . Most of the jobs require levels of intelligence below that. It's just a matter of time before agents get to that.

blauditore

13 days ago

4 replies

It's about the complexity of the task. Front end apps tend do be much less complex and boilerplate-y than backends, hence AI tends to work better.

etse

13 days ago

1 reply

Isn’t frontend more complex? If my task starts with a Figma UI design, how well does a code agent do at generating working code that looks right, and iterate on it (presuming some browser MCP)? Some automated tests seem enough for an genetic loop on backend.

murukesh_s

13 days ago

>Isn’t frontend more complex? If my task starts with a Figma UI design, how well does a code agent do at generating working code that looks right, and iterate on it (presuming some browser MCP)? Some automated tests seem enough for an genetic loop on backend.

Haven't tried a Figma design, but i built an internal tool entirely via instructions to agent. The kind of work I could easily quote 3 weeks previously.

murukesh_s

13 days ago

3 replies

I disagree - having worked on backends most of the time, I find modern frontend much more complex (and difficult to test) than pure backend. When I say modern frontend - its mostly React, state management like Redux, Zustand, Router framework like React Router, a CSS framework like Tailwind and component framework like Shadcn. Not to mention different versions of React, different ways of managing state, animation/transitions etc. And on top of that the ever increasing complex quirks in the codebase still needed to be compatible with all the modern browsers and device sizes/orientation out there.

rafaelmn

13 days ago

1 reply

That's just a farmiliarity thing. I've worked on project doing full web FE, mobile and BE.

It's hard to generalize but modern frontend is very good at isolating you from dealing with complex state machine states and you're dealing with single user/limited concurrency. It's usually easy to find all references/usecases for something.

Most modern backend is building consistent distributed state machines, you need to cover all the edge cases, deal with concurrency, different clients/contracts etc. I would say getting BE right (beyond simple CRUD) is going to be hard for LLM simply because the context is usually wider and hard to compress/isolate.

murukesh_s

11 days ago

>Most modern backend is building consistent distributed state machines, you need to cover all the edge cases, deal with concurrency, different clients/contracts etc. I would say getting BE right (beyond simple CRUD) is going to be hard for LLM simply because the context is usually wider and hard to compress/isolate.

Seeing the kind of complexity that agents (not standalone llm) are able to navigate - I can only start to believe - just a matter of time it can do all kinds of programming, including state of the art backend programming - even writing a database on its own - good thing with backend is its easily testable and if there is documentation that a developer can read and comprehend - an llm/agent would be able to do that - not very far from today.

Jensson

13 days ago

> When I say modern frontend - its mostly React, state management like Redux, Zustand, Router framework like React Router, a CSS framework like Tailwind and component framework like Shadcn

AI is the best at adding standard things into standard boilerplate situations, all those frameworks just makes it easier for AI. They also make it easier for humans once you know them and have seen examples, that is why they exist, once you know those frontend is not hard.

blauditore

12 days ago

Actual complexity depends is not imposed by the framework, but by the application itself, and the amount of front-end logic tends to be lower. Yes, there is more complexity in the setup of front end code (now there are dependencies and build pipelines), but ultimately they should simplofy things especially for slighly more complex websites.

Testing is one of the things that's generally tedious in front end applications, but not inherently complex. There may be lots of config needed (e.g. for setting up and controlling a headless browser), and long turnarounds because tests are slow and shaky. But they are also boilerplatey.

ehnto

13 days ago

Training data is quite readily available as well, and the online education for React is immense in volume. Where enterprise backend software tends to be closed source and unavailable, and there's much less good advice online for how to build with say Java or .NET

That said, I still get surprising results from time to time, it just takes a lot more curation and handholding.

ribosometronome

13 days ago

Or perhaps the sort of things it's been trained on? There's not really a huge corpus of material re: beating Pokemon in the manner it has to play Pokemon, especially compared to the mountains of code these models have access to.

rybosome

13 days ago

2 replies

“And, because AI never got any better or any cheaper after that point, sussmanbaka’s wry observation remained true in perpetuity, forever.”

- History, most likely

mchusma

13 days ago

Cost per intelligence is shrinking by something like 100x per year. Even the Gemini flash release would potentially do as well for 1/5th already.

ehnto

13 days ago

We will only be able to see where the economic chips lie once all the current money games shake out. It's all a bit obfuscated at the moment.

dwaltrip

13 days ago

1 reply

Children are incredibly smart. All of this was fantasy 15 years ago. Comments like yours are amazing to me…

TulliusCicero

12 days ago

True AI is whatever hasn't been invented yet.

orbital-decay

13 days ago

1 reply

The baked-in assumptions observation is basically the opposite of the impression I get after watching Gemini 3's CoT. With the maximum reasoning effort it's able to break out of the wrong route by rethinking the strategy. For example I gave it an onion address without the .onion part, and told it to figure out what this string means. All models assume it's a puzzle (because they're trained on puzzles) and start endlessly deciphering it, applying different algorithms to no avail. Gemini 3 Pro is the only model that is capable of breaking the initial assumption after some time, and correctly identify the string as an onion address. My guess they trained it on simulations to enforce the anti-jailbreaking commands injected by the Model Armor, as its CoT is incredibly paranoid at times. I could be wrong, of course.

jug

13 days ago

1 reply

I've had some weird "thinking outside the box" behavior like this. I once asked 3 Pro what Ozzy Osbourne is up to. The CoT was a journey, I can tell you! It's not in its training data that he actually passed away. It did know he was planning a tour though. It had a real struggle trying to consolidate "suspicious search results" and even questioned whether it was fake news, or running against a simulation!, determining it wasn't going to fall for my "test".

It did ultimately decide Ozzy was alive. I pushed back on that, and it instantly corrected itself and partially blamed my query "what is he up to" for being formulated as if he was alive.

Wowfunhappy

13 days ago

Odd, mine didn't do anything interesting.

topaz0

13 days ago

Who do I have to talk to to get somebody to pay me thousands of dollars to beat a game from the 90s?

krige

12 days ago

As a fun comparison, Gemini 3 Pro took 17 days to beat the game. Twitch Plays Pokemon, which was frequently random, chaotic, even malicious, took 13 days to clear Crystal.

elif

13 days ago

Give it the gameFAQ next time

reilly3000

13 days ago

I’d love to see how the new flash-3 model would fare.

dpedu

12 days ago

Is the code behind this available?

dash2

13 days ago

> it often makes early assumptions and fails to validate them, which can waste a lot of time

Is this baked into how the models are built? A model outputs a bunch of tokens, then reads them back and treats them as the existing "state" which has to be built on. So if the model has earlier said (or acted like) a given assumption is true, then it is going to assume "oh, I said that, it must be the case". Presumably one reason that hacks like "Wait..." exist is to work around this problem.

View full discussion on Hacker News

ID: 46287848Type: storyLast synced: 12/21/2025, 4:20:26 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN