Gemini 3 Pro Model Card [pdf]
Key topics
The Gemini 3 Pro model card was accidentally released by Google DeepMind, revealing details about the model's training data and performance, sparking discussion among the HN community about its capabilities and implications.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
20m
Peak period
60
2-4h
Avg / period
16
Based on 160 loaded comments
Key moments
- 01Story posted
Nov 18, 2025 at 6:12 AM EST
about 2 months ago
Step 01 - 02First comment
Nov 18, 2025 at 6:31 AM EST
20m after posting
Step 02 - 03Peak activity
60 comments in 2-4h
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 19, 2025 at 12:03 PM EST
about 2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Well don't complain when you are using Gmail and your emails are being trained to develop Gemini.
I don't expect them to follow their own privacy policies.
[0] https://www.yahoo.com/news/articles/google-sued-over-gemini-...
Also notable which models they include for comparison: Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-5.1. That seems like a minor snub against Grok 4 / Grok 4.1.
https://firstpagesage.com/reports/top-generative-ai-chatbots... suggests 0.6% of chat use cases, well below the other big names, and I suspect those stats for chat are higher than other scenarios like business usage. Given all that, I can see how Gemini might not be focused on competing with them.
it is understandable that grok is not popular.
I would want to hear more detail about prompts, frameworks, thinking time, etc., but they don't matter too much. The main caveat would be that this is probably on the public test set, so could be in pretraining, and there could even be some ARC-focussed post-training - I think we don't know yet and might never know.
But for any reasonable setup, if no egregious cheating, that is an amazing score on ARC 2.
* https://x.com/arcprize/status/1990820655411909018
* https://arcprize.org/guide
One month Gemini is on top, then ChatGPT, then Anthropic. Not sure why everyone gets FOMO whenever a new version gets released.
I don't think any other company has all these ingredients.
Microsoft has the chance of changing habit the most by virtue of being bundled into business contracts that have companies with policies not allowing any other product in the workplace.
Elaborate please. Are you saying that MS is forcing customers to make Copilot the only allowed LLM product?
Microsoft has contracts to provide software to companies. Companies have policies that only provided software and ai is allowed. Ipso facto
They have a long way to go to become profitable though. Those users will get less sticky when openAI starts upping their pricing/putting ads everywhere/making the product worse to save money/all of the above.
Even other search competitors have not proven to be a danger to Google.
Or maybe Google just benchmaxxed and this doesn't translate at all in real world performance.
TBD if that performance generalizes to other real world tasks.
2) Google's search revenue last quarter was $56 billion, a 14% increase over Q3 2024.
2) I'm not suggesting this will happen overnight but especially younger people gravitate towards LLM for information search + actively use some sort of ad blocking. In the long run it doesn't look great for Google.
[1] Binomial formula gives a confidence interval of 3.7%, using p=0.77, N=500, confidence=95%
But we'll have to wait a few weeks to see if the nerfed model post-release is still as good.
Having said that, OpenAI's ridiculous hype cycle has been living on borrowed time. OpenAI has zero moat, and are just one vendor in a space with many vendors, and even incredibly competent open source models by surprise Chinese entrants. Sam Altman going around acting like he's a prophet and they're the gatekeepers of the future is an act that should be super old, but somehow fools and their money continue to be parted.
Also, models are already pretty good but product/market fit (in terms of demonstrated economic value delivered) remains elusive outside of a couple domains. Does a model that's (say) 30% better reach an inflection point that changes that narrative, or is a more qualitative change required?
Also I really hoped for a 2M+ context. I'm living on the context edge even with 1M.
So far, IMHO, Claude Code remains significantly better than Gemini CLI. We'll see whether that changes with Gemini 3.
Not that Google didn't use to have problems shipping useful things. But it's gotten a lot worse.
That's because coding is currently the only reliable benchmark where reasoning capabilities transfer to predict capabilities for other professions like law. Coding is the only area where they are shy to release numbers. All these exam scores are fakeable by gaming those benchmarks.
EDIT: Don't disagree that Gemini CLI has a lot of rough edges, though.
Claude code seems to be more compatible with the model (or the reverse) whereas gemini-cli still feels a bit awkward (as of 2.5 Pro). I'm hoping its better with 3.0!
https://www.reddit.com/r/Bard/comments/1p093fb/gemini_3_in_c...
EDIT: formatting, hopefully a bit more mobile friendly
What do you mean? These coding leaderboards were at single digits about a year ago and are now in the seventies. These frontier models are arguably already better at the benchmark that any single human - it's unlikely that any particular human dev is knowledgeable to tackle the full range of diverse tasks even in the smaller SWE-Bench Verified within a reasonable time frame; to the best of my knowledge, no one has tried that.
Why should we expect this to be the limit? Once the frontier labs figure out how to train these fully with self-play (which shouldn't be that hard in this domain), I don't see any clear limit to the level they can reach.
What field are you in where you feel that there might not have been any growth in capabilities at all?
What makes me even more curious is the following
> Model dependencies: This model is not a modification or a fine-tune of a prior model
So did they start from scratch with this one?
My hunch is that, aside from "safety" reasons, the Google Books lawsuit left some copyright wounds that Google did not want to reopen.
As of a couple weeks ago (the last time I checked) if you are signed in to multiple Google accounts and you cannot accept the non-commercial terms for one of them for AI Studio, the site is horribly broken (the text showing which account they’re asking you to agree to the terms for is blurred, and you can’t switch accounts without agreeing first).
In Google’s very slight defense, Anthropic hasn’t even tried to make a proper sign in system.
Like, kind of unreasonably good. You’d expect some perfunctory Electronic app that just barely wraps the website. But no, you get something that feels incredibly polished…more so than a lot of recent apps from Apple…and has powerful integrations into other apps, including text editors and terminals.
Gemini 1.0 was strictly worse than GPT-3.5 and was unusable due to "safety" features.
Google followed that up with 1.5 which was still worse than GPT-3.5 and unbelievably far behind GPT-4. At this same time Google had their "black nazi" scandals.
With Gemini 2.0 finally had a model that was at least useful for OCR and with their fash series a model that, while not up to par in capabilities, was sufficiently inexpensive that it found uses.
Only with Gemini-2.5 did Google catch up with SoTA. It was within "spitting distance" of the leading models.
Google did indeed drop the ball, very, very badly.
I suspect that Sergey coming back helped immensely, somehow. I suspect that he was able to tame some of the more dysfunctional elements of Google, at least for a time.
To be fair, for my use case (apart from GitHub copilot stuff with Claude 4.5 sonnet) I've never noticed too big of a difference between the actual models, and am more inclined to judge them by their ancillary services and speed, which google excells in.
Unfortunate typo.
Anyone with money can trivially catch up to a state of the art model from six months ago.
And as others have said, late is really a function of spigot, guardrails, branding, and ux, as much as it is being a laggard under the hood.
How come apple is struggling then?
The may want to use 3rd party or just wait for AI to be more stable to see how people actually use it instead of adding slop in the core of their product.
Announcing a load of AI features on stage and then failing to deliver them doesn't feel very strategic.
To be fair to Apple, so far the only mass market LLM use case so far is just a simple chatbot, and they don't seem to be interested in that. It remains to be seen if what Apple wants to do ("private" LLMs with access to your personal context acting as intimate personal assistants) is even possible to do reliably. It sounds useful, and I do believe it will eventually be possible, but no one is there yet.
They did botch the launch by announcing the Apple Intelligence features before they are ready though.
Enter late, enter great.
The biggest strides in the last 6-8 months have been in generative AIs, specifically for animation.
Their major version number bumps are a new pre-trained model. Minor bumps are changes/improvements to post-training on the same foundation.
On Terminal-Bench 2 for example, the leader is currently "Codex CLI (GPT-5.1-Codex)" at 57.8%, beating this new release.
This chart (comparing base models to base models) probably gives a better idea of the total strength of each model.
I can't wait to try 3.0, hopefully it continues this trend. Raw numbers in a table don't mean much, you can only get a true feeling once you use it on existing code, in existing projects. Anyway, the top labs keeping eachother honest is great for us, the consumers.
I hope it's isn't such a sycophant like the current gemini 2.5 models, it makes me doubt its output, which is maybe a good thing now that I think about it.
What's with the hyperbole? It'll tighten the screws, but saying that it's "over for the other labs' might be a tad premature.
Its not over and never will be for 2 decade old accounting software, it is definitely will not be over for other AI labs.
The new Gemini is not THAT far of a jump to switch your org to a new model if you already invested in e.g. OpenAI.
The difference must be night and day to call it "its over".
Right they all are marginally different. Today google fine tuned their model to be better, tomorrow it will be new Kimi, after that DeepSeek.
Because it seems to lead by a decent margin on the former and trails behind on the latter
LCB Pro are leet code style questions and SWE bench verified is heavily benchmaxxed very old python tasks.
However, going above 75%, it is likely about the same. The remaining instances are likely underspecified despite the effort of the authors that made the benchmark "verified". From what I have seen, these are often cases where the problem statement says implement X for Y, but the agent has to simply guess whether to implement the same for other case Y' - which leads to losing or winning an instance.
Models have begun to fairly thoroughly saturate "knowledge" and such, but there are still considerable bumps there
But the _big news_, and the demonstration of their achievement here, are the incredible scores they've racked up here for what's necessary for agentic AI to become widely deployable. t2-bench. Visual comprehension. Computer use. Vending-Bench. The sorts of things that are necessary for AI to move beyond an auto-researching tool, and into the realm where it can actually handle complex tasks in the way that businesses need in order to reap rewards from deploying AI tech.
Will be very interesting to see what papers are published as a result of this, as they have _clearly_ tapped into some new avenues for training models.
And here I was, all wowed, after playing with Grok 4.1 for the past few hours! xD
While the private questions don't seem to be included in the performance results, HLE will presumably flag any LLM that appears to have gamed its scores based on the differential performance on the private questions. Since they haven't yet, I think the scores are relatively trustworthy.
There are some Googlers behind the threads though (e.g. Hugh Zhang)
This was the primary bottleneck preventing models from tackling novel scientific problems they haven't seen before.
If Gemini 3 Pro has transcended "reading the internet" (knowledge saturation), and made huge progress in "thinking about the internet" (reasoning scaling), then this is a really big deal.
I feel like many will be pretty disappointed by their self created expectations for this model when they end up actually using it and it turns out to be fairly similar to other frontier models.
Personally I'm very interested in how they end up pricing it.
wayback machine still has it: https://web.archive.org/web/20251118111103/https://storage.g...
here’s the archived pdf: https://web.archive.org/web/20251118111103/https://storage.g...
174 more comments available on Hacker News