Gemini 3
Mood
excited
Sentiment
mixed
Category
tech
Key topics
AI
Gemini 3
Google released Gemini 3, its latest multimodal AI model, available for early testing in AI Studio, Vertex AI, and Google Antigravity, with users sharing mixed experiences and benchmarking results.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
5m
Peak period
61
Hour 1
Avg / period
10.7
Based on 160 loaded comments
Key moments
- 01Story posted
11/18/2025, 3:09:38 PM
1d ago
Step 01 - 02First comment
11/18/2025, 3:14:22 PM
5m after posting
Step 02 - 03Peak activity
61 comments in Hour 1
Hottest window of the conversation
Step 03 - 04Latest activity
11/19/2025, 12:48:54 PM
6h ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
I would like to try the model, wondering if it's worth setting up billing or waiting. At the moment trying to use it in AI Studio (on the Free tier) just gives me "Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."
Edit: Now that I have access to Gemini 3 preview, I've compared the results of the same one shot prompts on the gemini app's 2.5 canvas vs 3 AI studio and they're very similar. I think the rumor of a stealth launch might be true.
How come I can't even see prices without logging in... they doing regional pricing?
"Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."
"You've reached your rate limit. Please try again later."
Update: as of 3:33 PM UTC, Tuesday, November 18, 2025, it seems to be enabled.
when i signed up for billing via cloud console and entered my credit card, i got $300 "free credits".
i haven't thrown a difficult problem at gemini 3 pro it yet, but i'm sure i got to see it in some of the A/B tests in aistudio for a while. i could not tell which model was clearly better, one was always more succinct and i liked its "style" but they usually offered about the same solution.
I reckon it's an API key thing... you can more explicitly select a "paid API key" in AI Studio now.
> gemini-3-pro-preview-ais-applets
> gemini-3-pro-preview
For comparison: Gemini 2.5 Pro was $1.25/M for input and $10/M for output Gemini 1.5 Pro was $1.25/M for input and $5/M for output
https://goo.gle/enable-preview-features
-> https://github.com/google-gemini/gemini-cli/blob/release/v0....
--> https://goo.gle/geminicli-waitlist-signup
---> https://docs.google.com/forms/d/e/1FAIpQLScQBMmnXxIYDnZhPtTP...
Honestly Google models have this mix of smart/dumb that is scary. Like if the universe is turned into paperclips then it'll probably be by a Google model.
* 1,500 RPD (free), then $35 / 1,000 grounded prompts
to
* 1,500 RPD (free), then (Coming soon) $14 / 1,000 search queries
It looks like the pricing changed from per-prompt (previous models) to per-search (Gemini 3)
2026: cure cancer
Standard Context(≤ 200K tokens)
Input $2.00 vs $1.25 (Gemini 3 pro input is 60% more expensive vs 2.5)
Output $12.00 vs $10.00 (Gemini 3 pro output is 20% more expensive vs 2.5)
Long Context(> 200K tokens)
Input $4.00 vs $2.50 (same +60%)
Output $18.00 vs $15.00 (same +20%)
For reference: gpt-5.1-thinking passes, gpt-5.1-instant fails, gpt-5-thinking fails, gpt-5-instant fails, sonnet-4.5 passes, opus-4.1 passes (lesser claude models fail).
This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks. A lot of people are going to say "wow, look how much they jumped in x, y, and z benchmark" and start to make some extrapolation about society, and what this means for others. Meanwhile.. I'm still wondering how they're still getting this problem wrong.
edit: I've a lot of good feedback here. I think there are ways I can improve my benchmark.
If you make some assumptions about the species of the snake, it can count as a basic python benchmark ;)
it's easy to focus on what they can't do
Ask it to implement tic-tac-toe in Python for command line. Or even just bring your own tic-tac toe code.
Then make it imagine playing against you and it's gonna be fast and reliable.
Yeah I have my own set of tests and the results are a bit unsettling in the sense that sometimes older models outperform newer ones. Moreover, they change even if officially the model doesn't change. This is especially true of Gemini 2.5 pro that was performing much better on the same tests several months ago vs. now.
GPT4/3o might be the best we will ever have
No they’re not. Maybe you mean to say they don’t tell the whole story or have their limitations, which has always been the case.
>>my fairly basic python benchmark
I suspect your definition of “basic” may not be consensus. Gpt-5 thinking is a strong model for basic coding and it’d be interesting to see a simple python task it reliably fails at.
I like to compare them using chathub using the same prompts
Gemini still calls me "the architect" in half of the prompts. It's very cringe.
Gemini still calls me "the architect" in half of the prompts. It's very cringe.
Can't say I've ever seen this in my own chats. Maybe it's something about your writing style?This exact thing is why people strongly claimed that GPT-5 Thinking was strictly worse than o3 on release, only for people to change their minds later when they’ve had more time to use it and learn its strengths and weaknesses.
Is the only thing that prevents a benchmark from being meaningful publicity?
I think my benchmark is well designed. It's well designed because it's a generalization of a problem I've consistently had with LLMs on my code. Insofar that it encapsulates my coding preferences and communication style, that's the proper benchmark for me.
I've been thinking about making/publishing a new eval - if it's not public, presumably LLMs would never get better at them. But is your fear that generally speaking, LLMs tend to (I don't want to say cheat but) overfit on known problems, but then do (generally speaking) poorly on anything they haven't seen?
Thanks
Why? This is not obvious to me at all.
But if you care about its fundamental reasoning and capability to solve new problems, or even just new instances of the same problem, then it is not obvious that publishing will improve this latter metric.
Problem solving ability is largely not from the pretraining data.
I was considering working on the ability to dynamically generate eval questions whose solutions would all involve problem solving (and a known, definitive answer). I guess that this would be more valuable than publishing a fixed number of problems with known solutions. (and I get your point that in the end it might not matter because it's still about problem solving, not just rote memorization)
Even at the risk of teaching future AI the answer to your benchmark, I think you should share it here so we can evaluate it. It's entirely possible you are coming to a wrong conclusion.
This probably means my test is a little too niche. The fact that it didn’t pass one of my tests doesn’t speak to the broader intelligence of the model per se.
While i still believe in the importance of a personalized suite of benchmarks, my python one needs to be down weighted or supplanted.
my bad to the google team for the cursory brush off.
> my python one needs to be down weighted or supplanted.
To me, this just proves your original statement. You can't know if an AI can do your specific task based on benchmarks. They are relatively meaningless. You must just try.
I have AI fail spectacularly, often, because I'm in a niche field. To me, in the context of AI, "niche" is "most of the code for this is proprietary/not in public repos, so statistically sparse".
I constantly see failures in trivial vectors projections, broken bash scripts that don't properly quote variables (fail if space in filenames), and near completely inability to do relatively basic image processing tasks (if they don't rely on template matches).
I accidentally spent $50 on Gemeni 2.5 Pro last week, with Roo, trying to make a simple Mock interface for some lab equipment. The result: it asks permission to delete everything it did and start over...
I just chatted with gemini-3-pro-preview about an idea I had and I'm glad that I did. I will definitely come back to it.
IMHO, the current batch of free, free-ish models are all perfectly adequate for my uses, which are mostly coding, troubleshooting and learning/research.
This is an amazing time to be alive and the AI bubble doomers that are costing me some gains RN can F-Off!
"Create me a SVG of a pelican riding on a bicycle"
So impressive it makes you wonder if someone has noticed it being used a benchmark prompt.
Like replacing named concepts with nonsense words in reasoning benchmarks.
https://pbs.twimg.com/media/G6CFG6jXAAA1p0I?format=jpg&name=...
Also, the full document:
https://archive.org/details/gemini-3-pro-model-card/page/n3/...
The questions AND the answers are public.
If the LLM manages through reasoning OR memory to repeat back the answer then they win.
The scores represent the % of correct answers they recalled.
You could question how well this works, but it’s not like the answers are just hanging out on the public internet.
The person also claims that with thinking on the gap narrows considerably.
We'll probably have 3rd party benchmarks in a couple of days.
Just go to the leaderboard website and see for yourself: https://arcprize.org/leaderboard
Secondly, benchmarks are public data, and these models are trained on such large amounts of it that it would be impractical to ensure that some benchmark data is not part of the training set. And even if it's not, it would be safe to assume that engineers building these models would test their performance on all kinds of benchmarks, and tweak them accordingly. This happens all the time in other industries as well.
So the pelican riding a bicycle test is interesting, but it's not a performance indicator at this point.
https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
3 created a great "Executive Summary", identified the speakers' names, and then gave me a second by second transcript:
[00:00] Greg: Hello.
[00:01] X: You great?
[00:02] Greg: Hi.
[00:03] X: I'm X.
[00:04] Y: I'm Y.
...
Super impressive!- Anyone actually able to use it? I get 'You've reached your rate limit. Please try again later'. (That said, I don't have a paid plan, but I've always had pretty much unlimited access to 2.5 pro)
[Edit: working for me now in ai studio]
Temperature continues to be gated to maximum of 0.2, and there's still the hidden top_k of 64 that you can't turn off.
I love the google AI studio, but I hate it too for not enabling a whole host of advanced features. So many mixed feelings, so many unanswered questions, so many frustrating UI decisions on a tool that is ostensibly aimed at prosumers...
- GPT-5 medium is the best
- GPT-5.1 falls right between Gemini 2.5 Pro and GPT-5 but it’s quite a bit faster
Really wonder how well Gemini 3 will perform
854 more comments available on Hacker News
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.