Back to Home11/18/2025, 3:09:38 PM

Gemini 3

preek

1638 points

1014 comments

Mood

excited

Sentiment

mixed

Discussion Activity

Very active discussion

First comment

Peak period

Hour 1

Avg / period

10.7

Comment distribution160 data points

Based on 160 loaded comments

Key moments

01Story posted
11/18/2025, 3:09:38 PM
1d ago
Step 01
02First comment
11/18/2025, 3:14:22 PM
5m after posting
Step 02
03Peak activity
61 comments in Hour 1
Hottest window of the conversation
Step 03
04Latest activity
11/19/2025, 12:48:54 PM
6h ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (1014 comments)

Showing 160 comments of 1014

sd9

1d ago

6 replies

How long does it typically take after this to become available on https://gemini.google.com/app ?

I would like to try the model, wondering if it's worth setting up billing or waiting. At the moment trying to use it in AI Studio (on the Free tier) just gives me "Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."

mpeg

1d ago

1 reply

Allegedly it's already available in stealth mode if you choose the "canvas" tool and 2.5. I don't know how true that is, but it is indeed pumping out some really impressive one shot code

Edit: Now that I have access to Gemini 3 preview, I've compared the results of the same one shot prompts on the gemini app's 2.5 canvas vs 3 AI studio and they're very similar. I think the rumor of a stealth launch might be true.

sd9

1d ago

Thanks for the hint about Canvas/2.5. I have access to 3.0 in AI Studio now, and I agree the results are very similar.

Romario77

1d ago

1 reply

It's available in cursor. Should be there pretty soon as well.

ionwake

1d ago

are you sure its available in cursor? ( I get: We're having trouble connecting to the model provider. This might be temporary - please try again in a moment. )

netdur

1d ago

On gemini.google.com, I see options labeled 'Fast' and 'Thinking.' The 'Thinking' option uses Gemini 3 Pro

csomar

1d ago

It's already available. I asked it "how smart are you really?" and it gave me the same ai garbage template that's now very common on blog posts: https://gist.githubusercontent.com/omarabid/a7e564f09401a64e...

magicalhippo

1d ago

> https://gemini.google.com/app

How come I can't even see prices without logging in... they doing regional pricing?

Squarex

1d ago

Today I guess. They were not releasing the preview models this time and it seems the want to synchronize the release.

mil22

1d ago

5 replies

It's available to be selected, but the quota does not seem to have been enabled just yet.

"Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."

"You've reached your rate limit. Please try again later."

Update: as of 3:33 PM UTC, Tuesday, November 18, 2025, it seems to be enabled.

misiti3780

1d ago

1 reply

seeing the same issue.

sottol

1d ago

you can bring your google api key to try it out, and google used to give $300 free when signing up for billing and creating a key.

when i signed up for billing via cloud console and entered my credit card, i got $300 "free credits".

i haven't thrown a difficult problem at gemini 3 pro it yet, but i'm sure i got to see it in some of the A/B tests in aistudio for a while. i could not tell which model was clearly better, one was always more succinct and i liked its "style" but they usually offered about the same solution.

sarreph

1d ago

Looks to be available in Vertex.

I reckon it's an API key thing... you can more explicitly select a "paid API key" in AI Studio now.

lousken

1d ago

I hope some users will switch from cerebras to free up those resources

r0fl

1d ago

Works for me.

CjHuber

1d ago

For me it’s up and running. I was doing some work with AI Studio when it was released and reran a few prompts already. Interesting also that you can now set thinking level low or high. I hope it does something, in 2.5 increasing maximum thought tokens never made it think more

guluarte

1d ago

1 reply

it is live in the api

> gemini-3-pro-preview-ais-applets

> gemini-3-pro-preview

spudlyo

1d ago

Can confirm. I was able to access it using GPTel in Emacs using 'gemini-3-pro-preview' as the model name.

__jl__

1d ago

4 replies

API pricing is up to $2/M for input and $12/M for output

For comparison: Gemini 2.5 Pro was $1.25/M for input and $10/M for output Gemini 1.5 Pro was $1.25/M for input and $5/M for output

jhack

1d ago

1 reply

With this kind of pricing I wonder if it'll be available in Gemini CLI for free or if it'll stay at 2.5.

xnx

1d ago

1 reply

There's a waitlist for using Gemini 3 for Gemini CLI free users: https://docs.google.com/forms/d/e/1FAIpQLScQBMmnXxIYDnZhPtTP...

eevmanu

1d ago

In case anyone wants to confirm if this link is official, it is.

https://goo.gle/enable-preview-features

-> https://github.com/google-gemini/gemini-cli/blob/release/v0....

--> https://goo.gle/geminicli-waitlist-signup

---> https://docs.google.com/forms/d/e/1FAIpQLScQBMmnXxIYDnZhPtTP...

raincole

1d ago

1 reply

Still cheaper than Sonnet 4.5: $3/M for input and $15/M for output.

brianjking

1d ago

2 replies

It is so impressive that Anthropic has been able to maintain this pricing still.

Aeolun

1d ago

1 reply

Because every time I try to move away I realize there’s nothing equivalent to move to.

Alex-Programs

1d ago

1 reply

People insist upon Codex, but it takes ages and has an absolutely hideous lack of taste.

andybak

1d ago

1 reply

Taste in what?

js4ever

20h ago

Wines!

bottlepalm

1d ago

2 replies

Claude is just so good. Every time I try moving to ChatGPT or Gemini, they end up making concerning decisions. Trust is earned, and Claude has earned a lot of trust from me.

Honestly Google models have this mix of smart/dumb that is scary. Like if the universe is turned into paperclips then it'll probably be by a Google model.

epolanski

1d ago

Idk Anthropic has the least consistent models out there imho.

int_19h

20h ago

Well, it depends. Just recently I had Opus 4.1 spend 1.5 hours looking at 600+ sources while doing deep research, only to get back to me with a report consisting of a single sentence: "Full text as above - the comprehensive summary I wrote". Anthropic acknowledged that it was a problem on their side but refused to do anything to make it right, even though all I asked them to do was to adjust the counter so that this attempt doesn't count against their incredibly low limit.

fosterfriends

1d ago

Thrilled to see the cost is competitive with Anthropic.

dktp

20h ago

It's interesting that grounding with search cost changed from

* 1,500 RPD (free), then $35 / 1,000 grounded prompts

* 1,500 RPD (free), then (Coming soon) $14 / 1,000 search queries

It looks like the pricing changed from per-prompt (previous models) to per-search (Gemini 3)

DeathArrow

1d ago

1 reply

It generated a quite cool pelican on a bike: https://imgur.com/a/yzXpEEh

rixed

1d ago

2025: solve the biking pelican problem

2026: cure cancer

GodelNumbering

1d ago

3 replies

And of course they hiked the API prices

Standard Context(≤ 200K tokens)

Input $2.00 vs $1.25 (Gemini 3 pro input is 60% more expensive vs 2.5)

Output $12.00 vs $10.00 (Gemini 3 pro output is 20% more expensive vs 2.5)

Long Context(> 200K tokens)

Input $4.00 vs $2.50 (same +60%)

Output $18.00 vs $15.00 (same +20%)

CjHuber

1d ago

3 replies

Is it the first time long context has separate pricing? I hadn’t encountered that yet

brianjking

1d ago

1 reply

Google has always done this.

CjHuber

1d ago

Ok wow then I‘ve always missed that.

1ucky

1d ago

Anthropic is also doing this for long context >= 200k Tokens on Sonnet 4.5

Topfi

1d ago

Google has been doing that for a while.

panarky

1d ago

Claude Opus is $15 input, $75 output.

xnx

18h ago

If the model solves your needs in fewer prompts, it costs less.

aliljet

1d ago

2 replies

When will this be available in the cli?

_ryanjsalva

1d ago

2 replies

Gemini CLI team member here. We'll start rolling out today.

evandena

1d ago

How about for Pro (not Ultra) subscribers?

aliljet

1d ago

This is the heroic move everyone is waiting for. Do you know how this will be priced?

Sammi

1d ago

I'm already seeing it in https://aistudio.google.com/

skerit

1d ago

1 reply

Not the preview crap again. Haven't they tested it enough? When will it be available in Gemini-CLI?

CjHuber

1d ago

Honestly I liked 2.5 Pro preview much more than the final version

prodigycorp

1d ago

14 replies

I'm sure this is a very impressive model, but gemini-3-pro-preview is failing spectacularly at my fairly basic python benchmark. In fact, gemini-2.5-pro gets a lot closer (but is still wrong).

For reference: gpt-5.1-thinking passes, gpt-5.1-instant fails, gpt-5-thinking fails, gpt-5-instant fails, sonnet-4.5 passes, opus-4.1 passes (lesser claude models fail).

This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks. A lot of people are going to say "wow, look how much they jumped in x, y, and z benchmark" and start to make some extrapolation about society, and what this means for others. Meanwhile.. I'm still wondering how they're still getting this problem wrong.

edit: I've a lot of good feedback here. I think there are ways I can improve my benchmark.

Filligree

1d ago

4 replies

What's the benchmark?

ahmedfromtunis

1d ago

1 reply

I don't think it would be a good idea to publish it on a prime source of training data.

Hammershaft

1d ago

3 replies

He could post an encrypted version and post the key with it to avoid it being trained on?

stefs

20h ago

Every AI corp has people reading HN.

benterix

1d ago

What makes you think it wouldn't end up in the training set anyway?

rs186

1d ago

I wouldn't underestimate the intelligence of agentic AI, despite how stupid they are today.

prodigycorp

1d ago

1 reply

nice try!

ankit219

1d ago

you already sent the prompt to gemini api - and they likely recorded it. So in a way they can access it anyway. Posting here or not would not matter in that aspect.

petters

1d ago

Good personal benchmarks should be kept secret :)

GuB-42

6h ago

NIBBLES.BAS maybe [1]

If you make some assumptions about the species of the snake, it can count as a basic python benchmark ;)

[1] https://en.wikipedia.org/wiki/Nibbles_(video_game)

mupuff1234

1d ago

1 reply

Could also just be rollout issues.

prodigycorp

1d ago

Could be. I'll reply to my comment later with pass/fail results of a re-run.

ddalex

1d ago

1 reply

I moved to using the model from python coding to golang coding and got incredible speedups in writing the correct version of the code

layer8

1d ago

Is observed speed meaningful for a model preview? Isn’t it likely to go down once usage goes up?

testartr

1d ago

1 reply

and models are still pretty bad at playing tic-tac-toe, they can do it, but think way too much

it's easy to focus on what they can't do

big-and-small

1d ago

1 reply

Everything is about context. When you just ask non-concrete task it's still have to parse your input and figure what is tic-tac-toe in this context and what exactly you expect it to do. This is why all "thinking".

Ask it to implement tic-tac-toe in Python for command line. Or even just bring your own tic-tac toe code.

Then make it imagine playing against you and it's gonna be fast and reliable.

testartr

20h ago

prompt was very concrete: draw a tic tac toe ASCII table and let's play. gemini 2.5 thought for pages particular moves

benterix

1d ago

3 replies

> This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks.

Yeah I have my own set of tests and the results are a bit unsettling in the sense that sometimes older models outperform newer ones. Moreover, they change even if officially the model doesn't change. This is especially true of Gemini 2.5 pro that was performing much better on the same tests several months ago vs. now.

adastra22

1d ago

I maintain a set of prompts and scripts for development using Claude Code. They are still all locked to using Sonnet 4 and Opus 4.1, because Sonnet 4.5 is flaming hot garbage. I’ve stopped trusting the benchmarks for anything.

Iulioh

1d ago

A lot of newer models are geared towards efficency and if you add the fact that more efficent models are trained on the output of less efficent (but more accurate) models....

GPT4/3o might be the best we will ever have

Archer6621

12h ago

I wonder whether it could be related to some kind of over-fitting, i.e. a prompting style that tends to work better with the older models, but performs worse with the newer ones.

WhitneyLand

1d ago

1 reply

>>benchmarks are meaningless

No they’re not. Maybe you mean to say they don’t tell the whole story or have their limitations, which has always been the case.

>>my fairly basic python benchmark

I suspect your definition of “basic” may not be consensus. Gpt-5 thinking is a strong model for basic coding and it’d be interesting to see a simple python task it reliably fails at.

NaomiLehman

1d ago

3 replies

they are not meaningless, but when you work a lot with LLMs and know them VERY well, then a few varied, complex prompts tell you all you need to know about things like EQ, sycophancy, and creative writing.

I like to compare them using chathub using the same prompts

Gemini still calls me "the architect" in half of the prompts. It's very cringe.

mpalmer

18h ago

    Gemini still calls me "the architect" in half of the prompts. It's very cringe.

Can't say I've ever seen this in my own chats. Maybe it's something about your writing style?

beepbooptheory

22h ago

I get that one can perhaps have an intuition about these things, but doesn't this seem like a somewhat flawed attitude to have all things considered? That is, saying something to the effect of "well I know its not too sycophantic, no measurement needed, I have some special prompts of my own and it passed with flying colors!" just sounds a little suspect on first pass, even if its not like totally unbelievable I guess.

sothatsit

18h ago

It’s very different to get a “vibe check” for a model than to get an actual robust idea of how it works and what it can or can’t do.

This exact thing is why people strongly claimed that GPT-5 Thinking was strictly worse than o3 on release, only for people to change their minds later when they’ve had more time to use it and learn its strengths and weaknesses.

sosodev

1d ago

1 reply

How can you be sure that your benchmark is meaningful and well designed?

Is the only thing that prevents a benchmark from being meaningful publicity?

prodigycorp

1d ago

1 reply

I didn't tell you what you should think about the model. All I said is that you should have your own benchmark.

I think my benchmark is well designed. It's well designed because it's a generalization of a problem I've consistently had with LLMs on my code. Insofar that it encapsulates my coding preferences and communication style, that's the proper benchmark for me.

gregsadetsky

1d ago

1 reply

I asked a semi related question in a different thread [0] -- is the basic idea behind your benchmark that you specifically keep it secret to use it as an "actually real" test that was definitely withheld from training new LLMs?

I've been thinking about making/publishing a new eval - if it's not public, presumably LLMs would never get better at them. But is your fear that generally speaking, LLMs tend to (I don't want to say cheat but) overfit on known problems, but then do (generally speaking) poorly on anything they haven't seen?

Thanks

[0] https://news.ycombinator.com/item?id=45968665

adastra22

1d ago

1 reply

> if it's not public, presumably LLMs would never get better at them.

Why? This is not obvious to me at all.

gregsadetsky

1d ago

1 reply

You're correct of course - LLMs may get better at any task of course, but I meant that publishing the evals might (optimistically speaking) help LLMs get better at the task. If the eval was actually picked up / used in the training loop, of course.

adastra22

1d ago

1 reply

That kind of “get better at” doesn’t generalize. It will regurgitate its training data, which now includes the exact answer being looked for. It will get better at answering that exact problem.

But if you care about its fundamental reasoning and capability to solve new problems, or even just new instances of the same problem, then it is not obvious that publishing will improve this latter metric.

Problem solving ability is largely not from the pretraining data.

gregsadetsky

1d ago

Yeah, great point.

I was considering working on the ability to dynamically generate eval questions whose solutions would all involve problem solving (and a known, definitive answer). I guess that this would be more valuable than publishing a fixed number of problems with known solutions. (and I get your point that in the end it might not matter because it's still about problem solving, not just rote memorization)

thefourthchime

1d ago

3 replies

I like to ask "Make a pacman game in a single html page". No model has ever gotten a decent game in one shot. My attempt with Gemini3 was no better than 2.5.

ofa0e

1d ago

2 replies

Your benchmarks should not involve IP.

ComplexSystems

1d ago

2 replies

Why? This seems like a reasonable task to benchmark on.

ofa0e

1d ago

2 replies

Sure, reasonable to benchmark on if your goal is to find out which companies are the best at stealing the hard work of some honest, human souls.

scragz

1d ago

1 reply

correction: pacman is not a human and has no soul.

WhyOhWhyQ

18h ago

Why do you have to willfully misinterpret the person you're replying to? There's truth in their comment.

tomalbrc

1d ago

tech bros hate reality

adastra22

1d ago

Because you hit guard rails.

sowbug

1d ago

1 reply

The only intellectual property here would be trademark. No copyright, no patent, no trade secret. Unless someone wants to market the test results as a genuine Pac-Man-branded product, or otherwise dilute that brand, there's nothing should-y about it.

bongodongobob

1d ago

1 reply

It's not an ethics thing. It's a guardrails thing.

sowbug

23h ago

That's a valid point, though an average LLM would certainly understand the difference between trademark other forms of IP. I was responding to the earlier comment, whose author later clarified that it represented an ethical stance ("stealing the hard work of some honest, human souls").

bitexploder

1d ago

1 reply

Something else to consider. I often have much better success with something like: Create a prompt that creates a specification for a pacman game in a single html page. Consider edge cases and key implementation details that result in bugs. <take prompt>, execute prompt. It will often yield a much better result than one generic prompt. Now that models are trained on how to generate prompts for themselves this is quite productive. You can also ask it to implement everything in stages and implement tests, and even evaluate its tests! I know that isn't quite the same as "Implement pacman on an HTML page" but still, with very minimal human effort you can get the intended result.

amelius

22h ago

I thought this kind of chaining was already part of these systems.

Workaccount2

1d ago

It made a working game for me (with a slightly expanded prompt), but the ghosts got trapped in the box after coming back from getting killed. A second prompt fixed it. The art and animation however was really impressive.

dekhn

1d ago

2 replies

Using a single custom benchmark as a metric seems pretty unreliable to me.

Even at the risk of teaching future AI the answer to your benchmark, I think you should share it here so we can evaluate it. It's entirely possible you are coming to a wrong conclusion.

prodigycorp

1d ago

2 replies

after taking a walk for a bit i decided you’re right. I came to the wrong conclusion. Gemini 3 is incredibly powerful in some other stuff I’ve run.

This probably means my test is a little too niche. The fact that it didn’t pass one of my tests doesn’t speak to the broader intelligence of the model per se.

While i still believe in the importance of a personalized suite of benchmarks, my python one needs to be down weighted or supplanted.

my bad to the google team for the cursory brush off.

nomel

21h ago

1 reply

> This probably means my test is a little too niche.

> my python one needs to be down weighted or supplanted.

To me, this just proves your original statement. You can't know if an AI can do your specific task based on benchmarks. They are relatively meaningless. You must just try.

I have AI fail spectacularly, often, because I'm in a niche field. To me, in the context of AI, "niche" is "most of the code for this is proprietary/not in public repos, so statistically sparse".

relaytheurgency

19h ago

I feel similarly. If you're working with some relatively niche APIs on services that don't get seen by the public, the AI isn't one-shotting anything. But I still find it helpful to generate some crap that I can then feel good about fixing.

chermi

23h ago

Walks are magical. But also this reads partially like you got sent to a reeducation camp lol.

lofaszvanitt

13h ago

No, do not share it. The bigger black hole these models are in, the better.

luckydata

1d ago

1 reply

I'm dying to know what you're giving to it that's choking on. It's actually really impressive if that's the case.

nomel

21h ago

I find this hard to understand. I have AI completely choke on my code constantly. What are you doing where it performs so well? Web?

I constantly see failures in trivial vectors projections, broken bash scripts that don't properly quote variables (fail if space in filenames), and near completely inability to do relatively basic image processing tasks (if they don't rely on template matches).

I accidentally spent $50 on Gemeni 2.5 Pro last week, with Roo, trying to make a simple Mock interface for some lab equipment. The result: it asks permission to delete everything it did and start over...

m00dy

1d ago

that's why everyone using AI for code should code in rust only.

t0mas88

21h ago

Google reports a lower score for Gemini 3 Pro on SWEBench than Claude Sonnet 4.5, which is comparing a top tier model with a smaller one. Very curious to see whether there will be an Opus 4.5 that does even better.

Rover222

1d ago

curious if you tried grok 4.1 too

mring33621

1d ago

I agree that benchmarks are noise. I guess, if you're selling an LLM wrapper, you'd care, but as a happy chat end-user, I just like to ask a new model about random stuff that I'm working on. That helps me decide if I like it or not.

I just chatted with gemini-3-pro-preview about an idea I had and I'm glad that I did. I will definitely come back to it.

IMHO, the current batch of free, free-ish models are all perfectly adequate for my uses, which are mostly coding, troubleshooting and learning/research.

This is an amazing time to be alive and the AI bubble doomers that are costing me some gains RN can F-Off!

nickandbro

1d ago

2 replies

What we have all been waiting for:

"Create me a SVG of a pelican riding on a bicycle"

https://www.svgviewer.dev/s/FfhmhTK1

Thev00d00

1d ago

2 replies

That is pretty impressive.

So impressive it makes you wonder if someone has noticed it being used a benchmark prompt.

burkaman

1d ago

2 replies

Simon says if he gets a suspiciously good result he'll just try a bunch of other absurd animal/vehicle combinations to see if they trained a special case: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

jmmcd

1d ago

1 reply

"Pelican on bicycle" is one special case, but the problem (and the interesting point) is that with LLMs, they are always generalising. If a lab focussed specially on pelicans on bicycles, they would as a by-product improve performance on, say, tigers on rollercoasters. This is new and counter-intuitive to most ML/AI people.

BoorishBears

1d ago

The gold standard for cheating on a benchmark is SFT and ignoring memorization. That's why the standard for quickly testing for benchmark contamination has always been to switch out specifics of the task.

Like replacing named concepts with nonsense words in reasoning benchmarks.

ddalex

1d ago

https://www.svgviewer.dev/s/TVk9pqGE giraffe in a ferrari

rixed

1d ago

I have tried combinations of hard to draw vehicle and animals (crocodile, frog, pterodactly, riding a hand glider, tricycle, skydiving), and it did a rather good job in every cases (compared to previous tests). Whatever they have done to improve on that point, they did it in a way that generalise.

bitshiftfaced

1d ago

1 reply

It hadn't occurred to me until now that the pelican could overcome the short legs issue by not sitting on the seat and instead put its legs inside the frame of the bike. That's probably closer to how a real pelican would ride a bike, even if it wasn't deliberate.

xnx

1d ago

Very aero

golfer

1d ago

2 replies

Supposedly this is the model card. Very impressive results.

https://pbs.twimg.com/media/G6CFG6jXAAA1p0I?format=jpg&name=...

Also, the full document:

https://archive.org/details/gemini-3-pro-model-card/page/n3/...

tweakimp

1d ago

2 replies

Every time I see a table like this numbers go up. Can someone explain what this actually means? Is there just an improvement that some tests are solved in a better way or is this a breakthrough and this model can do something that all others can not?

rvnx

1d ago

1 reply

This is a list of questions and answers that was created by different people.

The questions AND the answers are public.

If the LLM manages through reasoning OR memory to repeat back the answer then they win.

The scores represent the % of correct answers they recalled.

tylervigen

22h ago

1 reply

That is not entirely true. At least some of these tests (like HLE and ARC) take steps to keep the evaluation set private so that LLMs can’t just memorize the answers.

You could question how well this works, but it’s not like the answers are just hanging out on the public internet.

slaterbug

16h ago

Excuse my ignorance, how do these companies evaluate their models against the evaluation set without access to it?

stavros

1d ago

I estimate another 7 months before models start getting 115% on Humanity's Last Exam.

HardCodedBias

1d ago

1 reply

If you believe another thread the benchmarks are comparing Gemini-3 (probably thinking) to GPT-5.1 without thinking.

The person also claims that with thinking on the gap narrows considerably.

We'll probably have 3rd party benchmarks in a couple of days.

iamdelirium

1d ago

This is easily shown that the numbers are for GPT 5.1 thinking high.

Just go to the leaderboard website and see for yourself: https://arcprize.org/leaderboard

santhoshr

1d ago

4 replies

Pelican riding a bicycle: https://pasteboard.co/CjJ7Xxftljzp.png

mohsen1

1d ago

1 reply

Some time I think I should spend $50 on Upwork to get a real human artist to do it first to know what is that we're going for. What a good pelican riding a bicycle SVG is actually looking like?

AstroBen

1d ago

IMO it's not about art, but a completely different path than all these images are going down. The pelican needs tools to ride the bike, or a modified bike. Maybe a recumbent?

robterrell

1d ago

1 reply

At this point I'm surprised they haven't been training on thousands of professionally-created SVGs of pelicans on bicycles.

notatoad

1d ago

1 reply

i think anything that makes it clear they've done that would be a lot worse PR than failing the pelican test would ever be.

imiric

1d ago

It would be next to impossible for anyone without insider knowledge to prove that to be the case.

Secondly, benchmarks are public data, and these models are trained on such large amounts of it that it would be impractical to ensure that some benchmark data is not part of the training set. And even if it's not, it would be safe to assume that engineers building these models would test their performance on all kinds of benchmarks, and tweak them accordingly. This happens all the time in other industries as well.

So the pelican riding a bicycle test is interesting, but it's not a performance indicator at this point.

bn-l

1d ago

1 reply

It’s a good pelican. Not great but good.

cubefox

23h ago

The blue lines indicating wind really sell it.

xnx

1d ago

4 replies

2D SVG is old news. Next frontier is animated 3D. One shot shows there's still progress to be made: https://aistudio.google.com/apps/drive/1XA4HdqQK5ixqi1jD9uMg...

knownjorbist

1d ago

1 reply

Did you notice that this embedded a Gemini API connection within the app itself? Or am I not understanding what that is?

xnx

1d ago

I hadn't! It looks like that is there to power the text box at the bottom of the app that allows for AI-powered changes to the scene.

agnosticmantis

1d ago

1 reply

This says Gemini 2.5 though.

xnx

23h ago

Good observation. The app was created with Gemini 3 Pro Preview, but the app calls out to Gemini 2.5 if you use the embedded prompt box.

Alex-Programs

1d ago

Incredible. Thanks for sharing.

nick32661123

23h ago

Great improvement by only adding one feedback prompt: Change the rotation axis of the wheels by 90 degrees in the horizontal plane. Same for the legs and arms

https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

ttul

1d ago

4 replies

My favorite benchmark is to analyze a very long audio file recording of a management meeting and produce very good notes along with a transcript labeling all the speakers. 2.5 was decently good at generating the summary, but it was terrible at labeling speakers. 3.0 has so far absolutely nailed speaker labeling.

iagooar

1d ago

2 replies

What prompt do you use for that?

gregsadetsky

1d ago

1 reply

I just tried "analyze this audio file recording of a meeting and notes along with a transcript labeling all the speakers" (using the language from the parent's comment) and indeed Gemini 3 was significantly better than 2.5 Pro.

3 created a great "Executive Summary", identified the speakers' names, and then gave me a second by second transcript:

    [00:00] Greg: Hello.
    [00:01] X: You great?
    [00:02] Greg: Hi.
    [00:03] X: I'm X.
    [00:04] Y: I'm Y.
    ...

Super impressive!

HPsquared

1d ago

1 reply

Does it deduce everyone's name?

gregsadetsky

1d ago

It does! I redacted them, but yes. This was a 3-person call.

punnerud

1d ago

I made a simple webpage to grab text from YouTube videos: https://summynews.com Great for this kind of testing? (want to expand to other sources in the long run)

valtism

1d ago

1 reply

Parakeet TDT v3 would be really good at that

kridsdale3

19h ago

Yes, this is the best solution for that goal. Use the MacWhisper app + Parakeet 3.

satvikpendem

1d ago

1 reply

I'd do the transcript and the summary parts separately. Dedicated audio models from vendors like ElevenLabs or Soniox use speaker detection models to produce an accurate speaker based transcript while I'm not necessarily sure that Google's models do so, maybe they just hallucinate the speakers instead.

trvz

12h ago

Agreed. I don’t see the need for Gemini to be able to do this task, although it should be able to offload it to another model.

renegade-otter

1d ago

It's not even THAT hard. I am working on a side project that gets a podcast episode and then labels the speakers. It works.

RobinL

1d ago

- Anyone have any idea why it says 'confidential'?

- Anyone actually able to use it? I get 'You've reached your rate limit. Please try again later'. (That said, I don't have a paid plan, but I've always had pretty much unlimited access to 2.5 pro)

[Edit: working for me now in ai studio]

nilsingwersen

1d ago

Feeling great to see something confidential

samuelknight

1d ago

"Gemini 3 Pro Preview" is in Vertex

CjHuber

1d ago

Interesting that they added an option to select your own API key right in AI studio‘s input field. I sincerely hope the times of generous free AIstudio usage are not over

Der_Einzige

1d ago

When will they allow us to use modern LLM samplers like min_p, or even better samplers like top N sigma, or P-less decoding? They are provably SOTA and in some cases enable infinite temperature.

Temperature continues to be gated to maximum of 0.2, and there's still the hidden top_k of 64 that you can't turn off.

I love the google AI studio, but I hate it too for not enabling a whole host of advanced features. So many mixed feelings, so many unanswered questions, so many frustrating UI decisions on a tool that is ostensibly aimed at prosumers...

ponyous

1d ago

Can’t wait to test it out. Been running a tons of benchmarks (1000+ generations) for my AI to CAD model project and noticed:

- GPT-5 medium is the best

- GPT-5.1 falls right between Gemini 2.5 Pro and GPT-5 but it’s quite a bit faster

Really wonder how well Gemini 3 will perform

informal007

1d ago

It seem that Google doesn't prepare well to release Gemini 3 but leak many contents, include the model card early today and gemini 3 on aistudio.google.com

854 more comments available on Hacker News

View full discussion on Hacker News

ID: 45967211Type: storyLast synced: 11/19/2025, 7:26:56 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Read Article View on HN