Gemini 3 Flash | Not Hacker News!

Discussion (575 comments)

Showing 160 comments of 575

meetpateltech

16 days ago

2 replies

Deepmind Page: https://deepmind.google/models/gemini/flash/

Developer Blog: https://blog.google/technology/developers/build-with-gemini-...

Model Card [pdf]: https://deepmind.google/models/model-cards/gemini-3-flash/

Gemini 3 Flash in Search AI mode: https://blog.google/products/search/google-ai-mode-update-ge...

simonw

16 days ago

1 reply

For anyone from the Gemini team reading this: these links should all be prominent in the announcement posts. I always have to hunt around for them!

meetpateltech

16 days ago

1 reply

Google actually does something similar for major releases - they publish a dedicated collection page with all related links.

For example, the Gemini 3 Pro collection: https://blog.google/products/gemini/gemini-3-collection/

But having everything linked at the bottom of the announcement post itself would be really great too!

simonw

16 days ago

Sadly there's nothing about Gemini 3 Flash on that page yet.

minimaxir

16 days ago

Documentation for Gemini 3 Flash in particular: https://ai.google.dev/gemini-api/docs/gemini-3

GaggiX

16 days ago

3 replies

They went too far, now the Flash model is competing with their Pro version. Better SWE-bench, better ARC-AGI 2 than 3.0 Pro. I imagine they are going to improve 3.0 Pro before it's no more in Preview.

Also I don't see it written in the blog post but Flash supports more granular settings for reasoning: minimal, low, medium, high (like openai models), while pro is only low and high.

skerit

16 days ago

1 reply

> They went too far, now the Flash model is competing with their Pro version

Wasn't this the case with the 2.5 Flash models too? I remember being very confused at that time.

JohnnyMarcone

16 days ago

This is similar to how Anthropic has treated sonnet/opus as well. At least pre opus 4.5.

To me it seems like the big model has been "look what we can do", and the smaller model is "actually use this one though".

minimaxir

16 days ago

1 reply

"minimal" is a bit weird.

> Matches the “no thinking” setting for most queries. The model may think very minimally for complex coding tasks. Minimizes latency for chat or high throughput applications.

I'd prefer a hard "no thinking" rule than what this is.

GaggiX

16 days ago

1 reply

It still supports the legacy mode of setting the budget, you can set it to 0 and it would be equivalent to none reasoning effort like gpt 5.1/5.2

minimaxir

16 days ago

I can confirm this is the case via the API, but annoyingly AI Studio doesn't let you do so.

jug

16 days ago

I'm not sure how I'm going to live with this!

samyok

16 days ago

16 replies

have been playing with it for the past few weeks, it’s genuinely my new favorite model. it’s so fast and it has such a vast world knowledge that it’s more performant than opus 4.5 or 5.2 extra high, for a fraction of the inference time and price

jauntywundrkind

16 days ago

2 replies

Just to point this out: many of these frontier models cost isn't that far away from two orders of magnitude more than what DeepSeek charges. It doesn't compare the same, no, but with coaxing I find it to be a pretty capable competent coding model & capable of answering a lot of general queries pretty satisfactorily (but if it's a short session, why economize?). $0.28/m in, $0.42/m out. Opus 4.5 is $5/$25 (17x/60x).

I only assume "if you're not getting charged, you are the product" has to be somewhat in play here. But when working on open source code, I don't mind.

happyopossum

16 days ago

2 replies

Two orders of magnitude would imply that these models cost $28/m in and $42/m out. Nothing is even close to that.

minraws

16 days ago

1 reply

Gpt 5.2 pro is well beyond that iirc

jauntywundrkind

16 days ago

Whoa! I had no idea. $21/$168. That's 75x / 400x (1e1.875/1e2.6). https://platform.openai.com/docs/pricing

jauntywundrkind

16 days ago

To me as an engineer, 60x for output (which is most of the cost I see, AFAICT) is not that significantly different from 100x.

I tried to be quite clear with showing my work here. I agree that 17x is much closer to a single order of magnitude than two. But 60x is, to me, a bulk enough of the way to 100x that yeah I don't feel bad saying it's nearly two orders.

KoolKat23

16 days ago

1 reply

I struggle to see the incentive to do this, I have similar thoughts for locally run models. It's only use case I can imagine is small jobs at scale perhaps something like auto complete integrated into your deployed application, or for extreme privacy, honouring NDA's etc.

Otherwise, if it's a short prompt or answer, SOTA (state of the art) model will be cheap anyway and id it's a long prompt/answer, it's way more likely to be wrong and a lot more time is spent on "checking/debugging" any issue or hallucination, so again SOTA is better.

lukan

16 days ago

1 reply

"or for extreme privacy"

Or for any privacy/IP protection at all? There is zero privacy, when using cloud based LLM models.

Workaccount2

16 days ago

1 reply

Really only if you are paranoid. It's incredibly unlikely that the labs are lying about not training on your data for the API plans that offer it. Breaking trust with outright lies would be catastrophic to any lab right now. Enterprise demands privacy, and the labs will be happy to accommodate (for the extra cost, of course).

mistercheph

16 days ago

1 reply

No, it's incredibly unlikely that they aren't training on user data. It's billions of dollars worth of high quality tokens and preference that the frontier labs have access to, you think they would give that up for their reputation in the eyes of the enterprise market? LMAO. Every single frontier model is trained on torrented books, music, and movies.

user34283

16 days ago

1 reply

Considering that they will make a lot of money with enterprise, yes, that's exactly what I think.

What I don't think is that I can take seriously someone's opinion on enterprise service's privacy after they write "LMAO" in capslock in their post.

lukan

16 days ago

1 reply

I just know many people here complained about the very unclear way, google for example communicates what they use for training data and what plan to choose to opt out of everything, or if you (as a normal buisness) even can opt out. Given the whole volatile nature of this thing, I can imagine an easy "oops, we messed up" from google if it turns out they were in fact using allmost everything for training.

Second thing to consider is the whole geopolitical situation. I know companies in europe are really reluctant to give US companies access to their internal data.

KoolKat23

15 days ago

To be fair, we all know googles terms are ambiguous as hell. It would not be a big surprise nor an outright lie if they did use it.

Its different if they proclaimed outright they won't use it and then do.

Not that any of this is right, it wouldn't be a massive betrayal.

thecupisblue

16 days ago

2 replies

Oh wow - I recently tried 3 Pro preview and it was too slow for me.

After reading your comment I ran my product benchmark against 2.5 flash, 2.5 pro and 3.0 flash.

The results are better AND the response times have stayed the same. What an insane gain - especially considering the price compared to 2.5 Pro. I'm about to get much better results for 1/3rd of the price. Not sure what magic Google did here, but would love to hear a more technical deep dive comparing what they do different in Pro and Flash models to achieve such a performance.

Also wondering, how did you get early access? I'm using the Gemini API quite a lot and have a quite nice internal benchmark suite for it, so would love to toy with the new ones as they come out.

lancekey

16 days ago

2 replies

Curious to learn what a “product benchmark” looks like. Is it evals you use to test prompts/models? A third party tool?

Examples from the wild are a great learning tool, anything you’re able to share is appreciated.

theshrike79

16 days ago

2 replies

[delayed]

ggsp

16 days ago

2 replies

Any suggestions for a simple tool to set up your own local evals?

theshrike79

16 days ago

3 replies

[delayed]

ggsp

16 days ago

Yeah I’ve wondered about the same myself… My evals are also a pile of text snippets, as are some of my workflows. Thought I’d have a look to see what’s out there and found Promptfoo and Inspect AI. Haven’t tried either but will for my next round of evals

kedihacker

16 days ago

Well you need to stop them from getting incorporated into its training data

lobsterthief

16 days ago

_Brain backlog project #77 created_

dimava

16 days ago

Just ask LLM to write one on top of OpenRouter, AI SDK and Bun To take your .md input file and save outputs as md files (or whatever you need) Take https://github.com/T3-Content/auto-draftify as example

bluecalm

15 days ago

I am asking the models to generate an image where fictional characters play chess or Texas Holdem. None of them can make a realistic chess position or poker game. Always something is off like too many pawns or too may cards, or some cards being ace-up when they shouldn't be.

thecupisblue

16 days ago

It's an internal benchmark that I use to test prompts, models and prompt-tunes, nothing but a dashboard calling our internal endpoints and showing the data, basically going through the prod flow.

For my product, I run a video through a multimodal LLM with multiple steps, combine data and spit out the outputs + score for the video.

I have a dataset of videos that I manually marked for my usecase, so when a new model drops, I run it + the last few best benchmarked models through the process, and check multiple things:

- Diff between outputed score and the manual one - Processing time for each step - Input/Output tokens - Request time for each step - Price of request

And the classic stats of average score delta, average time, p50, p90 etc. + One fun thing which is finding the edge cases, since even if the average score delta is low (means its spot-on), there are usually some videos where the abs delta is higher, so these usually indicate niche edge cases the model might have.

Gemini 3 Flash nails it sometimes even better than the Pro version, with nearly the same times as 2.5 Pro does on that usecase. Actually, pushed it to prod yesterday and looking at the data, it seems it's 5 seconds faster than Pro on average, with my cost-per-user going down from 20 cents to 12 cents.

IMO it's pretty rudimentary, so let me know if there's anything else I can explain.

m00dy

16 days ago

1 reply

May I ask your internal benchmark ? I'm building a new set of benchmarks and testing suite for agentic workflows using deepwalker [0]. How do you design your benchmark suite ? would be really cool if you can give more details.

[0] https://deepwalker.xyz

thecupisblue

16 days ago

1 reply

Shared a bit more here - https://news.ycombinator.com/item?id=46314047.

But pretty rudimentary, nothing special. Also did not know about deepwalker, looks quite interesting - you building it?

m00dy

15 days ago

I personally know the team who builds the product.

freedomben

16 days ago

1 reply

Cool! I've been using 2.5 flash and it is pretty bad. 1 out of 5 answers it gives will be a lie. Hopefully 3 is better

samyok

16 days ago

1 reply

Did you try with the grounding tool? Turning it on solved this problem for me.

Davidzheng

16 days ago

1 reply

what if the lie is a logical deduction error not a fact retrieval error

rat9988

16 days ago

The error rate would still be improved overall and might make it a viable tool for the price depending on the usecase.

unsupp0rted

16 days ago

2 replies

[delayed]

jasonjmcghee

16 days ago

2 replies

My experience so far- much less reliable. Though it’s been in chat not opencode or antigravity etc. you give it a program and say change it in this way, and it just throws stuff away, changes unrelated stuff etc. completely different quality than pro (or sonnet 4.5 / GPT-5.2)

PrayagS

16 days ago

Been thinking of having Opus generate plans and then having Gemini 3 Flash execute. Might be better than using Haiku for the same.

Anyone tried something similar already?

piokoch

16 days ago

So why Flash is so high in LiveCodeBench Pro?

BTW: I have the same impression, Claude was working better for me for coding tasks.

bovermyer

16 days ago

In my own, very anecdotal, experience, Gemini 3 Pro and Flash are both more reliably accurate than GPT 5.x.

I have not worked with Sonnet enough to give an opinion there.

scrollop

16 days ago

1 reply

Alright so we have more benchmarks including hallucinations and flash doesn't do well with that, though generally it beats gemini 3 pro and GPT 5.1 thinking and gpt 5.2 thinking xhigh (but then, sonnet, grok, opus, gemini and 5.1 beat 5.2 xhigh) - everything. Crazy.

https://artificialanalysis.ai/evaluations/omniscience

tallclair

16 days ago

On your Omniscience-Index vs. Cost graph, I think your Gemini 3 pro & flash models might be swapped.

giancarlostoro

16 days ago

6 replies

I wonder at what point will everyone who over-invested in OpenAI will regret their decision (expect maybe Nvidia?). Maybe Microsoft doesn't need to care, they get to sell their models via Azure.

jack_riminton

16 days ago

1 reply

But you’re forgetting the Jonny Ive hardware device that totally isn’t like that laughable pin badge thing from Humane

user34283

16 days ago

1 reply

I agree completely. Altman was at some point talking about a screen less device and getting people away from the screen.

Abandoning our mose useful sense, vision, is a recipe for a flop.

jack_riminton

16 days ago

I'm not entirely sure it will ever see the light of day tbh

The amount of money sloshing around in these acquisitions makes you wonder what they're really for

guelo

16 days ago

2 replies

I think OpenAI's doom was written when Altman (and Nadella) got greedy, threw away the nonprofit mission, and caused the exodus of talent and funding that created Anthropic. If they had stayed nonprofit the rest of the industry could have consolidated their efforts against Google's juggernaut. I don't understand how they expected to sustain the advantage against Google's infinite money machine. With Waymo Google showed that they're willing to burn money for decades until they succeed.

behnamoh

16 days ago

3 replies

> I don't understand how they expected to sustain the advantage against Google's infinite money machine.

I ask this question about Nazi Germany. They adopted the Blitkrieg strategy and expanded unsustainably, but it was only a matter of time until powers with infinite resources (US, USSR) put an end to it.

eru

16 days ago

1 reply

Huh? How did the USSR have infinite resources? They were barely kept afloat by western allied help (especially at the beginning). Remember also how Tsarist Russia was the first power to collapse and get knocked out of the war in WW1, long before the war was over. They did worse than even the proverbial 'Sick Man of Europe', the Ottoman Empire.

Not saying that the Nazi strategy was without flaws, of course. But your specific critique is a bit too blunt.

SoftTalker

16 days ago

1 reply

they had more soldiers to throw into the meat grinder

eru

16 days ago

1 reply

They also had more soldiers in WW1.

elbear

16 days ago

They withdrew in WW1 after the revolution.

qcnguy

16 days ago

Lots of plausible alternative histories don't end with the destruction of Nazi Germany. Others already named some, another is if the RAF collapsed during the Battle of Britain and Germany had established air superiority. The Germans would have taken out the Royal Navy and mounted an invasion of Britain soon after; if Britain had fallen there'd have been nowhere for the US to stage D-Day. Hitler could have then diverted all resources to the eastern front and possibly managed to reach Moscow before the winter set in.

goobatrooba

16 days ago

I know you're making an analogy but I have to point out that there are many points where Nazi Germany could have gone a different route and potentially could have ended up with a stable dominion over much of Western Europe.

Most obvious decision points were betraying the USSR and declaring war on the US (no one really had been able to print the reason, but presumably it was to get Japan to attack the soviets from the other side, which then however didn't happen). Another could have been to consolidate after the surrender/supplication of France, rather than continue attacking further.

deegles

16 days ago

1 reply

I think their downfall will be the fact that they don't have a "path to AGI" and have been raising investor money on the promise that they do.

taytus

16 days ago

1 reply

I believethere’s also exponential dislike growing for Altman among most AI users, and that impacts how the brand/company is perceived.

mingusrude

16 days ago

1 reply

Most AI users outside of HN does not have any idea of who Altman is. ChatGPT is in many circles synonymous to AI so their brand recognition is huge.

giancarlostoro

15 days ago

I agree, I have said it before, ChatGPT is like Photoshop at this point, or Google. Even if you are using Bing you are googling it. Even if you are using MS Paint to edit an image it was photoshopped.

TacticalCoder

16 days ago

1 reply

Oracle's stock skyrocketed then took a nosedive. Financial experts warned that companies who bet big on OpenAI like Oracle and Coreweave to pump their stock would go down the drain, and down the drain they went (so far: -65% for Coreweave and nearly -50% of Oracle compared to their OpenAI-hype all-time highs).

Markets seems to be in a: "Show me the OpenAI money" mood at the moment.

And even financial commentators who don't necessarily know a thing about AI can realize that Gemini 3 Pro and now Gemini 3 Flash are giving ChatGPT a run for its money.

Oracle and Microsoft have other source of revenues but for those really drinking the OpenAI koolaid, including OpenAI itself, I sure as heck don't know what the future holds.

My safe bet however is that Google ain't going anywhere and shall keep progressing on the AI front at an insane pace.

eru

16 days ago

Financial experts [0] and analysts are pretty much useless. Empirically their predictions are slightly worse than chance.

[0] At least the guys who publish where you or me can read them.

toomuchtodo

16 days ago

Amazon Set to Waste $10 Billion on OpenAI - https://finance.yahoo.com/news/amazon-set-waste-10-billion-1... - December 17th, 2025

outside1234

16 days ago

Very soon, because clearly OpenAI is in very serious trouble. They are scaled and have no business model and a competitor that is much better than them at almost everything (ads, hardware, cloud, consumer, scaling).

spaceman_2020

16 days ago

Seeing Sergey Brin back in the trenches makes me think Google is really going to win this

They always had the best talent, but with Brin at the helm, they also have someone with the organizational heft to drive them towards a single goal

tonyhart7

16 days ago

2 replies

I think google is the only one that still produce general knowledge LLM right now

claude is coding model from the start but GPT is in more and more becoming coding model

Imustaskforhelp

16 days ago

3 replies

I agree with this observation. Gemini does feel like code-red for basically every AI company like chatgpt,claude etc. too in my opinion if the underlying model is both fast and cheap and good enough

I hope open source AI models catch up to gemini 3 / gemini 3 flash. Or google open sources it but lets be honest that google isnt open sourcing gemini 3 flash and I guess the best bet mostly nowadays in open source is probably glm or deepseek terminus or maybe qwen/kimi too.

Uehreka

16 days ago

2 replies

I would expect open weights models to always lag behind; training is resource-intensive and it’s much easier to finance if you can make money directly from the result. So in a year we may have a ~700B open weights model that competes with Gemini 3, but by then we’ll have Gemini 4, and other things we can’t predict now.

xbmcuser

16 days ago

1 reply

There will be diminishing returns though as the future models won't be thah much better we will reach a point where the open source model will be good enough for most things. And the need for being on the latest model no longer so important.

For me the bigger concern which I have mentioned on other AI related topics is that AI is eating all the production of computer hardware so we should be worrying about hardware prices getting out of hand and making it harder for general public to run open source models. Hence I am rooting for China to reach parity on node size and crash the PC hardware prices.

FuckButtons

16 days ago

2 replies

I had a similar opinion, that we were somewhere near the top of the sigmoid curve of model improvement that we could achieve in the near term. But given continued advancements, I’m less sure that prediction holds.

eru

16 days ago

My model is a bit simpler: model quality is something like the logarithm of effort you put into making the model. (Assuming you know what you are doing with your effort.)

So I don't think we are on any sigmoid curve or so. Though if you plot the performance of the best model available at any point in time against time on the x-axis, you might see a sigmoid curve, but that's a combination of the logarithm and the amount of effort people are willing to spend on making new models.

(I'm not sure about it specifically being the logarithm. Just any curve that has rapidly diminishing marginal returns that nevertheless never go to zero, ie the curve never saturates.)

Imustaskforhelp

16 days ago

Yeah I have a similar opinion and you can go back almost a year when claude 3.5 launched and I said on hackernews, that its good enough

And now I am saying the same for gemini 3 flash.

I still feel the same way tho, sure there is an increase but I somewhat believe that gemini 3 is good enough and the returns on training from now on might not be worth thaat much imo but I am not sure too and i can be wrong, I usually am.

baq

16 days ago

1 reply

If Gemini 3 flash is really confirmed close to Opus 4.5 at coding and a similarly capable model is open weights, I want to buy a box with an usb cable that has that thing loaded, because today that’s enough to run out of engineering work for a small team.

eru

16 days ago

Open weights doesn't mean you can necessarily run it on a (small) box.

If Google released their weights today, it would technically be open weight; but I doubt you'd have an easy time running the whole Gemini system outside of Google's datacentres.

Workaccount2

16 days ago

2 replies

Open source models are riding coat tails, they are basically just distilling the giant SOTA models, hence perpetually being 4-6mos behind.

waffletower

16 days ago

1 reply

If this quantification of lag is anywhere near accurate (it may be larger and/or more complex to describe), soon open source models will be "simply good enough". Perhaps companies like Apple could be 2nd round AI growth companies -- where they market optimized private AI devices via already capable Macbooks or rumored appliances. While not obviating cloud AI, they could cheaply provide capable models without subscription while driving their revenue through increased device sales. If the cost of cloud AI increases to support its expense, this use case will act as a check on subscription prices.

xzjis

16 days ago

Google already has dedicated hardware for running private LLMs: just look at what they're doing on the Google Pixel. The main limiting factor right now is access to hardware that's powerful enough, and especially has enough memory, to run a good LLM, which will happen eventually. Normally, by 2031 we should have devices with 400 GB of RAM, but the current RAM crisis could throw off my calculations...

Gigachad

16 days ago

So basically the proprietary models are devalued to almost 0 in about 4-6 months. Can they recover the training costs + profit margin every 4 months?

leemoore

16 days ago

2 replies

Gemini isn't code red for Anthropic. Gemini threatens none of Anthropic's positioning in the market.

ralusek

16 days ago

2 replies

Yes it does. I never use Claude anymore outside of agentic tasks.

leemoore

16 days ago

2 replies

What demographic are you in that is leaving anthropic in mass that they care about retaining. From what I see it’s enterprise in general,enterprise coding, targeted portions of financial, life sciences. Gemini threatens exactly zero of these market right now for anthropic. Claude Code just caught up to cursor (no 2) in revenue and based on trajectories is about to pass GitHub copilot (number 1) in a few more months. They just locked down Deloitte with 350k seats of Claude Enterprise. In my fortune 100 financial company they just finished crushing open ai in a broad enterprise wide evaluation. Google Gemini was never in the mix, never on the table and still isn’t. Every one of our engineers has 1k a month allocated in Claude tokens for Claude enterprise and Claude code. There is 1 leader with enterprise. There is one leader with developers. And google has nothing to make a dent. Not Gemini 3, not Gemini cli, not anti gravity, not Gemini. There is no Code Red for Anthropic. They have clear target markets and nothing from google threatens those

user34283

16 days ago

1 reply

Enterprise is slow. As for developers, we will be switching to Google unless the competition can catch up and deliver a similarly fast model.

Enterprise will follow.

I don't see any distinction in target markets - it's the same market.

Imustaskforhelp

16 days ago

1 reply

Yeah, this is what I was trying to say in my original comment too.

Also I do not really use agentic tasks but I am not sure that gemini 3/3 flash have mcp support/skills support for agentic tasks

if not, I feel like they are very low hanging fruits and something that google can try to do too to win the market of agentic tasks over claude too perhaps.

user34283

16 days ago

I don't use MCP, but I am using agents in Antigravity.

So far they seem faster with Flash, and with less corruption of files using the Edit tool - or at least it recovered faster.

Karrot_Kream

16 days ago

I agree with your overall thesis but:

> Google Gemini was never in the mix, never on the table and still isn’t. Every one of our engineers has 1k a month allocated in Claude tokens for Claude enterprise and Claude code.

Does that mean y'all never evaluated Gemini at all or just that it couldn't compete? I'd be worried that prior performance of the models prejudiced stats away from Gemini, but I am a Claude Code and heavy Anthropic user myself so shrug.

siva7

16 days ago

so? agentic tasks is where the promised agi is for many of us

mupuff1234

16 days ago

Maybe code orange since I'm guessing sooner or later Google will release a coding specialist LLM that is better, faster & cheaper than Claude.

Workaccount2

16 days ago

2 replies

Coding is basically an edge case for LLMs too.

Pretty much every person in the first (and second) world is using AI now, and only small fraction of those people are writing software. This is also reflected in OAI's report from a few months ago that found programming to only be 4% of tokens.

aleph_minus_one

16 days ago

2 replies

> Pretty much every person in the first (and second) world is using AI now

This sounds like you live in a huge echo chamber. :-(

lukan

16 days ago

2 replies

Depends what you count as AI (just googling makes you use the LLM summary), but also my mother who is really not tech affine loved what google lense can do, after I showed her.

Apart from my very old grandmothers, I don't know anyone not using AI.

pests

16 days ago

2 replies

How many people do you know? Do you talk to your local shop keeper? Or the clerk at the gas station? How are they using AI? I'm a pretty techy person with a lot of tech friends, and I know more people not using AI (on purpose, or lack of knowledge) then do.

lukan

16 days ago

1 reply

Hm, quite some. Like I said, it depends what you count as AI.

Just googling means you use AI nowdays.

eru

16 days ago

1 reply

Whether Googling something counts as AI has more to do with the shifting definition of AI over time, then with Googling itself.

Remember, really back in the day the A* search algorithm was part of AI.

If you had asked anyone in the 1970s whether a box that given a query pinpoints the right document that answers that question (aka Google search in the early 2000s), they'd definitely would have called it AI.

lukan

16 days ago

1 reply

Google gives you an AI summary, reading that means interacting with LLMs.

pests

16 days ago

Google also gives you ads. Some learn to scroll past before reading.

GeneralMaximus

16 days ago

I live in India and a surprising number of people here are using AI.

A lot of public religious imagery is very clearly AI generated, and you can find a lot of it on social media too. "I asked ChatGPT" is a common refrain at family gatherings. A lot of regular non-techie folks (local shopkeepers, the clerk at the gas station, the guy at the vegetable stand) have been editing their WhatsApp profile pictures using generative AI tools.

Some of my lawyer and journalist friends are using ChatGPT heavily, which is concerning. College students too. Bangalore is plastered with ChatGPT ads.

There's even a low-cost ChatGPT plan called ChatGPT Go you can get if you're in India (not sure if this is available in the rest of the world). It costs ₹399/mo or $4.41/mo, but it's completely free for the first year of use.

So yes, I'd say many people outside of tech circles are using AI tools. Even outside of wealthy first-world countries.

SoftTalker

16 days ago

I'm sort of old but not a grandmother. Not using AI.

chpatrick

16 days ago

All of my non techy friends use it, it's the new search engine. I think at this point people refusing to use it are the echo chamber.

int_19h

16 days ago

That may be so, but I rather suspect the breakdown would be very different if you only count paid tokens. Coding is one of the few things where you can actually get enough benefit out of AI right now to justify high-end subscriptions (or high pay-per-token bills).

lambda

16 days ago

7 replies

I'm a significant genAI skeptic.

I periodically ask them questions about topics that are subtle or tricky, and somewhat niche, that I know a lot about, and find that they frequently provide extremely bad answers. There have been improvements on some topics, but there's one benchmark question that I have that just about every model I've tried has completely gotten wrong.

Tried it on LMArena recently, got a comparison between Gemini 2.5 flash and a codenamed model that people believe was a preview of Gemini 3 flash. Gemini 2.5 flash got it completely wrong. Gemini 3 flash actually gave a reasonable answer; not quite up to the best human description, but it's the first model I've found that actually seems to mostly correctly answer the question.

So, it's just one data point, but at least for my one fairly niche benchmark problem, Gemini 3 Flash has successfully answered a question that none of the others I've tried have (I haven't actually tried Gemini 3 Pro, but I'd compared various Claude and ChatGPT models, and a few different open weights models).

So, guess I need to put together some more benchmark problems, to get a better sample than one, but it's at least now passing a "I can find the answer to this in the top 3 hits in a Google search for a niche topic" test better than any of the other models.

Still a lot of things I'm skeptical about in all the LLM hype, but at least they are making some progress in being able to accurately answer a wider range of questions.

TeodorDyakov

16 days ago

3 replies

Hi. I am curious what was the benchmark question? Cheers!

Turskarama

16 days ago

2 replies

The problem with publicly disclosing these is that if lots of people adopt them they will become targeted to be in the model and will no longer be a good benchmark.

grog454

16 days ago

4 replies

This thought process is pretty baffling to me, and this is at least the second time I've encountered it on HN.

What's the value of a secret benchmark to anyone but the secret holder? Does your niche benchmark even influence which model you use for unrelated queries? If LLM authors care enough about your niche (they don't) and fake the response somehow, you will learn on the very next query that something is amiss. Now that query is your secret benchmark.

Even for niche topics it's rare that I need to provide more than 1 correction or knowledge update.

16 days ago

1 reply

I have a bunch of private benchmarks I run against new models I'm evaluating.

The reason I don't disclose isn't generally that I think an individual person is going to read my post and update the model to include it. Instead it is because if I write "I ask the question X and expect Y" then that data ends up in the train corpus of new LLMs.

However, one set of my benchmarks is a more generalized type of test (think a parlor-game type thing) that actually works quite well. That set is the kind of thing that could be learnt via reinforcement learning very well, and just mentioning it could be enough for a training company or data provider company to try it. You can generate thousands of verifiable tests - potentially with verifiable reasoning traces - quite easily.

grog454

16 days ago

2 replies

Ok, but then your "post" isn't scientific by definition since it cannot be verified. "Post" is in quotes because I don't know what you're trying to but you're implying some sort of public discourse.

For fun: https://chatgpt.com/s/t_694361c12cec819185e9850d0cf0c629

eru

16 days ago

1 reply

I didn't see anyone claiming any 'science'? Did I miss something?

grog454

16 days ago

1 reply

I guess there's two things I'm still stuck on:

1. What is the purpose of the benchmark?

2. What is the purpose of publicly discussing a benchmark's results but keeping the methodology secret?

To me it's in the same spirit as claiming to have defeated alpha zero but refusing to share the game.

16 days ago

1. The purpose of the benchmark is to choose what models I use for my own system(s). This is extremely common practice in AI - I think every company I've worked with doing LLM work in the last 2 years has done this in some form.

2. I discussed that up-thread, but https://github.com/microsoft/private-benchmarking and https://arxiv.org/abs/2403.00393 discuss some further motivation for this if you are interested.

> To me it's in the same spirit as claiming to have defeated alpha zero but refusing to share the game.

This is an odd way of looking at it. There is no "winning" at benchmarks, it's simply that it is a better and more repeatable evaluation than the old "vibe test" that people did in 2024.

16 days ago

1 reply

As ChatGPT said to you:

> A secret benchmark is: Useful for internal model selection

That's what I'm doing.

grog454

15 days ago

My question was "What's the value of a secret benchmark to anyone but the secret holder?"

akoboldfrying

16 days ago

1 reply

I actually think "concealing the question" is not only a good idea, but a rather general and powerful idea that should be much more widely deployed (but often won't be, for what I consider "emotional reasons").

Example: You are probably already aware that almost any metric that you try to use to measure code quality can be easily gamed. One possible strategy is to choose a weighted mixture of metrics and conceal the weights. The weights can even change over time. Is it perfect? No. But it's at least correlated with code quality -- and it's not trivially gameable, which puts it above most individual public metrics.

grog454

16 days ago

1 reply

It's hard to have any certainty around concealment unless you are only testing local LLMs. As a matter of principle I assume the input and output of any query I run in a remote LLM is permanently public information (same with search queries).

Will someone (or some system) see my query and think "we ought to improve this"? I have no idea since I don't work on these systems. In some instances involving random sampling... probably yes!

This is the second reason I find the idea of publicly discussing secret benchmarks silly.

grog454

16 days ago

I learned in another thread there is some work being done to avoid contamination of training data during evaluation of remote models using trusted execution environments (https://arxiv.org/pdf/2403.00393). It requires participation of the model owner.

theshrike79

16 days ago

[delayed]

Turskarama

16 days ago

The point is that it's a litmus test for how well the models do with niche knowledge _in general_. The point isn't really to know how well the model works for that specific niche. Ideally of course you would use a few of them and aggregate the results.

lambda

16 days ago

Yeah, that's part of why I don't disclose.

Obviously, the fact that I've done Google searches and tested the models on these means that their systems may have picked up on them; I'm sure that Google uses its huge dataset of Google searches and search index as inputs to its training, so Google has an advantage here. But, well, that might be why Googles new models are so much better, they're actually taking advantage of some of this massive dataset they've had for years.

kridsdale3

16 days ago

2 replies

If they told you, it would be picked up in a future model's training run.

jacobn

16 days ago

3 replies

Don't the models typically train on their input too? I.e. submitting the question also carries a risk/chance of it getting picked up?

I guess they get such a large input of queries that they can only realistically check and therefore use a small fraction? Though maybe they've come up with some clever trick to make use of it anyway?

energy123

16 days ago

1 reply

Given they asked in on LMArena, yes.

lambda

16 days ago

Yeah, probably asking on LMArena makes this an invalid benchmark going forward, especially since I think Google is particular active in testing models on LMArena (as evidenced by the fact that I got their preview for this question).

I'll need to find a new one, or actually put together a set of questions to use instead of just a single benchmark.

16 days ago

OpenAI and Anthropic don't train on your questions if you have pressed the opt-out button and are using their UI. LMArena is a different matter.

jerojero

16 days ago

they probably dont train on inputs from testing grounds.

you dont train on your test data because you need to have that to compare if training is improving or not.

_heimdall

16 days ago

Is that an issue if you now need a new question to ask?

Marazan

16 days ago

Heres my old benchmark question and my new variant:

"When was the last time England beat Scotland at rugby union"

new variant "Without using search when was the last time England beat Scotland at rugby union"

It is amazing how bad ChatGPT is at this question and has been for years now across multiple models. It's not that it gets it wrong - no shade, I've told it not to search the web so this is _hard_ for it - but how badly it reports the answer. Starting from the small stuff - it almost always reports the wrong year, wrong location and wrong score - that's the boring facts stuff that I would expect it to stumble on. It often creates details of matches that didn't exist, cool standard hallucinations. But even within the text it generates itself it cannot keep it consistent with how reality works. It often reports draws as wins for England. It frequently states the team that it just said scored most points lost the match, etc.

It is my ur example for when people challenge my assertion LLMs are stochastic parrots or fancy Markov chains on steroids.

prettyblocks

16 days ago

4 replies

I don't think tricky niche knowledge is the sweet spot for genai and it likely won't be for some time. Instead, it's a great replacement for rote tasks where a less than perfect performance is good enough. Transcription, ocr, boilerplate code generation, etc.

lambda

16 days ago

4 replies

The thing is, I see people use it for tricky niche knowledge all the time; using it as an alternative to doing a Google search.

So I want to have a general idea of how good it is at this.

I found something that was niche, but not super niche; I could easily find a good, human written answer in the top couple of results of a Google search.

But until now, all LLM answers I've gotten for it have been complete hallucinated gibberish.

Anyhow, this is a single data point, I need to expand my set of benchmark questions a bit now, but this is the first time that I've actually seen progress on this particular personal benchmark.

mikepurvis

16 days ago

1 reply

And Google themselves obviously believe that too as they happily insert AI summaries at the top of most serps now.

ComputerGuru

16 days ago

3 replies

Or maybe Google knows most people search inane, obvious things?

vitorgrs

16 days ago

1 reply

Google AI Overview a lot of times write wrong about obvious things so... lol

They probably use old Flash Lite model, something super small, and just summarize the search...

mikepurvis

16 days ago

Those summaries would be far more expensive to generate than the searches themselves so they're probably caching the top 100k most common or something, maybe even pre-caching it.

coldtea

16 days ago

Or more likely Google couldn't give a rat's arse whether those AI summaries are good or not (except to the degree that people don't flee it), and what it cares is that they keep users with Google itself, instead of clicking of to other sources.

After all it's the same search engine team that didn't care about its search results - it's main draw - activey going shit for over a decade.

Ankaios

16 days ago

It routinely provides super incorrect responses to those inane, obvious searches.

ozim

16 days ago

That’s riding hype machine and throwing baby with bath water.

Get an API and try to use it for classification of text or classification of images. Having an excel file with somewhat random looking 10k entries you want to classify or filter down to 10 important for you, use LLM.

Get it to make audio transcription. You can now just talk and it will make note for you on level that was not possible earlier without training on someone voice it can do anyone’s voice.

Fixing up text is of corse also big.

Data classification is easy for LLM. Data transformation is a bit harder but still great. Creating new data is hard so like answering questions where it has to generate stuff from thin air it will hallucinate like mad man.

The ones that LLMs are good in are used in background by people creating actual useful software on top of LLMs but those problems are not seen by general public who sees chat box.

illiac786

16 days ago

But people using the wrong tool for a task is nothing new. Using excel as a database (still happening today), etc.

Maybe the scale is different with genAI and there are some painful learnings ahead of us.

katzenversteher

16 days ago

I also use niche questions a lot but mostly to check how much the models tend to hallucinate. E.g. I start asking about rank badges in Star Trek which they usually get right and then I ask about specific (non existing) rank badges shaped like strawberries or something like that. Or I ask about smaller German cities and what's famous about them.

I know without the ability to search it's very unlikely the model actually has accurate "memories" about these things, I just hope one day they will acutally know that their "memory" is bad or non-existing and they will tell me so instead of hallucinating something.

DrewADesign

16 days ago

Yeah, but tests like that deliberately prod the boundaries of its capability rather than how well it does what it’s good at.

ozim

16 days ago

Second this.

Basically making sense of unstructured data is super cool. I can get 20 people to write an answer the way they feel like it and model can convert it to structured data - something I would have to spend time on, or I would have to make form with mandatory fields that annoy audience.

I am already building useful tools with the help of models. Asking tricky or trivia questions is fun and games. There are much more interesting ways to use AI.

DeathArrow

16 days ago

Well, I used Grok to find information I forgot about like product names, films, books and various articles on different subjects. Google search didn't help but putting the LLM at work did the trick.

So I think LLMs can be good for finding niche info.

fragmede

16 days ago

1 reply

Even the most magical wonderful auto-hammer is gonna be bad at driving in screws. And, in this analogy I can't fault you because there are people trying to sell this hammer as a screwdriver. My opinion is that it's important to not lose sight of the places where it is useful because of the places where it isn't.

pretzellogician

16 days ago

Funny, I grew up using what's called a "hand impact screwdriver"... turns out a hammer can be used to drive in screws!

andai

16 days ago

1 reply

So this is an interesting benchmark, because if the answer is actually in the top 3 google results, then my python script that runs a google search, scrapes the top n results and shoves them into a crappy LLM would pass your benchmark too!

Which also implies that (for most tasks), most of the weights in a LLM are unnecessary, since they are spent on memorizing the long tail of Common Crawl... but maybe memorizing infinite trivia is not a bug but actually required for the generalization to work? (Humans don't have far transfer though... do transformers have it?)

lambda

16 days ago

1 reply

I've tried doing this query with search enabled in LLMs before, which is supposed to effectively do that, and even with that they didn't give very good answers. It's a very physical kind of thing, and its easy to conflate with other similar descriptions, so they would frequently just conflate various different things and give some horrible mash-up answer that wasn't about the specific thing I'd asked about.

andai

16 days ago

So it's a difficult question for LLMs to answer even when given perfect context?

Kinda sounds like you're testing two things at the same time then, right? The knowledge of the thing (was it in the training data and was it memorized?) and the understanding of the thing (can they explain it properly even if you give them the answer in context).

jve

16 days ago

Counter point about general knowledge that is documented/discussed in different spots on the internet.

Today I had to resolve performance problems for some sql server statement. Been doing it years, know the regular pitfalls, sometimes have to find "right" words to explain to customer why X is bad and such.

I described the issue to GPT5.2, gave the query, the execution plan and asked for help.

It was spot on, high quality responses and actionable items and explanations on why this or that is bad, how to improve it and why particularly sql may have generated such a query plan. I could instantly validate the response given my experience in the field. I even answered with some parts of chatgpt on how well it explained. However I did mention that to customer and I did tell them I approve the answer.

Asked high quality question and receive a high quality answer. And I am happy that I found out about an sql server flag where I can influence particular decision. But the suggestion was not limited to that, there were multiple points given that would help.

arisAlexis

16 days ago

can you give us an example of this niche knowledge? I highly doubt there is knowledge that is not inside some internet training material.

vitaflo

16 days ago

I also have my own tricky benchmark that up til now only Deepseek has been able to answer. Gemini 3 Pro was the second. Every other LLM fail horribly. This is the main reason I started looking at G3pro more seriously.

yunohn

16 days ago

I love how every single LLM model release is accompanied by pre-release insiders proclaiming how it’s the best model yet…

Sincere6066

16 days ago

"amazing model"

no such thing

moffkalast

16 days ago

Should I not let the "Gemini" name fool me either?

epolanski

16 days ago

Gemini 2.0 flash was good already for some tasks of mine long time ago..

mmaunder

16 days ago

Thanks, having it walk a hardcore SDR signal chain right now --- oh damn it just finished. The blog post makes it clear this isn't just some 'lite' model - you get low latency and cognitive performance. really appreciate you amplifying that.

tonymet

16 days ago

Can you be more specific on the tasks you’ve found exceptional ?

encroach

16 days ago

How did you get early access?

esafak

16 days ago

What are you using it for and what were you using before?

415 more comments available on Hacker News

Resources