Gemini 3 Flash
blog.googleKey Features
Tech Stack
Key Features
Tech Stack
Developer Blog: https://blog.google/technology/developers/build-with-gemini-...
Model Card [pdf]: https://deepmind.google/models/model-cards/gemini-3-flash/
Gemini 3 Flash in Search AI mode: https://blog.google/products/search/google-ai-mode-update-ge...
For example, the Gemini 3 Pro collection: https://blog.google/products/gemini/gemini-3-collection/
But having everything linked at the bottom of the announcement post itself would be really great too!
Also I don't see it written in the blog post but Flash supports more granular settings for reasoning: minimal, low, medium, high (like openai models), while pro is only low and high.
Wasn't this the case with the 2.5 Flash models too? I remember being very confused at that time.
To me it seems like the big model has been "look what we can do", and the smaller model is "actually use this one though".
> Matches the “no thinking” setting for most queries. The model may think very minimally for complex coding tasks. Minimizes latency for chat or high throughput applications.
I'd prefer a hard "no thinking" rule than what this is.
I only assume "if you're not getting charged, you are the product" has to be somewhat in play here. But when working on open source code, I don't mind.
I tried to be quite clear with showing my work here. I agree that 17x is much closer to a single order of magnitude than two. But 60x is, to me, a bulk enough of the way to 100x that yeah I don't feel bad saying it's nearly two orders.
Otherwise, if it's a short prompt or answer, SOTA (state of the art) model will be cheap anyway and id it's a long prompt/answer, it's way more likely to be wrong and a lot more time is spent on "checking/debugging" any issue or hallucination, so again SOTA is better.
Or for any privacy/IP protection at all? There is zero privacy, when using cloud based LLM models.
What I don't think is that I can take seriously someone's opinion on enterprise service's privacy after they write "LMAO" in capslock in their post.
Second thing to consider is the whole geopolitical situation. I know companies in europe are really reluctant to give US companies access to their internal data.
Its different if they proclaimed outright they won't use it and then do.
Not that any of this is right, it wouldn't be a massive betrayal.
After reading your comment I ran my product benchmark against 2.5 flash, 2.5 pro and 3.0 flash.
The results are better AND the response times have stayed the same. What an insane gain - especially considering the price compared to 2.5 Pro. I'm about to get much better results for 1/3rd of the price. Not sure what magic Google did here, but would love to hear a more technical deep dive comparing what they do different in Pro and Flash models to achieve such a performance.
Also wondering, how did you get early access? I'm using the Gemini API quite a lot and have a quite nice internal benchmark suite for it, so would love to toy with the new ones as they come out.
Examples from the wild are a great learning tool, anything you’re able to share is appreciated.
For my product, I run a video through a multimodal LLM with multiple steps, combine data and spit out the outputs + score for the video.
I have a dataset of videos that I manually marked for my usecase, so when a new model drops, I run it + the last few best benchmarked models through the process, and check multiple things:
- Diff between outputed score and the manual one - Processing time for each step - Input/Output tokens - Request time for each step - Price of request
And the classic stats of average score delta, average time, p50, p90 etc. + One fun thing which is finding the edge cases, since even if the average score delta is low (means its spot-on), there are usually some videos where the abs delta is higher, so these usually indicate niche edge cases the model might have.
Gemini 3 Flash nails it sometimes even better than the Pro version, with nearly the same times as 2.5 Pro does on that usecase. Actually, pushed it to prod yesterday and looking at the data, it seems it's 5 seconds faster than Pro on average, with my cost-per-user going down from 20 cents to 12 cents.
IMO it's pretty rudimentary, so let me know if there's anything else I can explain.
But pretty rudimentary, nothing special. Also did not know about deepwalker, looks quite interesting - you building it?
Anyone tried something similar already?
BTW: I have the same impression, Claude was working better for me for coding tasks.
I have not worked with Sonnet enough to give an opinion there.
/s
Abandoning our mose useful sense, vision, is a recipe for a flop.
The amount of money sloshing around in these acquisitions makes you wonder what they're really for
I ask this question about Nazi Germany. They adopted the Blitkrieg strategy and expanded unsustainably, but it was only a matter of time until powers with infinite resources (US, USSR) put an end to it.
Not saying that the Nazi strategy was without flaws, of course. But your specific critique is a bit too blunt.
Most obvious decision points were betraying the USSR and declaring war on the US (no one really had been able to print the reason, but presumably it was to get Japan to attack the soviets from the other side, which then however didn't happen). Another could have been to consolidate after the surrender/supplication of France, rather than continue attacking further.
Markets seems to be in a: "Show me the OpenAI money" mood at the moment.
And even financial commentators who don't necessarily know a thing about AI can realize that Gemini 3 Pro and now Gemini 3 Flash are giving ChatGPT a run for its money.
Oracle and Microsoft have other source of revenues but for those really drinking the OpenAI koolaid, including OpenAI itself, I sure as heck don't know what the future holds.
My safe bet however is that Google ain't going anywhere and shall keep progressing on the AI front at an insane pace.
[0] At least the guys who publish where you or me can read them.
They always had the best talent, but with Brin at the helm, they also have someone with the organizational heft to drive them towards a single goal
claude is coding model from the start but GPT is in more and more becoming coding model
I hope open source AI models catch up to gemini 3 / gemini 3 flash. Or google open sources it but lets be honest that google isnt open sourcing gemini 3 flash and I guess the best bet mostly nowadays in open source is probably glm or deepseek terminus or maybe qwen/kimi too.
For me the bigger concern which I have mentioned on other AI related topics is that AI is eating all the production of computer hardware so we should be worrying about hardware prices getting out of hand and making it harder for general public to run open source models. Hence I am rooting for China to reach parity on node size and crash the PC hardware prices.
So I don't think we are on any sigmoid curve or so. Though if you plot the performance of the best model available at any point in time against time on the x-axis, you might see a sigmoid curve, but that's a combination of the logarithm and the amount of effort people are willing to spend on making new models.
(I'm not sure about it specifically being the logarithm. Just any curve that has rapidly diminishing marginal returns that nevertheless never go to zero, ie the curve never saturates.)
And now I am saying the same for gemini 3 flash.
I still feel the same way tho, sure there is an increase but I somewhat believe that gemini 3 is good enough and the returns on training from now on might not be worth thaat much imo but I am not sure too and i can be wrong, I usually am.
If Google released their weights today, it would technically be open weight; but I doubt you'd have an easy time running the whole Gemini system outside of Google's datacentres.
Enterprise will follow.
I don't see any distinction in target markets - it's the same market.
Also I do not really use agentic tasks but I am not sure that gemini 3/3 flash have mcp support/skills support for agentic tasks
if not, I feel like they are very low hanging fruits and something that google can try to do too to win the market of agentic tasks over claude too perhaps.
So far they seem faster with Flash, and with less corruption of files using the Edit tool - or at least it recovered faster.
> Google Gemini was never in the mix, never on the table and still isn’t. Every one of our engineers has 1k a month allocated in Claude tokens for Claude enterprise and Claude code.
Does that mean y'all never evaluated Gemini at all or just that it couldn't compete? I'd be worried that prior performance of the models prejudiced stats away from Gemini, but I am a Claude Code and heavy Anthropic user myself so shrug.
Pretty much every person in the first (and second) world is using AI now, and only small fraction of those people are writing software. This is also reflected in OAI's report from a few months ago that found programming to only be 4% of tokens.
This sounds like you live in a huge echo chamber. :-(
Apart from my very old grandmothers, I don't know anyone not using AI.
Just googling means you use AI nowdays.
Remember, really back in the day the A* search algorithm was part of AI.
If you had asked anyone in the 1970s whether a box that given a query pinpoints the right document that answers that question (aka Google search in the early 2000s), they'd definitely would have called it AI.
A lot of public religious imagery is very clearly AI generated, and you can find a lot of it on social media too. "I asked ChatGPT" is a common refrain at family gatherings. A lot of regular non-techie folks (local shopkeepers, the clerk at the gas station, the guy at the vegetable stand) have been editing their WhatsApp profile pictures using generative AI tools.
Some of my lawyer and journalist friends are using ChatGPT heavily, which is concerning. College students too. Bangalore is plastered with ChatGPT ads.
There's even a low-cost ChatGPT plan called ChatGPT Go you can get if you're in India (not sure if this is available in the rest of the world). It costs ₹399/mo or $4.41/mo, but it's completely free for the first year of use.
So yes, I'd say many people outside of tech circles are using AI tools. Even outside of wealthy first-world countries.
I periodically ask them questions about topics that are subtle or tricky, and somewhat niche, that I know a lot about, and find that they frequently provide extremely bad answers. There have been improvements on some topics, but there's one benchmark question that I have that just about every model I've tried has completely gotten wrong.
Tried it on LMArena recently, got a comparison between Gemini 2.5 flash and a codenamed model that people believe was a preview of Gemini 3 flash. Gemini 2.5 flash got it completely wrong. Gemini 3 flash actually gave a reasonable answer; not quite up to the best human description, but it's the first model I've found that actually seems to mostly correctly answer the question.
So, it's just one data point, but at least for my one fairly niche benchmark problem, Gemini 3 Flash has successfully answered a question that none of the others I've tried have (I haven't actually tried Gemini 3 Pro, but I'd compared various Claude and ChatGPT models, and a few different open weights models).
So, guess I need to put together some more benchmark problems, to get a better sample than one, but it's at least now passing a "I can find the answer to this in the top 3 hits in a Google search for a niche topic" test better than any of the other models.
Still a lot of things I'm skeptical about in all the LLM hype, but at least they are making some progress in being able to accurately answer a wider range of questions.
What's the value of a secret benchmark to anyone but the secret holder? Does your niche benchmark even influence which model you use for unrelated queries? If LLM authors care enough about your niche (they don't) and fake the response somehow, you will learn on the very next query that something is amiss. Now that query is your secret benchmark.
Even for niche topics it's rare that I need to provide more than 1 correction or knowledge update.
The reason I don't disclose isn't generally that I think an individual person is going to read my post and update the model to include it. Instead it is because if I write "I ask the question X and expect Y" then that data ends up in the train corpus of new LLMs.
However, one set of my benchmarks is a more generalized type of test (think a parlor-game type thing) that actually works quite well. That set is the kind of thing that could be learnt via reinforcement learning very well, and just mentioning it could be enough for a training company or data provider company to try it. You can generate thousands of verifiable tests - potentially with verifiable reasoning traces - quite easily.
For fun: https://chatgpt.com/s/t_694361c12cec819185e9850d0cf0c629
1. What is the purpose of the benchmark?
2. What is the purpose of publicly discussing a benchmark's results but keeping the methodology secret?
To me it's in the same spirit as claiming to have defeated alpha zero but refusing to share the game.
2. I discussed that up-thread, but https://github.com/microsoft/private-benchmarking and https://arxiv.org/abs/2403.00393 discuss some further motivation for this if you are interested.
> To me it's in the same spirit as claiming to have defeated alpha zero but refusing to share the game.
This is an odd way of looking at it. There is no "winning" at benchmarks, it's simply that it is a better and more repeatable evaluation than the old "vibe test" that people did in 2024.
Example: You are probably already aware that almost any metric that you try to use to measure code quality can be easily gamed. One possible strategy is to choose a weighted mixture of metrics and conceal the weights. The weights can even change over time. Is it perfect? No. But it's at least correlated with code quality -- and it's not trivially gameable, which puts it above most individual public metrics.
Will someone (or some system) see my query and think "we ought to improve this"? I have no idea since I don't work on these systems. In some instances involving random sampling... probably yes!
This is the second reason I find the idea of publicly discussing secret benchmarks silly.
Obviously, the fact that I've done Google searches and tested the models on these means that their systems may have picked up on them; I'm sure that Google uses its huge dataset of Google searches and search index as inputs to its training, so Google has an advantage here. But, well, that might be why Googles new models are so much better, they're actually taking advantage of some of this massive dataset they've had for years.
I guess they get such a large input of queries that they can only realistically check and therefore use a small fraction? Though maybe they've come up with some clever trick to make use of it anyway?
I'll need to find a new one, or actually put together a set of questions to use instead of just a single benchmark.
you dont train on your test data because you need to have that to compare if training is improving or not.
"When was the last time England beat Scotland at rugby union"
new variant "Without using search when was the last time England beat Scotland at rugby union"
It is amazing how bad ChatGPT is at this question and has been for years now across multiple models. It's not that it gets it wrong - no shade, I've told it not to search the web so this is _hard_ for it - but how badly it reports the answer. Starting from the small stuff - it almost always reports the wrong year, wrong location and wrong score - that's the boring facts stuff that I would expect it to stumble on. It often creates details of matches that didn't exist, cool standard hallucinations. But even within the text it generates itself it cannot keep it consistent with how reality works. It often reports draws as wins for England. It frequently states the team that it just said scored most points lost the match, etc.
It is my ur example for when people challenge my assertion LLMs are stochastic parrots or fancy Markov chains on steroids.
So I want to have a general idea of how good it is at this.
I found something that was niche, but not super niche; I could easily find a good, human written answer in the top couple of results of a Google search.
But until now, all LLM answers I've gotten for it have been complete hallucinated gibberish.
Anyhow, this is a single data point, I need to expand my set of benchmark questions a bit now, but this is the first time that I've actually seen progress on this particular personal benchmark.
They probably use old Flash Lite model, something super small, and just summarize the search...
After all it's the same search engine team that didn't care about its search results - it's main draw - activey going shit for over a decade.
Get an API and try to use it for classification of text or classification of images. Having an excel file with somewhat random looking 10k entries you want to classify or filter down to 10 important for you, use LLM.
Get it to make audio transcription. You can now just talk and it will make note for you on level that was not possible earlier without training on someone voice it can do anyone’s voice.
Fixing up text is of corse also big.
Data classification is easy for LLM. Data transformation is a bit harder but still great. Creating new data is hard so like answering questions where it has to generate stuff from thin air it will hallucinate like mad man.
The ones that LLMs are good in are used in background by people creating actual useful software on top of LLMs but those problems are not seen by general public who sees chat box.
Maybe the scale is different with genAI and there are some painful learnings ahead of us.
I know without the ability to search it's very unlikely the model actually has accurate "memories" about these things, I just hope one day they will acutally know that their "memory" is bad or non-existing and they will tell me so instead of hallucinating something.
Basically making sense of unstructured data is super cool. I can get 20 people to write an answer the way they feel like it and model can convert it to structured data - something I would have to spend time on, or I would have to make form with mandatory fields that annoy audience.
I am already building useful tools with the help of models. Asking tricky or trivia questions is fun and games. There are much more interesting ways to use AI.
So I think LLMs can be good for finding niche info.
Which also implies that (for most tasks), most of the weights in a LLM are unnecessary, since they are spent on memorizing the long tail of Common Crawl... but maybe memorizing infinite trivia is not a bug but actually required for the generalization to work? (Humans don't have far transfer though... do transformers have it?)
Kinda sounds like you're testing two things at the same time then, right? The knowledge of the thing (was it in the training data and was it memorized?) and the understanding of the thing (can they explain it properly even if you give them the answer in context).
Today I had to resolve performance problems for some sql server statement. Been doing it years, know the regular pitfalls, sometimes have to find "right" words to explain to customer why X is bad and such.
I described the issue to GPT5.2, gave the query, the execution plan and asked for help.
It was spot on, high quality responses and actionable items and explanations on why this or that is bad, how to improve it and why particularly sql may have generated such a query plan. I could instantly validate the response given my experience in the field. I even answered with some parts of chatgpt on how well it explained. However I did mention that to customer and I did tell them I approve the answer.
Asked high quality question and receive a high quality answer. And I am happy that I found out about an sql server flag where I can influence particular decision. But the suggestion was not limited to that, there were multiple points given that would help.
no such thing
415 more comments available on Hacker News
Not affiliated with Hacker News or Y Combinator. We simply enrich the public API with analytics.