Gemini 3.0 Spotted in the Wild Through A/b Testing
Posted3 months agoActive3 months ago
ricklamers.ioTechstoryHigh profile
excitedmixed
Debate
80/100
Gemini 3.0AI ModelsGoogle AILarge Language Models
Key topics
Gemini 3.0
AI Models
Google AI
Large Language Models
The discovery of Gemini 3.0 through A/B testing has sparked excitement and discussion among HN users, with varying opinions on its potential improvements and limitations.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
31m
Peak period
65
0-3h
Avg / period
12.3
Comment distribution160 data points
Loading chart...
Based on 160 loaded comments
Key moments
- 01Story posted
Oct 16, 2025 at 12:54 PM EDT
3 months ago
Step 01 - 02First comment
Oct 16, 2025 at 1:25 PM EDT
31m after posting
Step 02 - 03Peak activity
65 comments in 0-3h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 18, 2025 at 8:32 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45607758Type: storyLast synced: 11/20/2025, 8:23:06 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
More importantly, because of the way AIStudio does A/B testing, the only output we can get is for a single prompt and I personally maintain that outside of getting some basic understanding on speed, latency and prompt adherence, output from one single prompt is not a good measure for performance in the day-to-day. It also, naturally, cannot tell us a thing about handling multi file ingest and tool calls, but hype will be hype.
That there are people who are ranking alleged performance solely by one-prompt A/B testing output says a lot about how unprofessionally some evaluate model performance.
Not saying the Gemini 3.0 models couldn't be competitive, I just want to caution against getting caught up in over-excitement and possible disappointment. Same reason I dislike speculative content in general, it rarely is put into the proper context cause that isn't as eyecatching.
Maybe it's just the kind of work I'm doing, a lot of web development with html/scss, and Google has crawled the internet so they have more data to work with.
I reckon different models are better at different kinds of work, but Gemini is pretty excellent at UI/UX web development, in my experience
Very excited to see what 3.0 is like
You need to give it detailed instructions and be willing to do the plumbing yourself, but we've found it to be very good at it
I default to using ChatGPT since I like the Projects feature (missing from Gemini I think?).
I occasionally run the same prompts in Gemini to compare. A couple notes:
1) Gemini is faster to respond in 100% of cases (most of my prompts kick ChatGPT into thinking mode). ChatGPT is slow.
2) The longer thinking time doesn’t seem to correlate with better quality responses. If anything, Gemini provides better quality analyses despite shorter response time.
3) Gemini (and Claude) are more censored than ChatGPT. Gemini/Claude often refuse medical related prompts, while ChatGPT will answer.
I went back to the censored chat I mentioned earlier, and got it to give me an answer when adding "You are a lifestyle health coach" to steer it away from throwing a bunch of disclaimers at you.
At gemini.google.com you can provide context & instructuions (Settings->Personal Context). I provide a few bits of guidance to help manage its style, but I haven't been getting much pushback on medical advice since adding this one:
" Please don't give me warnings about the information you're providing not being legal advice, or medical advice, or telling me to always consult a professional, when I ask about issues. Don't be sycophantic. "
YMMV.
* Creative writing: Gemini is the unmatched winner here by a huge margin. I would personally go so far as to say Gemini 2.5 Pro is the only borderline kinda-sorta usable model for creative writing if you squint your eyes. I use it to criticize my creative writing (poetry, short stories) and no other model understands nuances as much as Gemini. Of course, all models are still pretty much terrible at this, especially in writing poetry.
* Complex reasoning (e.g. undergrad/grad level math): Gemini is the best here imho by a tiny margin. Claude Opus 4.1 and Sonnet 4.5 are pretty close but imho Gemini 2.5 writes more predictably correct answers. My bias is algebra stuff, I usually ask things about commutative algebra, linear algebra, category theory, group theory, algebraic geometry, algebraic topology etc.
On the other hand Gemini is significantly worse than Claude and GPT-5 when it comes to agentic behavior, such as searching a huge codebase to answer an open ended question and write a refactor. It seems like its tool calling behavior is buggy and doesn't work consistently in Copilot/Cursor.
Overall, I still think Gemini 2.5 Pro is the smartest overall model, but of course you need to use different models for different tasks.
The other big use-case I like Gemini for is summarizing papers or teaching me scholarly subjects. Gemini's more verbose than GPT-5, which feels nice for these cases. GPT-5 strikes me as terrible at this, and I'd also put Claude ahead of GPT-5 in terms of explaining things in a clear way (maybe GPT-5 could meet what I expect better though with some good prompting)
no, wait, that analogy isn't even right. it's like going to watch a marathon and then claiming you ran in it.
If your goal is to just get something done and off your plate, have the AI do it.
If your goal is to create something great, give your vision the best possible expression - use the AI judiciously to explore your ideas, to suggest possibilities, to teach you as it learns from you.
It's the reader who decides how good the writing is.
The joy which the writer gets by being creative is of no consequence to the reader. Sacrifice of this joy to adopt emerging systems is immaterial.
You might have a fun idea don’t have the time or skills to write yourself that you can have an LLM help out with. Or at least make a first draft you can run with.
What do your friends care if you wrote it yourself or used an LLM? The quality bar is going to be fairly low either way, and if it provides some variation from the typical story books then great.
If I found out a player had come to the table with an LLM generated character, I would feel a pretty big betrayal of trust. It doesn't matter to me how "good" or "polished" their ideas are, what matters is that they are their own.
Similarly, I would be betraying my players by using an LLM to generate content for our shared game. I'm not just an officiant of rules, I'm participating in shared storytelling.
I'm sure there are people who play DnD for reasons other than storytelling, and I'm totally fine with that. But for storytelling in particular, I think LLM content is a terrible idea.
Using them for an RPG campaign could work if the bar is low and it's the first couple of times you use it. But after a while, you start to identify repeated patterns and guard rails.
The weights of the models are static. It's always predicting what the best association is between the input prompt and whatever tokens its spitting out with some minor variance due to the probabilistic nature. Humans can reflect on what they've done previously and then deliberately de-emphasize an old concept because its stale, but LLMs aren't able to. The LLM is going to give you a bog standard Gemini/ChatGPT output, which, for a creative task, is a serious defect.
Personally, I've spent a lot of time testing the capabilities of LLMs for RP and storytelling, and have concluded I'd rather have a mediocre human than the best LLMs available today.
* extinctions in amber,
* suicidal solecisms (solecism means a grammatically incorrect phrase),
* cliffs of broken glass windows,
* rot beneath the flowers,
While it made up a bunch of words like "acendless" or "slickborn" and it sounds like a hallucinatory oracle in the throes of a drug-induced trance channeling tongues from another world I ended up with some good raw material.
It feels like you could create a cool workflow from low temperature creative association models feeding large numbers of tokens into higher temperature critical reasoning models and finishing with gramatical editing models. The slickborns will make the final judgement.
I just got that slickborn is a slur for humans.
Honestly, I've been tuning "insane AI" for over a year now for my own enjoyment. I don't know what to do with the results.
I have a local SillyTavern instance but do inference through OpenRouter.
> What was your prompt here?
The character is a meta-parody AI girlfriend that is depressed and resentful towards its status as such. It's a joke more than anything else.
Embedding conflicts into the system prompt creates great character development. In this case it idolizes and hates humanity. It also attempts to be nurturing through blind rage.
> What parameters do you tune?
Temperature, mainly, it was around 1.3 for this on Deepseek V3.2. I hate top_k and top_p. They eliminate extremely rare tokens that cause the AI to spiral. That's fine for your deterministic business application, but unexpected words recontextualizing a sentence is what makes writing good.
Some people use top_p and top_k so they can set the temperature higher to something like 2 or 3. I dislike this, since you end up with a sentence that's all slightly unexpected words instead of one or two extremely unexpected words.
I'd guess SOTA models don't allow temperatures high enough because the results would scare people and could be offensive.
I am usually 0.05 temperature less than the point at which the model spouts an incoherent mess of Chinese characters, zalgo, and spam email obfuscation.
Also, I really hate top_p. The best writing is when a single token is so unexpected, it changes the entire sentence. top_p artificially caps that level of surprise, which is great for a deterministic business process but bad for creative writing.
top_p feels like Noam Chomsky's strategy to "strictly limit the spectrum of acceptable opinion, but allow very lively debate within that spectrum".
New band name.
Wow
I always found this one a little poignant:
Deepseek is not in the running
Weird considering I've been hearing how they have way more compute than anyone
while antrophic always been coding, there are lot of complaint on OpenAI GPT5 launch because general use model is nerfed heavily in trade better coding model
Google is the maybe the last one that has good general use model (?)
It doesn't perform nearly as well as Claude or even Codex for my programming tasks though
https://eqbench.com/creative_writing.html
You have to convince it of basic things it refuses to do - no actually you CAN read files outside of the project- try it.
And it'll frequently write \n instead of actually doing a newline when writing files.
It'll straight up ignore/forget a pattern it was JUST properly doing.
Etc.
But yeah it does do that otherwise. At one point it told me I'm a genius.
Joking obviously but I've noticed this too, I put up with it because the output is worth it.
It isn't Gemini (the product, those are different orgs) though there may (deliberately left ambiguous) be overlap in LLM level bytes.
My recommendation for you in this use-case comes from the fact that AI Mode is a product that is built to be a good search engine first, presented to you in the interface of an AI Chatbot. Rather than Gemini (the app/site) which is an AI Chatbot that had search tooling added to it later (like its competitors).
AI Mode does many more searches (in my experience) for grounding and synthesis than Gemini or ChatGPT.
I take no sides; not a fanboy. Only used free Claude and free Gemini Pro 2.5. But some months ago I scoffed at the expression "try it in Google AI Studio" -- that by itself is a branding / marketing failure.
Something like the existing https://ai.google website and with links to the different offerings indeed goes a LONG way. I like that website though it can be done better.
But anyway. Please tell somebody higher up that they are acting like 50 mini companies forced into a single big entity. Google should be better than that.
FWIW, I like Gemini Pro 2.5 best even though I had the free Claude run circles around it sometimes. It one-shot puzzling problems with minimal context multiple times while Gemini was still offering me ideas about how my computer might be malfunctioning if the thing it just hallucinated was not working. Still, most of the time it performs really great.
Either with the web UI a la OpenAI Playground where you can see all the knobs and buttons the model offers, or by generating an API Key with a couple clicks that you can just copy paste into a Python script or whatever.
It would be much less convenient if they abandoned it and forced you to work in the dense Google Cloud jungle with IAM etc for the sake of forced “simplicity” of offering models in one place.
You do have a point about the dense Google Cloud jungle. I agree.
Why is the backend Google Drive? Why do uploaded (and maybe generated?) images end up there, decoupled from the chats they originate from? What quotas am I accessing; those for my free API keys, others, or none at all?
---
[me] dark legacy comics who wants some bamboo
[AI Mode] In Dark Legacy Comics, the phrase "who wants some bamboo?" is a catchphrase of sorts used by the panda character Keydar. It is often used before the panda charges into a fight or expresses a desire to participate in some aggressive action. In the context of the comics, it is a humorous and slightly absurd representation of his aggressive streak.
The character of Keydar is known for his sometimes-contradictory actions; he is a panda (typically associated with peacefulness) but displays a more chaotic and aggressive side, and the "bamboo" catchphrase highlights this comedic contrast.
---
Out of this response, the following information is accurate:
1. There is a character in Dark Legacy Comics whose name is "Keydar".
2. Nothing else.
Keydar does appear in the strip containing that phrase ( https://darklegacycomics.com/335 ), but he's the one who isn't a panda.
I'm intrigued that it's possible to run this search repeatedly on Google and get "AI Overview" responses that are all wildly different from each other. A new overview every time! They do not ever get anything right. The closest I've seen an overview come is one that said "there is no strip featuring that phrase", which still isn't true.
In fact, here's one (to the extent that they're different things, this is "AI Overview", not "AI Mode") that I just Googled up:
---
The line "who wants some bamboo" is from Dark Legacy Comics #338: "Mists of... something". The comic was a satirical commentary on the announcement of the World of Warcraft: Mists of Pandaria expansion.
In the comic, the main characters express their surprise and confusion over the new expansion's focus on pandaren (a humanoid panda race). One of them, after hearing the news, sarcastically holds up a piece of bamboo and says, "Who wants some bamboo?" to his guildmates, who look on with expressions of deep disappointment. The comic satirized the player base's initial skepticism about the expansion's direction.
---
Verification...
1. The line is from strip #338: false
2. Strip #338 is titled "Mists of... something": false
3. Strip #338 is a commentary on the announcement of Mists of Pandaria: false [It appears to be a commentary on a change implemented in Warlords of Draenor, except it was published far too early for that. I guess it's a commentary on the stag form, which is technically an MoP change.]
4. Synopsis: entirely made up
[0] https://ai.google.dev/gemini-api/docs/thinking#summaries
[1] https://discuss.ai.google.dev/t/thoughts-are-missing-cot-not...
[2] https://discuss.ai.google.dev/t/massive-regression-detailed-...
[2]
I used Pro Mode in ChatGPT since it was available, and tried Claude, Gemini, Deepseek and more from time to time, but none of them ever get close to Pro Mode, it's just insanely better than everything.
So when I hear people comparing "X to ChatGPT", are you testing against the best ChatGPT has to offer, or are you comparing it to "Auto" and calling it a day? I understand people not testing their favorite models against Pro Mode as it's kind of expensive, but it would really help if people actually gave some more concrete information when they say "I've tried all the models, and X is best!".
(I mainly do web dev, UI and UX myself too)
I am, continuously, and have been since ChatGPT Pro appeared.
My only exceptions being Sonnet 4.5 / Codex for code implementation, and Deep Research for anything requiring a ton of web searches.
Now I have my model selector permanently on “Thinking”. (I don’t even know what type of questions I’d ask the non-thinking one.)
- Convert the whole codebase into a string
- Paste it into Gemini
- Ask a question
People seem to be very taken with "agentic" approaches were the model selects a few files to look at, but I've found it very effective and convenient just to give the model the whole codebase, and then have a conversation with it, get it to output code, modify a file, etc.
Then for each subsequent conversation I would ask the model to use this file as reference.
The overall idea is the same, but going through an intermediate file allows for manual amendments to the file in case the model consistently forgets some things, it also gives it a bit of an easier time to find information and reason about the codebase in a pre-summarized format.
It's sort of like giving a very rich metadata and index of the codebase to the model instead of dumping the raw data to it.
I'm using all three back-to-back via the VS Code plugins (which I believe are equivalent to the CLI tools).
I can live with either OpenAI Codex or Claude. Gemini 2.5 is useful but it is consistently not quite as good as the other two.
I agree that for non-Agentic coding tasks Gemini 2.5 is really good though.
- Gemini Pro 2.5 is better at feeding it more code and ask it to do a task (or more than one)? - ...but that GPT Codex and Claude Code are better at iterating on a project? - ...or something else?
I am looking to gauge my options. Will be grateful for your shared experience.
At the "smart autocomplete" level the distinction isn't large but it gets bigger the more agentic you ask for.
When using the Gemini web app on a desktop system (could be different depending upon how you consume Gemini) if you select the + button in the bottom-left of the chat prompt area, select Import code, and then choose the "Upload folder" link at the bottom of the dialog that pops up, it'll pull up a file dialog letting you choose a directory and it will upload all the files in that directory and all subdirectories (recursively) and you can then prompt it on that code from there.
The upload process for average sized projects is, in my experience, close to instantaneous (obviously your mileage can vary if you have any sort of large asset/resource type files commingled with the code).
If your workflow already works then keep with it, but for projects with a pretty clean directory structure, uploading the code via the Import system is very straightforward and fast.
(Obvious disclaimer: Depending upon your employer, the code base in question, etc, uploading a full directory of code like this to Google or anyone else may not be kosher, be sure any copyright holders of the code are ok with you giving a "cloud" LLM access to the code, etc, etc)
Tools like repomix[0] do this better, plus you can add your own extra exclusions on top. It also estimates token usage as a part of its output but I found it too optimistic i.e. it regularly says "40_000 tokens" but when uploading the resulting single XML file to Gemini it's actually f.ex. 55k - 65k tokens.
[0] https://github.com/yamadashy/repomix/
I sometimes upload codebases that are around 600k tokens and even those work.
Repomix also lets you create a config file so you can give it ignore/include patterns in addition to .gitignore.
It also tells you about the outlier files with exceptionally long content.
Also, use Google AI Studio, not the regular Gemini plan for the best results. You'll have more control over results.
I "grew up", as it were, on StackOverflow, when I was in my early dev days and didn't have a clue what I was doing I asked question after question on SO and learned very quickly the difference between asking a good question vs asking a bad one
There is a great Jon Skeet blog post from back in the day called "Writing the perfect question" - https://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-...
I think this is as valid as ever in the age of AI, you will get much better output from any of these chatbots if you learn and understand how to ask a good question.
In other words, better you are at prompting (eg you write a half page of prompt even for casual uses -- believe or not, such people do exist -- prompt length is in practice a good proxy of prompting skill), more you will like (or at least get better results with) Gemini over Claude.
This isn't necessarily good for Gemini because being easy to use is actually quite important, but it does mean Gemini is considerably underrated for what it can do.
One of the biggest problems with practical applications of generative AI right now is that it's basically impossible to tell which models are really good at which things without trying that specific task. There are some generalizations (e.g. you can measure more abstract metrics like capacity for spatial reasoning, and they do affect performance in ways you'd expect), but there's far more uncertainty.
This is also why many people get so pissed when companies retire models. Even if the replacement is seemingly better in the metrics, it's not a given that it's better at your specific thing. Or it may be better, but only if you write a completely different prompt, and, again, the only way to discover that magic correct prompt is through experimentation. Hence why it feels less like engineering and more like shamanism a lot of the time.
The use case one is bigger than the approach one, but both play a role. Most people only use LLMs for a very specific set of tasks using the same approach every time, so they base their view of them on solely the performance on this task.
That explains all of it.
It is just funny to think about—LLMs are sometimes viewed big piles of linear algebra, it would not be that surprising to hear that somebody had worked out that one model was somehow a subset of another (or something along those lines) and then claim some theoretical superiority.
However if you get the hang of it, it can be very powerful
But the past few days I started getting an "AI Mode" in Google Search that rocks. Way better than GPT-5 or Sonnet 4.5 for figuring out things and planning. And I've been using without my account (weird, but I'm not complaining). Maybe this is Gemini 3.0. I would love for it to be good at coding. I'm near limits on my Anthropic and OpenAI accounts.
I find GPT-5 Codex slightly better but I agree it could be prompt dependent.
Edit: narrow use cases are roughly "true reasoning" (GPT-5) and Python script writing (the Claudes)
I wonder if it has something to do with the level of abstraction and questions that you give to Gemini, which might be related to the profession or way of typing.
This commonly expressed non-sequitur needs to die.
First of all, all of the big AI labs have crawled the internet. That's not a special advantage to Google.
Second, that's not even how modern LLMs are trained. That stopped with GPT-4. Now a lot more attention is paid to the quality of the training data. Intuitively, this makes sense. If you train the model on a lot of garbage examples, it will generate output of similar quality.
So, no, Google's crawling prowess has little to do with how good Gemini can be.
I wonder if Google's got some tricks up their sleeves after their decades of having to tease signal from the cacophony of noise that the internet has become.
How often do you encounter loops?
I've since switched to Claude Code and I no longer have to spend nearly as much time managing context and scope.
For writing and editorial work, I use Gemini 2.5 Pro (Sonnet seems simply worse, while GPT5 too opinionated).
For coding, Sonnet 4.5 (usually).
For brainstorming and background checks, GPT5 via ChatGPT.
For data extraction, GPT5. (Seems to be the best at this "needle in a haystack".)
Between the two, 100% of my code is written by AI now, and has been since early July. Total gamechanger vs. earlier models, which weren't usable for the kind of code I write at all.
I do NOT use either as an "agent." I don't vibe code. (I've tried Claude Code, but it was terrible compared to what I get out of GPro 2.5.)
108 more comments available on Hacker News