Gemini 3.0 Pro – Early Tests
Posted3 months agoActive3 months ago
twitter.comTechstoryHigh profile
skepticalmixed
Debate
80/100
Gemini 3.0 ProAI ModelsGoogle AILarge Language Models
Key topics
Gemini 3.0 Pro
AI Models
Google AI
Large Language Models
The post discusses early tests of Gemini 3.0 Pro, a new AI model from Google, with comments expressing skepticism about its capabilities and concerns about the company's product culture and data privacy.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
20m
Peak period
68
0-3h
Avg / period
11.4
Comment distribution125 data points
Loading chart...
Based on 125 loaded comments
Key moments
- 01Story posted
Oct 2, 2025 at 2:26 PM EDT
3 months ago
Step 01 - 02First comment
Oct 2, 2025 at 2:46 PM EDT
20m after posting
Step 02 - 03Peak activity
68 comments in 0-3h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 4, 2025 at 5:47 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45453448Type: storyLast synced: 11/20/2025, 7:55:16 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
A few more in this genre:
https://x.com/cannn064/status/1973818263168852146 - "Make a SVG of a PlayStation 4 controller"
https://x.com/cannn064/status/1973415142302830878 "Create a single, self-contained HTML5 file that mimics a macOS Sonoma-style desktop: translucent menu bar with live clock, magnifying dock, draggable/resizable windows, and a dynamic wallpaper. No external assets; use inline SVG for icons."
https://x.com/synthwavedd/status/1973405539708056022 "Write full HTML, CSS and Javascript for a very realistic page on Apple's website for the new iPhone 18"
I've not seen it myself so I'm not sure how confident they are that it's Gemini 3.0.
The only thing I've found to give me some sort of quantitative idea of how good a new model is, is my own private benchmarks. It doesn't cover everything I want to use LLMs for, and only has 20-30 tests per "category", but at least I'm 99% sure it isn't in the training datasets.
I would be so entertained if I found out an AI lab had wasted their time cheating on my dumb benchmark!
Que intro: "The gang wastes their time cheating on a dumb benchmark"
But now I am worried that since you have shared that you do SVG of an X riding a Y thing, maybe these models will try to cheat on the whole SVG of X riding Y thing instead of hyper focusing the pelican.
So now I suppose you might need to come up with an entirely new thing though :)
A duck billed platypus riding a unicycle? A man o' war riding a pyrosome? A chicken riding a Quetzalcoatlus? A tardigrade riding a surf board?
The point is that in order to cheat on @simonw's benchmark across any arbitrary combination, they'd have to come up with an absurd number of human crafted input-output training pairs with human produced drawings. You can't just ask ChatGPT to generate every combination because all it'll produce is garbage that gets a lot worse the further from a pelican riding a bicycle.
It might work at first for the pelican and a few other animals/transport combination but what does it even mean for a man o' war riding a pyrosome? I asked every model I have access to generate an SVG for a "man o' war riding a pyrosome" and not a single one managed to draw anything resembling a pyrosome. Most couldn't even produce something resembling a man o' war except as a generic ellipsoid-shaped jellyfish with a few tenticles.
Expand that to every weird noun-noun combination and it's just not practical to train even a tiny fraction of them.
Man'o'war on a pyrosome. I don't what you expected it to look like, maybe it could be more whiteish translucent instead of orange, but it looks fairly reasonable to me. Took a bit over a minute with the ChatGPT app.
Simonw's test is for the text-only output from an LLM to write an SVG, not "can a multimodal AI in 2025" generate a PNG. By having pictures of pelicans on bicycles in the training data in PNG format, from people wanting to see one, after reading his blog, there are now raster-based images from an image generation model that fairly convincingly look as described in the training data. Now that there's PNGs of pelicans on bicycles, we would expect GPT-6 to be better at generating SVGs of something it's already "seen".
We don't know what simonw's secret combo X and Y is, nor do I want to know, because that would ruin the benchmark (if it isn't ruined already by virtue of him having asked it). 200k nouns is definitely high though. A bit of thought could cut it down to exclude concepts and lot of other things. How much spare GPU capacity OpenAI has, I have no idea. But if I were there, I'd want the GPUs to be running as hot as the cloud provider would let me run them, because they're paying per hour, not per watt, and have a low-priority queue of jobs for employees to generate whatever extra training data they can think of on their off hours.
Oh and here's the pelican PNG so the other platforms can crawl this comment and slurp it up.
https://chatgpt.com/share/68def958-3008-8009-91fa-99127fc053...
Granted not an SVG, but still awesome.
https://imgur.com/a/KsbyVNP
I don't think it's necessarily "cheating", it just happens as they're discovering and ingesting large ranges of content. A problem of public content, it's bound to be included sooner or later, directly or indirectly.
Nice to hear you're doing some sort of contingency though, and looking forward to the inevitable blog post announcing the change to a different bird and vehicle :)
How would you determine that improvements to SVG pelicans on bicycles (and not your secret X on Ys) are from an OpenAI employee cheating your benchmark vs being an improvement on pelicans on bicycles thanks to that picture from Reddit and everywhere elsewhere in the training data?
> Absolutely — the “pelican riding a bicycle” SVG test is a quirky but clever benchmark created by Simon Willison to evaluate how well different large language models (LLMs) can generate SVG (Scalable Vector Graphics) images from a prompt that’s both unusual and unlikely to be in their training data.
Not that I mind; I want models to be good at generating SVG! Makes icons much simpler.
Obviously they're only getting the question and not a perfect answer, but with today's process of generating hundreds of potential answers and getting another model to choose the best/correct one for training, I don't think that matters.
More important would be post training, where the labs specifically train on the exact question. But it doesn't seem like this is happening for most amateur benchmarks at least. All the models that are good at pelican bike have been good at whatever else you throw at them to SVG.
The things that I see represented here, may or may not be impressive, but sure as hell have never been the major blockers in achieving progress on complex tasks and software.
I understand you're merely reporting, thank you for that, not criticizing you, but those tests are absolutely irrelevant.
Is this supposed to be a good example?
It looks like something I'd put together, and you don't want me doing design work.
I tried it with Claude Code CLI, it didn't follow instructions correctly (I had a Claude.md file with clear instructions), stopped after a few implementations (less than 3 minutes), and produced code that does not work.
For the benefit of the doubt, I changed instructions to be NextJS platform as I thought it's a known framework and it might do better, but still, same quality issues.
I still think ultimately (and somewhat sadly) Google will win the AI race due to its engineering talent and the sheer amount of data it has (and Android integration potential).
It may well be that they also didn't have a product culture as an organization, but were willing to experiment or let small teams do so.
It's still a lesson, but maybe a different one.
With organizational scale it becomes harder and harder to launch experiments under the brand. Red tape increases, outside scrutiny increases. Retaining the ability to do that is difficult.
Google does experiment a fair bit (including in AI, e.g. NotebookLLM and its podcast feature are I think a standout example of trying to see what sticks) but they also tend to try to hide their experiments in developer portals nowadays, which makes it difficult to get a signal from a general consumer audience.
OpenAI mistakenly thought Anthropic was about to launch a chatbot, and ChatGPT was a scrappy, rushed-out-the-door product made from an intermediate version of GPT-4, meant to one-up them. Of course, they were surprised at how popular it became.
OpenAI forced Google to release and as a result, we have all of the AI tooling, integrations, and models. Meta's leaning into the stolen Llama code took this further and sparked the Open Source LLM revolution (in addition to the myriad contributors and researchers who built on that).
If we had left it to Google, I suspect they'd release tooling (as they did with TensorFlow) but not an LLM that might compete with their core product..
I feel like Google tried to solve for this with their `withgoogle.com` domain and it just ends up being confusing or worse still, frustrating when you see something awesome and then nothing ever comes of it.
Google's AI offering is a complete nightmare to use. Three different APIs, at least two different subscriptions, documentation that uses them interchangeably.
For Gemini's API it's often much simpler to actually pay OpenRouter the 5% surchargeto BYOK than deal with it all.
I still can't use my Google AI Pro account with gemini-cli..
It's amazing how they can show useless data while completely obfuscating what matters.
Not enough brain cycles to figure out a way to give Google money, whereas the OpenAI subscription was basically a no-brainer.
The python library is not well documented, and has some pretty basic issues that need looking at. Terrible, unhelpful errors, and "oh, so this works if I put it in camel-case" sort of stuff.
I find Gemini is their first API that works like that. Not like their pre-Gemini vision, speech recognition, sheets etc.. Those were/are a nightmare to set up indeed.
Google personally reached out to someone trying to reproduce GPT3 and convinced him to abandon his plan of releasing it to the public.
GPT-2!!
https://medium.com/@NPCollapse/the-hacker-learns-to-trust-62...
This is evident in Android and the pixel lineup, which could be my favorite phone if not for some of the most baffling and frustrating decisions that lead to a very weirdly disjointed app experience (comparing to something like iOS's first party tools).
Like removing location based reminders from google tasks, for some reason? Still no apple shortcuts-like automation built-in, keep can still do location based reminders but it's a notes app so which am I supposed to use? Google tasks or keep? Well, gemini adds reminders to google tasks and not keep if I wanted to use keep primarily.
If they just spent some time polishing and integrating these tools, and add some of their ML magic to it they'd blow Apple out of the park.
All of Google's tech is cool and interesting, from a tech standpoint but it's not well integrated for a full consumer experience.
https://www.thevoiceofuser.com/google-clouds-cuts-and-the-bi...
Could it be argued that perhaps UX Research was not working at all? Or that their recommendations were not being incorporated? Or that things will get even worse now without them?
I get Google can't force it on all the OEMs with their custom skins, but they can at least control their own PixelOS and their own apps.
> Some teams in the Google Cloud org just laid off all UX researchers below L6
That’s not all UX researchers below L6 in the entire company. It doesn’t even sound like it’s all UX researchers below L6 in Google Cloud.
It would be weird to release that as a serious company. They tried making a deliberately-wacky chatbot but it was not fun.
Letting OpenAI to release it first was a right move.
I remember forming a really simple dead simple sveltekit website during Chatgpt 3. It was good, it was mind blowing and I was proud of it.
The only interactivity was a button which would go from one color to other and it would then lead to a pdf.
If I am going to be honest, the UI was genuinely good. It was great tho and still gives me more nostalgia and good vibes than current models. Em-dashes weren't that common in Chatgpt 3 iirc but I have genuinely forgotten what it was like to talk to it
> In June 2022, LaMDA gained widespread attention when Google engineer Blake Lemoine made claims that the chatbot had become sentient. The scientific community has largely rejected Lemoine's claims...
From https://en.wikipedia.org/wiki/LaMDA
Damn, that's crazy. Or at least in hindsight it is. I don't remember anything big deal being made about it back then.
There were other, less-available prototypes prior to that.
For example does a CSS expert know how to design a great website? _maybe_…but knowing the CSS spec in its entirely doesn’t (by itself) help you understand how to make useful or delightful products.
Fair criticism that it took someone else to make something of the tech that Google initially invented, but Google is furiously experimenting with all their active products since Sundar's "code red" memo.
Without RLHF, LLM-based chat was a psychotic liability.
Nearly all the people that matter use iPhone... Yet Apple really hasn't had much success in the AI world, despite being in a position to win if their product is even only vaguely passable.
The fact that we had Attention Is All You Need was freely available online alone was unbelievably fortunate from hindsight.
It’s amazing how pervasive company cultures can be, and how this comes from the top, and can only be fixed with replacing leadership with an extremely talented CEO that knows the company inside out and can change its course. Nadella from Microsoft comes to mind, although that was more about Microsoft going back to its roots (replace sales oriented leadership with product oriented leadership again).
Google never had product oriented leadership in the same way that Amazon, Apple and Microsoft had.
I don’t think this will ever change at this point.
For those who haven’t read it, Steve Yegge’s rant about Google is worth your time:
1 https://gist.github.com/chitchcock/1281611
Now Grok is publicly boasting PhD level reasoning while Surge AI and Scale AI are focusing on high quality datasets curated by actual PhD humans.
Surge AI is boasting $1B in revenue, and I am wondering how much of that was paid in X.ai stock: https://podcasts.apple.com/us/podcast/the-startup-powering-t...
In my opinion the major advancements of 2025 have been more efficient models. They have made smaller models much, much better (including MoE models) but have failed to meaningfully push the SoTA on huge models; at least when looking at the USA companies.
You can try to build a monster the size of GPT-4.5, but even if you could actually make the training stable and efficient at this scale, you still would suffer trying to serve it to the users.
Next generation of AI hardware should put them in reach, and I expect that model scale would grow in lockstep with new hardware becoming available.
It was one of the first things I tried when Claude Code went GA:
https://gondolaprime.pw/hex-balls
One of the biggest issues holding Gemini back, IMO, compared to the competitors.
Many LLMs are still plagued by "it's easier to reset the conversation than to unfuck the conversation", but Gemini 2.5 is among the worst.
The other day I asked 2.5 Pro for suggestions. It would provide one, which I rejected with some reasoning. It would provide another, which I also rejected. Asked for more it would then loop between the two, repeating the previous suggestions verbatim. It went on for 3-4 times, even after being told to reflect on it and it being able to recite the rejection reasons.
* Gemini has the highest ceiling out of all of the models, but has consistently struggled with token-level accuracy. In other words, it's conceptual thinking it well beyond other models, but it sometimes makes stupid errors when talking. This makes it hard to reliably use for tool calling or structured output. Gemini is also very hard to steer, so when it's wrong, it's really hard to correct.
* Claude is extremely consistent and reliable. It's very, very good at the details - but will start to forget things if things get too complex. The good news is Claude is very steerable and will remember those details if you remind it.
* GPT-5 seems to be completely random for me. It's so inconsistent that it's extremely hard to use.
I tend to use Claude because I'm the most familiar with it and I'm confident that I can get good results out of it.
It's really the only model that can do large(er) codebase work.
It’s honestly crazy how good it is, coming from Claude. I never thought I could already pass something a design doc and have it one-shot the entire thing with such level of accuracy. Even with Opus, I always need to either steer it, or fix the stuff it forgot by hand / have another phase afterwards to get it from 90% to 100%.
Yes the Codex TUI sucks but the model with high reasoning is an absolute beast, and convinced me to switch from Claude Max to ChatGPT Pro
It's like the personality of a person. Employee A is better at talking to customers than Employee B, but Employee B is better at writing code than Employee A. Is one better than the other? Is one smarter than the other? Nope. Different training data.
I run claude CLI as a primary and just ask it nicely to consult gemini cli (but not let it do any coding). It works surprisingly well. OpenAI just fell out of my view. Even cancelled ChatGPT subscription. Gemini is leaping forward and _feels like_ ChatGPT-5 is a regression.. I can't put my finger on it tbh.
One advantage Gemini had (or still has, I’m not sure about the other providers) was its large context window combined with the ability to use PDF documents. It probably saved me weeks of work on an integration with a government system uploading hundreds of pages of documentation and immediately start asking questions, generating rules, and troubleshooting payloads that were leading to generic, computer-says-no errors.
No need to go trough RAG shenanigans and all of it within the free token allowance.
It took me way too long to figure out how to even access & use Veo 3.
It’s like Google doesn’t know how to package a product.
I can't get even gpt5 to create a new feature without generating completely awful code - making up facts where it can't find how it fits into the rest of the code - and functionality spawning error ridden unmaintainable mess.
I've spent this whole week debugging AI trash. And it's not fun.
They are literally the worst major provider in terms of privacy for consumer paid service.
Not to mention every team will have the bouncing balls in the polygon in their dataset now.