Gemini 3.0 Pro – Early Tests

3 months ago

3 replies

I've seen a bunch of tweets like this recently, as far as I can tell they're all from people using https://aistudio.google.com/ who got served an A/B test.

A few more in this genre:

https://x.com/cannn064/status/1973818263168852146 - "Make a SVG of a PlayStation 4 controller"

https://x.com/cannn064/status/1973415142302830878 "Create a single, self-contained HTML5 file that mimics a macOS Sonoma-style desktop: translucent menu bar with live clock, magnifying dock, draggable/resizable windows, and a dynamic wallpaper. No external assets; use inline SVG for icons."

https://x.com/synthwavedd/status/1973405539708056022 "Write full HTML, CSS and Javascript for a very realistic page on Apple's website for the new iPhone 18"

I've not seen it myself so I'm not sure how confident they are that it's Gemini 3.0.

3 months ago

1 reply

At this point until I see one run through the Pelican Benchmark I can't really take a new model seriously.

diggan

3 months ago

5 replies

Unfortunately, as every public benchmark, once it ends up in the training sets and/or the developers aware of it, it stops being effective, and I think we've started to reach that point.

The only thing I've found to give me some sort of quantitative idea of how good a new model is, is my own private benchmarks. It doesn't cover everything I want to use LLMs for, and only has 20-30 tests per "category", but at least I'm 99% sure it isn't in the training datasets.

3 months ago

6 replies

I have a few "SVG of an X riding a Y" tests that I don't publish online which I run occasionally to see if a model is suspiciously better at drawing a pelican riding a bicycle than some other creature on some other form of transport.

I would be so entertained if I found out an AI lab had wasted their time cheating on my dumb benchmark!

3 months ago

1 reply

-> I would be so entertained if I found out an AI lab had wasted their time cheating on my dumb benchmark!

Que intro: "The gang wastes their time cheating on a dumb benchmark"

mcny

3 months ago

A shower thought I just had: there must be some AI training company somewhere that has injested all It is always sunny in Philadelphia, not just the text but all the video from all episodes somehow...

Imustaskforhelp

3 months ago

1 reply

Please do let us know through your blog post if you ever find AI labs to cheat on your benchmark.

But now I am worried that since you have shared that you do SVG of an X riding a Y thing, maybe these models will try to cheat on the whole SVG of X riding Y thing instead of hyper focusing the pelican.

So now I suppose you might need to come up with an entirely new thing though :)

throwup238

3 months ago

3 replies

There are so many X and Y combinations that I find it hard to believe they could realistically train for a even a small fraction of them. Someone has to generate the graphics output for the training.

A duck billed platypus riding a unicycle? A man o' war riding a pyrosome? A chicken riding a Quetzalcoatlus? A tardigrade riding a surf board?

gnatolf

3 months ago

1 reply

You're assuming that given the collection of simonw's publicly available blog posts, the creativity of those combinations can't be narrowed down. Simply reverse engineer his brain this way and you'll get your Xs and Ys ;)

throwup238

3 months ago

I feel like that would over fit on various snakes like pythons.

3 months ago

1 reply

If we accept ChatGPT telling me that there are approximately 200k common nouns in English, and then we square that, we get 40 billion combinations. At one second per, that's ~1200 years, but then if we parallelize it on a supercomputer that can do 100,000 per second that would only take 3 days. Given that ChatGPT was trained on all of the Internet and every book written, I'm not sure that still seems infeasible.

throwup238

3 months ago

1 reply

It still can't satisfactorily draw a pelican on a bicycle because that's either not in the training data or the signal is too weak, so why would it be able to satisfactorily draw every random noun-riding-noun combination just because you threw a for loop at it?

The point is that in order to cheat on @simonw's benchmark across any arbitrary combination, they'd have to come up with an absurd number of human crafted input-output training pairs with human produced drawings. You can't just ask ChatGPT to generate every combination because all it'll produce is garbage that gets a lot worse the further from a pelican riding a bicycle.

It might work at first for the pelican and a few other animals/transport combination but what does it even mean for a man o' war riding a pyrosome? I asked every model I have access to generate an SVG for a "man o' war riding a pyrosome" and not a single one managed to draw anything resembling a pyrosome. Most couldn't even produce something resembling a man o' war except as a generic ellipsoid-shaped jellyfish with a few tenticles.

Expand that to every weird noun-noun combination and it's just not practical to train even a tiny fraction of them.

https://chatgpt.com/share/68def5c5-8ca4-8009-bbca-feabbe0651...

3 months ago

Man'o'war on a pyrosome. I don't what you expected it to look like, maybe it could be more whiteish translucent instead of orange, but it looks fairly reasonable to me. Took a bit over a minute with the ChatGPT app.

Simonw's test is for the text-only output from an LLM to write an SVG, not "can a multimodal AI in 2025" generate a PNG. By having pictures of pelicans on bicycles in the training data in PNG format, from people wanting to see one, after reading his blog, there are now raster-based images from an image generation model that fairly convincingly look as described in the training data. Now that there's PNGs of pelicans on bicycles, we would expect GPT-6 to be better at generating SVGs of something it's already "seen".

We don't know what simonw's secret combo X and Y is, nor do I want to know, because that would ruin the benchmark (if it isn't ruined already by virtue of him having asked it). 200k nouns is definitely high though. A bit of thought could cut it down to exclude concepts and lot of other things. How much spare GPU capacity OpenAI has, I have no idea. But if I were there, I'd want the GPUs to be running as hot as the cloud provider would let me run them, because they're paying per hour, not per watt, and have a low-priority queue of jobs for employees to generate whatever extra training data they can think of on their off hours.

Oh and here's the pelican PNG so the other platforms can crawl this comment and slurp it up.

https://chatgpt.com/share/68def958-3008-8009-91fa-99127fc053...

brianjking

3 months ago

I must say that I loved the idea of a tardigrade riding a surfboard. You're welcome.

Granted not an SVG, but still awesome.

https://imgur.com/a/KsbyVNP

diggan

3 months ago

1 reply

> I would be so entertained if I found out an AI lab had wasted their time cheating on my dumb benchmark!

I don't think it's necessarily "cheating", it just happens as they're discovering and ingesting large ranges of content. A problem of public content, it's bound to be included sooner or later, directly or indirectly.

Nice to hear you're doing some sort of contingency though, and looking forward to the inevitable blog post announcing the change to a different bird and vehicle :)

svachalek

3 months ago

1 reply

The thing is most of the discussion about it is embarrassingly bad SVGs so training on them would actually hurt their performance.

JSR_FDED

3 months ago

Regrettably AI is still better at SVG than I am

3 months ago

1 reply

But how would you know it's from what you would consider cheating as opposed to pelicans on bicycles existing in the latest training data? Obviously your blog gets fed into the training set for GPT-6, as well as everyone else talking about your test, so how would the comparison to a secret X riding a Y tell you if an AI lab is cheating as opposed to merely there being more examples in the training data?

3 months ago

1 reply

Mainly because if they train on the pelican on bicycle SVGs from my blog they are going to get some very weird looking pelicans riding some terrible looking bicycles.

3 months ago

1 reply

It's not that I claiming they're training on SVG pelicans on bicycles from your blog, it's that thanks to your popularity, there are simply now more pictures of pelicans on bicycles floating around on the Internet and thus ChatGPT's training data. Eg https://www.reddit.com/r/ColoredPencils/comments/1l9l4fq/pel...

How would you determine that improvements to SVG pelicans on bicycles (and not your secret X on Ys) are from an OpenAI employee cheating your benchmark vs being an improvement on pelicans on bicycles thanks to that picture from Reddit and everywhere elsewhere in the training data?

3 months ago

See comment here: https://news.ycombinator.com/item?id=45454269

jgalt212

3 months ago

Your benchmark may or may not be dumb, but it is definitely widely followed. So much so this is what Bing AI has to say on the matter.

> Absolutely — the “pelican riding a bicycle” SVG test is a quirky but clever benchmark created by Simon Willison to evaluate how well different large language models (LLMs) can generate SVG (Scalable Vector Graphics) images from a prompt that’s both unusual and unlikely to be in their training data.

reissbaker

3 months ago

I doubt they'd cheat that obviously... But "SVG of X" has become common enough that I suspect most frontier labs train on it, especially since the models are multimodal now anyway.

Not that I mind; I want models to be good at generating SVG! Makes icons much simpler.

londons_explore

3 months ago

1 reply

As soon as you use your private tests, all the AI companies vacuum up the input to use to train the next model.

Obviously they're only getting the question and not a perfect answer, but with today's process of generating hundreds of potential answers and getting another model to choose the best/correct one for training, I don't think that matters.

astrange

3 months ago

1 reply

Are the models capable of judging a good SVG? They can't read ASCII art.

londons_explore

3 months ago

If you give the 'judge' models tool use, they could easily fire up a web browser to render an SVG and then use imagenet or something to see how 'pelican-y' the result is.

latemedium

3 months ago

We need to know if big AI labs are explicitly training models to generate SVGs of pelicans on bicycles. I wouldn't put it past them. But it would be pretty wild in they did!

Workaccount2

3 months ago

I honestly think people really blow out of proportion the effect of "being in the training set". The internet is ridden with examples of problem/solution posts that many models definitely trained on, but still get wrong.

More important would be post training, where the labs specifically train on the exact question. But it doesn't seem like this is happening for most amateur benchmarks at least. All the models that are good at pelican bike have been good at whatever else you throw at them to SVG.

3 months ago

That's the move right there.

epolanski

3 months ago

1 reply

Jm2c I couldn't care less about this vibecode style benchmarks and I'm sick of them.

The things that I see represented here, may or may not be impressive, but sure as hell have never been the major blockers in achieving progress on complex tasks and software.

I understand you're merely reporting, thank you for that, not criticizing you, but those tests are absolutely irrelevant.

mnk47

3 months ago

In my experience, the model's performance in silly tasks like these is usually (not always) correlated with its performance in other areas except tool use/agent stuff.

ceejayoz

3 months ago

> a very realistic page on Apple's website…

Is this supposed to be a good example?

It looks like something I'd put together, and you don't want me doing design work.

Oras

3 months ago

3 replies

These tests mean nothing; I yet to see a model that is better than Sonnet 4 for coding. I tried many, all of them are sub-par, even with a small code base.

nnevatie

3 months ago

1 reply

Well, Codex with GPT5 High wins Claude Sonnet 4.5 - this is anecdotal, but I've used both extensively.

solarkraft

3 months ago

1 reply

At what speed? At some point you’ll have to compare to Opus.

adastra22

3 months ago

And Sonnet 4.5 is better than Opus.

Bolwin

3 months ago

1 reply

Well yeah no surprise. You should try glm 4.6

Oras

3 months ago

I tried it, and it was shockingly bad compared to their benchmarks and to Claude Sonnet 4.

I tried it with Claude Code CLI, it didn't follow instructions correctly (I had a Claude.md file with clear instructions), stopped after a few implementations (less than 3 minutes), and produced code that does not work.

For the benefit of the doubt, I changed instructions to be NextJS platform as I thought it's a known framework and it might do better, but still, same quality issues.

adastra22

3 months ago

Well, Sonnet 4.5 is better.

strongpigeon

3 months ago

14 replies

Google's biggest problem in my opinion (and I'm saying that as an ex-googler) is that Google doesn't have a product culture. Google had the tech for something like ChatGPT for a long time, but couldn't come up with that product. Instead it had to rely on another company showing it the way and then copy them and try to out-engineer them...

I still think ultimately (and somewhat sadly) Google will win the AI race due to its engineering talent and the sheer amount of data it has (and Android integration potential).

sho_hn

3 months ago

4 replies

To be fair, according to OpenAI they started ChatGPT as a demo/experiment and were taken by surprise when it went viral.

It may well be that they also didn't have a product culture as an organization, but were willing to experiment or let small teams do so.

It's still a lesson, but maybe a different one.

With organizational scale it becomes harder and harder to launch experiments under the brand. Red tape increases, outside scrutiny increases. Retaining the ability to do that is difficult.

Google does experiment a fair bit (including in AI, e.g. NotebookLLM and its podcast feature are I think a standout example of trying to see what sticks) but they also tend to try to hide their experiments in developer portals nowadays, which makes it difficult to get a signal from a general consumer audience.

strongpigeon

3 months ago

1 reply

Google is definitely good at experimenting (and yeah NotebookLLM is really cool), which is a product of the bottom-up culture. The lack of a consistent story with regard to AI products however is a testament to the lack of product vision from the top.

3 months ago

NotebookLM came out of Google Labs though, and in collaboration with outside stakeholders. I'm not sure I would call it a success of "bottom-up" culture, but a well realized idea from a dedicated incubator. That doesn't necessarily mean the rest of the company is so empowered or product oriented.

thereitgoes456

3 months ago

1 reply

According to Karen Hao's Empire of AI, this is only half accurate. And I trust what Karen Hao says a lot more.

OpenAI mistakenly thought Anthropic was about to launch a chatbot, and ChatGPT was a scrappy, rushed-out-the-door product made from an intermediate version of GPT-4, meant to one-up them. Of course, they were surprised at how popular it became.

kristianp

3 months ago

Do you mean an intermediate version of GPT-3? That's more the timeline I'm thinking.

dudeinhawaii

3 months ago

If I can take a slight tangent. This is what I will remember OpenAI for. Not the Closed vs Open debate. They caused the democratization of access to AI models. Prior to ChatGPT, I would hear about these great models Deep Mind and Google were developing. They'd always stay closed behind the walls of Google.

OpenAI forced Google to release and as a result, we have all of the AI tooling, integrations, and models. Meta's leaning into the stolen Llama code took this further and sparked the Open Source LLM revolution (in addition to the myriad contributors and researchers who built on that).

If we had left it to Google, I suspect they'd release tooling (as they did with TensorFlow) but not an LLM that might compete with their core product..