Gemini 2.5 Computer Use Model
Posted3 months agoActive3 months ago
blog.googleTechstoryHigh profile
excitedmixed
Debate
70/100
Artificial IntelligenceAutomationUser Interface
Key topics
Artificial Intelligence
Automation
User Interface
Google's Gemini 2.5 Computer Use model is a new AI model that can interact with computer interfaces, sparking both excitement and concern among HN users about its potential applications and implications.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
36m
Peak period
115
0-6h
Avg / period
17.8
Comment distribution160 data points
Loading chart...
Based on 160 loaded comments
Key moments
- 01Story posted
Oct 7, 2025 at 3:49 PM EDT
3 months ago
Step 01 - 02First comment
Oct 7, 2025 at 4:24 PM EDT
36m after posting
Step 02 - 03Peak activity
115 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 10, 2025 at 2:00 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45507936Type: storyLast synced: 11/22/2025, 11:47:55 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
See this section: https://googledevai.devsite.corp.google.com/gemini-api/docs/...
And the repo has a sample setup for using the default computer use tool: https://github.com/google/computer-use-preview
Cuz right now it's way too slow... perform an action, then read the results, then wait for the next tool call, etc.
You can build your own trails, publish them on our registry, compose them ... You can also run them in a distributed fashion over several Herd clients where we take care of the signaling and communication but you simply call functions. The CLI and npm & python packages [4, 5] might be interesting as well.
Note: The automation stack is entirely home-grown to enable distributed orchestration, and doesn't rely on puppeteer nor playwright but the browser automation API[6] is relatively similar to ease adoption. We also don't use the Chrome Devtools Protocol and therefore have a different tradeoff footprint.
0: https://herd.garden
1: https://herd.garden/trails
2: https://herd.garden/docs/trails-automations
3: https://herd.garden/docs/reference-mcp-server
4: https://www.npmjs.com/package/@monitoro/herd
5: https://pypi.org/project/monitoro-herd/
6: https://herd.garden/docs/reference-page
This has always felt like a natural best use for LLMs - let them "figure something out" then write/configure a tool to do the same thing. Throwing the full might of an LLM every time you're trying to do something that could be scriptable is a massive waste of compute, not to mention the inconsistent LLM output.
Although I find the electron mcp is better for what I'm doing at the moment.
Case in point, last week I wrote a scraper for Rate Your Music, but found it frustrating. I'm not experienced with Playwright, so I used vscode with Claude to iterate in the project. Constantly diving into devtools, copying outter html, inspecting specific elements etc is a chore that this could get around, making for faster development of complex tests
https://github.com/grantcarthew/scripts/blob/main/get-webpag...
You should check out our most recent announcement about Web Bot Auth
https://www.browserbase.com/blog/cloudflare-browserbase-pion...
https://www.browserbase.com/blog/cloudflare-browserbase-pion...
I asked ChatGPT about it and it revealed
Very nice solve, ChatGPT.
https://chatgpt.com/share/68e5e68e-00c4-8011-b806-c936ac657a...
I also found it interesting that despite me suggesting it might be a password generator or API key, ChatGPT doesn't appear to have given that much consideration.
Impressive. This could legitimately have been tricky a puzzle on some Easter egg hunt, even for nerds.
When I told it that was cheating, it decided to lie to me:
With the original thought with the search results being:You should check out our most recent announcement about Web Bot Auth
https://www.browserbase.com/blog/cloudflare-browserbase-pion...
https://developers.cloudflare.com/bots/concepts/bot/verified...
more like continued employment.
On a serious note, What the fuck is happening in the world.
[Looks around and sees people not making APIs for everything]
Well that didn't work.
That'd be neat.
But for now: The web exists, and is universal. We have programs that can render websites to an image in memory (solved for ~30 years), and other programs that can parse images of fully-rendered websites (solved for at least a few years), along with bots that can click on links (solved much more recently).
Maybe tomorrow will be different.
If you want to make something that can book every airline? Better be able to navigate a website.
That's just one obvious example, but the principle holds more generally.
It'll never happen, so companies need to deal with the reality we have.
New technology is slow due to risk aversion, it's very rare for people to just tear up what they already have to re-implement new technology from the ground up. We always have to shoe-horn new technology into old systems to prove it first.
There are just so many factors that get solved by working with what already exists.
I am more leaning into the idea that an efficient self-driving car wouldn't even need to have a steering wheel, pedals, or thin pillars to help the passengers see the outside environment or be seen by pedestrians.
The way this ties back to the computer use models is that a lot of webpages have stuff designed for humans would make it difficult for a model to navigate them well. I think this was the goal of the "semantic web".
We always make our way back to trains
Both Open Pilot and Tesla FSD use regular cameras (ie. eyes) to try and understand the environment just as a human would. That is where my analogy is coming from.
I could say the same about using a humanoid robot to log on to your computer and open chrome. My point is also that we made no changes to the road network to enable FSD.
While the self-driving car industry aims to replace all humans with machines, I don't think this is the case with browser automation.
I see this technology as more similar to a crash dummy than a self-driving system. It's designed to simulate a human in very niche scenarios.
Obviously much harder with UI vs agent events similar to the below.
https://docs.claude.com/en/docs/claude-code/hooks
https://google.github.io/adk-docs/callbacks/
Do you think callbacks are how this gets done?
But my bet - we will not deploy a single agent into any real environment without deterministic guarantees. Hooks are a means...
Browserbase with hooks would be really powerful, governance beyond RBAC (but of course enabling relevant guardrailing as well - "does agent have permission to access this sharepoint right now, within this context, to conduct action x?").
I would love to meet with you actually, my shop cares intimately about agent verification and governance. Soon to release the tool I originally designed for claude code hooks.
knowing how many times Claude Code breezed through a hook call and threw it away after actually computing the hook for an answer and then proceeding to not integrate the hook results ; I think the concept of 'governance' is laughable.
LLMs are so much further from determinism/governance than people seem to realize.
I've even seen earlier CC breeze through a hook that ends with a halting test failure and "DO NOT PROCEED" verbage. The only hook that is guaranteed to work on call is a big theoretical dangerous claude-killing hook.
I think screenshots are a really good and robust idea. It bothers the more structured-minded people, but apps are often not built so well. They are built until the point that it looks fine and people are able to use it. I'm pretty sure people who rely on accessibility systems have lots of complaints about this.
Few things are more frustrating for a team than maintaining a slow E2E browser test suite.
Reminds me of an anecdote where Amazon invested howevermany personlives in building AI for Alexa, only to discover that alarms, music, and weather make up the large majority of things people actually use smart speakers for. They're making these things worse at their main jobs so they can sell the sizzle of AI to investors.
It's like google staff are saying "If it means promotion, we don't give a shit about users".
> I am back in the Google Sheet. I previously typed "Zip Code" in F1, but it looks like I selected cell A1 and typed "A". I need to correct that first. I'll re-type "Zip Code" in F1 and clear A1. It seems I clicked A1 (y=219, x=72) then F1 (y=219, x=469) and typed "Zip Code", but then maybe clicked A1 again.
It did not work, multiple times, just gets stuck after going to Hacker news.
Unfortunately it really struggled in the demos for me. It took nearly 18 attempts to click the comment link on the HN demo, each a few pixels off.
This was long before computer vision was mature enough to do anything like that and I found out that instead, there are magnetic systems that can detect cars passing over - trivial hardware and software - and I concluded that my approach was just far too complicated and expensive.
Similarly, when I look at computers, I typically want the ML/AI system to operate on a structured data that is codified for computer use. But I guess the world is complicated enough and computers got fast enough that having an AI look at a computer screen and move/click a mouse makes sense.
Indoors I tend to cast some show or youtube video. Often enough I want to change the Youtube video or show using voice commands - I can do this for Youtube, but results are horrible unless I know exactly which video I want to watch. For other services it's largely not possible at all
In a perfect world Google would provide superb APIs for these integrations and all app providers would integrate it and keep it up to date. But if we can bypass that and get good results across the board - I would find it very valuable
I understand this is a very specific scenario. But one I would be excited about nonetheless
also biking on roads you should never count on sounds to guide you. you should always use vision. for example, making a left you have to visually establish that driver coming straight has made eye contact with you or atleast looked at you.
can you share a example of how you are using sound to help you ride bikes with other vehicles on the road? are you maybe talking about honking? that. you will hear over podcasts.
Audio cues are less and less useful as electric vehicles become more popular. (I am a city biker and there are plenty already.)
I don’t understand why everybody’s so happy to discount ears in this thread. Haven’t they been vital to our survival since forever? Yes eyes are more important in this case, but I’ll take whatever sensory aid I can get on my morning commute..
If such AI tools allow to automate this soulcrushing drudgery, it will be great. I know that you can technically script things Selenium, AutoHotkey whatnot. But you can imagine that it's a nonstarter in a regular office. This kind of tool could make things like that much more efficient. And it's not like it will then obviate the jobs entirely (at least not right away). These offices often have immense backlogs and are understaffed as is.
Knowing it's technically possible is one thing, but giving it a short command and seeing it go log in to a site, scroll around, reply to posts, etc. is eerie.
Also it tied me at wordle today, making the same mistake I did on the second to lass guess. Too bad you can't talk to it while it's working.
165 more comments available on Hacker News