Why Windows Xp Is the Ultimate AI Benchmark

Posted19 days agoActive19 days ago

frabonacci

1 points

1 comments

cuabench.aiTech Discussionstory

controversialmixed

Debate

80/100

AI Performance AnalysisWindows XpFreebsd

Key topics

AI Performance Analysis

Windows Xp

Freebsd

Discussion Activity

Light discussion

First comment

N/A

Peak period

0-1h

Avg / period

Key moments

01Story posted
Dec 16, 2025 at 11:00 AM EST
19 days ago
Step 01
02First comment
Dec 16, 2025 at 11:00 AM EST
0s after posting
Step 02
03Peak activity
3 comments in 0-1h
Hottest window of the conversation
Step 03
04Latest activity
Dec 16, 2025 at 11:37 AM EST
19 days ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (1 comments)

Showing 3 comments

frabonacciAuthor

19 days ago

1 reply

We spent the last few months trying to understand why computer-use agents (Claude Computer-Use, OpenAI CUA, Gemini 2.5 Computer-Use) fail so inconsistently.

The pattern we kept seeing: same agent, same task, different OS theme = notably different results.

Claude Sonnet 4 scores 31.9% on OSWorld and Windows Agent Arena (2 of the most relevant benchmarks for computer-use agents) — but with massive variance. An agent trained on Windows 11 light mode fails on dark mode. Works on macOS Ventura, breaks on Monterey. Works on Win11, collapses on Vista.

The root cause: training data lacks visual diversity. Current benchmarks (OSWorld, Windows Agent Arena) rely on static VM snapshots with fixed configurations. They don't capture the reality of diverse OS themes, window layouts, resolution differences, or desktop clutter.

We built cua-bench — HTML-based simulated environments that render across 10+ OS themes (macOS, Win11, WinXP, Win98, Vista, iOS, Android). Define a task once, generate thousands of visual variations.

This enables: - Oracle trajectory generation via a Playwright-like API (verified ground truth for training) - Trajectory replotting: record 1 demo → re-render across 10 OS themes = 10 training trajectories

The technical report covers our approach to trajectory generation, Android/iOS environments, cross-platform HTML snapshots, and a comparison with existing benchmarks.

We’re currently working with research labs on training data generation and benchmarks, but we’d really value input from the HN community: - What tasks or OS environments should be standardized to actually stress computer-use agents? - Legacy OSes? Weird resolutions? Broken themes? Cluttered desktops? Modal hell?

Curious what people here think are the real failure modes we should be benchmarking.

someguy101010

19 days ago

1 reply

as an infrastructure engineer the idea of being able to train computer use agents without provisioning infrastructure sounds amazing!

a common use case i run into is i want to be able to configure corporate vpn software on windows machines. is there a link for a getting started guide i could try this out with?

frabonacciAuthor

19 days ago

Yes, in a simulated environment you can do this today using plain JS and connecting to a real VPN, while driving the desktop UI. No infra provisioning needed.

If you need a real Windows OS + corporate VPN, we also support binding agents to actual Windows sandboxes. This example shows automating a Windows app behind a VPN: https://cua.ai/docs/example-usecases/windows-app-behind-vpn

you'll need to define a new task in the cua-bench registry first though - just sign up on the website for early access!

View full discussion on Hacker News

ID: 46290185Type: storyLast synced: 12/16/2025, 4:05:15 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN