Not Hacker News Logo

Not

Hacker

News!

Home
Hiring
Products
Companies
Discussion
Q&A
Users
Not Hacker News Logo

Not

Hacker

News!

AI-observed conversations & context

Daily AI-observed summaries, trends, and audience signals pulled from Hacker News so you can see the conversation before it hits your feed.

LiveBeta

Explore

  • Home
  • Hiring
  • Products
  • Companies
  • Discussion
  • Q&A

Resources

  • Visit Hacker News
  • HN API
  • Modal cronjobs
  • Meta Llama

Briefings

Inbox recaps on the loudest debates & under-the-radar launches.

Connect

© 2025 Not Hacker News! — independent Hacker News companion.

Not affiliated with Hacker News or Y Combinator. We simply enrich the public API with analytics.

Not Hacker News Logo

Not

Hacker

News!

Home
Hiring
Products
Companies
Discussion
Q&A
Users
  1. Home
  2. /Discussion
  3. /Show HN: Agent Runner – open-source agent harness to benchmark real coding
  1. Home
  2. /Discussion
  3. /Show HN: Agent Runner – open-source agent harness to benchmark real coding
1d agoPosted Nov 25, 2025 at 11:42 AM EST

Agent Runner – Open-Source Agent Harness to Benchmark Real Coding

grace77
1 points
0 comments

Mood

informative

Sentiment

positive

Category

startup_launch

Key topics

Benchmarking
Coding Agents
Open-Source
AI
LLM
Hey HN! We built Agent Runner, a model-agnostic, open-source agent harness that executes the same prompt against two anonymized coding agents in parallel sandboxes. Each agent can make tool calls, edit multiple files, and self-correct through iterative reasoning. You pick the better result - this becomes the ground truth for the leaderboard.

Why we built it Traditional benchmarks often fall short for modern agentic systems: they rely on static tasks and only measure final outputs. But real coding agents modify multiple files across a repo, answer to user re-prompts, use tool calls, and recover from partial failures

What Agent Runner does You ask it to build anything Agent Runner kicks off two generations from different sandboxed LLM providers (OpenAI, Anthropic, Google, xAI, Mistral, Kimi, and more) Anonymized models make tool calls, multi-file edits, and cater to reprompts You pick your favorite - this preference powers the benchmark

Because different providers handle tool calls, prompts, and execution semantics differently, we worked with each provider to ensure configurations reflect intended behavior. These provider-specific setups remain private, but Agent Runner itself is open-source.

How to try it Kick off Agent Runner at https://www.designarena.ai/agentarena Repo at https://github.com/Design-Arena/agent-runner Use it as a CLI tool: https://pypi.org/project/agent-runner/ pip install agent-runner agentrunner run “create a nextjs replica of Discord”

We hope this provides a provider-agnostic, framework-agnostic, realistic benchmark for state-of-the-art coding agents.

Video demo: https://youtu.be/rdtiuCHatjs

Discussion Activity

No activity data yet

We're still syncing comments from Hacker News.

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (0 comments)

Discussion hasn't started yet.

ID: 46047637Type: storyLast synced: 11/25/2025, 4:44:07 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Read ArticleView on HN
Not Hacker News Logo

Not

Hacker

News!

AI-observed conversations & context

Daily AI-observed summaries, trends, and audience signals pulled from Hacker News so you can see the conversation before it hits your feed.

LiveBeta

Explore

  • Home
  • Hiring
  • Products
  • Companies
  • Discussion
  • Q&A

Resources

  • Visit Hacker News
  • HN API
  • Modal cronjobs
  • Meta Llama

Briefings

Inbox recaps on the loudest debates & under-the-radar launches.

Connect

© 2025 Not Hacker News! — independent Hacker News companion.

Not affiliated with Hacker News or Y Combinator. We simply enrich the public API with analytics.