Agent Runner – Open-Source Agent Harness to Benchmark Real Coding
Mood
informative
Sentiment
positive
Category
startup_launch
Key topics
Why we built it Traditional benchmarks often fall short for modern agentic systems: they rely on static tasks and only measure final outputs. But real coding agents modify multiple files across a repo, answer to user re-prompts, use tool calls, and recover from partial failures
What Agent Runner does You ask it to build anything Agent Runner kicks off two generations from different sandboxed LLM providers (OpenAI, Anthropic, Google, xAI, Mistral, Kimi, and more) Anonymized models make tool calls, multi-file edits, and cater to reprompts You pick your favorite - this preference powers the benchmark
Because different providers handle tool calls, prompts, and execution semantics differently, we worked with each provider to ensure configurations reflect intended behavior. These provider-specific setups remain private, but Agent Runner itself is open-source.
How to try it Kick off Agent Runner at https://www.designarena.ai/agentarena Repo at https://github.com/Design-Arena/agent-runner Use it as a CLI tool: https://pypi.org/project/agent-runner/ pip install agent-runner agentrunner run “create a nextjs replica of Discord”
We hope this provides a provider-agnostic, framework-agnostic, realistic benchmark for state-of-the-art coding agents.
Video demo: https://youtu.be/rdtiuCHatjs
Discussion Activity
No activity data yet
We're still syncing comments from Hacker News.
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Discussion hasn't started yet.
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.