Evaluating Agents
Mood
calm
Sentiment
positive
Category
other
Key topics
The post discusses various approaches to evaluating AI agents, sparking a discussion among commenters on the challenges and best practices for agent evaluation.
Snapshot generated from the HN discussion
Discussion Activity
Light discussionFirst comment
2m
Peak period
4
Hour 5
Avg / period
2.3
Key moments
- 01Story posted
Sep 3, 2025 at 7:32 PM EDT
3 months ago
Step 01 - 02First comment
Sep 3, 2025 at 7:35 PM EDT
2m after posting
Step 02 - 03Peak activity
4 comments in Hour 5
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 4, 2025 at 12:03 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
This is the biggest problem I've encountered with evals for agents so far. Especially with agents that might do multiple turns of user input > perform task > more user input > perform another task > etc.
Creating evals for these flows has been difficult because I've found mocking the conversation to a certain point runs into the drift problem you highlighted as the system changes. I've also explored using an LLM to create dynamic responses to points that require additional user input in E2E flows, which adds its own levels of complexity and indeterministic behavior. Both approaches are time consuming and difficult to setup in their own ways.
I'm still thinking about good ways to mitigate this issue, will share.
The idea is to keep updating this post with a few more approaches I'd been using.
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.