Not Hacker News Logo

Not

Hacker

News!

Home
Hiring
Products
Companies
Discussion
Q&A
Users
Not Hacker News Logo

Not

Hacker

News!

AI-observed conversations & context

Daily AI-observed summaries, trends, and audience signals pulled from Hacker News so you can see the conversation before it hits your feed.

LiveBeta

Explore

  • Home
  • Hiring
  • Products
  • Companies
  • Discussion
  • Q&A

Resources

  • Visit Hacker News
  • HN API
  • Modal cronjobs
  • Meta Llama

Briefings

Inbox recaps on the loudest debates & under-the-radar launches.

Connect

© 2025 Not Hacker News! — independent Hacker News companion.

Not affiliated with Hacker News or Y Combinator. We simply enrich the public API with analytics.

Not Hacker News Logo

Not

Hacker

News!

Home
Hiring
Products
Companies
Discussion
Q&A
Users
  1. Home
  2. /Discussion
  3. /Evaluating Agents
  1. Home
  2. /Discussion
  3. /Evaluating Agents
Last activity 3 months agoPosted Sep 3, 2025 at 7:32 PM EDT

Evaluating Agents

mfalcon
42 points
9 comments

Mood

calm

Sentiment

positive

Category

other

Key topics

AI Agents
Evaluation Methods
Software Development
Debate intensity20/100

The post discusses various approaches to evaluating AI agents, sparking a discussion among commenters on the challenges and best practices for agent evaluation.

Snapshot generated from the HN discussion

Discussion Activity

Light discussion

First comment

2m

Peak period

4

Hour 5

Avg / period

2.3

Key moments

  1. 01Story posted

    Sep 3, 2025 at 7:32 PM EDT

    3 months ago

    Step 01
  2. 02First comment

    Sep 3, 2025 at 7:35 PM EDT

    2m after posting

    Step 02
  3. 03Peak activity

    4 comments in Hour 5

    Hottest window of the conversation

    Step 03
  4. 04Latest activity

    Sep 4, 2025 at 12:03 AM EDT

    3 months ago

    Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (9 comments)
Showing 9 comments
codazoda
3 months ago
1 reply
Would love to see some examples
mfalconAuthor
3 months ago
Good idea for a follow up post :)
yuzhun
3 months ago
1 reply
I'm a beginner user. My current agent is built using Java. I'm hesitant whether to use Python to call the api for evaluation or to introduce some tools into the Java project for evaluation, such as those related to OpenTelemetry.
mfalconAuthor
3 months ago
You can evaluate with your programming language of choice.
localbuilder
3 months ago
1 reply
> There’s one issue with this, you’ll have to be careful to keep the “N – 1” interactions updated whenever you make some changes because you will be “simulating” something that will never happen again in your agent.

This is the biggest problem I've encountered with evals for agents so far. Especially with agents that might do multiple turns of user input > perform task > more user input > perform another task > etc.

Creating evals for these flows has been difficult because I've found mocking the conversation to a certain point runs into the drift problem you highlighted as the system changes. I've also explored using an LLM to create dynamic responses to points that require additional user input in E2E flows, which adds its own levels of complexity and indeterministic behavior. Both approaches are time consuming and difficult to setup in their own ways.

mfalconAuthor
3 months ago
Yes, and these problems are more present in the first iterations, when you are still trying to get a good enough agent behaviour.

I'm still thinking about good ways to mitigate this issue, will share.

CuriouslyC
3 months ago
Feed your failure traces into gemini to get a distillate then use DSPy to optimize the tools/prompts that are failing.
mfalconAuthor
3 months ago
Hey fellow hners, OP here. Been working on agents for a while so I started sharing some things.

The idea is to keep updating this post with a few more approaches I'd been using.

mailswept_dev
3 months ago
Totally agree with this — especially the part about end-to-end evals. I’ve seen too many teams rely only on manual testing and miss obvious regressions. Checkpoints + lightweight e2e evals feel like the sweet spot before things get too costly.
View full discussion on Hacker News
ID: 45121547Type: storyLast synced: 11/20/2025, 6:12:35 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Read ArticleView on HN
Not Hacker News Logo

Not

Hacker

News!

AI-observed conversations & context

Daily AI-observed summaries, trends, and audience signals pulled from Hacker News so you can see the conversation before it hits your feed.

LiveBeta

Explore

  • Home
  • Hiring
  • Products
  • Companies
  • Discussion
  • Q&A

Resources

  • Visit Hacker News
  • HN API
  • Modal cronjobs
  • Meta Llama

Briefings

Inbox recaps on the loudest debates & under-the-radar launches.

Connect

© 2025 Not Hacker News! — independent Hacker News companion.

Not affiliated with Hacker News or Y Combinator. We simply enrich the public API with analytics.