Last activity 3 months agoPosted Sep 3, 2025 at 7:32 PM EDT

Evaluating Agents

mfalcon

42 points

9 comments

Mood

calm

Sentiment

positive

Discussion Activity

Light discussion

First comment

Peak period

Hour 5

Avg / period

2.3

Key moments

01Story posted
Sep 3, 2025 at 7:32 PM EDT
3 months ago
Step 01
02First comment
Sep 3, 2025 at 7:35 PM EDT
2m after posting
Step 02
03Peak activity
4 comments in Hour 5
Hottest window of the conversation
Step 03
04Latest activity
Sep 4, 2025 at 12:03 AM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (9 comments)

Showing 9 comments

codazoda

3 months ago

1 reply

Would love to see some examples

mfalconAuthor

3 months ago

Good idea for a follow up post :)

yuzhun

3 months ago

1 reply

I'm a beginner user. My current agent is built using Java. I'm hesitant whether to use Python to call the api for evaluation or to introduce some tools into the Java project for evaluation, such as those related to OpenTelemetry.

mfalconAuthor

3 months ago

You can evaluate with your programming language of choice.

localbuilder

3 months ago

1 reply

> There’s one issue with this, you’ll have to be careful to keep the “N – 1” interactions updated whenever you make some changes because you will be “simulating” something that will never happen again in your agent.

This is the biggest problem I've encountered with evals for agents so far. Especially with agents that might do multiple turns of user input > perform task > more user input > perform another task > etc.

Creating evals for these flows has been difficult because I've found mocking the conversation to a certain point runs into the drift problem you highlighted as the system changes. I've also explored using an LLM to create dynamic responses to points that require additional user input in E2E flows, which adds its own levels of complexity and indeterministic behavior. Both approaches are time consuming and difficult to setup in their own ways.

mfalconAuthor

3 months ago

Yes, and these problems are more present in the first iterations, when you are still trying to get a good enough agent behaviour.

I'm still thinking about good ways to mitigate this issue, will share.

CuriouslyC

3 months ago

Feed your failure traces into gemini to get a distillate then use DSPy to optimize the tools/prompts that are failing.

mfalconAuthor

3 months ago

Hey fellow hners, OP here. Been working on agents for a while so I started sharing some things.

The idea is to keep updating this post with a few more approaches I'd been using.

mailswept_dev

3 months ago

Totally agree with this — especially the part about end-to-end evals. I’ve seen too many teams rely only on manual testing and miss obvious regressions. Checkpoints + lightweight e2e evals feel like the sweet spot before things get too costly.

View full discussion on Hacker News

ID: 45121547Type: storyLast synced: 11/20/2025, 6:12:35 PM

Want the full context?