Testing LLM Agents Like Software – Behaviour Driven Evals of AI Systems

Posted2 months agoActive2 months ago

PranoyP

21 points

11 comments

aclanthology.orgTechstory

supportivepositive

Debate

10/100

LLM EvaluationBehaviour Driven TestingAI Systems

Key topics

LLM Evaluation

Behaviour Driven Testing

AI Systems

The paper proposes a new approach to testing LLM agents using behaviour driven evaluations, moving beyond traditional benchmarks, and the community is generally supportive and enthusiastic about the work, with some suggestions for open-sourcing and further exploration.

Snapshot generated from the HN discussion

Discussion Activity

Moderate engagement

First comment

29m

Peak period

0-1h

Avg / period

2.2

Comment distribution11 data points

Loading chart...

Based on 11 loaded comments

Key moments

01Story posted
Nov 4, 2025 at 12:11 PM EST
2 months ago
Step 01
02First comment
Nov 4, 2025 at 12:40 PM EST
29m after posting
Step 02
03Peak activity
7 comments in 0-1h
Hottest window of the conversation
Step 03
04Latest activity
Nov 5, 2025 at 3:28 AM EST
2 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (11 comments)

Showing 11 comments

ajay_shastry

2 months ago

Intresting work

raj_maddipati

2 months ago

Excellent work

saurabh_xen

2 months ago

Great work

ankush9812

2 months ago

Nice Work

harshv_03

2 months ago

Interesting

mlop99

2 months ago

Curious if the behaviour driven testing can be done by another LLM agent (or a group of agents) - one LLM agent testing another. Could lead to a self-improving loop?

jlukecarlson

2 months ago

I appreciate the details shared in this paper but it'd be great if they open sourced their implementation!

shailendra145

2 months ago

A powerful move beyond benchmarks — this paper redefines LLM evaluation through realistic, behavior-driven testing.

quanta9

2 months ago

interesting

papz2k

2 months ago

Very interesting work.