Not

Hacker

News!

Not

Hacker

News!

AI-observed conversations & context

Daily AI-observed summaries, trends, and audience signals pulled from Hacker News so you can see the conversation before it hits your feed.

LiveBeta

Explore

Home
Hiring
Products
Companies
Discussion
Q&A
Privacy Policy

Resources

Visit Hacker News
HN API
Modal cronjobs
Meta Llama

Briefings

Inbox recaps on the loudest debates & under-the-radar launches.

Connect

© 2026 Not Hacker News! — independent Hacker News companion.

Not affiliated with Hacker News or Y Combinator. We simply enrich the public API with analytics.

Not

Hacker

News!

Home
Discussion
AI Evaluation

AI Evaluation

20 stories

•

24h: 0%

•

7d: 0

•

212 comments

Top contributors:pseudolus jxmorris12 zlatkov mpavlov capybarahi

Stories

Related Stories

20 stories tagged with ai evaluation

Study Identifies Weaknesses in How AI Systems Are Evaluated

416192 commentsby pseudolus

Postedabout 2 months agoActiveabout 1 month ago

Evals in 2025: Going Beyond Simple Benchmarks to Build Models People Can Use

808 commentsby jxmorris12

Posted4 months agoActiveabout 1 month ago

Deep Dive Into G-Eval: How Llms Evaluate Themselves

116 commentsby zlatkov

Posted2 months agoActiveabout 1 month ago

Why Alpha Arena Was a Bad Benchmark

60 commentsby mpavlov

Posted2 months agoActiveabout 1 month ago

Why Your AI Evals Keep Breaking

61 commentsby capybarahi

Posted2 months agoActiveabout 1 month ago

To Solve the Benchmark Crisis, Evals Must Think

60 commentsby hsikka

Posted2 months agoActiveabout 1 month ago

Verse AI – Catch the AI Failures Your Evals Miss

50 commentsby 4thabang

Postedabout 2 months agoActiveabout 1 month ago

New Eval From Swe-Bench Team Evalutes Lms Based on Goals Not Tickets

51 commentsby lieret

Posted2 months agoActiveabout 1 month ago

Codelens.ai– Community Benchmark Comparing 6 Llms on Real Code Tasks

50 commentsby skrid

Posted3 months agoActiveabout 1 month ago

Gaia2 and Are: Empowering the Community to Evaluate Agents

51 commentsby mortimerp9

Posted3 months agoActiveabout 1 month ago

Emotional Intelligence Leaderboard for Llms

50 commentsby surprisetalk

Posted4 months agoActiveabout 1 month ago

We Built Convolytic Because Nobody Knows If Their Voice AI Works

32 commentsby argamd

Postedabout 2 months agoActiveabout 1 month ago

Evaluating LLM-Generated Detection Rules in Cybersecurity

30 commentsby ianthiel

Posted3 months agoActiveabout 1 month ago

Are Large Language Models Worth It?

20 commentsby vinhnx

Postedabout 2 months agoActiveabout 1 month ago

Are Large Language Models Worth It?

20 commentsby freediver

Postedabout 2 months agoActiveabout 1 month ago

Agci Benchmark: Evaluating Long-Term and Adaptive Intelligence in AI Systems

20 commentsby tempinst5

Postedabout 2 months agoActiveabout 1 month ago

Multi-Domain Rubrics Requiring Professional Knowledge to Answer and Judge

20 commentsby PaulHoule

Posted2 months agoActiveabout 1 month ago

Llms Often Know When They're Being Evaluated

20 commentsby lawrenceyan

Posted2 months agoActiveabout 1 month ago

Agentic Ai: Why Evaluation Is the Make-or-Break Factor

20 commentsby paperplaneflyr

Posted3 months agoActiveabout 1 month ago

Thoughts on Evals

21 commentsby chw9e

Posted3 months agoActiveabout 1 month ago

Not

Hacker

News!

AI-observed conversations & context

Daily AI-observed summaries, trends, and audience signals pulled from Hacker News so you can see the conversation before it hits your feed.

LiveBeta

Explore

Home
Hiring
Products
Companies
Discussion
Q&A
Privacy Policy

Resources

Visit Hacker News
HN API
Modal cronjobs
Meta Llama

Briefings

Inbox recaps on the loudest debates & under-the-radar launches.

Connect

© 2026 Not Hacker News! — independent Hacker News companion.

Not affiliated with Hacker News or Y Combinator. We simply enrich the public API with analytics.

AI Evaluation | Trending Topic on Hacker News | Not Hacker News!