Not

Hacker

News!

Not

Hacker

News!

AI-observed conversations & context

Daily AI-observed summaries, trends, and audience signals pulled from Hacker News so you can see the conversation before it hits your feed.

LiveBeta

Explore

Home
Hiring
Products
Companies
Discussion
Q&A
Privacy Policy

Resources

Visit Hacker News
HN API
Modal cronjobs
Meta Llama

Briefings

Inbox recaps on the loudest debates & under-the-radar launches.

Connect

© 2026 Not Hacker News! — independent Hacker News companion.

Not affiliated with Hacker News or Y Combinator. We simply enrich the public API with analytics.

Not

Hacker

News!

Home
Discussion
LLM Evaluation

LLM Evaluation

20 stories

•

24h: 0%

•

7d: 0

•

264 comments

Top contributors:ibobev mustaphah fertrevino PranoyP luciesim

Stories

Related Stories

20 stories tagged with llm evaluation

Top Model Scores May Be Skewed by Git History Leaks in Swe-Bench

466153 commentsby mustaphah

Posted4 months agoActiveabout 1 month ago

From GPT-4 to GPT-5: Measuring Progress Through Medhelm [pdf]

12796 commentsby fertrevino

Posted4 months ago

Testing LLM Agents Like Software – Behaviour Driven Evals of AI Systems

2111 commentsby PranoyP

Postedabout 2 months agoActiveabout 1 month ago

Tau² Benchmark in Action: Early Results and Key Takeaways

160 commentsby luciesim

Posted4 months agoActiveabout 1 month ago

Scorecard – Evaluate Llms Like Waymo Simulates Cars

70 commentsby Rutledge

Posted3 months agoActiveabout 1 month ago

LLM Evaluation From Scratch: Multiple Choice, Verifiers, Leaderboards, LLM Judge

40 commentsby ModelForge

Posted3 months agoActiveabout 1 month ago

Upbench: Dynamically Evolving Real-World Labor-Market Agentic Benchmark [pdf]

21 commentsby pablomendes

Postedabout 2 months agoActiveabout 1 month ago

A/b Tests Over Evals

20 commentsby Nischalj10

Postedabout 2 months agoActiveabout 1 month ago

We Tested 6 AI Models on 3 Common Security Exploits

21 commentsby heymax054

Postedabout 2 months agoActiveabout 1 month ago

Understanding the 4 Main Approaches to LLM Evaluation (from Scratch)

20 commentsby ibobev

Posted3 months agoActiveabout 1 month ago

A Long-Tail Professional Forum-Based Benchmark for LLM Evaluation

10 commentsby wslh

Postedabout 1 month agoActiveabout 1 month ago

Envtrace: Simulation-Based Semantic Evaluation of LLM Code

10 commentsby amscotti

Postedabout 2 months agoActiveabout 1 month ago

Structeval - a Structured Output Evaluation and Comparison Tool

10 commentsby jwesleyharding

Postedabout 2 months agoActiveabout 1 month ago

Open Prompt Pack for Testing AI Visibility Stability Across Assistants (v0.1

11 commentsby businessmate

Posted2 months agoActiveabout 1 month ago

The Backbone Breaker Benchmark: Testing the Real Security of AI Agents

10 commentsby crescit_eundo

Posted2 months agoActiveabout 1 month ago

Understanding the 4 Main Approaches to LLM Evaluation (from Scratch)

10 commentsby ibobev

Posted3 months agoActiveabout 1 month ago

New Paper: a Single Character Can Make or Break Your LLM Evals

11 commentsby mark_yellow

Posted3 months agoActiveabout 1 month ago

Claude Sonnet 4 Vs. 4.5: a Real-World Comparison

10 commentsby tonyspiro

Posted3 months agoActiveabout 1 month ago

The Self-Betrayal Heuristic (sbh)

10 commentsby dgeep

Posted4 months agoActiveabout 1 month ago

LLM-Eval-Simple a Simple Way to Evaluate LLM for Your Use Case

10 commentsby grigio

Posted4 months agoActiveabout 1 month ago

Not

Hacker

News!

AI-observed conversations & context

Daily AI-observed summaries, trends, and audience signals pulled from Hacker News so you can see the conversation before it hits your feed.

LiveBeta

Explore

Home
Hiring
Products
Companies
Discussion
Q&A
Privacy Policy

Resources

Visit Hacker News
HN API
Modal cronjobs
Meta Llama

Briefings

Inbox recaps on the loudest debates & under-the-radar launches.

Connect

© 2026 Not Hacker News! — independent Hacker News companion.

Not affiliated with Hacker News or Y Combinator. We simply enrich the public API with analytics.

LLM Evaluation | Trending Topic on Hacker News | Not Hacker News!