Not
Hacker
News
!
Home
Hiring
Products
Discussion
Q&A
Users
Not
Hacker
News
!
Home
Hiring
Products
Discussion
Q&A
Users
Home
/
Discussion
/
LLM Evaluation
Back to Discussion
LLM Evaluation
Loading...
20 stories
•
24h:
0%
•
7d: 0
•
264 comments
Top contributors:
ibobev
mustaphah
fertrevino
PranoyP
luciesim
Stories
Related Stories
20 stories tagged with llm evaluation
Top Model Scores May Be Skewed by Git History Leaks in Swe-Bench
466
153 comments
by mustaphah
Posted
4 months ago
Active
about 1 month ago
AI benchmarking
LLM evaluation
software engineering
From GPT-4 to GPT-5: Measuring Progress Through Medhelm [pdf]
127
96 comments
by fertrevino
Posted
4 months ago
GPT-5
LLM evaluation
healthcare AI
Testing LLM Agents Like Software – Behaviour Driven Evals of AI Systems
21
11 comments
by PranoyP
Posted
about 2 months ago
Active
about 1 month ago
LLM evaluation
behaviour driven testing
AI systems
Tau² Benchmark in Action: Early Results and Key Takeaways
16
0 comments
by luciesim
Posted
4 months ago
Active
about 1 month ago
AI benchmarking
LLM evaluation
AI agent testing
Scorecard – Evaluate Llms Like Waymo Simulates Cars
7
0 comments
by Rutledge
Posted
3 months ago
Active
about 1 month ago
LLM evaluation
AI testing
software development
LLM Evaluation From Scratch: Multiple Choice, Verifiers, Leaderboards, LLM Judge
4
0 comments
by ModelForge
Posted
3 months ago
Active
about 1 month ago
LLM Evaluation
AI Research
Machine Learning
Upbench: Dynamically Evolving Real-World Labor-Market Agentic Benchmark [pdf]
2
1 comments
by pablomendes
Posted
about 2 months ago
Active
about 1 month ago
AI benchmarking
LLM evaluation
labor market
A/b Tests Over Evals
2
0 comments
by Nischalj10
Posted
about 2 months ago
Active
about 1 month ago
A/B testing
LLM evaluation
software development
We Tested 6 AI Models on 3 Common Security Exploits
2
1 comments
by heymax054
Posted
about 2 months ago
Active
about 1 month ago
AI security
LLM evaluation
model bias
Understanding the 4 Main Approaches to LLM Evaluation (from Scratch)
2
0 comments
by ibobev
Posted
3 months ago
Active
about 1 month ago
LLM Evaluation
AI
Machine Learning
A Long-Tail Professional Forum-Based Benchmark for LLM Evaluation
1
0 comments
by wslh
Posted
about 1 month ago
Active
about 1 month ago
llm evaluation
natural language processing
benchmarking
Envtrace: Simulation-Based Semantic Evaluation of LLM Code
1
0 comments
by amscotti
Posted
about 2 months ago
Active
about 1 month ago
LLM evaluation
code simulation
AI research
Structeval - a Structured Output Evaluation and Comparison Tool
1
0 comments
by jwesleyharding
Posted
about 2 months ago
Active
about 1 month ago
LLM evaluation
structured output comparison
CLI tools
Open Prompt Pack for Testing AI Visibility Stability Across Assistants (v0.1
1
1 comments
by businessmate
Posted
2 months ago
Active
about 1 month ago
AI testing
LLM evaluation
AI stability
The Backbone Breaker Benchmark: Testing the Real Security of AI Agents
1
0 comments
by crescit_eundo
Posted
2 months ago
Active
about 1 month ago
AI security
benchmarking
LLM evaluation
Understanding the 4 Main Approaches to LLM Evaluation (from Scratch)
1
0 comments
by ibobev
Posted
3 months ago
Active
about 1 month ago
LLM evaluation
Artificial Intelligence
machine learning
New Paper: a Single Character Can Make or Break Your LLM Evals
1
1 comments
by mark_yellow
Posted
3 months ago
Active
about 1 month ago
LLM evaluation
AI research
natural language processing
Claude Sonnet 4 Vs. 4.5: a Real-World Comparison
1
0 comments
by tonyspiro
Posted
3 months ago
Active
about 1 month ago
AI comparison
Claude Sonnet
LLM evaluation
The Self-Betrayal Heuristic (sbh)
1
0 comments
by dgeep
Posted
4 months ago
Active
about 1 month ago
AI alignment
AI safety
LLM evaluation
LLM-Eval-Simple a Simple Way to Evaluate LLM for Your Use Case
1
0 comments
by grigio
Posted
4 months ago
Active
about 1 month ago
LLM evaluation
artificial intelligence
machine learning
natural language processing
LLM Evaluation | Trending Topic on Hacker News | Not Hacker News!