Not
Hacker
News
!
Home
Hiring
Products
Discussion
Q&A
Users
Not
Hacker
News
!
Home
Hiring
Products
Discussion
Q&A
Users
Home
/
Discussion
/
AI Benchmarking
Back to Discussion
AI Benchmarking
Loading...
17 stories
•
24h:
0%
•
7d: 0
•
250 comments
Top contributors:
mustaphah
blndrt
tosh
luciesim
codelensai
Stories
Related Stories
17 stories tagged with ai benchmarking
Top Model Scores May Be Skewed by Git History Leaks in Swe-Bench
466
153 comments
by mustaphah
Posted
4 months ago
Active
about 1 month ago
AI benchmarking
LLM evaluation
software engineering
Tau² Benchmark: How a Prompt Rewrite Boosted GPT-5-Mini by 22%
197
65 comments
by blndrt
Posted
4 months ago
Active
about 1 month ago
LLM optimization
prompt engineering
AI benchmarking
Swe-Bench Pro
101
28 comments
by tosh
Posted
3 months ago
Active
about 1 month ago
AI benchmarking
software development
machine learning
Tau² Benchmark in Action: Early Results and Key Takeaways
16
0 comments
by luciesim
Posted
4 months ago
Active
about 1 month ago
AI benchmarking
LLM evaluation
AI agent testing
Benchmark AI on Your Actual Code (gpt-5, Claude, Grok, Gemini, O3)
7
0 comments
by codelensai
Posted
3 months ago
Active
about 1 month ago
AI benchmarking
Large Language Models
software development
Context-Bench: Benchmarking Llms on Agentic Context Engineering
5
0 comments
by janpio
Posted
2 months ago
Active
about 1 month ago
Large Language Models
AI benchmarking
context engineering
Epoch Capabilities Index Aggregates AI Benchmark Scores Into One Metric
4
0 comments
by finder83
Posted
2 months ago
Active
about 1 month ago
AI benchmarking
machine learning
Epoch AI
Flashinfer Bench: a Benchmark Suite for AI Systems That Improve Themselves
4
0 comments
by yiyan
Posted
2 months ago
Active
about 1 month ago
AI benchmarking
self-improving AI systems
machine learning
Measuring What Matters: Construct Validity in Large Language Model Benchmarks
3
2 comments
by Cynddl
Posted
2 months ago
Active
about 1 month ago
AI benchmarking
large language models
scientific rigor
Upbench: Dynamically Evolving Real-World Labor-Market Agentic Benchmark [pdf]
2
1 comments
by pablomendes
Posted
about 2 months ago
Active
about 1 month ago
AI benchmarking
LLM evaluation
labor market
Claude Haiku 4.5 Vs. Glm-4.6 Vs. GPT-5 Mini: Job Queue System Benchmark
2
0 comments
by heymax054
Posted
2 months ago
Active
about 1 month ago
AI benchmarking
LLM comparison
job queue system
Gemini 2.5 Pro Still Tops Text and Vision Benchmarks
2
0 comments
by robertwt7
Posted
2 months ago
Active
about 1 month ago
AI benchmarking
Gemini 2.5 Pro
machine learning
AI Agent Benchmark Compendium
2
0 comments
by nkko
Posted
3 months ago
Active
about 1 month ago
AI benchmarking
machine learning
artificial intelligence
Terminal-Bench 2.0 and Harbor
1
1 comments
by falcor84
Posted
about 2 months ago
Active
about 1 month ago
AI benchmarking
Terminal-Bench
Harbor tool
Imo-Bench – Towards Robust Mathematical Reasoning
1
0 comments
by stared
Posted
about 2 months ago
Active
about 1 month ago
mathematical reasoning
AI benchmarking
robustness evaluation
Seal Showdown Technical Report (ai Benchmark) [pdf]
1
0 comments
by freeqaz
Posted
3 months ago
Active
about 1 month ago
AI benchmarking
technical report
performance evaluation
Mlperf Inference V5.1 Results Land with New Benchmarks and Record Participation
1
0 comments
by rbanffy
Posted
4 months ago
Active
about 1 month ago
MLPerf
AI Benchmarking
Machine Learning
AI Benchmarking | Trending Topic on Hacker News | Not Hacker News!