AI Capabilities May Be Overhyped on Bogus Benchmarks, Study Finds
Posted2 months agoActive2 months ago
gizmodo.comTechstory
skepticalmixed
Debate
70/100
AIBenchmarkingLlms
Key topics
AI
Benchmarking
Llms
A study found that AI capabilities may be overhyped due to bogus benchmarks, sparking debate among HN commenters about the validity of current AI evaluation methods and the potential for gaming or cheating.
Snapshot generated from the HN discussion
Discussion Activity
Moderate engagementFirst comment
4h
Peak period
6
14-15h
Avg / period
1.9
Comment distribution17 data points
Loading chart...
Based on 17 loaded comments
Key moments
- 01Story posted
Nov 7, 2025 at 5:55 PM EST
2 months ago
Step 01 - 02First comment
Nov 7, 2025 at 9:43 PM EST
4h after posting
Step 02 - 03Peak activity
6 comments in 14-15h
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 8, 2025 at 10:30 AM EST
2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45852240Type: storyLast synced: 11/20/2025, 12:47:39 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
They agree that the benchmarks show that the LLMs can solve such questions and models are getting better. But their main point is that this does not prove that the model is reasoning.
But so what??? It may not reason in the way humans do but it is pretty damn close. The mechanics are the same - recursively generate a prompt that terminates in an answer generating prompt.
They don’t like that this indicates the model “reasons through” the problem. But it’s just semantics at this point. For me and for most others - getting the final answer is important. And it largely accomplishes this task.
I don’t buy that the model couldn’t reason through - have you ever asked a model for its explanation? It does genuinely explain how it got the solution. At this point who the hell cares what “reasoning” means if it
1. Gets me the right answer
2. Reasonably explains how it did it
Why do we care if the benchmark results are misleading? The reason we have benchmarks in machine learning is that we can use the results on the benchmarks to predict the performance of a system in uncontrolled conditions, i.e. "in the real world". If the benchmarks don't measure what we think they measure then they can't be used to make that kind of prediction. If that's the case then we really have no idea how good or bad a system really is. Seen another way, if a benchmark is not measuring what we think it measures, all we learn from a system passing the benchmark is that the system passes the benchmark.
Still, what do you care if it gets you the right answer? The question is, exactly, how do you know it's really getting you the right answer? Maybe you can tell when you know the answer, but what about answers you genuinely don't know? And how often does it get you the wrong answer but you don't realise? You can't realistically test an AI system by interacting with it as thoroughly and as rigorously as you can with... a benchmark.
That's why we care about having accurate benchmarks that measure what they're supposed to be measuring.
P.S. Another issue of course is that guessing is limited while reasoning is... less limited. We care about reasoning because we ideally want to have systems that are better than the best guessing machine.
This is very misleading because the generalisation ability of LLMs is very very high. It doesn’t just memorise problems - that’s nonsense.
At high school level maths you genuinely can’t get gpt-5 thinking to make a single mistake. Not possible at all. Unless you give some convoluted ambiguous prompt that no human can understand. If you assume I’m correct, how does gpt memorise then?
In fact even undergraduate level mathematics is quite simple for gpt-5 thinking.
IMO gold was won.. by what? Memorising solutions?
I challenge people to find ONE example that gpt-5 thinking gets wrong in high school or undergrad level maths. I could not achieve it. You must allow all tools though.
look at those goalposts go!
If you don't think that's the case I think it's up to you to show that it's not.
___________________
[1] GSM8K leaderboard: https://llm-stats.com/benchmarks/gsm8k
[2] This is regardless of what GSM8K or any other benchmark is measuring.
https://openai.com/index/learning-to-reason-with-llms/
The benchmark was so saturated that they didn’t even bother running it on the newer models.
Which is interesting because it shows the rapid progress LLMs are making.
I’m also making a bigger claim - you can’t get gpt-5 thinking to make a mistake in undergraduate level maths. At least it would be comparable in performance to a good student.
If you give an LLM an incomplete question, it will guess at an answer. They don't know what they don't know, and they are trained to guess
I think the problem is that GPT5 is not "memorising" but conversely that doesn't automatically mean it is "reasoning". These are human attributes that we are trying to equate to machines and it just causes confusion.
Just like how GPUs were optimised to pass synthetic benchmarks.