From GPT-4 to GPT-5: Measuring Progress Through Medhelm [pdf]
Key topics
I found this to be an interesting finding. Here are the detailed results: https://www.fertrevino.com/docs/gpt5_medhelm.pdf
The author evaluated GPT-5's performance on healthcare tasks using MedHELM and found a slight regression compared to GPT-4, sparking discussion on the model's strengths, weaknesses, and implications for real-world applications.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
48m
Peak period
56
0-6h
Avg / period
13.7
Based on 96 loaded comments
Key moments
- 01Story posted
Aug 21, 2025 at 6:52 PM EDT
4 months ago
Step 01 - 02First comment
Aug 21, 2025 at 7:40 PM EDT
48m after posting
Step 02 - 03Peak activity
56 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Aug 25, 2025 at 7:59 PM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
I think pdf.js even defaults to not running scripts in PDFs by default (would need to double check), if you want to view it in the browser's sandbox. Of course there's still always text rendering based security attacks and such but, again, there's nothing unique to that vs a webpage in a browser.
“Did you try running it over and over until you got the results you wanted?”
Maybe I’m misunderstanding, but it sounds like you’re framing a completely normal proces (try, fail, adjust) as if it’s unreasonable?
In reality, when something doesn’t work, it would seem to me that the obvious next step is to adapt and try again. This does not seem like a radical approach but instead seems to largely be how problem solving sort of works?
For example, when I was a kid trying to push start my motorcycle, it wouldn’t fire no matter what I did. Someone suggested a simple tweak, try a different gear. I did, and instantly the bike roared to life. What I was doing wasn’t wrong, it just needed a slight adjustment to get the result I was after.
1. this is magic and will one-shot your questions 2. but if it goes wrong, keep trying until it works
Plus, knowing it's all probabilistic, how do you know, without knowing ahead of time already, that the result is correct? Is that not the classic halting problem?
> 1. this is magic and will one-shot your questions 2. but if it goes wrong, keep trying until it works
Ah that makes sense. I forgot the "magic" part, and was looking at it more practically.
For LLMs none of it sticks. You keep “teaching” it and the next time it forgets everything.
So again you keep trying until you get the results you want, which you need to know ahead of time.
So it makes sense to me that you should try until you get the results you want (or fail to do so). And it makes sense to ask people what they've tried. I haven't done the work yet to try this for gpt5 and am not that optimistic, but it is possible it will turn out this way again.
I skimmed through the paper and I didnt see any mention of what parameters they used other than they use gpt-5 via the API.
What was the reasoning_effort? verbosity? temperature?
These things matter.
One thing it's hard to wrap my head around is that we are giving more and more trust to something we don't understand with the assumption (often unchecked) that it just works. Basically your refrain is used to justify all sorts of odd setup of AIs, agents, etc.
I am much more worried about the problem where LLMs are actively misleading low-info users into thinking they’re people, especially children and old people.
As one might expect, because the AI isn't actually thinking, it's just spending more tokens on the problem. This sometimes leads to the desired outcome but the phenomenon is very brittle and disappears when the AI is pushed outside the bounds of its training.
To quote their discussion, "CoT is not a mechanism for genuine logical inference but rather a sophisticated form of structured pattern matching, fundamentally bounded by the data distribution seen during training. When pushed even slightly beyond this distribution, its performance degrades significantly, exposing the superficial nature of the “reasoning” it produces."
This is science at its worst, where you start at an inflammatory conclusion and work backwards. There is nothing particularly novel presented here, especially not in the mathematics; obviously performance will degrade on out-of-distribution tasks (and will do so for humans under the same formulation), but the real question is how out-of-distribution a lot of tasks actually are if they can still be solved with CoT. Yes, if you restrict the dataset, then it will perform poorly. But humans already have a pretty large visual dataset to pull from, so what are we comparing to here? How do tiny language models trained on small amounts of data demonstrate fundamental limitations?
I'm eager to see more works showing the limitations of LLM reasoning, both at small and large scale, but this ain't it. Others have already supplied similar critiques, so let's please stop sharing this one around without the grain of salt.
Science starts with a guess and you run experiments to test.
I honestly wish this paper actually showed what it claims, since it is a significant open problem to understand CoT reasoning relative to the underlying training set.
What would be your argument against
1. COT models performing way better in benchmarks than normal models
2. people choose to use the COT models in day to day life because they actually find that it gives better performance
I know what you're saying here, and I know it is primarily a critique of my phrasing, but establishing something like this is the objective of in-context learning theory and mathematical applications of deep learning. It is possible to prove that sufficiently well-trained models will generalize for certain unseen classes of patterns, e.g. transformer acting like gradient descent. There is still a long way to go in the theory---it is difficult research!
> performance collapses under modest distribution shift
The problem is that the notion of "modest" depends on the scale here. With enough varied data and/or enough parameters, what was once out-of-distribution can become in-distribution. The paper is purposely ignorant of this fact. Yes, the claims hold for tiny models, but I don't think anyone ever doubted this.
- https://arcprize.org/leaderboard
- https://aider.chat/docs/leaderboards/
- https://arstechnica.com/ai/2025/07/google-deepmind-earns-gol...
Surely the IMO problems weren't "within the bounds" of Gemini's training data.
Certainly they weren't training on the unreleased problems. Defining out of distribution gets tricky.
This is false.
https://x.com/YiTayML/status/1947350087941951596
This is false even for the OpenAI model
https://x.com/polynoamial/status/1946478250974200272
"Typically for these AI results, like in Go/Dota/Poker/Diplomacy, researchers spend years making an AI that masters one narrow domain and does little else. But this isn’t an IMO-specific model. It’s a reasoning LLM that incorporates new experimental general-purpose techniques."
GPT 5 fast gets many things wrong but switching to the thinking model fixes the issues very often.
Regardless of being in conversation or thinking context this doesn't prevent the model from speaking the wrong answer so the paper on the illusion of thinking makes sense.
What actually seems to be happening is a form of conversational prompting. Of course with the right conversation back and forth with an LLM you can inject knowledge in a way that causes the natural distribution to shift (again - side effect of the LLM tech.) but by itself it won't naturally get the answer perfect every time.
If this extended thinking were actually working you would expect the LLM to be able to logically conclude an answer with very high accuracy 100% of the time which it does not.
"Did you try a room full of chimpanzees with typewriters?"
[1]https://crfm.stanford.edu/helm/medhelm/latest/#/leaderboard
eg - GPT-5 beats GPT-4 on factual recall + reasoning (HeadQA, Medbullets, MedCalc).
But then slips on structured queries (EHRSQL), fairness (RaceBias), evidence QA (PubMedQA).
Hallucination resistance better but only modestly.
Latency seems uneven (maybe more testing?) faster on long tasks, slower on short ones.
I wonder if part of the degraded performance is where they think you're going into a dangerous area and they get more and more vague, for example like they demoed on launch day with the fireworks example. It gets very vague when talking about non-abusable prescription drugs for example. I wonder if that sort of nerfing gradient is affecting medical queries.
After seeing some painfully bad results, I'm currently using Grok4 for medical queries with a lot of success.
Currently, GPT-5 sits at $10/1M output tokens, o3-pro at $80, and o1-pro at a whopping $600: https://platform.openai.com/docs/pricing
Of course this is not indicative of actual performance or quality per $ spent, but according to my own testing, their performance does seem to scale in line with their cost.
Supposedly it fires off multiple parallel thinking chains and then essentially debates with itself to net a final answer.
Things like:
Me: Is this thing you claim documented? Where in the documentation does it say this?
GPT: Here’s a long-winded assertion that what I said before was correct, plus a link to an unofficial source that doesn’t back me up.
Me: That’s not official documentation and it doesn’t say what you claim. Find me the official word on the matter.
GPT: Exact same response, word-for-word.
Me: You are repeating yourself. Do not repeat what you said before. Here’s the official documentation: [link]. Find me the part where it says this. Do not consider any other source.
GPT: Exact same response, word-for-word.
Me: Here are some random words to test if you are listening to me: foo, bar, baz.
GPT: Exact same response, word-for-word.
It’s so repetitive I wonder if it’s an engineering fault, because it’s weird that the model would be so consistent in its responses regardless of the input. Once it gets stuck, it doesn’t matter what I enter, it just keeps saying the same thing over and over.
Its impressive but a regression for now, in direct comparison to just high parameter model
"in my experience [x model] one shots everything and [y model] stumbles and fumbles like a drunkard", for _any_ combination of X and Y.
I get the idea of sharing what's working and what's not, but at this point it's clear that there are more factors to using these with success and it's hard to replicate other people's successful workflows.
codex -m gpt-5 model_reasoning_effort="high"
Are they really understanding, or putting out a stream of probabilities?
The idea is: if you have a substantive point, make it thoughtfully; if not, please don't comment until you do.
Probabilities have nothing to do with it; by any appropriate definition, there exist statistical models that exhibit "understanding" and "reasoning".
Lays out pretty well what our current knowledge on understanding is
The "lie detector" is used to misguide people, the polygraph is used to measure autonomic arousal.
I think these misnomers can cause real issues like thinking the LLM is "reasoning".
The previous truncation ("From GPT-4 to GPT-5: Measuring Progress in Medical Language Understanding") was baity in the sense that the word 'understanding' was provoking objections and taking us down a generic tangent about whether LLMs really understand anything or not. Since that wasn't about the specific work (and since generic tangents are basically always less interesting*), it was a good idea to find an alternate truncation.
So I took out the bit that was snagging people ("understanding") and instead swapped in "MedHELM". Whatever that is, it's clearly something in the medical domain and has no sharp edge of offtopicness. Seemed fine, and it stopped the generic tangent from spreading further.
* https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
Generic Tangents is my new band's name.
I'm guessing HeadQA, Medbullets, MedHallu, and perhaps PubMedQA? (Seems to me that "unsupported speculation" could be a good thing for a patient who has yet to receive a diagnosis...)
Maybe in practice it's better to look at RAG benchmarks, since a lot of AI tools will search online for information before giving you an answer anyways? (Memorization of info would matter less in that scenario)