Famous Cognitive Psychology Experiments That Failed to Replicate
Posted4 months agoActive3 months ago
buttondown.comResearchstoryHigh profile
skepticalmixed
Debate
80/100
PsychologyReplication CrisisResearch Methodology
Key topics
Psychology
Replication Crisis
Research Methodology
The article discusses famous cognitive psychology experiments that failed to replicate, sparking a discussion on the replication crisis in psychology and the field's research practices.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
6m
Peak period
79
0-6h
Avg / period
11.6
Comment distribution116 data points
Loading chart...
Based on 116 loaded comments
Key moments
- 01Story posted
Sep 17, 2025 at 2:55 PM EDT
4 months ago
Step 01 - 02First comment
Sep 17, 2025 at 3:01 PM EDT
6m after posting
Step 02 - 03Peak activity
79 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 20, 2025 at 10:14 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45279898Type: storyLast synced: 11/20/2025, 8:42:02 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
https://www.nature.com/articles/nature.2015.18248
> We use publicly available data to show that published papers in top psychology, economics, and general interest journals that fail to replicate are cited more than those that replicate. This difference in citation does not change after the publication of the failure to replicate. Only 12% of postreplication citations of nonreplicable findings acknowledge the replication failure.
https://www.science.org/doi/10.1126/sciadv.abd1705
Press release: https://rady.ucsd.edu/why/news/2021/05-21-a-new-replication-...
What we need is for every paper to be published alongside a stats card that is kept up to date. How many times it's been cited, how many times people tried to replicate it, and how many times they failed.
> Most results in the field do actually replicate and are robust[citation needed]
Disclosure: Physics PhD.
The ego depletion effect seems intuitively surprising to me. Science is often unintuitive. I do know that it is easier to make forward-thinking decisions when I am not tired so I dont know.
I don't like Giancotti's claims. He wrote: >This post is a compact reference list of the most (in)famous cognitive science results that failed to replicate and should, for the time being, be considered false.
I don't agree with Giancotti's epistemological claims but today I will not bloviate at length about the epistemology of science. I will try to be brief.
If I understand Marco Giancotti correctly, one particular point is that Giancotti seems to be saying that Hagger et al. have impressively debunked Baumeister et al.
The ego depletion "debunking" is not really what I would call a refutation. It says, "Results from the current multilab registered replication of the ego-depletion effect provide evidence that, if there is any effect, it is close to zero. ... Although the current analysis provides robust evidence that questions the strength of the ego-depletion effect and its replicability, it may be premature to reject the ego-depletion effect altogether based on these data alone."
Maybe Baumeister's protocol was fundamentally flawed, but the counter-argument from Hagger et al. does not convince me. I wasn't thrilled with Baumeister's claims when they came out, but now I am somehow even less thrilled with the claims of Hagger et al., and I absolutely don't trust Giancotti's assessment. I could believe that Hagger executed Baumeister's protocol correctly, but I can't believe Giancotti has a grasp of what scientific claims "should" be "believed."
That might be true, but this article's comment section isn't a good place for it because it doesn't seem like the article is entirely fair. I would not call it dishonest, but there is a lack of certainty and finality in being able to conclude that these papers have been successfully proven to not be replicable.
I think that can be subtly confused by people thinking you can't get better at self control with practice? That is, I would think a deliberate practice of doing more and more self control every day should build up your ability to do more self control. And it would be easy to think that that means you have a stamina for self control that depletes in the same way that aerobic fitness can work. But, those don't necessarily follow each other.
I can't help chuckling at the idea that over 1.98 * 10^87 people were involved in the paper.
http://www.psychpage.com/learning/library/intell/mainstream....
in fact, the foundational statistical models considered the gold standard for statistics today were developed for this testing.
The normal distribution predates the general factor model of IQ by hundreds of years.[0]
You can try other distributions yourself, it's going to be hard to find one that better fits the existing IQ data than the normal (bell curve) distribution.
[0] https://en.wikipedia.org/wiki/Normal_distribution#History
not realizing he was hundreds of years late to the game, he still went ahead and coined the term "median"
more tidbits here https://en.wikipedia.org/wiki/Francis_Galton#Statistical_inn...
I have no doubt that IQ tests reproducibly measure the test takers ability to pass tests, as well as to perform in a society that the tests are based on.
I think it's disingenuous to attribute IQ to intelligence as a whole though, and it is better understood as an indicator of cultural intelligence.
I would expect that, for cultures who's members score below average on IQ tests from the US, an equivalent IQ test created within that culture would show average members of that culture scoring higher than average members of US culture.
https://en.wikipedia.org/wiki/Flynn_effect
https://pubmed.ncbi.nlm.nih.gov/24104504/
My belief was reinforced when companies switched to remote work, and management at many companies complained that it was difficult to tell who was and wasn't working, when the managers didn't get to watch the workers. Abstracting the social relationship from the results of work will make it easier to judge the work itself, but more difficult to enforce the social relationship. When the abstraction occurred, those who were basing the status of their employees on the social relationship, and not the work output, were especially disadvantaged.
I was told this in context of "cultural psychology" how many tests or psychological observations and metrics poorly translate over culture. (especially when you try to pin it on some success metric)
A moment from the show "Good Times" in 1974. https://m.youtube.com/watch?v=DhbsDdMoHC0 at 1:25
Also, I forgot how annoying comic relief characters were in sitcoms. They are the opposite of relieving.
also, cultures don't have iq's, there is no known link to culture.
If you're trying to say they replicate over the lifetime of the same person, I've had a 15 point swing between tests, out of the few I've taken. What did stay constant for me from age 10 to age 40 was my Myers-Briggs test (my dad was a metrics obsessive), and that's obvious horseshit. Consistency doesn't mean you're measuring what you claim to be measuring.
edit: if it matters, scores were between 137 and 152, so exactly an entire standard deviation. That's like the difference between sub-Saharan and European that racists are always crowing about, but in the same person. IQ doesn't even personally replicate for me.
if a variety of different IQ tests sort the same people the same way, even though every question on the tests is different from the other tests, you have shown that the test is showing something about the subjects, not something about the tests. and that is replicable, and falsifiable.
if you follow the same people over time and provide them with new tests, and they continue to sort in the same relative fashion, you have increased confidence that you are measuring something relatively fixed, not variable. For statistical significance (look it up) you don't draw conclusions on the basis of one person (or one Dad) but on population samples tested under standard conditions.
this is like all study results published here, a thousand nerds who've never studied intelligence come up with a hundred objections to what was tested, assuming with arrogance that the people who specialized and did the work aren't considering what comes off the top of this nerd's head. Better qualified nerds did this work.
>Myers-Briggs...'s obvious horseshit
Myers Briggs is not complete horseshit, correlates closely to, but not as good a fit as, the generally accepted Big Five Factor system, the gold standard of personality tests: you should educate yourself a bit more. Myers-Briggs essentially tries to phrase everything in a postive way, where the Big Five separates and includes Neuroticism which is a more negative (for the person) trait. All these traits should be considered adaptive till proven otherwise, so resist the urge to judge.
https://www.stat.cmu.edu/~brian/Pmka-Attack-V71-N3/pmka-2006...
(1st section).
Related: that brain is plastic and can adapt to challenges in different ways. https://www.scientificamerican.com/article/london-taxi-memor...
Wow, what are the odds?
https://en.wikipedia.org/wiki/Stern%E2%80%93Gerlach_experime...
Based on my brief stint doing data work in psychology research, amongst many other problems they are AWFUL at stats. And it isn't a skill issue as much as a cultural one. They teach it wrong and have a "well, everybody else does it" attitude towards p-hacking and other statistical malpractice.
[actually], is a neutral declaration that some cognitive structure was presented, but is at odds with physically observable fact that will now be laid out to you.
SF author Michael Flynn was a process control engineer as his day job; he wrote about how designing statistically valid experiments is incredibly difficult, and the potential for fooling yourself is high, even when you really do know what you are doing and you have nearly perfect control over the measurement setup.
And on top of it you're trying to measure the behavior of people not widgets; and people change their behavior based on the context and what they think you're measuring.
There was a lab set up to do "experimental economics" at Caltech back in the late 80's/early 90's. Trouble is, people make different economic decisions when they are working with play money rather than real money.
Understated even. Ever play poker with just chips and no money behind them? Nobody cares, there is no value to the plastic coins.
>meta-analyses and systematic reviews have shown significant evidence for the effects of stereotype threat, though the phenomenon defies over-simplistic characterization.[22][23][24][25][26][27][28][9]
Failing to reproduce an effect doesn't prove it isn't real. Mythbusters would do this all the time.
On the other hand, some empires are built on publication malpractice.
One of the worst that I know is John Gottman. Marriage counselling based on 'thin slicing'/microexpressions/'Horsemen of the Apocalypse'. His studies had been exposed as fundamentally flawed, and training based on his principles performed worse than prior offerings, before he was further popularized by Malcolm Gladwell in Blink.
This type of intellectual dishonesty underlies both of their careers.
https://en.wikipedia.org/wiki/Cascade_Model_of_Relational_Di...
https://en.wikipedia.org/wiki/The_Seven_Principles_for_Makin...
https://www.gottman.com/blog/this-one-thing-is-the-biggest-p...
It's definitely a skill issue then
Is they are the ones who need to be at the bleeding edge of statistics but often aren’t
They absolutely need Bayesian competitive hypothesis testing but are often the least likely to use it
https://people.cs.uchicago.edu/~ravenben/cargocult.html
I think some "bad people" used eugenics and phrenology to justify prior hate, but they were also effective tools at convincing otherwise "good people" to join them.
Stereotype threat for example was widely used to explain test score gaps as purely environmental, which contributed to the public seeing gaps as a moral emergency that needed to be fixed, leading to affirmative action policies.
It's just a question of power in the end. And even if you could question the legitimacy of "studies" the people in power use to justify their ruling, they would produce a dozen more flawed justifications before you could even produce one serious debunking. And they wouldn't even have to give much light to your production so you would need large cultural and political support.
Psychology exists mostly as a new religion; it serves as a tool for justification for people in power, it is used just in the same way as the bible.
It should not be surprising to anyone that much of it isn't replicable (nor falsifiable in the first place) and when it is, the effects are so close to randomness that you can't even be sure of what it means. This is all by design, you need to keep people confused to rule over them. If they start asking questions you can't answer, you lose authority and legitimacy. Psychology is the tool that serves the dominant ideology that is used to "answer" those questions.
Science should never be taught as dogmatic, but the reproducibility crisis has ultimately fostered a culture where one should not question "established" results (Kahneman famously proclaimed that one "must" accept the results of the unbelievable priming results in his famous book), especially if that one is interested in a long academic career.
The trouble is that some trust is necessary in communicating scientific observations and hypothesis to the general public. It's easy to blame the failure of the public to unify around Covid as based around cultural divides, but the truth is that skepticism around high stakes, hastily done science is well warranted. The trouble is that even when you can step through the research and see the conclusions are sound, the skepticism remains.
However, as someone that has spent a long career using data to understand the world, I suspect the harm directly caused by the wrong conclusions being reached is more minimal than one would think. This is largely because, despite lip service to "data driven decision making", science and statistics very rarely are the prime driver of any policy decision.
But for most people science doesn't really make much difference in how they choose and operate. Knowing the truth doesn't mean you are ready to adapt your behavior.
Let's be candid: Most policies have no backing in science whatsoever. The fact that some were backed by poor science is not an indictment of much.
You either have to change the policy and admit you were "wrong" to an electorate who can't understand nuance, or continue with the policy and accept a few bad news days before the media cycle resets to something else.
Learning styles have also been debunked for decades though they continue to be used in education. I saw an amusing line in an article that said 90% of teachers were happy to continue using them even after accepting they're nonsense.
And that's just theories that have been debunked (i.e. proven wrong).
> Claimed result: Holding a pen in your teeth (forcing a smile-like expression) makes you rate cartoons as funnier compared to holding a pen with your lips (preventing smiling). More broadly, facial expressions can influence emotional experiences: "fake it till you make it."
I read this about a decade ago, and started, when going into a situation where I wanted to have a natural smile, grimacing maniacally like I had a pencil in my teeth. The thing is, it's just so silly, it always makes me laugh at myself, at which point I have a genuine smile. I always doubted whether the claimed connection was real, but it's been a useful tool anyway.
I think there may be something to a few of these, and more may need considering regarding how these are conducted.
Let’s leave open our credulities for the inquest of time.
https://www.apa.org/about/policy/chapter-4b
A heuristic I use that is unreasonably good at identifying grifters and charlatans: Unnecessarily invoking cortisol or other hormones when discussing behavioral topics. Influencers, podcasters, and pseudoscience practitioners love to invoke cortisol, testosterone, inflammation, and other generic concepts to make their ideas sound more scientific. Instead of saying "stress levels" they say "cortisol". They also try to suggest that cortisol is bad and you always want it lower, which isn't true.
Dopamine is another favorite of the grifters. Whenever someone starts talking about raising dopamine or doing something to increase dopamine, they're almost always being misleading or just outright lying. Health and fitness podcasters are the worst at this right now.
The APA has a really good style guide, but I don't trust them for actual psychology.
Meanwhile, it’s been reproduced “in vitro” in numerous episodes of atrocity, e.g. Abu Ghraib…
Also, how do you italicize text in your comment?
HN will italicize any string between a pair of asterisks. [0]
> practical ethics requirements
Practical ethics requirements :)
[0] https://news.ycombinator.com/formatdoc
https://www.democracynow.org/2007/8/20/apa_members_hold_fier...
“No objective measure” pretty much sums up the whole field, to be honest. I started on a CS & Psych double major, did about eight psych courses, and then decided it was mostly a joke once I got to the quantitative portions. But those courses were very useful for general life skills. Developmental psychology in particular was packed with dense lessons about how we learn as children… social psych was a good overview of all the “well-known” experiments… etc.
This belongs in a dungeon crawl game. You find an artifact that plays music to you. Depending on the music played (depends on the artifact's enchantment and blessed status), it can buff or debuff your intelligence by several points temporarily.
It's important to say that a psychology study can be scientific in one sense -- say, rigorous and disciplined, but at the same time be unscientific, in the sense that it doesn't test a falsifiable, defining psychological theory -- because there aren't any of those.
Or, to put it more simply, scientific fields require falsifiable theories about some aspect of nature, and the mind is not part of nature.
Future neuroscience might fix this, but don't hold your breath for that outcome. I suspect we'll have AGI in artificial brains before we have testable, falsifiable neuroscience theories about our natural brains.
If you won't trust the process, you will gain no real outcome.
What we recieve from the process is not necessarily tangible, but instead a fresh perspective on what may be possible. Thus, the inversion is complete, and we may then move forward.
Setting that aside, among any scientific field I'm aware of, psychology has taken the replication crisis most seriously. Rigor across all areas of psychology is steadily increasing: https://journals.sagepub.com/doi/full/10.1177/25152459251323...
In biomedicine, Amgen could reproduce only 6/53 “landmark” preclinical cancer papers and Bayer reported widespread failures.
Is there a good list of results that do consistently replicate?
Not that it matters, most of the psychology field is inherently bullshit, those are just the example of cases they went so far in the insult to intelligence, no amount of "studies" and rhetoric can save them.
(This isn't a comment on any of the individual studies listed.)
This title would be much more accurate if the author omitted “cognitive” from the title.
26 more comments available on Hacker News