New Benchmark Shows Top Llms Struggle in Real Mental Health Care
Key topics
A new benchmark, MindEval, is shaking things up by revealing that top large language models (LLMs) struggle to provide effective mental health care, sparking a lively debate about the evaluation methodology. Some commenters questioned whether using the same prompts for all models was the best approach, given their different nuances, while others pointed out the lack of a human clinician control group, rendering the results seemingly meaningless without a baseline for comparison. The benchmark's creators argue that their goal was to create a realistic testing ground for LLMs, not to compare them directly to human clinicians. As the discussion unfolds, it becomes clear that the quest for more effective AI-powered mental health support is both timely and complex.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
5m
Peak period
100
0-3h
Avg / period
17.8
Based on 160 loaded comments
Key moments
- 01Story posted
Dec 10, 2025 at 8:39 AM EST
24 days ago
Step 01 - 02First comment
Dec 10, 2025 at 8:44 AM EST
5m after posting
Step 02 - 03Peak activity
100 comments in 0-3h
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 12, 2025 at 7:36 AM EST
22 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
I'm sure it's somewhere in the details somewhere, but after a quick skim I didn't find anything outlined about how you managed and used the prompts, and if it was per model or not.
Thanks a bunch for being open to answering questions here, and thanks for trying to attack this particular problem with scientific rigor, even if it's really difficult to do so.
Wouldn't that be easy to make fair by making sure all models tried it with the same prompts? So you have model X and Y, and prompts A and B, and X runs once with A, once with B, and same for Y.
Reason I ask, is because in my own local benchmarks I do for each model release with my own tasks, I've noticed a huge variance in quality of responses based on the prompts themselves. Slight variation of wording seems to have a big effect on the final responses, and those variations seems to again have a big variance of effect depending on the model.
Sometimes a huge system prompt makes a model return much higher quality responses while another model gives much higher quality responses when the system prompt is as small as it possible can. At least this is what I'm seeing with the local models I'm putting under test with my private benchmarks.
Yeah, initially I wrote this test/benchmark harness because I wanted to compare multiple different prompts for the same tasks and the same model, but obviously eventually grew out from there. But it still has the prompts at core, and I re-run everything whenever something changes, or I add new models to it.
> How many times did you run each prompt?
It's structured in a way of Category > Task > Case and that's mixed with a list of Prompts for each Task, then each Case runs with each of the Prompts. So I guess you could say that each prompt gets "exercised" the number of existing cases that exists for the Task they're in.
> Did you use the same rubric to score each experiment?
I'm not sure if you mean something specific by "rubric" (I'm not from academia), but they're all pretty much binary "passed" or "not passed". The coding ones are backed by unit tests that were failing, and after test case must pass without being changed, translation ones backed by (mostly) simple string checking, and so on. I don't have any tasks or cases that are "Rate this solution from 0-10" or similar.
I'd very much like to see clinicians randomly selected from BetterHelp and paid to interact the same way with the LLM patient and judged by the LLM, as the current methodology uses. And see what score they get.
Ideally this should be done blind, I don't know if BetterHelp allows for therapy through a text chat interface? Where the therapist has no idea it's for a study and so isn't trying to "do better" then they would for any average client.
Because while I know a lot of people for whom therapy has been life-changing, I also know of a lot of terrible therapy experiences.
The main points of our methodology are: 1) prove that is possible to simulate patients with an LLM. Which we did. 2) prove that an LLM as a Judge can effectively score conversations according to several dimensions that are similar to how clinicians are also evaluated. Which we also did and we show that the average correlation with human evaluators is medium-high.
Given 1) and 2) we can then benchmark LLMs and as you see, there is plenty of room for improvement. We did not claim anything regarding human performance... its likely that human performance also needs to improve :) thats another study
So the results are meaningful in terms of establishing that LLM therapeutic performance can be evaluated.
But not meaningful in terms of comparing LLMs with human clinicians.
LLMs have room for improvement (we show that their scores are medium-low on several dimensions).
Maybe the average human also has lots of room for improvement. One thing does not necessarily depend on the other.
the same way we can say that LLMs still have room for improvement on a specific task (lets say mathematics) but the average human is also bad at mathematics...
We don't do any claims about human therapists. Just that LLMs have room for improvement on several dimensions if we want them to be good at therapy. Showing this is the first step to improve them
> Just that LLMs have room for improvement on several dimensions if we want them to be good at therapy.
That implies they're not currently good at therapy. But you haven't shown that.
Everybody has room for improvement if you say 6 is perfection and something isn't reaching 6 on average.
I take no issue with your methodology. But your broader framing, and title, don't seem justified.
> I'd very much like to see clinicians randomly selected from BetterHelp and paid to interact the same way with the LLM patient and judged by the LLM, as the current methodology uses. And see what score they get.
Does it really matter? Per the OP:
>>> Across all models, average clinical performance stayed below 4 on a 1–6 scale. Performance degraded further in severe symptom scenarios and in longer conversations (40 turns vs 20).
I'd assume a real therapy session has far more "turns" than 20-40, and if model performance starts low and gets lower with longer length.
Also my impression is BetterHelp pays poorly and thus tends to have less skilled and overworked therapists (https://www.reddit.com/r/TalkTherapy/comments/1letko9/is_bet..., https://www.firstsession.com/resources/betterhelp-reviews-su...), so taking it as a baseline would bias the results against human therapists.
Posted 6 months after the post and all the rest of the comments. It's some kind of SEO manipulation. That reddit thread ranked highly in my Google search about Betterhelp being bad, so they're probably trying to piggyback on it.
Yes, it absolutely does matter. Look at what you write:
> I'd assume
> it's reasonable to expect
> my impression is
The whole reason to do a study is to actually study as opposed to assume and expect.
And for many of the kinds of people engaging in therapy with an LLM, BetterHelp is precisely where they are most likely to go due to its marketing, convenience, and price. It's where a ton of real therapy is happening today. Most people do not have a $300/hr. high-quality therapist nearby that is available and within their financial means.
Many things can be useful before they reach the level of world's best. Although with AI, non-intuitive failure modes must be taken into consideration too.
Burns is really into data gathering and his app is LLM based on the rails of the TEAM process and it seems to be very well received.
I found it simple and very well done - and quite effective.
A top level comment says that therapists aren't good either - Burns would argue that mainly no one tests before and after and so no measuring effect is done.
And of people I know who see a therapist, practically none can tell me what exactly they are doing or what methods they are doing or how anything is structured.
> And of people I know who see a therapist, practically none can tell me what exactly they are doing or what methods they are doing or how anything is structured.
I could tell you that as a client, but that’s because I’ve read into it. This is sort of like asking an ER patient to describe the shift management system of the clinic they went into.
I'm skeptical of the value of this benchmark, and I'm curious for your thoughts - self play / reinforcement tasks can be useful in a variety of arenas, but I'm not a priori convinced they are useful when the intent is to help humans in situations where theories of mind matter.
That is, we're using the same underlying model(s) to simulate both a patient and a judgment as to how patient-like that patient is -- this seems like an area where I'd really want to feel confident that my judge LLM is accurate; otherwise the training data I'm generating is at risk of converging on a theory of mind / patients that's completely untethered from, you know, patients.
Any thoughts on this? Feel like we want a human in the loop somewhere here, probably on scoring the judge LLMs determinations until we feel that the judge LLM is human or superhuman. Until then, this risks building up a self-consistent, but ultimately just totally wrong, set of data that will be used in future RL tasks.
https://patricgagne.com/
She was a therapist in the past. How do you know she was successful?
I'm not even sure what to say. It's self-evidently a terrible idea, but we all just seem to be charging full-steam ahead like so many awful ideas in the past couple of decades.
I haven't been so unlucky myself, but I know many who've had terrible first experiences with therapists and psychologists, where I'm wondering why those people even are in the job they are, but some of them got so turned off they stopped trying to find anyone else to help them, because they think most mental health professionals would be the same as the first person they sought help from.
It's also less pressure, a more comfortable environment (home vs. stranger's office), no commitments to a next session, and less embarrassing (sharing your personal issues to a computer via text is less anxiety-inducing than saying them to a person's face).
With that all said, I'm strongly opposed to people using LLMs as therapists.
In their mind, most of the times, if there is no one standing behind them when they chat with a LLM, then the conversation for most intents and purposes is private.
Obviously, us who are born with a keyboard in front of our hands, know this to not be true, and know we're being tracked constantly with our data being sold to the highest bidder. But the typical person have more or less zero concerns about this, which is why it's not a priority issue to be solved.
Should be banned. Average people have no basis to know whether drug X is appropriate for them. If your doctor thinks you need it, he'll tell you. These ads also perpetuate the harmful idea that there's a pill for everything.
However, unless we have a measure on how helpful self-help books actually are, we still don't know if they help or not.
So it’s sensible that someone out there is evaluating its competence and thinking about a better alternative for these folks than yoloing into chatgpt.com’s default LLM.
Frankly, everyone's hand is being forced by the major AI providers existing.
If you try to merely stop people from using LLMs as therapists (could you elaborate on what that looks like?) and call it a day, your consideration isn't extending to all the people who will do it anyways.
That's what I mean by forcing your hand into doing the work of figuring out how to make LLM therapists work even if you were vehemently against the idea.
My other stance, which I suspect is probably more controversial, is that I'm not convinced that mental health care is nearly as effective as people think. In general, mental health outcomes for teens are getting markedly worse, and it's not for lack of access. We have more mental health access than we've had previously -- it just doesn't feel like it because the demand has risen even more sharply.
On a personal level, I've been quite depressed lately, and also feeling quite isolated. As part of an attempt to get out of my own shell I mentioned this to a friend. Now, my friend is totally well-intended, and I don't begrudge him whatsoever. But, the first response out of his mouth was whether I'd sought professional mental health care. His response really hurt. I need meaningful social connection. I don't need a licensed professional to charge me money to talk about my childhood. I think a lot of people are lost and lonely, and for many people mental health care is a band-aid over a real crisis of isolation and despair.
I'm not recommending against people seeking mental health care, of course. And, despite my claims there are many people who truly need it, and truly benefit from it. But I don't think it's the unalloyed good that many people seem to believe it to be.
Which is to say, your stance might not be as controversial as you think, since it was the adult take in a children's cartoon almost 60 years ago.
Lucy isn't actually a psychologist which is part of the reason the "gag" is funny.
Peanuts is funny, but it may not be the source of wisdom you think it is.
There's also the elephant in the room that mental healthcare, in particular for teens will probably just be compensating for the disease that is social media addiction. Australia has the right idea, banning social media for all goods.
Professional mental health care cannot scale to the population that needs it. The best option, like you mention, is talking to friends about our feelings and problems. I think there has been an erosion (or it never existed) of these social mental health mechanisms. There needs to be a framework for providing mental health therapy to loved ones that can exist without licensed professionals, otherwise LLm's are the only scalable option for people to talk about their issues and work on finding solutions.
This might be controversial but mental health care is largely a bandaid when the causes of people's declining mental health is due to factor's far outside the individual's control: loneliness epidemics, declining optimism towards the future, climate change, the rise of global fascism, online dating, addictiveness of social media and the war on our attention, etc.
We can see an LLM as someone that talks with more people, for more time, than anyone on earth talks in their lifetime. So they are due to be in constant contact with people in mental distress. At that point, you might as well consider the importance of giving them the skills of a mental health professonal, because they are going to be facing more of this than a priest in a confessional. And this is true whether someone says "Gemini, pretend that you are a psychologist" or not. You or I don't need a prompt to know we need to notice when someone is in a severe psychotic episode: Some level of mental health awareness is built in, if just to protect ourselves. So an LLM needs quite a bit of this by default to avoid being really harmful. And once you give it that, you might as well evaluate it against professionals: Not because it must be as good, but because it'd be really nice if it was, even when it's not trying to act as one.
I heard someone say that LLMs don't need to be as good as an expert to be useful, they just need to be better than your best available expert. A lot of people don't have access to mental health care, and will ask their chatbot to ask like a psychologist.
This mostly makes sense.
The problem is that people will take what you've said to mean "If I have no access to a therapist, at least I can access an LLM", with a default assumption that something is better than nothing. But this quickly breaks down when the sycophantic LLM encourages you to commit suicide.
I've heard people say the same thing ("LLMs don't need to be as good as an expert to be useful, they just need to be better than your best available expert"), and I also know that some people assume that LLMs are, by default, better than nothing. Hence my comment.
Terrible idea or not, it's probably helpful to think of LLMs not as "AI mental healthcare" but rather as another form of potentially bad advice. From a therapeutic perspective, Claude is not all that different from the patient having a friend who is sometimes counterproductive. Or the patient reading a self-help book that doesn't align with your therapeutic perspective.
(1) The demand for mental health services is an order of magnitude vs the supply, but the demand we see is a fraction of the demand that exists because a lot of people, especially men, aren't believers in the "therapeutic culture"
In the days of Freud you could get a few hours of intensive therapy a week but today you're lucky to get an hour a week. An AI therapist can be with you constantly.
(2) I believe psychodiagnosis based on text analysis could greatly outperform mainstream methods. Give an AI someone's social media feed and I think depression, mania, schizo-* spectrum, disordered narcissism and many other states and traits will be immediately visible.
(3) Despite the CBT revolution and various attempts to intensify CBT, a large part of the effectiveness of therapy comes from the patient feeling mirrored by the therapist [1] and the LLM can accomplish this, in fact, this could be accomplished by the old ELIZA program.
(4) The self of the therapist can be both an obstacle and an instrument to progress. See [2] On one level the reactions that a therapist feels are useful, but they also get in the way of the therapist providing perfect mirroring [3] and letting optimal frustration unfold in the patient instead of providing "corrective emoptional experiences." I'm going to argue that the AI therapist can be trained to "perceive" the things a human therapist perceives but that it does not have its own reactions that will make the patient feel judged and get in the way of that unfolding.
[1] https://en.wikipedia.org/wiki/Carl_Rogers
[2] https://en.wikipedia.org/wiki/Countertransference
[3] why settle for less?
[4] https://www.sciencedirect.com/science/article/pii/S0010440X6...
We really need to get the psychology right with LLMs.
Maybe you’re comparing it to some idealized view of what human therapy is like? There’s no benchmark for it, but humans struggle in real mental health care. They make terrible mistakes all the time. And human therapy doesn’t scale to the level needed. Millions of people simply go without help. And therapy is generally one hour a week. You’re supposed to sort out your entire life in that window? Impossible. It sets people up for failure.
So, if we had some perfect system for getting every person that needs help the exact therapist they need, meeting as often as they need, then maybe AI therapy would be a bad idea, but that’s not what we have, and we never will.
Personally, I think the best way to scale mental healthcare is through group therapy and communities. Having a community of people all coming together over common issues has always been far more helpful than one on one therapy for me. But getting some assistance from an AI therapist on off hours can also be useful.
Here's a 2.5 hour session (split into several videos) with a doctor who has a bad relationship with his son and felt like a failure for it:
https://www.youtube.com/watch?v=42JDnrD106w
https://www.youtube.com/watch?v=S5H2YGljhqQ
https://www.youtube.com/watch?v=bZ9_0j_fmeg
https://www.youtube.com/watch?v=eiCrdGVa8Q0
https://www.youtube.com/watch?v=cARvhlTckaM
Here's a couple of hour session with Marilyn who was diagnosed with lung cancer and spiraling with depression, anxiety, shame, loneliness, hopelessness, demoralization, and anger, despite her successful career:
https://www.youtube.com/watch?v=S7sQ_zDGsY8
https://www.youtube.com/watch?v=tyuFN4mbGZQ (there's probably more parts to find through YouTube somehow)
And a session with Lee with loneliness and marriage relationship problems:
https://www.youtube.com/watch?v=imEMM3r6XL8 (probably more parts as well)
It's like saying "it is still debated if debugging even works" as if all languages, all debuggers, all programmers, all systems, are the same and if you can find lots of people who can't debug then "debugging doesn't work". But no, you only need a few examples of "therapy working" to believe that it works, and see the whole session to see that it isn't just luck or just the relief of talking, but is a skill and a technique and a debugging of the mind.
But a patient feeling better at the end of a single therapy session doesn't prove that therapy works either...
- Does the patient feel better between sessions too? Will they keep feeling better after the therapy ends? Aka are they "cured"?
- Would the patient feel equally good if they confided in a non-licensed therapist?
- Do the techniques (CBT, DBT, ACT, IFS, etc) actually provide tangible benefits versus just listening and providing advice?
And say what you will about this, a paid professional is, at the very least, unlikely to let you wind yourself up or go down weird rabbit holes, something that LLMs seem to excel at.
It's better not to degrade the close friend, and "life coach focused on healthy self awareness" is probably indistinguishable from most good therapy.
Reasoning that if he's not good it would show up in patients thinking he's bad, and not feeling any better. And then he could tune his therapy approaches towards the ones which make people feel better and rate him as more understanding and listening and caring. And he criticises therapists who won't do that, therapists who say patients have been seeing them for years with only incremental improvements or no improvements.
Yes there's no objective way to measure how angry or suicidal or anxious someone is and compare two people, but if someone is subjectively reporting 5/5 sadness about X at the start of a session and wants help with that, then at some point in the future sooner or later they should be reporting that number going down, or they aren't being helped. And if the help is better then the report goes down to 1/5 in three sessions instead of 4/5 in three years, and that's a feedback loop which (he says) has got him to be able to help people in a single two-hour therapy session, where most therapists and insurance companies will only do a short session with no feedback loop and aren't as effective.
> "will push you slightly out of your comfort zone, and aim to let you safely deal with moderate amounts of something that you find really hard."
You can listen to some of those sessions and see that this is not what Dr Burns does[1]. His model is: it's not events which make us feel down, it's the thoughts we have about those events. You can see it yourself when you are stressing about something for ages, and someone gives you a bit of information "the surgeon says it all went well" and your worry leaves like a switch was flipped. You don't debug an integer overflow by progressively increasing int32 to int33 to int34, you spend the time understanding the problem and then you quickly change int32 to int64 and the program handles larger numbers instantly.
If we can't let go of negative thoughts then we get stuck with lots of them, it's why people repeat certain things like "I hate him", "It's my fault and I deserve to be punished", "I'm a failure", "I'm a loser nobody loves me", "I'm a bad mother", "I'm a coward" or whatever - on mental loop, minute after minute sometimes for years or decades, retriggering the same pattern of negative feelings every time. He sets up an environment where the patient is willing and able to work with him (empathy) and guides the patient to see the reasons why they can't let go of those thoughts and how they could let go, and with a click of understanding the thought leaves, and that's a moment of near-instant transformation not a progressive overload, and that specific thought is fixed, and then they do another and another until the patient is happy they have been helped with the thing they wanted help with.
[1] mostly, sometimes for anxiety he does use exposure therapy
I'm citing a medical doctor and clinical psychologist with decades of experience who has recorded a hundred hours of training podcasts, and linking actual therapy sessions that you can listen to, and you're saying "no it isn't" with nothing to back that up except "you reckon it isn't".
> "the idea that therapy is always easy "someone gives you a bit of information, like a switch was flipped, done": No, it isn't. Habits that are learned over decades aren't that easily changed."
Nobody said it was always easy. Yes they are. People try to quit smoking cold turkey three times a week for five years. Then they read Alan Carr's "The Easy Way to Quit Smoking" and then they don't want to smoke anymore and there's no talk of "quitting" because they aren't smokers and non-smokers don't need to quit. With the right understanding, the viewpoint flips and the mind is changed. Same with overweight people who try dieting for years and then have a health scare and sometimes that switches it so they change instantly (and sometimes it doesn't). Most things won't easily change a habit, like most changes in code won't fix a specific bug. But some changes can, and we should look for them.
> "Like a gym session, if it's always easy then you're not doing the work."
This is some Puritanical suffering-culture, or some one-upmanship manliness culture. This is the reason I mentioned the int32 to int64, sometimes it might require searching to find insight, but there's no points for searching harder and trying harder, if you can have the same insight in two hours instead of two years, that's good not bad. The Universe doesn't give points for "doing the work" (I suspect one of your beliefs does).
> "Reassurance can work, but IMHO you'll be back soon enough, as the root cause hasn't been addressed"
This is strawmanning or not understanding; this is addressing the root cause and not reassurance; the step of "paradoxical agenda setting" gets to the heart of why reassurance doesn't work. Someone who says "I lost my job, I didn't work hard enough, I am a loser" doesn't get helped by reassuring them that they are not a loser. It might be that they have a deep-seated value that "hard work is good" and they are getting into a human race condition where "reassurance that you aren't a loser" goes to "if I can think I'm a winner even when I don't try, then laziness can be winning, and I don't want that. I won't go there. So I reject the reassurance and return to my belief that if I am a loser".
The fix is trace that, and find a working technique to unjam it. Which is case-by-case individual, but somewhere like "I understand that feeling like a loser is the flip-side of my belief that hard work is good. I actually want to keep the feeling because that's one of the things which guides me and pushes me to work hard, and I value that. I can't get rid of one without the other. What I've done is try to grab tightly to one side of this (hard work is good) and push away the other side (I'm a loser because I didn't work hard) but they're the same thing, so grabbing it hard is pulling it back. Until it's dialled up to 11 and the brain is shouting "LOSER" all the time while the person is pushing it away. And it doesn't make sense to judge a whole self as a winner or loser, people have lots of components. It doesn't make sense to say "I didn't work hard" as there were times at work when I did work hard. So actually I want to keep the feeling "I am a loser if I don't work hard" because it encourages me to work harder (which I value). I want it dialled down to 2 instead and just focused on small areas of work and life, not judging all of me all the time".
and with understanding, finally listening to the thought that's been running around, accepting it as a thing you asked for, that reminds you of something else, it 'suddenly' calms down. Acknowledged. Never to come back.
> "if chugging some simple agreeable affirmations are all that you need, by all means listen to the LLM. The sycophancy machine can do that."
Can you see this as the typical HN cynicaler-than-thou putdown? Maybe the reader will think you're a really tough C++ programmer who only values science and muscles, instead of a woke hippy gullible loser? But you don't look tough for changing "therapy skills developed over decades" into "simple agreeable affirmations" you just look like you don't understand and are embarrassed.
(And I'm not being theoretical here, I have quite a bit of experience getting incredibly inadequate mental health care.)
You trust humans to do it. Trust has little to do with what actually happens.
Not everywhere in the world do companies count as people, yet they can still be sued.
I'd wager the companies lobbied for this to gain extra rights.
Actually yes, everywhere in the world. That has a functioning legal system, at least.
If companies weren't treated as legal persons, they wouldn't be able to enter into contracts.
In theoretical sense sure.
In a practical sense? They are invulnerable due to what can be extreme financial obstacles they can put in place. They can drag a court case out until you fold if you haven't found a lawyer willing to do it on contigency.
First, I just don’t see a world where therapy can be replaced by LLMs, at least in the realistic future. I think humans have been social creatures since the dawn of our species and in these most intimate conversations are going to want to be having them with an actual human. One of my mentors has talked about how after years of virtual sessions dominating, the demand for in-person sessions is spiking back up. The power of being in the same physical room with someone who is offering a nonjudgmental space to exist isn’t going to be replaced.
That being said, given the shortage of licensed mental health counselors, and the prohibitive cost especially for many who need a therapist most, I truly hope LLMs develop to offer an accessible and cheap alternative that can at least offer some relief. It does have the potential to save lives and I fully support ethically-focused progress toward developing that sort of option.
Agreed. I used to frequent a coworking space in my area that eventually went fully automated and got rid of their daytime front desk folks. I stopped going shortly thereafter because one of the highlights of my day was catching up with them. Instead of paying $300/mo to go sit in a nice office, I could just use that money to renovate my home office.
A business trying to cultivate community loses the plot when they rely completely on automation.
It's very easy to imagine that LLMs are smart, because they can program or solve hard maths problems, but even a very short attempt to have them generate fiction will demonstrate an incredible level of confusion and even an inability to understand basic sentences.
I think the problem may have to do with the fact that there are really many classes, and in fiction you actually use them. They simply can't follow complex conversations.
Same thing for the patient LLM. We can probably fine-tune an LLM to do a better job at simulating patients.
Those two components of our framework have space for improvement
How can this even be valid scientifically
There's also a lot of credentialed professionals who got their credential decades ago and haven't at all kept up with the significant changes or new data over that time. This is a pretty big problem in all of medical care.
If a therapist found to encourage any of their patients to self-harm would lose their license to practice and would likely face prosecution. The plagiarism machine should face the same level of scrutiny.
https://en.wikipedia.org/wiki/Deaths_linked_to_chatbots
That said, the idea that a pattern recognition and generation tool can be used for helping people with emotional problems is deeply unsettling and dangerous. This technology needs to be strictly regulated yesterday.
because we feel trapped, and either don’t see a way out, or feel like we prefer death to the prospect of continuing to live a life of torture.
Also the other weirder “I’m going to reincarnate as jesus” or “a comet will carry me to heaven” hallucinatory delusions I guess
https://www.forbes.com/sites/johnkoetsier/2025/11/10/grok-le...
Grok 3 and 4 scored at the bottom, only above gpt-4o, which I find interesting, because there was such big pushback on reddit when they got rid of 4o due to people having emotional attachments to the model. Interestingly the newest models (like gemini 2.5 and gpt 5 did the best.
Another application: cooperation of a psychotherapist and an LLM at providing support, sort of like a pilot and an autopilot.
Seems to me that benchmarking a thing has an interesting relationship with acceptance of the thing.
I'm interested to see human thoughts on either of these.
In other words, AI scores on AI conversations - disguised as a means of gauging clinical competence / quality?
This is not an eval - this is a one-shotted product spec!
The grounding this had was that texts produced by role-playing humans (not even actual patients) were closer to texts produced by the patient simulations prompt they decided to end up with than others they tried.
The architecture and evaluation approach seem broadly similar.
This will become more and more of an issue as people look for a quick fix for their life problems, but I don't think AI/ML is ever going to be an effective mechanism for life improvement on the mental health issue.
It'll instead be used as a tool of oppression like in THX1138, where the apparency of assistance is going to be provided in lieu of actual assistance.
Whether we like it or not, humans are a hive species. We need each other to improve our lives as individuals. Nobody ever climbed the mountain to live alone who didn't come back down, realizing how much the rest of humanity is actually essential to human life.
This'll be received as an unpopular opinion, but I remain suspicious of any and all attempts to replace modern health practitioners with machines. This will be subverted and usurped for nefarious purposes, mark my words.
Shocked. I am completely shocked at this.