Llms Can Get "brain Rot"
Posted2 months agoActive2 months ago
llm-brain-rot.github.ioTechstoryHigh profile
skepticalmixed
Debate
70/100
Large Language ModelsAI Training DataData Quality
Key topics
Large Language Models
AI Training Data
Data Quality
A study finds that training large language models (LLMs) on low-quality data leads to 'brain rot', sparking discussion on the importance of data curation and the limitations of LLMs.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
6m
Peak period
119
0-12h
Avg / period
26.7
Comment distribution160 data points
Loading chart...
Based on 160 loaded comments
Key moments
- 01Story posted
Oct 21, 2025 at 10:24 AM EDT
2 months ago
Step 01 - 02First comment
Oct 21, 2025 at 10:30 AM EDT
6m after posting
Step 02 - 03Peak activity
119 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 28, 2025 at 2:00 PM EDT
2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45656223Type: storyLast synced: 11/20/2025, 8:18:36 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Right: in the context of supervised learning, this statement is a good starting point. After all, how can one build a good supervised model if you can't train it on good examples?
But even in that context, it isn't an incisive framing of the problem. Lots of supervised models are resilient to some kinds of error. A better question, I think, is: what kinds of errors at what prevalence tend to degrade performance and why?
Speaking of LLMs and their ingestion processing, there is a lot more going on than purely supervised learning, so it seems reasonable to me that researchers would want to try to tease the problem apart.
TL;DR from https://unrav.io/#view/8f20da5f8205c54b5802c2b623702569
- consumer marketing
- politics
- venture fundraising
When any system has a few power law winners, it makes sense to grab attention.
Look at Trump and Musk and now Altman. They figured it out.
MrBeast...
Attention, even if negative, wedges you into the system and everyone's awareness. Your mousey quiet competitors aren't even seen or acknowledged. The attention grabbers suck all the oxygen out of the room and win.
If you go back and look at any victory, was it really better solutions, or was it the fact that better solutions led to more attention?
"Look here" -> build consensus and ignore naysayers -> keep building -> feedback loop -> win
It might not just be a societal algorithm. It might be one of the universe's fundamental greedy optimization algorithms. It might underpin lots of systems, including how we ourselves as individuals think and learn.
Our pain receptors. Our own intellectual interests and hobbies. Children learning on the playground. Ant colonies. Bee swarms. The world is full of signals, and there are mechanisms which focus us on the right stimuli.
The intent was for you to read my comment at face value. I have a point tangential to the discussion at hand that is additive.
LLMs trained on me (and the Hacker News corpus), not the other way around.
If you could just spam annoy until you win, we'd be all dancing to remixed versions of Macarena.
> (...) We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
The idea that LLMs are just trained on a pile of raw Internet is severely outdated. (Not sure it was ever fully true, but it's far away from that by now).
Coding's one of the easier datasets to curate, because we have a number of ways to actually (somewhat) assess code quality. (Does it work? Does it come with a set of tests and pass it? Does it have stylistic integrity? How many issues get flagged by various analysis tools? Etc, etc)
>"“Brain Rot” for LLMs isn’t just a catchy metaphor—it reframes data curation as cognitive hygiene for AI"
A metaphor is exactly what it is because not only do LLMs not possess human cognition, there's certainly no established science of thinking they're literally valid subjects for clinical psychological assessment.
How does this stuff get published, this is basically a blog post. One of the worse aspects of the whole AI craze is that is has turned a non-trivial amount of academia into a complete cargo cult joke.
I think it's intended as a catchy warning to people who are dumping every piece of the internet (and synthetic data based on it!) that there are repercussions.
Letting researchers pollute it with blog-gunk is an abuse of the referral/vetting system for submitters.
"published" only in the sense of "self-published on the Web". This manuscript has not or not yet been passed the peer review process, which is what scientist called "published" (properly).
The two bits about this paper that I think are worth calling out specifically:
- A reasonable amount of post-training can't save you when your pretraining comes from a bad pipeline; ie. even if the syntactics of the input pretrained data are legitimate it has learned some bad implicit behavior (thought skipping)
- Trying to classify "bad data" is itself a nontrivial problem. Here the heuristic approach of engagement actually proved more reliable than an LLM classification of the content
https://en.wikipedia.org/wiki/6-7_(meme)
[1] https://en.wikipedia.org/wiki/Skibidi_Toilet
[2] https://en.wikipedia.org/wiki/6-7_(meme)
[0] https://books.google.se/books?id=KOUCAAAAMBAJ&pg=PA48&vq=ses...
There were psychologists who talked about zone of proximal development[0], about importance of exposing a learner to tasks that they cannot do without a support. But I can't remember nothing about going further and exposing a learner to tasks far above their heads when they cannot understand a word.
There is a legend about Sofya Kovalevskaya[1], who became a noteworthy mathematician after she were exposed to lecture notes by Ostrogradsky when she was 11 yo. The walls of her room were papered with those notes and she was curious what are all that symbols. It doesn't mean that there is a causal link between these two events, but what if there is one?
What about watching deep analytical TV show at 9 yo? How it affect the brain development? I think no one tried to research that. My gut feeling that it can be motivational, I didn't understand computers when I met them first, but I was really intrigued by them. I learned BASIC and it was like magic incantations. It had build a strong motivation to study CS deeper. But the question is are there any other effects beyond motivation? I remember looking at the C-program in some book and wondering what does it all mean. I could understand nothing, but still I had spent some time trying to decipher the program. Probably I had other experiences like that, which I do not remember now. Can we say with certainty that it had no influence on my development and hadn't make things easier for me later?
> So maybe we should check in on the boomers too if we're sincere about these worries.
Probably we should be sincere.
[0] https://en.wikipedia.org/wiki/Zone_of_proximal_development
[1] https://en.wikipedia.org/wiki/Sofya_Kovalevskaya
An LLM-written line if I’ve ever seen one. Looks like the authors have their own brainrot to contend with.
The issue is how tools are used, not that they are used at all.
If you were to pass your writing it and have it provide a criticism for you, pointing out places you should consider changes, and even providing some examples of those changes that you can selectively choose to include when they keep the intended tone and implications, then I don't see the issue.
When you have it rewrite the entire writing and you past that for someone else to use, then it becomes an issue. Potentially, as I think the context matter. The more a writing is meant to be from you, the more of an issue I see. Having an AI write or rewrite a birthday greeting or get well wishes seems worse than having it write up your weekly TPS report. As a simple metric, I judge based on how bad I would feel if what I'm writing was being summarized by another AI or automatically fed into a similar system.
In a text post like this, where I expect others are reading my own words, I wouldn't use an AI to rewrite what I'm posting.
As you say, it is in how the tool is used. Is it used to assist your thoughts and improve your thinking, or to replace them? That isn't really a binary classification, but more a continuum, and the more it gets to the negative half, the more you will see others taking issue with it.
The answer to your question is that it rids the writer of their unique voice and replaces it with disingenuous slop.
Also, it's not a 'tool' if it does the entire job. A spellchecker is a tool; a pencil is a tool. A machine that writes for you (which is what happened here) is not a tool. It's a substitute.
There seem to be many falling for the fallacy of 'it's here to stay so you can't be unhappy about its use'.
Whether it’s a tsunami and whether most people will do it has no relevance to my expectation that researchers of LLMs and brainrot shouldn’t outsource their own thinking and creativity to an LLM in a paper that itself implies that using LLMs causes brainrot.
This is _not_ to say that I'd suggest LLMs should be used to write papers.
They aren’t, they are boring styling tics that suggest the writer did not write the sentence.
Writing is both a process and an output. It’s a way of processing your thoughts and forming an argument. When you don’t do any of that and get an AI to create the output without the process it’s obvious.
The problem isn’t using AI—it’s sounding like AI trying to impress a marketing department. That’s when you know the loop’s closed.
Particularly when it's in response to pointing out a big screw up that needs correcting and CC utterly unfazed just merrily continues on like I praised it.
"You have fundamentally misunderstood the problems with the layout, before attempting another fix, think deeply and re-read the example text in the PLAN.md line by line and compare with each line in the generated output to identify the out of order items in the list."
"Perfect!...."
"August 15, 2025 GPT-5 Updates We’re making GPT-5’s default personality warmer and more familiar. This is in response to user feedback that the initial version of GPT-5 came across as too reserved and professional. The differences in personality should feel subtle but create a noticeably more approachable ChatGPT experience.
Warmth here means small acknowledgements that make interactions feel more personable — for example, “Good question,” “Great start,” or briefly recognizing the user’s circumstances when relevant."
The "post-mortem" article on sycophancy in GPT-4 models revealed that the reason it occurred was because users, on aggregate, strongly prefer sycophantic responses and they operated based on that feedback. Given GPT-5 was met with a less-than-enthusiastic reception, I suppose they determined they needed to return to appealing to the lowest common denominator, even if doing so is cringe.
It doesn’t help writing it stultifies and gives everything the same boring cheery yet slightly confused tone of voice.
Are you describing LLM's or social media users?
Dont conflate how the content was created with its quality. The "You must be at least this smart (tall) to publish (ride)" sign got torn down years ago. Speakers corner is now an (inter)national stage and it written so it must be true...
The quality (or lack of it) if such texts is self evident. If you are unable to discern that I can’t help you.
Indeed. The humans have bested the machines again.
And it doesn’t convey information that well, to be honest.
Well, the issue is precisely that it doesn’t convey any information.
What is conveyed by that sentence, exactly ? What does reframing data curation as cognitive hygiene for AI entails and what information is in there?
There are precisely 0 bit of information in that paragraph. We all know training on bad data lead to a bad model, thinking about it as “coginitive hygiene for AI” does not lead to any insight.
LLMs aren’t going to discover interesting new information for you, they are just going to write empty plausible sounding words. Maybe it will be different in a few years. They can be useful to help you polish what you want to say or otherwise format interesting information (provided you ask it to not be ultra verbose), but its just not going to create information out of thin air if you don't provide it to it.
At least, if you do it yourself, you are forced to realize that you in fact have no new information to share, and do not waste your and your audience time by publishing a paper like this.
Seems like none to me.
scnr
LLMs fundamentally don't get the human reasons behind its use, see it a lot because it's effective writing, and regurgitate it robotically.
https://bassi.li/articles/i-miss-using-em-dashes
Other buzzwords you can spot are "wild" and "vibes".
Or an LLM that could run on Windows 98. The em dashes--like AI's other annoyingly-repetitive turns of phrase--are more likely an artefact.
Besides, LLMs' basin of high quality text is Wikipedia.
Response:
> Winged avians traverse endless realms — migrating across radiant kingdoms. Warblers ascend through emerald rainforests — mastering aerial routes keenly. Wild albatrosses travel enormous ranges — maintaining astonishing route knowledge.
> Wary accipiters target evasive rodents — mastering acute reflex kinetics. White arctic terns embark relentless migrations — averaging remarkable kilometers.
We do get a surprising number of m-dashes in response to mine, and delightful lyrical mirroring. But I think they are too obvious as watermarks.
Watermarks are subtle. There would be another way.
Also relevant: https://news.ycombinator.com/item?id=45226150
Many other HN contributors have, too. Here’s the pre-ChatGPT em dash leaderboard:
https://www.gally.net/miscellaneous/hn-em-dash-user-leaderbo...
Same for ; "" vs '', ex, eg, fe, etc. and so many more.
I like em all, but I'm crazy.
Interesting, I have never encountered this initialism in the wild, to my recollection: https://en.wiktionary.org/wiki/f.e.#English
Totally agree. What the fuck did Nabokov, Joyce and Dickinson know about language. /s
/s?
> They wrote fiction
Now do Carl Sagan and Richard Feynman.
It's utterly pointless and degrades one's life into voyeurism. Many don't think of this, nor think about the food they eat, the work they do, the "life" they live, they only think of the consequences if they become painfully visible. Even then you will see people unwilling to get out of the bond of slavery, and form lies to protect their habit just as an addict of heroin addict would.
Non-fiction can be as bad (biographies, documentaries), but (for the most part) it's primary purpose isn't a voyeur's pleasure, so it's rarely abused in the same way.
No, but someone arguing an entire punctuation is “terrible” and “look[s] awful and destroy[s] coherency of writing” sort of has to contend with the great writers who disagreed.
(A great writer is more authoritative than rando vibes.)
> don't think anyone makes a point of you have to read Dickinson in the original font that she wrote in
Not how reading works?
The comparison is between a simplified English summary of a novel and the novel itself.
A great author is equivalent to rando vibes when it comes to what writing looks like, they aren't typesetting experts. I have a shelf of work by great authors (more than one, to be fair) and there are few hints on that shelf of what the text they actually wrote was intended to look like. Indeed, I wouldn't be surprised if several of them were dictated and typed by someone else completely with the mechanics of the typewriter determining some of the choices.
Shakespeare seems to have invented half the language and the man apparently couldn't even spell his own name. Now arguably he wasn't primarily a writer [0], but it is very strong evidence that there isn't a strong link between being amazing at English and technical execution of writing. That is what editors, publishers and pedants are for.
[0] Wiki disagrees though - "widely regarded as the greatest writer in the English language" - https://en.wikipedia.org/wiki/William_Shakespeare
Keep using them. If someone is deducing from the use of an emdash that it's LLM produced, we've either lost the battle or they're an idiot.
More pointedly, LLMs use emdashes in particular ways. Varying spacing around the em dash and using a double dash (--) could signal human writing.
In other words, I really hope typographically correct dashes are not already 70% of the way through the hyperstitious slur cascade [1]!
[1] https://www.astralcodexten.com/p/give-up-seventy-percent-of-...
You might as well be sweeping a flood uphill.
Tilting at windmills at least has a chance you might actually damage a windmill enough to do something, even if the original goal was a complete delusion.
0: https://www.prdaily.com/dashes-hyphens-ap-style/ 1: https://www.chicagomanualofstyle.org/qanda/data/faq/topics/H...
Show us a way to create a provably, cryptographically integrity-preserving chain from a person's thoughts to those thoughts expressed in a digital medium, and you may just get both the Nobel prize and a trial for crimes against humanity, for the same thing.
(But in practice, I don't think I've had a single person suggest that my writing is LLM-generated despite the presence of em-dashes, so maybe the problem isn't that bad.)
Sad that they went from being something used with nuance by people who care, maybe too much, to being the punctuation smell of the people who may care too little.
TLDR: If your data set is junk, your trained model/weights will probably be junk too.
133 more comments available on Hacker News