'attention Is All You Need' Coauthor Says He's 'sick' of Transformers
Key topics
The co-author of the 'Attention is all you need' paper expresses frustration with the dominance of transformer models in AI research, sparking a discussion about the limitations and potential alternatives to current AI architectures.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
2h
Peak period
104
12-18h
Avg / period
13.3
Based on 160 loaded comments
Key moments
- 01Story posted
Oct 24, 2025 at 12:40 AM EDT
3 months ago
Step 01 - 02First comment
Oct 24, 2025 at 3:06 AM EDT
2h after posting
Step 02 - 03Peak activity
104 comments in 12-18h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 27, 2025 at 8:51 PM EDT
2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
isn't this what [etched](https://www.etched.com/) is doing?
Many of the breakthrough and game changing inventions were done this way with the back of the envelope discussions, the other popular example was the Ethernet network.
Some good stories of similar culture in AT&T Bell lab is well described in the Hamming's book [1].
[1] Stripe Press The Art of Doing Science and Engineering:
https://press.stripe.com/the-art-of-doing-science-and-engine...
According to various stories pieced together, the ideas of 4 of Pixar’s early hits were conceived on or around one lunch.
Bug’s Life, Wall-E, Monsters, Inc
I'm wondering how big impact work from home will really have on humanity in general, when so many of our life changing discoveries comes from the odd chance of two specific people happening to be in the same place at some moment in time.
The agile treadmill (with PM's breathing down our necks) and features getting planned and delivered in 2 week-sprints, has also reduced our ability to just do something we feel needs getting done. Today you go to work to feed several layers of incompetent managers - there is no room for play, or for creativity. At least in most orgs I know.
I think innovation (or even joy of being at work) needs more than just the office, or people, or a canteen, but an environment that supports it.
Basically, I set aside as much time as I can to squeeze in creativity and real engineering work into the job. Otherwise I'd go crazy from the grind of just cranking out deliverables
As for agile: I've made it clear to my PMs that I generally plan on a quarterly/half year basis and my work and other people's work adheres to that schedule, not weekly sprints (we stay up to date in a slack channel, no standups)
It seems, in those days, people at Bell Labs did get the best of both worlds: being able to have chance encounters with very smart people while also being able to just be gone for weeks to work undistracted.
A dream job that probably didn’t even feel like a job (at least that’s the impression I get from hearing Thompson talk about that time).
"""Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. In the following days, Pike and Thompson implemented it and updated Plan 9 to use it throughout,[11] and then communicated their success back to X/Open, which accepted it as the specification for FSS-UTF.[9]"""
Related thread:https://threadreaderapp.com/thread/1864023344435380613.html
The LLM stack has enough branches of evolution within it for efficiency, agent-based work can power a new industrial revolution specifically around white collar workers on its own, while expanding the self-expression for personal fulfillment for everyone else
Well have fun sir
https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...
It's like if someone invented the hamburger and every single food outlet decided to only serve hamburgers from that point on, only spending time and money on making the perfect hamburger, rather than spending time and effort on making great meals. Which sounds ludicrously far-fetched, but is exactly what happened here.
But yes: the analogy is already hyperbole, and real life is even more hyperbolic. Transforms might work really well, but no one actually seems to know how to put them to real use, compared to generating billions-of-dollars in loss and burning the planet for it
(In this analogy, we can take aim at the hamburger industry either birthing CAFO's, or putting them into hyper-overdrive, destroying the environment with orders of magnitude more CO2, etc. etc. it's a weirdly long-lasting analogy)
i think you analogously just described Sun Microsystems, where Unixes (BSD originally in their case, generalized to SVR4 (?) hybrid later) worked soooo well, that NT was built as a hybridization for the Microsoft user base and Apple reabsorbed the BSD-Mach-DisplayPostscript hybridization spinoff NeXT, while Linux simultaneously thrived.
realistically, I think the valuable idea is probabilistic graphical models- of which transformers is an example- combining probability with sequences, or with trees and graphs- is likely to continue to be a valuable area for research exploration for the foreseeable future.
As if this approach [1] does not exist.
[1] https://pmc.ncbi.nlm.nih.gov/articles/PMC7197060/
This seems extremely, extremely unlikely for many reasons. The HP model is a simplification of true protein folding/structure adoption, while AlphaFold (and the open source equivalents) works with real proteins. The SAT approach uses little to no prior knowledge about protein structures, unlike AlphaFold (which has basically memorized and generalized the PDB). To express all the necessary details would likely exceed the capabilities of the best SAT solvers.
(don't get me wrong- SAT and other constraint approaches are powerful tools. But I do not think they are the best approach for protein structure prediction).
Like the OP says, it's as if such approaches don't even exist.
Such approaches exist, and they've been found wanting, and no amount of compute is going to improve their performance limits, because it isn't an ML approach with scaling laws.
This is definitely not some unfair conspiracy against SAT, and probably not against the majority of pre-transformer approaches. I am sympathetic to the concern that transformer based research is getting too much attention at the expense of other approaches. However, I'd think the success of transformer makes it more likely than ever that proven-promising alternative approaches would get funding as investors try to beat everyone to the next big thing. See quantum computing funding or funding for way out there ASIC startups.
TL;DR I don't know what is meant by the "same treatment" for SAT solvers. Funding is finite and goes toward promising approaches. If there "at least as promising" approaches, go show clear evidence of that to a VC and I promise you'll get funding.
Why am supposed to pretend SAT is being treated unfairly or whatever you guys are expounding? Based on your response and the parent's, don't think you'd be happy if SAT approaches WERE cited.
Maybe you and parent think every preexisting approach hasn't been proven to be inferior to the transformer approach until some equivalent amount of compute has been thrown at them compared to the transformer approach? That's the best I can come up with. There is no room for 'scaling' gains with SAT solvers that will be found with more compute, it's not an ML approach. That is, it doesn't learn with more data. If you mean something else more specific I'd be interested to know.
on the other hand it's also led to improvements in many places hidden behind the scenes. for example, vision transformers are much more powerful and scalable than many of the other computer vision models which has probably led to new capabilities.
in general, transformers aren't just "generate text" but it's a new foundational model architecture which enables a leap step in many things which require modeling!
Like, vision transformers? They seem to work best when they still have a CNN backbone, but the "transformer" component is very good at focusing on relevant information, and doing different things depending on what you want to be done with those images.
And if you bolt that hybrid vision transformer to an even larger language-oriented transformer? That also imbues it with basic problem-solving, world knowledge and commonsense reasoning capabilities - which, in things like advanced OCR systems, are very welcome.
Simultaneously discovering and leveraging the functional nature of language seems like kind of a big deal.
All that remains is to come up with a way to integrate short-term experience into long-term memory, and we can call the job of emulating our brains done, at least in principle. Everything after that will amount to detail work.
...lol. Yikes.
I do not accept your premise. At all.
> use it to compose original works and solve original problems
Which original works and original problems have LLMs solved, exactly? You might find a random article or stealth marketing paper that claims to have solved some novel problem, but if what you're saying were actually true, we'd be flooded with original works and new problems being solved. So where are all these original works?
> All that remains is to come up with a way to integrate short-term experience into long-term memory, and we can call the job of emulating our brains done, at least in principle
What experience do you have that caused you to believe these things?
If anyone still insists on hidden magical components ranging from immortal souls to Penrose's quantum woo, well... let's see what you've got.
The International Math Olympiad qualifies as solving original problems, for example. If you disagree, that's a case you have to make. Transformer models are unquestionably better at math than I am. They are also better at composition, and will soon be better at programming if they aren't already.
Every time a magazine editor is fooled by AI slop, every time an entire subreddit loses the Turing test to somebody's ethically-questionable 'experiment', every time an AI-rendered image wins a contest meant for human artists -- those are original works.
Heck, looking at my Spotify playlist, I'd be amazed if I haven't already been fooled by AI-composed music. If it hasn't happened yet, it will probably happen next week, or maybe next year. Certainly within the next five years.
No, it does not. You're just telling me you've never seen what these problems are like.
> Every time a magazine editor is fooled by AI slop, every time an entire subreddit loses the Turing test to somebody's ethically-questionable 'experiment', every time an AI-rendered image wins a contest meant for human artists -- those are original works.
That's such an absurd logical leap. If you plagiarize a paper and it fools your English teacher, you did not produce an original work. You fooled someone.
> Heck, looking at my Spotify playlist, I'd be amazed if I haven't already been fooled by AI-composed music.
Who knows, but you've already demonstrated that you're easy to fool, since you've bought all the AI hype and seem to be unwilling to accept that an AI CEO or a politician would lie to you.
> If it hasn't happened yet, it will probably happen next week, or maybe next year. Certainly within the next five years.
I can pull numbers out of my ass too, watch! 5, 18, 33, 1, 556. Impressed? But jokes aside, guesses about the future are not evidence, especially when they're based on nothing but your own misguided gut feeling.
>If anyone still insists on hidden magical components ranging from immortal souls to Penrose's quantum woo, well... let's see what you've got.
This isn't too far off from the marketing and hypesteria surrounding "AI" companies.
No they dont. Humans also know when they are pretending to know what they are talking about - put said people against the wall and they will freely admit they have no idea what the buzzwords they are saying mean.
Machines possess no such characteristic.
>No they dont.
WTAF? Maybe you're new here, but the term "hallucinate" came from a very human experience, and was only usurped recently by "AI" bros who wanted to anthropomorphize a tin can.
>Humans also know when they are pretending to know what they are talking about - put said people against the wall and they will freely admit they have no idea what the buzzwords they are saying mean.
>Machines possess no such characteristic.
"AI" will say whatever you want to hear to make you go away. That's the extent of their "characteristic". If it doesn't satisfy the user, they try again, and spit out whatever garbage it calculates should make the user go away. The machine has far less of an "idea" what it's saying.
I also do not accept your assertion, at all. Humans largely function on the basis of desire-fulfilment, be that eating, fucking, seeking safety, gaining power, or any of the other myriad human activities. Our brains, and the brains of all the animals before us, have evolved for that purpose. For evidence, start with Skinner or the millions of behavioral analysis studies done in that field.
Our thoughts lend themselves to those activities. They arise from desire. Transformers have nothing to do with human cognition because they do not contain the basic chemical building blocks that precede and give rise to human cognition. They are, in fact, stochastic parrots, that can fool others, like yourself, into believing they are somehow thinking.
[1] Libet, B., Gleason, C. A., Wright, E. W., & Pearl, D. K. (1983). Time of conscious intention to act in relation to onset of cerebral activity (readiness-potential). Brain, 106(3), 623-642.
[2] Soon, C. S., Brass, M., Heinze, H. J., & Haynes, J. D. (2008). Unconscious determinants of free decisions in the human brain. Nature Neuroscience, 11(5), 543-545.
[3] Berridge, K. C., & Robinson, T. E. (2003). Parsing reward. Trends in Neurosciences, 26(9), 507-513. (This paper reviews the "wanting" vs. "liking" distinction, where unconscious "wanting" or desire is driven by dopamine).
[4] Kavanagh, D. J., Andrade, J., & May, J. (2005). Elaborated Intrusion theory of desire: a multi-component cognitive model of craving. British Journal of Health Psychology, 10(4), 515-532. (This model proposes that desires begin as unconscious "intrusions" that precede conscious thought and elaboration).
They are, in fact, stochastic parrots, that can fool others, like yourself, into believing they are somehow thinking.
What makes you think you're not arguing with one now?
You are not making an argument, you are just making assertions without evidence and then telling us the burden of proof is on us to tell you why not.
If you went walking down the streets yelling the world is run by a secret cabal of reptile-people without evidence, you would rightfully be declared insane.
Our feelings and desires largely determine the content of our thoughts and actions. LLMs do not function as such.
Whether I am arguing with a parrot or not has nothing to do with cognition. A parrot being able to usefully fool a human has nothing to do with cognition.
Language is like a disembodied science-fiction narration.
Wegener's Illusion of Conscious Will
https://www.its.caltech.edu/~squartz/wegner2.pdf
Fedorenko's Language and Thought are Not The Same Thing
https://pmc.ncbi.nlm.nih.gov/articles/PMC4874898/
edit: post-transformers meaning "in the era after transformers were widely adopted" not some mystical new wave of hypothetical tech to disrupt transformers themselves.
Unless I misinterpreted the post, render me confused.
People who started their NLP work (PhDs etc; industry research projects) before the LLM / transformer craze had to adapt to the new world. (Hence 'post-mass-uptake-of-transformers')
I think this might be the ONLY example that doesn't back up the original claim, because of course an advancement in language processing is an advancement in language processing -- that's tautological! every new technology is an advancement in its domain; what's claimed to be special about transformers is that they are allegedly disruptive OUTSIDE of NLP. "Which fields have been transformed?" means ASIDE FROM language processing.
other than disrupting users by forcing "AI" features they don't want on them... what examples of transformers being revolutionary exist outside of NLP?
Claude Code? lol
saving lives
If you have something relevant to say, you can summarize for the class & include links to your receipts.
Summers over kid.
Reading the newspaper is such a lovely experience these days. But hey, the AI researchers are really excited so who really cares if stuff like this happens if we can declare that "therapy is transformed!"
It sure is. Could it have been that attention was all that kid needed?
Some directly, because LLMs and highly capable general purpose classifiers that might be enough for your use case are just out there, and some because of downstream effects, like GPU-compute being far more common, hardware optimized for tasks like matrix multiplication and mature well-maintained libraries with automatic differentiation capabilities. Plus the emergence of things that mix both classical ML and transformers, like training networks to approximate intermolecular potentials faster than the ab-initio calculation, allowing for accelerating molecular dynamics simulations.
I had a friend who did PhD research in NLP and I had a problem of extracting some structured data from unstructured text, and he told me to just ask ChatGPT to do it for me.
Basically ChatGPT is almost always better at language-based tasks than most specialized techniques for the specific problems the subfields meant to address, that were developed over decades.
That's a pretty effing huge deal, even if it falls short of the AGI 2027 hype
That's the thing with deep learning in general, people don't really understand what they are doing. It is a game of throwing stuff at the wall and see what sticks. NLP researchers are trying to open up these neural networks and try to understand where the familiar structures of language form.
I think it is important research. Both for improving models and to better understand language. Traditional NLP research is seen as obsolete by some but I think it is more relevant than ever. We can think of transformer-based LLMs as a life form we have created by accident and NLP researchers as biologists studying it, where companies like OpenAI and DeepSeek are more like breeders.
Therefore, the correct attitude to take regarding LLMs is to create ways for them to receive useful feedback on their outputs. When using a coding agent, have the agent work against tests. Scaffold constraints and feedback around it. AlphaZero, for example, had abundant environmental feedback and achieved amazing (superhuman) results. Other Alpha models (for math, coding, etc.) that operated within validation loops reached olympic levels in specific types of problem-solving. The limitation of LLMs is actually a limitation of their incomplete coupling with the external world.
In fact you don't even need a super intelligent agent to make progress, it is sufficient to have copying and competition, evolution shows it can create all life, including us and our culture and technology without a very smart learning algorithm. Instead what it has is plenty of feedback. Intelligence is not in the brain or the LLM, it is in the ecosystem, the society of agents, and the world. Intelligence is the result of having to pay the cost of our execution to continue to exist, a strategy to balance the cost of life.
What I mean by feedback is exploration, when you execute novel actions or actions in novel environment configurations, and observe the outcomes. And adjust, and iterate. So the feedback becomes part of the model, and the model part of the action-feedback process. They co-create each other.
They didn't create those markets, but they're the markets for which LLMs enhance productivity and capability the best right now, because they're the ones that need the least supervision of input to and output from the LLMs, and they happen to be otherwise well-suited to the kind of work it is, besides.
> This isn't unique to models; even we, humans, when operating without feedback, generate mostly slop.
I don't understand the relevance of this.
> Curation is performed by the environment and the passage of time, which reveals consequences.
It'd say it's revealed by human judgement and eroded by chance, but either way, I still don't get the relevance.
> LLMs taken in isolation from their environment are just as sloppy as brains in a similar situation.
Sure? And clouds are often fluffy. Water is often wet. Relevance?
The rest of this is a description of how we can make LLMs work better, which amounts to more work than required to make LLMs pay off enormously for the purposes I called out, so... are we even in disagreement? I don't disagree that perhaps this will change, and explicitly bound my original claim ("so far") for that reason.
... are you actually demonstrating my point, on purpose, by responding with LLM slop?
You should hear HN talk about crypto. If the knife were invented today they'd have a field day calling it the most evil plaything of bandits, etc. Nothing about human nature, of course.
Edit: There it is! Like clockwork.
This also describes most modern software development
I just bought Robokiller. I habe it set to contacts cuz the AI's were calling me all day.
Takes like this are utterly insane to me
Days that I’d normally feel overwhelmed from requests by management are just Claude Code and chill days now.
Eventually, your code be such shit that Claude Code will struggle to even do basic CRUD because there are four redundant functions and it keeps editing the wrong ones. Your colleagues will go to edit your code, only to realize that it's such utter garbage that they have to rewrite the whole thing because that's easier than trying to make sense of the slop you produced under your own name.
If you were feeling overwhelmed by management, and Claude Code is alleviating that, I fear you aren't cut out for the work.
Are there any papers that compare predictive power against compute needed?
In many cases, I can't even see how many GPU hours or what size cluster of what GPU's the pretraining required. If I can't afford it, then it doesn't matter what it achieved. What I can afford is what I have to choose from.
quite
the transformer innovation was to bring down the cost of producing incorrect, but plausible looking content (slop) in any modality to near zero
not a positive thing for anyone other than spammers
Defenders are supposed to defend against attacks on AI, but here it misfired, so the conversation should be interesting.
That's because the defender is actually a skeptic of AI. But the first sentence sounded like a typical "nothing to see here" defense of AI.
Wish there were more hours in the day.
As somebody who was a biiiiig user of probabilistic graphical models, and felt kind of left behind in this brave new world of stacked nets, I would love for my prior knowledge and experience to become valuable for a broader set of problem domains. However, I don't see it yet. Hope you are right!
Source: I am a PhD student, this is kinda my wheelhouse
I haven't actually read these to see if they achieved anything. I'm just sharing the results from a quick search in your sub-field in case it helps you PGM folks.
https://arxiv.org/abs/2104.12053
https://pmc.ncbi.nlm.nih.gov/articles/PMC7831091/
And here's an intro for those wondering what PGM is:
https://arxiv.org/abs/2507.17116
I agree. Causal inference and symbolic reasoning would SUPER juicy nuts to crack , more so than what we got from transformers.
The softmax has issues regarding attention sinks [1]. The softmax also causes sharpness problems [2]. In general this decision boundary being Euclidean dot products isn't actually optimal for everything, there are many classes of problem where you want polyhedral cones [3]. Positional embedding are also janky af and so is rope tbh, I think Cannon layers are a more promising alternative for horizontal alignment [4].
I still think there is plenty of room to improve these things. But a lot of focus right now is unfortunately being spent on benchmaxxing using flawed benchmarks that can be hacked with memorization. I think a really promising and underappreciated direction is synthetically coming up with ideas and tests that mathematically do not work well and proving that current arhitectures struggle with it. A great example of this is the VITs need glasses paper [5], or belief state transformers with their star task [6]. The Google one about what are the limits of embedding dimensions also is great and shows how the dimension of the QK part is actually important to getting good retrevial [7].
[1] https://arxiv.org/abs/2309.17453
[2] https://arxiv.org/abs/2410.01104
[3] https://arxiv.org/abs/2505.17190
[4] https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5240330
[5] https://arxiv.org/abs/2406.04267
[6] https://arxiv.org/abs/2410.23506
[6] https://arxiv.org/abs/2508.21038
No but seriously, just fix the fucking softmax. Add a dedicated "parking spot" like GPT-OSS does and eat the gradient flow tax on that, or replace softmax with any of the almost-softmax-but-not-really candidates. Plenty of options there.
The reason why we're "benchmaxxing" is that benchmarks are the metrics we have, and the only way by which we can sift through this gajillion of "revolutionary new architecture ideas" and get at the ones that show any promise at all. Of which there are very few, and fewer still that are worth their gains when you account for: there not being an unlimited amount of compute. Especially not when it comes to frontier training runs.
Memorization vs generalization is a well known idiot trap, and we are all stupid dumb fucks in the face of applied ML. Still, some benchmarks are harder to game than others (guess how we found that out), and there's power in that.
literally every new release of something point X model of every major player includes some benchmark graphs to show off
Which is what humans do too - it's not magic. Humans just get more metacognitive juice for free. Resulting in a hallucination rate significantly lower than that of LLMs, but significantly higher than zero.
Now, having the skills you need to avoid hallucinations is good, even if they're weak and basic skills. But is an LLM willing to actually put them to use?
OpenAI cooked o3 with reckless RL using hallucination-unaware reward calculation - which punished reluctance to answer and rewarded overconfident guesses. And their benchmark suite didn't catch it, because the benchmarks were hallucination-unaware too.
Not familiar with this topic, but intrigued-anywhere I can read more about it?
https://news.ycombinator.com/item?id=44834918
Having done my PhD in probabilistic programming... what?
In biology, PGMs were one of the first successful forms of "machine learning"- given a large set of examples, train a graphical model using probabilities using EM, and then pass many more examples through the model for classification. The HMM for proteins is pretty straightforward, basically just a probabilistic extension of using dynamic programming to do string alignment.
My perspective- which is a massive simplification- is that sequence models are a form of graphical model, although the graphs tend to be fairly "linear" and the predictions generate sequences (lists) rather than trees or graphs.
So, this is really just a BS hype talk. This is just trying to get more funding and VCs.
/s
It doesn't mean that you'll get good results by abandoning prior art, either with LLMs or musicians. But it does signal a sort of personal stress and insecurity, for sure.
64 more comments available on Hacker News