Why Can't Transformers Learn Multiplication?
Posted3 months agoActive3 months ago
arxiv.orgTechstoryHigh profile
calmmixed
Debate
70/100
TransformersLarge Language ModelsMathematical ReasoningAI Limitations
Key topics
Transformers
Large Language Models
Mathematical Reasoning
AI Limitations
A research paper explores why transformers struggle to learn multiplication, sparking a discussion on the limitations of LLMs in mathematical reasoning and potential alternatives.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
3d
Peak period
54
72-84h
Avg / period
21.4
Comment distribution107 data points
Loading chart...
Based on 107 loaded comments
Key moments
- 01Story posted
Oct 21, 2025 at 3:47 PM EDT
3 months ago
Step 01 - 02First comment
Oct 24, 2025 at 1:49 PM EDT
3d after posting
Step 02 - 03Peak activity
54 comments in 72-84h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 26, 2025 at 1:55 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45660753Type: storyLast synced: 11/20/2025, 4:50:34 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Also I wonder if the LLM "knows" that it has this capability after fine-tuning. If it encounters multiplication as part of some larger chain-of-thought, will it solve that internally, or will it continue to do it step-by-step in the chain-of-thought?
You can sit here and force me to recite ("train me on") multi-digit multiplication problems and their result until the day I die, and my language model is only going to get marginally better. It is in practicing my symbolic manipulation that I'm going to get better and faster.
It seems to me that expecting a Language Model to be very good at multiplication is asking for a substantially superhuman level of performance from them, and one that we have little reason to believe will scale anyhow. What we need is symbolic manipulation, better than the approximation they achieve when "reasoning".
I find it rather ironic to sit here and use the aforementioned 15 orders of magnitude improvement over the Commodore PET to use that level of symbolic manipulation firepower to laboriously recreate a software system that is as bad as we are at multiplication for what may well be the same fundamental reasons... and then have the audacity to complain about it. My metaphorical dude, you did a couple trillion multiplications just to get to this single bad multiplication output... maybe another approach is called for.
To disprove my point, please generate a list of 5 random 5-digit numbers and demonstrate multiplying them in your head as quickly as you can read them. Since you can't, clearly there is something about that that is hard for you, despite the fact that the act of reading this text, maintaining physical homeostasis while you do it, and all the other things your brain is doing as you do this represents a staggering amount of raw computation that is vastly, vastly in excess of what is nominally needed to achieve that computation.
Mathematics was born out of very careful reasoning that we do through language, we only use formalisms as they allow us to avoid the massive ambiguities that exist in natural language. Formal symbolic manipulation came out of our already existing abilities of symbolic manipulation through language.
I think most humans that do math aren't actually literally computing things as some kind of logic machine.
We can produce logic, and follow the steps of using that logic, but it doesn't seem to me that our cognition is some kind of logic machine itself.
I guess the challenge is, where would the training data come from? Data on the internet is in its final form so "next token" is never a delete.
Edit: I guess in essence, that's what reasoning LLMs already do. IIUC the thought blocks are ephemeral, and only the response is maintained for the chat. Maybe there'd be some benefit of doing this recursively? But that's also kind of what subagents are for. So, perhaps nothing new here.
If the numbers are represented with the most significant digit first as usual, you need a bunch of intermediate steps before outputting even the first digit just to determine whether it is affected by a carry or not.
The paper looks at multiplication of numbers represented with the least significant digit first as a toy task requiring several additions as intermediate steps to study why a model large enough to perform those additions in principle fails to learn to do so in practice.
They compare with a model that is first trained to produce the intermediate additions explicitly (as a "chain of thought" with a specific format) and then has this CoT progressively shortened during training until there's nothing left of it. But that second model successfully multiplies.
The difference appears to be that the presence of the intermediate results induces a better number representation in latent space, whereas the model without CoT gets stuck in a less efficient local minimum.
So the answer to the question "Why can't transformers learn multiplication?" is that the training process is insufficient for the model to discover the best intermediate steps on its own.
You could do a similar experiment where the CoT involves first taking the logarithm, adding, and then exponentiating to get the final result, but I think logarithms are probably another computation that's too difficult to learn without additional hints for intermediate steps.
I suppose you're probably right, but LLMs probably have a lot of log tables in their training data so I'm not so sure.
Also, it’s interesting that one of the big goals/measures of models is their capacity to “generalize”, but the training methods optimize for loss/accuracy, and only after training test for generalization to validate
Are there training methods/curriculums that explicitly maximize generalization?
But the downside is, you end up digging in the wrong direction if you leave it to a generalist system instead of a professional community in some cases which is counter productive.
Getting burnt is a good way to learn not to sometimes though...
I haven't used the deep research features much but their ability to hash out concepts and build knowledge or even provide an amplified search experience is something...
i.e. enduring countless generations of evolutionary selection and cross breeding, then fine-tuning a bit?
although it could be interesting, i don't think training on progressively complex strings entirely recapitulates this.
I guess if you really wanted to start from scratch, you could figure out how to evolve the whole system from a single cell or something like that. In some ways neural networks have kind of evolved in that way, assisted by humans. They started with a single perceptron, and have gone all the way to deep learning and convolutional networks
I also remember a long time ago studying genetic and evolutionary algorithms, but they were pretty basic in terms of what they could learn and do, compared to modern LLMs
Although recently I saw some research in which they were applying essentially genetic algorithms to merge model weights and produce models with new/evolved capabilities
Whether anyone likes it or not, these systems have co-evolved with us.
Hundreds of researchers contributing and just like English for example, it's ever-changing and evolving.
Given this trend, it's highly unlikely we won't achieve ASI.
It's not like hardware engineers stop innovating or venture capital stops wanting more. There might be a massive dip or even another AI winter but like the last one, eventually it picks up momentum again because there's clearly utility in these systems.
I've been coding for 25+ years and only a couple of days ago did it hit me that my profession has changed in a very dramatic way - I'm very critical of AI output, but I can read and comprehend code much quicker than I can write it relative to these systems.
Of course, that creates a barrier to holding a system in your head so going slow is something that should be pushed for when appropriate.
You don't need general intelligence to make a decent coding tool like Cursor.
You don't need general intelligence to improve SERPs.
You don't need general intelligence to sell a subscription for a decent AI assistant.
There's tons of value already added without anything general.
The revenues are already in the high tens of billions per year.
Models will get better from here, especially on the low end.
Costs will eventually approach peanuts for current capabilities.
Given enough time, this will pay for existing investments. If growth slows, future spending will slow as well.
I mean, probably, LLMs as they are today are already changing the world. But I do think a lot of the ongoing investment is propped up on the promise of another breakthrough that is looking less likely.
Would be a far more accurate statement. Training != Learning.
Or will we need to produce a host of documents and (re)train a new one in order for the concept to be deeply integrated.
This distinction is subtle but lost on many who think that our current path will get us to AGI...
That isn't to say we haven't created a meaningful tool but the sooner we get candid and realistic about what it is and how it works the sooner we can get down to the business of building practical applications with it. (And as an aside scaling it, something we arent doing well with now).
The "thinking" paradigm might also be a way of combatting this issue, ensuring the model is primed to say "wait a minute" - but this to me is cheating in a way, it's likely that it works because real thought is full of backtracking and recalling or "gut feelings" that something isn't entirely correct.
The models don't "know". They're just more likely to say one thing over another which is closer to recall of information.
These "databases" that talk back are an interesting illusion but the inconsistency is what you seem to be trying to nail here.
They have all the information encoded inside but don't layer that information logically and instead surface it based on "vibes".
The reason that the models don't learn continuously is because it's currently prohibitively expensive. Imagine OpenAI retraining a model each time one of its 800m users sends a message. That'd make it aware instantly of every new development in the world or your life without any context engineering. There's a research gap here too but that'll be fixed with time and money.
But it's not a fundamental limitation of transformers as you make it out to be. To me it's just that things take time. The exact same architecture will be continuously learning in 2-3 years, and all the "This is the wrong path" people will need to shift goalposts. Note that I didn't argue for AGI, just that this isn't a fundamental limitiation.
LLMs are trained. While they are training, they are not doing anything useful. Once they are trained, they do not learn.
That's the distinction.
Would a single human/entity learn more in ..say.. three million years or would short lived ones evolving over three million years and then ~20 years of education learn more?
The current AI tech cycle is focusing on the first, but we don't really know if there are benefits of both.
There's no obvious way to combine these yet.
That said, optimising for capability of maximal learning seems to be a natural occurrence in nature.
I think the non-obvious emergent effects are something to look into.
Culling bad models in favour of the A/B version and check pointing is a kind of combination of the two and the feedback loop of models trained on new snapshots of Internet data that are written with humans and AI.
There's an unintended long-form training loop which I think is going to get weirder as time goes on.
The wave of models being able to manipulate Cursor / Windsurf etc., being trained to be smarter and more efficient at this and then being retrained for other purposes, even though the model is deleted, the pattern of data can be saved and trained into more advanced models over time.
(I’m specifically talking about commercial hosted ones that have the capability i describe - obviously your run of the mill one downloaded off of the internet cannot do this).
When I multiply, I take it in chunks.
Put the LLM into a loop, instruct it to keep track of where it is and have it solve a digit at a time.
I bet it does just fine. See my other comment as to why I think that is.
If we were to take the stance of "ok, that happened so it must be the case" we wouldn't be better off in many cases, we would still be accusing people of being witches most likely.
Science is about coming up with a theory and trying to poke holes into it until you can't and in which case, after careful peer-review to ensure you're not just tricking yourself into seeing something which isn't there a consensus is approached in which we can continue to build more truth and knowledge.
I tried to raise this question yesterday. https://news.ycombinator.com/item?id=45683113#45687769
Declaring victory on "reasoning" based on cherry-picking a correct result about arithmetic is, of course, very narrow and absurdly optimistic. Even if it correctly works for all NxM calculations. Moving on from arithmetic to any kind of problem that fundamentally reduces to model-checking behind the scenes.. we would be talking about exploring a state-space with potentially many thousands of state-transitions for simple stuff. If each one even has a small chance of crapping out due to hallucination, the chance of encountering errors at the macro-scale is going to be practically guaranteed.
Everyone will say, "but you want tool-use or code-gen for this anyway". Sure! But carry-digits or similar is just one version of "correct matters" and putting some non-local kinds of demands on attention, plus it's easier to check than code. So tool-use or code-gen is just pushing the same problem somewhere else to hide it.. there's still a lot of steps involved, and each one really has to be correct if the macro-layer is going to be correct and the whole thing is going to be hands-off / actually automated. Maybe that's why local-models can still barely handle nontrivial tool-calling.
Here we can see the amount of data a high end traditional non-SOC CPU holds:
> For a recent high-end non-SoC desktop CPU: > Cache: ~40-100 MB total (L1 + L2 + shared L3) > Register files: tens to few hundreds of KB total across cores (e.g., ~200-300 KB or so) > Combined: So you're looking at ~40-100 MB + ~0.2 MB → roughly ~40-100 MB of total on-chip caches + registers.
I'm sure we can reduce these caches to fit in the context windows of today's LLMs (~500,000 tokens).
Then, with temperature 0 we get more "discrete" operations. Now, we still have the rare problem of hallucinations, but it should be small with temperature 0.
And temperature 0 makes outputs deterministic, not magically correct.
For reasons I don't claim to really understand, I don't think it even makes them deterministic. Floating point something something? I'm not sure temperature even has a static technical definition or implementation everywhere at this point. I've been ignoring temperature and using nucleus sampling anywhere that's exposed and it seems to work better.
Random but typical example.. pydantic-ai has a caveat that doesn't reference any particular model: "Note that even with temperature of 0.0, the results will not be fully deterministic". And of course this is just the very bottom layer of model-config and in a system of diverse agents using different frameworks and models, it's even worse.
Your previous example shows the best case, which is a model can sometimes follow a textual recipe for long multiplication on short inputs. That's not the same as learning a length generalizing bit exact algorithm.
Basically what you shown is the model can describe the algorithm. It doesn't show it can execute it at scale. Without writable state and bit exact ops, errors grow with length and "focus more" only slows that failure, it doesn’t eliminate it.
Well, modern LLM coding agent products (eg. Claude Code) are able to store state in files in the current repository. So, you could have the model keep the "CPU State", and the files in the repository be the "RAM".
Also, could this https://arxiv.org/html/2402.17764v1 possibly reduce errors when doing inference? There is no floating point operations
The focus here is the LLM being able to do it unaided.
The space of all combinations of steps is so large for many problems that require precision and usually one incorrect step breaks everything. "I forgot to carry the 1".
Even then, while brilliant, Claude does screw up sometimes - we're not there yet but it doesn't prevent it from being adequately useful.
If an LLM were sufficiently trained to be able to roll-forward and correctly set the current state of some registers written into the conversation..? I wouldn't trust it though, leaves too much to chance.
I too make mistakes trying to keep track of things, I end up using tools too.
In their prompt, they told it to leave itself a note and to accomplish something each time.
Then they put the model in a loop and it worked. In one instance, a model removed itself from the loop by editing a file or some other basic means.
To me, iterative tasks like like multiply and long divide, look an awful lot like the code port experiment.
Putting models into loops so they get more than one bite at the task seems to be a logical progression to improve capability.
The approach of “try a few more things before stopping” is a great strategy akin to taking a few more stabs at RNG. It’s not the same as saying keep trying until you get there - you won’t.
That's one hell of a criterion. Test-time inference undergoes a similar scaling law to pretraining, and has resulted in dramatically improved performance on many complex tasks. Law of diminishing returns kicks in of course, but this doesn't mean it's ineffective.
> akin to taking a few more stabs at RNG
Assuming I understand you correctly, I disagree. Scaling laws cannot appear with glassy optimisation procedures (essentially iid trials until you succeed, the mental model you seem to be implying here). They only appear if the underlying optimisation is globally connected and roughly convex. It's no different than gradient descent in this regard.
There's an obvious trend going on here, of course we're still just growing these systems and going with whatever works.
It's worked well so far, even if it's more convoluted than elegant...
What puts my mind at ease is that the current state of these AI systems isn't going to go backwards because of the data they generate which contributes to the pool of possible knowledge for more advanced systems.
What we end up with however is a model good at coding for example but bad at something else. And without enough general coding, good at one language over another.
And we're back to square one. The problem of being able to achieve true intelligence by distilling the essence of it not just knowing the answers to specific problems.
Given enough time, we'll plug the gaps and maybe get good enough but it's not true intelligence until it can learn in a way that excels at all fields in a cross-disciplinary way - much better than the side-effect way it's doing now where some other knowledge does actually contribute to achieving goals in other domains.
In this paper, the task is to learn how to multiply, strictly from AxB=C examples, with 4-digit numbers. Their vanilla transformer can't learn it, but the one with (their variant of) chain-of-thought can. These are transformers that have never encountered written text, and are too small to understand any of it anyway.
There is an inherent numeric-ness and logic to math that I don't think we can represent well using LLMs and transformers.
3 isn't about the word "three" - it is a quantity or a measurement. And 3x4 is a specific numerical operation that is not really contained in that sequence of symbols.
Maybe you could say that algebra is just symbol manipulation.
And in any case - "set of rules" is exactly what transformers aren't good at. Transformers are good at capturing the essence of what you meant and responding in a sensible, but not rule-bound way. This works well for language problems.
Perhaps you could argue that transformers are just a set of rules (weights/parameters) being applied, and you might similarly argue that numbers reduce to logical symbols like S(0), S(S(0)), but then I'd argue that you're missing the point.
I like to visualize them as cuts and spans in a continuum, such as a number line. They make up the full picture. One exists only because of the other. One can't do the job of the other and one is defined only in terms of the other.
Banks wouldn't use AI to compute the account balance after a transaction or for authenticating a customer. Network software wouldn't use AI for encryption and decryption of the TLS traffic. Also, banks wouldn't mind a x% error in computation of a credit rating, fraud detection or industry trends analysis.
Writing code is a probabilistic task with many variations possible, while the work done by the code during runtime, is a precision task, in most of the cases.
Doesn't both of them rely on randomness in real use cases/usage? And it's only once you have fixed seeds that cryptography becomes deterministic, and then you can make the same claim for most of ML, when the seeds are fixed you get fixed replies.
It happens to be that most people seem to use LLM clients that aren't deterministic, as they're using temperature + random seeds for each inference, but that doesn't mean someone couldn't do it in a different way.
Disclaimer: Not a crypto expert. Just like reading about it. Check actual sources for a better insight. Very interesting technology and much smarter people working in this field who deserve a lot of praise.
Not directly, no. But they might use AI to write the code that computes account balance, or authenticates a user, or encrypts/decrypts TLS.
Especially when financials are on the line, it's not like they don't have the money to ensure excruciatingly painful amounts of scrutiny here.
I did note that you said "might". So, I would hope not but I've seen things so maybe you're upsettingly right haha
Most languages and its stdlib's cannot deal with numbers properly at all. Most overflow without errors. Most integers cannot keep precision, most cannot promote types properly.
I only know of Common Lisp, Scheme, Python 3, Ruby, Erlang, Haskell, Raku which can handle numbers properly by default. Python extremely slow though.
clean separation matter, it’s really strange to force models to mimic numbers and math via incredibly unfit token-mangling stuff, imho