Why Can't Transformers Learn Multiplication?

Posted3 months agoActive3 months ago

PaulHoule

161 points

107 comments

arxiv.orgTechstoryHigh profile

calmmixed

Debate

70/100

TransformersLarge Language ModelsMathematical ReasoningAI Limitations

Key topics

Transformers

Large Language Models

Mathematical Reasoning

AI Limitations

A research paper explores why transformers struggle to learn multiplication, sparking a discussion on the limitations of LLMs in mathematical reasoning and potential alternatives.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

Peak period

72-84h

Avg / period

21.4

Comment distribution107 data points

Loading chart...

Based on 107 loaded comments

Key moments

01Story posted
Oct 21, 2025 at 3:47 PM EDT
3 months ago
Step 01
02First comment
Oct 24, 2025 at 1:49 PM EDT
3d after posting
Step 02
03Peak activity
54 comments in 72-84h
Hottest window of the conversation
Step 03
04Latest activity
Oct 26, 2025 at 1:55 PM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (107 comments)

Showing 107 comments

LouisSayers

3 months ago

1 reply

Given their names I'd say they're too busy optimising primes...

IAmBroom

3 months ago

1 reply

Take your damned upvote, and go away.

Razengan

3 months ago

Hmm do the winds favor an even/odd cycle of votes..

daxfohl

3 months ago

1 reply

The chains-of-thought here are artificially constructed, very information-dense partial sums formatted in a specific way that guides the fine tuning. A potential next step would be to look at real-world chains-of-thought and see whether some process could start with those and achieve the same result. Then you could really have a self-improving system!

Also I wonder if the LLM "knows" that it has this capability after fine-tuning. If it encounters multiplication as part of some larger chain-of-thought, will it solve that internally, or will it continue to do it step-by-step in the chain-of-thought?

hxu129

3 months ago

1 reply

But it's very hard to define "real-world CoT" -- think about human, we learn multiplications by vertical calculation and we learn division in a similar way -- all these learning process requires an "information dense" tools (calculation process) with intrinsic math rules in it. Isn't that an adapted way of CoT?

daxfohl

3 months ago

Oh, by "real world" I meant "chains of thought generated by existing reasoning LLMs" (as opposed to injecting predefined CoT like was done in the experiment), not human thoughts.

jerf

3 months ago

5 replies

This is a gut impression and I don't deny it, but LLMs are Large Language Models, and in my own brain, my Language Model isn't doing large-scale multiplication. I have a language-based intuition for the sigle-digit multiplication table and a touch beyond (and based on my observations that's already above average for a human Language Model, at least in my age peer group), but it's not my Language Model doing 283 times 9284. That requires a symbolic manipulation model, and in fact I would observe that my personal neural net, for all the things it is amazingly good at, is in fact quite terrible at that sort of multiplication too. A Commodore PET is by all measures vastly, vastly simpler than my brain, but it blows away my multiplication capabilities. And then the symbolic systems tacked on another, what, 15 orders of magnitude from that "blows away my multiplication capabilities"? Depends on how you count, but something like that.

You can sit here and force me to recite ("train me on") multi-digit multiplication problems and their result until the day I die, and my language model is only going to get marginally better. It is in practicing my symbolic manipulation that I'm going to get better and faster.

It seems to me that expecting a Language Model to be very good at multiplication is asking for a substantially superhuman level of performance from them, and one that we have little reason to believe will scale anyhow. What we need is symbolic manipulation, better than the approximation they achieve when "reasoning".

I find it rather ironic to sit here and use the aforementioned 15 orders of magnitude improvement over the Commodore PET to use that level of symbolic manipulation firepower to laboriously recreate a software system that is as bad as we are at multiplication for what may well be the same fundamental reasons... and then have the audacity to complain about it. My metaphorical dude, you did a couple trillion multiplications just to get to this single bad multiplication output... maybe another approach is called for.

suddenlybananas

3 months ago

1 reply

Language _is_ the symbolic manipulation system par excellence though.

jerf

3 months ago

1 reply

There's equivocation in that statement, though, whether you meant there to be or not. There is clearly a difference in how we manipulate English words for normal human activities and the symbolic manipulation with very strict rules we today associate with mathematics and computer science. Human language goes back thousands of years, into the indefinite past we can't track past. Symbolic manipulation is a much, much more recent development, starting only ~2300 years ago around Euclid and not really coming into full development until much later... you can argue about exactly when it is but I'd personally put it as late as the 19th century for it to be recognized in the modern sense. It must be something different if separated by that many centuries.

To disprove my point, please generate a list of 5 random 5-digit numbers and demonstrate multiplying them in your head as quickly as you can read them. Since you can't, clearly there is something about that that is hard for you, despite the fact that the act of reading this text, maintaining physical homeostasis while you do it, and all the other things your brain is doing as you do this represents a staggering amount of raw computation that is vastly, vastly in excess of what is nominally needed to achieve that computation.

suddenlybananas

3 months ago

Doing multiplication in your head isn't the point though, you can externalise language and use it to do things you can't do in your head by writing it down.

Mathematics was born out of very careful reasoning that we do through language, we only use formalisms as they allow us to avoid the massive ambiguities that exist in natural language. Formal symbolic manipulation came out of our already existing abilities of symbolic manipulation through language.

lacy_tinpot

3 months ago

2 replies

A lot of savants that are able to do really cool calculations, or even people that have synesthesia seeing numbers as colors, don't actually do "real" calculations.

I think most humans that do math aren't actually literally computing things as some kind of logic machine.

We can produce logic, and follow the steps of using that logic, but it doesn't seem to me that our cognition is some kind of logic machine itself.

jerf

3 months ago

I'm not sure if you think you're agreeing with me or not, but that is my point. Compared to the nominal account of computational power our brains have, we are staggeringly bad at logical manipulation. We extremely expensively and laboriously simulate them.

daxfohl

3 months ago

True. Generally it seems like you're visualizing things, moving stuff around, seeing vague patterns and trying to make them more clear. IDK how a transformer architecture would fit all of that in its context, or use it productivity once it's there. You can't just keep appending forever, but you also can't delete stuff either, because unlike humans, a deletion is a hard delete; there's no fuzzy remembrance left to rely on, so even deleting bad ideas is dangerous because it'll forget that it was a bad idea and infinite loop. Symbols manipulation doesn't come until the end, after you have a good idea what that part will look like.

daxfohl

3 months ago

Hmm, I wonder what happens if you let them manipulate their own context symbolically, maybe something like a stack machine. Perhaps all you need is a "delete" token, or a "replace" flag. That way you don't have context full of irrelevant information.

I guess the challenge is, where would the training data come from? Data on the internet is in its final form so "next token" is never a delete.

Edit: I guess in essence, that's what reasoning LLMs already do. IIUC the thought blocks are ephemeral, and only the response is maintained for the chat. Maybe there'd be some benefit of doing this recursively? But that's also kind of what subagents are for. So, perhaps nothing new here.

r0x0r007

3 months ago

I agree with you, seems like we are trying to make the shoe fit. Not only are we missing the understanding of what is happening inside transformers, but now we are trying to teach them and see how they respond and then interpret it. That seems fine with viruses and animals, but we are talking about a piece of software here. Shouldn't we know what's happening inside? Maybe these kinds of papers can shine more light and give us better understanding though, still it feels backwards to me...Regarding the multiplication itself, shouldn't pure understanding of the meaning of multiplication(it's a summation basically) be enough for 'AI' to call it a day? If AI or human understands that, then the rest is computation part. We already got that covered, so instead of having 'AI' learn it on its own on crazy amount of data and get it right 99% of time, shouldn't we just give it a calculator? Somebody PLEEAASE give this AI a calculator :-)

hodgehog11

3 months ago

I think you might be missing some appropriate context. I agree that it is ridiculous to expect a language model to be good at symbolic manipulation; that is best done with tool use. However, there is a significant line of work dedicated to algorithm discovery for mathematical problems using neural networks. Transformers are used here due to their popularity, but also some theoretical analysis to suggest that they are the among the most efficient architecture for learning automata. It's still unclear whether this is truly sound though, which is where this kind of research matters.

mikkupikku

3 months ago

1 reply

They're not any better at addition, are they? If they are, I wonder how good they are at adding numbers in log space.

yorwba

3 months ago

1 reply

The paper uses a number representation that is designed to make attention easy to learn: each digit is a separate token and the least significant digit is put first, so that the first digit of the output is simply the sum of the first digits of the inputs and the second digit is the sum of the second digits plus an optional carry from the first digits and so on.

If the numbers are represented with the most significant digit first as usual, you need a bunch of intermediate steps before outputting even the first digit just to determine whether it is affected by a carry or not.

The paper looks at multiplication of numbers represented with the least significant digit first as a toy task requiring several additions as intermediate steps to study why a model large enough to perform those additions in principle fails to learn to do so in practice.

They compare with a model that is first trained to produce the intermediate additions explicitly (as a "chain of thought" with a specific format) and then has this CoT progressively shortened during training until there's nothing left of it. But that second model successfully multiplies.

The difference appears to be that the presence of the intermediate results induces a better number representation in latent space, whereas the model without CoT gets stuck in a less efficient local minimum.

So the answer to the question "Why can't transformers learn multiplication?" is that the training process is insufficient for the model to discover the best intermediate steps on its own.

You could do a similar experiment where the CoT involves first taking the logarithm, adding, and then exponentiating to get the final result, but I think logarithms are probably another computation that's too difficult to learn without additional hints for intermediate steps.

mikkupikku

3 months ago

1 reply

> but I think logarithms are probably another computation that's too difficult to learn without additional hints for intermediate steps.

I suppose you're probably right, but LLMs probably have a lot of log tables in their training data so I'm not so sure.

yorwba

3 months ago

The paper is about the ability of transformers to learn a task based on training data for that task only, not about LLMs pretrained on much of the internet. And training on log tables doesn't necessarily allow the model to always output the correct logarithm, just as training on multiplication tables doesn't necessarily confer the ability to multiply.

nico

3 months ago

6 replies

Would love to see an architecture that learned more like humans. Start with just imitating one letter, then a few more, than some syllables, then full words, then sentences, etc. Progressively adding on top of previous knowledge

Also, it’s interesting that one of the big goals/measures of models is their capacity to “generalize”, but the training methods optimize for loss/accuracy, and only after training test for generalization to validate

Are there training methods/curriculums that explicitly maximize generalization?

serced

3 months ago

1 reply

Yes, I also wonder about this! Progress from children books to scientific papers etc. Could it learn e.g. language structure faster in a pre-training stage? Also somehow one needs to define a proxy to generalization to compute a loss and do backpropagation.

arbot360

3 months ago

2 replies

This field of study is known as "Curriculum Learning" for your Googling pleasure (or I guess ChatGPT Deep Research now).

olliepro

3 months ago

1 reply

Probably don’t need the name of the field for ChatGPT to get it.

razodactyl

3 months ago

I get why this comment was downvoted but I also get where you're coming from - yes, these models are becoming increasingly intelligent at understanding the nuance and where to look without knowing what to begin searching for.

But the downside is, you end up digging in the wrong direction if you leave it to a generalist system instead of a professional community in some cases which is counter productive.

Getting burnt is a good way to learn not to sometimes though...

razodactyl

3 months ago

Yeah. This comment is profound to me. The internet works differently with these tools.

I haven't used the deep research features much but their ability to hash out concepts and build knowledge or even provide an amplified search experience is something...

exit

3 months ago

1 reply

"an architecture that learned more like humans"

i.e. enduring countless generations of evolutionary selection and cross breeding, then fine-tuning a bit?

although it could be interesting, i don't think training on progressively complex strings entirely recapitulates this.

nico

3 months ago

2 replies

That’s a very interesting take. I hadn’t really considered evolution

I guess if you really wanted to start from scratch, you could figure out how to evolve the whole system from a single cell or something like that. In some ways neural networks have kind of evolved in that way, assisted by humans. They started with a single perceptron, and have gone all the way to deep learning and convolutional networks

I also remember a long time ago studying genetic and evolutionary algorithms, but they were pretty basic in terms of what they could learn and do, compared to modern LLMs

Although recently I saw some research in which they were applying essentially genetic algorithms to merge model weights and produce models with new/evolved capabilities

razodactyl

3 months ago

It's this take on the situation which I think needs more emphasis.

Whether anyone likes it or not, these systems have co-evolved with us.

Hundreds of researchers contributing and just like English for example, it's ever-changing and evolving.

Given this trend, it's highly unlikely we won't achieve ASI.

It's not like hardware engineers stop innovating or venture capital stops wanting more. There might be a massive dip or even another AI winter but like the last one, eventually it picks up momentum again because there's clearly utility in these systems.

I've been coding for 25+ years and only a couple of days ago did it hit me that my profession has changed in a very dramatic way - I'm very critical of AI output, but I can read and comprehend code much quicker than I can write it relative to these systems.

Of course, that creates a barrier to holding a system in your head so going slow is something that should be pushed for when appropriate.

laterium

3 months ago

How much compute does simulating the earth for 4.7 billion years at atomic precision take? Why would that be more efficient than current approaches? Evolutionary algorithms work but are extremely inefficient, we don't have the compute to evolve even a single bacteria, let alone the whole history of the planet so we can arrive at human-like species.

ares623

3 months ago

1 reply

Isn’t that what all the hundreds of billions are banking on? “General” intelligence.

onlyrealcuzzo

3 months ago

2 replies

You don't need general intelligence to make good memes to keep people scrolling through Instagram.

You don't need general intelligence to make a decent coding tool like Cursor.

You don't need general intelligence to improve SERPs.

You don't need general intelligence to sell a subscription for a decent AI assistant.

There's tons of value already added without anything general.

ares623

3 months ago

2 replies

Yes but $500B and counting for memes wasn’t what was sold

onlyrealcuzzo

3 months ago

$500B is future projections for total spending (a lot of that decently far into the future).

The revenues are already in the high tens of billions per year.

Models will get better from here, especially on the low end.

Costs will eventually approach peanuts for current capabilities.

Given enough time, this will pay for existing investments. If growth slows, future spending will slow as well.

malfist

3 months ago

I remember reading somewhere someone said "the problem with AI is it's a $50b industry pretending its a $10t industry"

reassess_blind

3 months ago

The question is whether, if the models plateau, and "AGI" as it was claimed in the beginning never arrives, if it's enough to justify these ongoing multi-hundred billion dollar deals.

I mean, probably, LLMs as they are today are already changing the world. But I do think a lot of the ongoing investment is propped up on the promise of another breakthrough that is looking less likely.

zer00eyz

3 months ago

1 reply

"Would love to see an architecture that learned"

Would be a far more accurate statement. Training != Learning.

rokobobo

3 months ago

2 replies

Do you have an example of an algorithm that learns, rather than is trained/trains itself? I don’t really see the boundary between the two concepts.

zer00eyz

3 months ago

2 replies

If we make some massive physics breakthrough tommrow is an LLM going to be able to fully integrate that into its current data set?

Or will we need to produce a host of documents and (re)train a new one in order for the concept to be deeply integrated.

This distinction is subtle but lost on many who think that our current path will get us to AGI...

That isn't to say we haven't created a meaningful tool but the sooner we get candid and realistic about what it is and how it works the sooner we can get down to the business of building practical applications with it. (And as an aside scaling it, something we arent doing well with now).

llbbdd

3 months ago

1 reply

What is the subtle distinction? I'm "many" and it's not clear at all here. If we had some massive physics breakthrough, the LLM needs to be tought about it, but so do people. Teaching people about it would involve producing a host of documents in some format but that's also true of teaching people. Training and learning here seem to be opposite ends of the same verb no matter the medium, but I'm open to being enlightened.

bjt

3 months ago

1 reply

Not sure exactly what the parent comment intended, but it does seem to me that it's harder for an LLM to undergo a paradigm shift than for humans. If some new scientific result disproves something that's been stated in a whole bunch of papers, how does the model know that all those old papers are wrong? Do we withhold all those old papers in the next training run, or apply a super heavy weight somehow to the new one, or just throw them all in the hopper and hope for the best?

razodactyl

3 months ago

You approach it from a data-science perspective and ensure more signal in the direction of the new discovery. Eg saturating / fine-tuning with biased data in the new direction.

The "thinking" paradigm might also be a way of combatting this issue, ensuring the model is primed to say "wait a minute" - but this to me is cheating in a way, it's likely that it works because real thought is full of backtracking and recalling or "gut feelings" that something isn't entirely correct.

The models don't "know". They're just more likely to say one thing over another which is closer to recall of information.

These "databases" that talk back are an interesting illusion but the inconsistency is what you seem to be trying to nail here.

They have all the information encoded inside but don't layer that information logically and instead surface it based on "vibes".

laterium

3 months ago

Why is retraining not allowed in this scenario? Yes, the model will know the breakthrough if you retrain. If you force the weights to stay static by fiat, then sure it's harder for them to learn, and will need go learn in-context or whatever. But that's true for you as well. If your brain is not allowed to update any connections I'm not sure how much you can learn either.

The reason that the models don't learn continuously is because it's currently prohibitively expensive. Imagine OpenAI retraining a model each time one of its 800m users sends a message. That'd make it aware instantly of every new development in the world or your life without any context engineering. There's a research gap here too but that'll be fixed with time and money.

But it's not a fundamental limitation of transformers as you make it out to be. To me it's just that things take time. The exact same architecture will be continuously learning in 2-3 years, and all the "This is the wrong path" people will need to shift goalposts. Note that I didn't argue for AGI, just that this isn't a fundamental limitiation.

plasticeagle

3 months ago

Humans, and many other creatures, learn. While they are performing a task, they improve at the task.

LLMs are trained. While they are training, they are not doing anything useful. Once they are trained, they do not learn.

That's the distinction.

fooker

3 months ago

1 reply

There's an interesting question here.

Would a single human/entity learn more in ..say.. three million years or would short lived ones evolving over three million years and then ~20 years of education learn more?

The current AI tech cycle is focusing on the first, but we don't really know if there are benefits of both.

There's no obvious way to combine these yet.

razodactyl

3 months ago

Opinion: a lot can change over such a span of time and knowledge goes in and out of relevance - I think the natural progression of models shrinking in parameter count goes to show it's better to know how to use knowledge than to attempt to remember everything.

That said, optimising for capability of maximal learning seems to be a natural occurrence in nature.

I think the non-obvious emergent effects are something to look into.

Culling bad models in favour of the A/B version and check pointing is a kind of combination of the two and the feedback loop of models trained on new snapshots of Internet data that are written with humans and AI.

There's an unintended long-form training loop which I think is going to get weirder as time goes on.

The wave of models being able to manipulate Cursor / Windsurf etc., being trained to be smarter and more efficient at this and then being retrained for other purposes, even though the model is deleted, the pattern of data can be saved and trained into more advanced models over time.

weregiraffe

3 months ago

2 replies

Would like to see a car that moved like a horse.

whyagain2025

3 months ago

Technically internal combustion engine has piston moving like horse legs.

dingnuts

3 months ago

yeah me too that would be fucking awesome, are you kidding?

carodgers

3 months ago

4 replies

Because they produce output probabilistically, when multiplication is deterministic. Why is this so hard for everyone?

trollied

3 months ago

2 replies

Not true though. Internally they can “shell out” to sub-tasks that know how to do specific things. The specific things don’t have to be models.

(I’m specifically talking about commercial hosted ones that have the capability i describe - obviously your run of the mill one downloaded off of the internet cannot do this).

KalMann

3 months ago

That doesn't appear to be the kind of thing this article is describing.

rrix2

3 months ago

yes, what your describing is not a transformer but a high-level LLM-based product with tool-calling wired up to it

skinner_

3 months ago

1 reply

If being probabilistic prevented learning deterministic functions, transformers couldn’t learn addition either. But they can, so that can't be the reason.

wat10000

3 months ago

2 replies

People are probabilistic, and I've been informed that people are able to perform multiplication.

krackers

3 months ago

2 replies

Are you sure? I bet you if you pull 10 people off the street and ask them to multiply 5 digit by 5 digit numbers by hand, you won't have a 100% success rate.

wat10000

3 months ago

The pertinent fact is that there exist people who can reliably perform 5x5 multiplication, not that every single person on the planet can do it.

emp17344

3 months ago

I bet with a little training, practically anyone could multiply 5 digit numbers reliably.

ddingus

3 months ago

Yes, and unlike the LLM they can iterate on a problem.

When I multiply, I take it in chunks.

Put the LLM into a loop, instruct it to keep track of where it is and have it solve a digit at a time.

I bet it does just fine. See my other comment as to why I think that is.

razodactyl

3 months ago

Bad take. It's not that it's hard for everyone - there's critical pushback because we don't know for certain if LLM technology can or cannot do the task in question. Which is the reason there's a paper being discussed.

If we were to take the stance of "ok, that happened so it must be the case" we wouldn't be better off in many cases, we would still be accusing people of being witches most likely.

Science is about coming up with a theory and trying to poke holes into it until you can't and in which case, after careful peer-review to ensure you're not just tricking yourself into seeing something which isn't there a consensus is approached in which we can continue to build more truth and knowledge.

laterium

3 months ago

Transformers do just fine on many deterministic tasks, and are not necessarily probabilistic. This is not the issue at all. So, it's hard for everyone else because they're not confidently wrong like you are.

kovek

3 months ago

1 reply

I tried to ask a model to tell me what is the "long multiplication algorithm". It gave it to me. I asked it to follow that algorithm to solve eg. 12987318927 * 12098102983, and it followed the algorithm, and it got the right answer. It DOES fail more when the numbers are longer (because it results with more text in the context), but that can be improved by having the model focus on the right subset of the text, right?

photonthug

3 months ago

1 reply

> It DOES fail more when the numbers are longer (because it results with more text in the context),

I tried to raise this question yesterday. https://news.ycombinator.com/item?id=45683113#45687769

Declaring victory on "reasoning" based on cherry-picking a correct result about arithmetic is, of course, very narrow and absurdly optimistic. Even if it correctly works for all NxM calculations. Moving on from arithmetic to any kind of problem that fundamentally reduces to model-checking behind the scenes.. we would be talking about exploring a state-space with potentially many thousands of state-transitions for simple stuff. If each one even has a small chance of crapping out due to hallucination, the chance of encountering errors at the macro-scale is going to be practically guaranteed.

Everyone will say, "but you want tool-use or code-gen for this anyway". Sure! But carry-digits or similar is just one version of "correct matters" and putting some non-local kinds of demands on attention, plus it's easier to check than code. So tool-use or code-gen is just pushing the same problem somewhere else to hide it.. there's still a lot of steps involved, and each one really has to be correct if the macro-layer is going to be correct and the whole thing is going to be hands-off / actually automated. Maybe that's why local-models can still barely handle nontrivial tool-calling.

kovek

3 months ago

1 reply

Well, if the model can reliably keep in context CPU cache plus CPU registers plus CPU instructions and is able to do operations based on those, then we pretty much solved computation using LLMs, right? It could use RAG to operate on RAM and SSD.

Here we can see the amount of data a high end traditional non-SOC CPU holds:

> For a recent high-end non-SoC desktop CPU: > Cache: ~40-100 MB total (L1 + L2 + shared L3) > Register files: tens to few hundreds of KB total across cores (e.g., ~200-300 KB or so) > Combined: So you're looking at ~40-100 MB + ~0.2 MB → roughly ~40-100 MB of total on-chip caches + registers.

I'm sure we can reduce these caches to fit in the context windows of today's LLMs (~500,000 tokens).

Then, with temperature 0 we get more "discrete" operations. Now, we still have the rare problem of hallucinations, but it should be small with temperature 0.

lossolo

3 months ago

3 replies

It doesn't work like mapping CPU caches/registers into an LLM context. Transformers have no mutable registers, they attend over past tokens and can't update prior state. RAG isn't RAM. Even with huge context, you still can't step CPU style instructions without an external, read/write memory/tooling.

And temperature 0 makes outputs deterministic, not magically correct.

photonthug

3 months ago

1 reply

> And temperature 0 makes outputs deterministic, not magically correct.

For reasons I don't claim to really understand, I don't think it even makes them deterministic. Floating point something something? I'm not sure temperature even has a static technical definition or implementation everywhere at this point. I've been ignoring temperature and using nucleus sampling anywhere that's exposed and it seems to work better.

Random but typical example.. pydantic-ai has a caveat that doesn't reference any particular model: "Note that even with temperature of 0.0, the results will not be fully deterministic". And of course this is just the very bottom layer of model-config and in a system of diverse agents using different frameworks and models, it's even worse.

astrange

3 months ago

It's partly because floating point math is not associative and GPU inference doesn't guarantee all the steps will be done in the same order.

kovek

3 months ago

1 reply

Well, the LLM may re-infer the whole state fully on every instruction. Temperature 0 is deterministic and that's what we are looking for. If the model is trained properly on how the CPU state + instructions should be handled, then it should be able to produce the next state.

lossolo

3 months ago

1 reply

With temp = 0 if the model is off by one bit at step k, all subsequent steps are deterministically wrong.

Your previous example shows the best case, which is a model can sometimes follow a textual recipe for long multiplication on short inputs. That's not the same as learning a length generalizing bit exact algorithm.

Basically what you shown is the model can describe the algorithm. It doesn't show it can execute it at scale. Without writable state and bit exact ops, errors grow with length and "focus more" only slows that failure, it doesn’t eliminate it.

kovek

3 months ago

1 reply

> It doesn't show it can execute it at scale. Without writable state and bit exact ops,

Well, modern LLM coding agent products (eg. Claude Code) are able to store state in files in the current repository. So, you could have the model keep the "CPU State", and the files in the repository be the "RAM".

Also, could this https://arxiv.org/html/2402.17764v1 possibly reduce errors when doing inference? There is no floating point operations

razodactyl

3 months ago

It seems to be the conclusion that we come to though, we ourselves use tools.

The focus here is the LLM being able to do it unaided.

The space of all combinations of steps is so large for many problems that require precision and usually one incorrect step breaks everything. "I forgot to carry the 1".

Even then, while brilliant, Claude does screw up sometimes - we're not there yet but it doesn't prevent it from being adequately useful.

razodactyl

3 months ago

Well mostly but they can generate more state that can push old state out of context.

If an LLM were sufficiently trained to be able to roll-forward and correctly set the current state of some registers written into the conversation..? I wouldn't trust it though, leaves too much to chance.

I too make mistakes trying to keep track of things, I end up using tools too.

amelius

3 months ago

1 reply

What probably works: Ask it to write a python program, but tell it to not use any built-in multiplication functions.

janalsncm

3 months ago

Then your transformer would need to know Python.

ddingus

3 months ago

2 replies

A while back I saw a post where people ran a model over and over to accomplish a code base port from one language to another.

In their prompt, they told it to leave itself a note and to accomplish something each time.

Then they put the model in a loop and it worked. In one instance, a model removed itself from the loop by editing a file or some other basic means.

To me, iterative tasks like like multiply and long divide, look an awful lot like the code port experiment.

Putting models into loops so they get more than one bite at the task seems to be a logical progression to improve capability.

CaveTech

3 months ago

2 replies

The amount of paths in the wrong direction are infinitely more than then number in the right direction. You'll quickly realize this doesn't actually scale.

hodgehog11

3 months ago

2 replies

I'm a bit confused by this; are you referring to vanishing/exploding gradients during training or iteration at inference? If the former, this is only true if you take too many steps. If the latter, we already know this works and scales well.

CaveTech

3 months ago

2 replies

The latter, and I would disagree that “this works and scales well” in the general sense. It clearly has very finite bounds by the fact we haven’t achieved agi by running an llm in a loop..

The approach of “try a few more things before stopping” is a great strategy akin to taking a few more stabs at RNG. It’s not the same as saying keep trying until you get there - you won’t.

hodgehog11

3 months ago

2 replies

> It clearly has very finite bounds by the fact we haven’t achieved agi by running an llm in a loop..

That's one hell of a criterion. Test-time inference undergoes a similar scaling law to pretraining, and has resulted in dramatically improved performance on many complex tasks. Law of diminishing returns kicks in of course, but this doesn't mean it's ineffective.

> akin to taking a few more stabs at RNG

Assuming I understand you correctly, I disagree. Scaling laws cannot appear with glassy optimisation procedures (essentially iid trials until you succeed, the mental model you seem to be implying here). They only appear if the underlying optimisation is globally connected and roughly convex. It's no different than gradient descent in this regard.

CaveTech

3 months ago

I never made a claim that it's ineffective, just that it's of limited effectiveness. The diminishing returns kick in quickly, and it's not applicable in more domains than it is applicable.

razodactyl

3 months ago

But test-time inference leads to better data to train better models that can generate better test-time inference data.

There's an obvious trend going on here, of course we're still just growing these systems and going with whatever works.

It's worked well so far, even if it's more convoluted than elegant...

What puts my mind at ease is that the current state of these AI systems isn't going to go backwards because of the data they generate which contributes to the pool of possible knowledge for more advanced systems.

ddingus

3 months ago

Achieving agi is not a requirement to working well.

malfist

3 months ago

1 reply

How do you know if you've taken too many steps beforehand?

hodgehog11

3 months ago

It's a hyperparameter much like learning rate. If the learning rate is too high, the training process would not work either. Addressing this is just a matter of a grid search.

ddingus

3 months ago

I am not sure it needs to scale.

razodactyl

3 months ago

The feedback from compilation tools / linters fed into the training loops is an example of this.

What we end up with however is a model good at coding for example but bad at something else. And without enough general coding, good at one language over another.

And we're back to square one. The problem of being able to achieve true intelligence by distilling the essence of it not just knowing the answers to specific problems.

Given enough time, we'll plug the gaps and maybe get good enough but it's not true intelligence until it can learn in a way that excels at all fields in a cross-disciplinary way - much better than the side-effect way it's doing now where some other knowledge does actually contribute to achieving goals in other domains.

smartmic

3 months ago

1 reply

Yesterday, I learned the opposite. Simon Willison demonstrated in another thread how this works out … see https://news.ycombinator.com/item?id=45686295

skinner_

3 months ago

That's very cool, but it's not an apples to apples comparison. The reasoning model learned how to do long multiplication. (Either from the internet, or from generated examples of long multiplication that were used to sharpen its reasoning skills. In principle, it might have invented it on its own during RL, but no, I don't think so.)

In this paper, the task is to learn how to multiply, strictly from AxB=C examples, with 4-digit numbers. Their vanilla transformer can't learn it, but the one with (their variant of) chain-of-thought can. These are transformers that have never encountered written text, and are too small to understand any of it anyway.

Nifty3929

3 months ago

1 reply

Numbers aren't language, or even sequences of tokens, or vectors.

There is an inherent numeric-ness and logic to math that I don't think we can represent well using LLMs and transformers.

3 isn't about the word "three" - it is a quantity or a measurement. And 3x4 is a specific numerical operation that is not really contained in that sequence of symbols.

jakefromstatecs

3 months ago

1 reply

Math is just symbol manipulation with a set of rules, no?

Nifty3929

3 months ago

No. Math and especially numbers are not just symbol manipulation. Geometry is a counter-example. So is multiplication, for that matter.

Maybe you could say that algebra is just symbol manipulation.

And in any case - "set of rules" is exactly what transformers aren't good at. Transformers are good at capturing the essence of what you meant and responding in a sensible, but not rule-bound way. This works well for language problems.

Perhaps you could argue that transformers are just a set of rules (weights/parameters) being applied, and you might similarly argue that numbers reduce to logical symbols like S(0), S(S(0)), but then I'd argue that you're missing the point.

AvAn12

3 months ago

1 reply

Computers are already fast and efficient at multiplication - optimized long ago. Transformers are fast and efficient at working with sequences of tokens. Tools are not universal. A hammer is not a good violin bow. A MRI machine is not a good relational database. This extends to the natural world too. A zebra is not a good dairy animal. And a human poet may or may not be a good surgeon. It’s good to explore what things can do beyond their intrinsic nature - but expect to encounter limits eventually.

razodactyl

3 months ago

Well. I don't like your limits... I'm looking forward to my zebra farm utopia. :D

zkmon

3 months ago

2 replies

There are two kinds of computing - precision computing and probabilistic computing. For example, cryptography falls into precision computing. There is no room for being incorrect even by a single bit. Where as machine learning is about getting a range of answers, with tolerance for error.

I like to visualize them as cuts and spans in a continuum, such as a number line. They make up the full picture. One exists only because of the other. One can't do the job of the other and one is defined only in terms of the other.

Banks wouldn't use AI to compute the account balance after a transaction or for authenticating a customer. Network software wouldn't use AI for encryption and decryption of the TLS traffic. Also, banks wouldn't mind a x% error in computation of a credit rating, fraud detection or industry trends analysis.

Writing code is a probabilistic task with many variations possible, while the work done by the code during runtime, is a precision task, in most of the cases.

CaptainOfCoit

3 months ago

1 reply

> For example, cryptography falls into precision computing. There is no room for being incorrect even by a single bit. Where as machine learning is about getting a range of answers, with tolerance for error.

Doesn't both of them rely on randomness in real use cases/usage? And it's only once you have fixed seeds that cryptography becomes deterministic, and then you can make the same claim for most of ML, when the seeds are fixed you get fixed replies.

It happens to be that most people seem to use LLM clients that aren't deterministic, as they're using temperature + random seeds for each inference, but that doesn't mean someone couldn't do it in a different way.

happa

3 months ago

1 reply

Fixing the seed wouldn't necessarily make LLMs deterministic. LLMs do lots of computation in parallel and the order in which these computations are performed is often indeterministic and can lead to different final results.

razodactyl

3 months ago

Yep. And to answer the question about randomness - it's absolutely vital to have a good source of noise to obscure the underlying pattern to prevent the secret information leaking - but the mathematical part that manipulates that noise into the encrypted output has to be precise. That's the distinction made here relating to probability.

Disclaimer: Not a crypto expert. Just like reading about it. Check actual sources for a better insight. Very interesting technology and much smarter people working in this field who deserve a lot of praise.

lelanthran

3 months ago

2 replies

> Banks wouldn't use AI to compute the account balance after a transaction or for authenticating a customer. Network software wouldn't use AI for encryption and decryption of the TLS traffic.

Not directly, no. But they might use AI to write the code that computes account balance, or authenticates a user, or encrypts/decrypts TLS.

zkmon

3 months ago

I was talking about "runtime" work. At runtime, the tasks I mentioned above, don't use AI to get work done. Coding ofcourse, falls under probabilistic task, as I mentioned.

razodactyl

3 months ago

I would argue that there are already quite a few slow-moving corporate procedures in place for the exact reason of ensuring correctness.

Especially when financials are on the line, it's not like they don't have the money to ensure excruciatingly painful amounts of scrutiny here.

I did note that you said "might". So, I would hope not but I've seen things so maybe you're upsettingly right haha

hatmatrix

3 months ago

Does this also apply to commutative operations in general?

rurban

3 months ago

Even worse: Why cannot programming languages learn arithmetic?

Most languages and its stdlib's cannot deal with numbers properly at all. Most overflow without errors. Most integers cannot keep precision, most cannot promote types properly.

I only know of Common Lisp, Scheme, Python 3, Ruby, Erlang, Haskell, Raku which can handle numbers properly by default. Python extremely slow though.

musicale

3 months ago

Transformers are very good at multiplication. They just don't expose it to the user.

faragon

3 months ago

Maybe the AGI will come with the equivalent of a "Turing Machine" enabling some kind of computability.

alyxya

3 months ago

I think it should be able to learn multiplication with chain of thought. Without it, it's probably really difficult to generalize the multiplication of two n-digit integers when you have to accumulate up to n products of digits and handle carrying for each output digit.

search_facility

3 months ago

Interesting research, but it is still fascinates me why AI devs of current SOTAs ignore possibility to add numbers as first-grade citizens to AI. like for example suggested here: https://huggingface.co/papers/2502.09741

clean separation matter, it’s really strange to force models to mimic numbers and math via incredibly unfit token-mangling stuff, imho

akomtu

3 months ago

IMO, the mystery has a simple explanation: addition is mostly local in nature, when the 5th digit in the input impacts only 5th or 4th digits in the output, while multiplication is not. That being said, LLMs don't understand addition either: the illusion will break down on very large inputs.

View full discussion on Hacker News

ID: 45660753Type: storyLast synced: 11/20/2025, 4:50:34 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN