Transformers Know More Than They Can Tell: Learning the Collatz Sequence

Postedabout 1 month agoActive23 days ago

Xcelerate

127 points

45 comments

arxiv.orgResearchstory

informativeneutral

Debate

20/100

DetrsMachine LearningNumber Theory

Key topics

Detrs

Machine Learning

Number Theory

Diving into the mysteries of artificial intelligence, researchers have discovered that transformers can grasp more than they can explicitly convey, as evidenced by their experiment with the Collatz sequence. Commenters were quick to unpack the findings, with some noting that the model's 99.7% accuracy in certain bases is impressive, while others pointed out that a 0.3% error rate is still a significant limitation. As one commenter astutely observed, a calculator that produces incorrect results 0.3% of the time would be practically unusable, highlighting the tension between the model's capabilities and its reliability. The discussion also touched on the research itself, with some speculating that the focus on the Collatz sequence might be a red herring, and others wondering why the authors didn't explore their findings further.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

Peak period

132-144h

Avg / period

15.3

Comment distribution46 data points

Loading chart...

Based on 46 loaded comments

Key moments

01Story posted
Dec 3, 2025 at 12:50 PM EST
about 1 month ago
Step 01
02First comment
Dec 9, 2025 at 8:41 AM EST
6d after posting
Step 02
03Peak activity
29 comments in 132-144h
Hottest window of the conversation
Step 03
04Latest activity
Dec 10, 2025 at 5:57 PM EST
23 days ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (45 comments)

Showing 46 comments

niek_pas

24 days ago

3 replies

Can someone ELI5 this for a non-mathematician?

poszlem

24 days ago

1 reply

A transformer can. Here gemini:

The Experiment: Researchers trained AI models (Transformers) to solve a complex arithmetic problem called the "long Collatz step".

The "Language" Matters: The AI's ability to solve the problem depended entirely on how the numbers were written. Models using bases divisible by 8 (like 16 or 24) achieved nearly 100% accuracy, while those using odd bases struggled significantly.

Pattern Matching, Not Math: The AI did not learn the actual arithmetic rules. Instead, it learned to recognize specific patterns in the binary endings of numbers (zeros and ones) to predict the answer.

Principled Errors: When the AI failed, it didn't hallucinate random answers. It usually performed the correct calculation but misjudged the length of the sequence, defaulting to the longest pattern it had already memorized.

Conclusion: These models solve complex math by acting as pattern recognizers rather than calculators. They struggle with the "control structure" (loops) of algorithms unless the input format reveals the answer through shortcuts.

embedding-shape

24 days ago

1 reply

Do you think maybe OP would have asked a language model for the answer if they felt like they wanted a language model to give an answer? Or in your mind parent doesn't know about LLMs, and this is your way of introducing them to this completely new concept?

NitpickLawyer

24 days ago

1 reply

Funny that the "human" answer above took 2 people to be "complete" (i.e. an initial answer, followed by a correction and expansion of concepts), while the LLM one had mostly the same explanation, but complete and in one answer.

embedding-shape

24 days ago

2 replies

Maybe most of us here don't seek just whatever answer to whatever question, but the human connection part of it is important too, that we're speaking with real humans that have real experience with real situations.

Otherwise I'd just be sitting chatting with ChatGPT all day instead of wast...spending all day on HN.

NitpickLawyer

24 days ago

Oh, I agree. What I found funny is the gut reaction of many other readers that downvoted the message (it's greyed out for me at time of writing this comment). Especially given that the user clearly mentioned that it was LLM generated, while also being cheeky with the "transformer" pun, on a ... transformer topic.

pixl97

24 days ago

If life is a jobs program, why don't we dig ditches with spoons?

esafak

24 days ago

1 reply

The model partially solves the problem but fails to learn the correct loop length:

> An investigation of model errors (Section 5) reveals that, whereas large language models commonly “hallucinate” random solutions, our models fail in principled ways. In almost all cases, the models perform the correct calculations for the long Collatz step, but use the wrong loop lengths, by setting them to the longest loop lengths they have learned so far.

The article is saying the model struggles to learn a particular integer function. https://en.wikipedia.org/wiki/Collatz_conjecture

spuz

24 days ago

1 reply

That's a bit uncharitable. In bases 8, 12, 16, 24 and 32 their model achieved 99.7% accuracy. They would never expect it to achieve 100% accuracy. It would be like if you trained a model to predict whether or not a given number is prime. A model that was 100% accurate would defy mathematical knowledge but a model that was 99.7% would certainly be impressive.

In this case, they prove that the model works by categorising inputs into a number of binary classes which just happen to be very good predictors for this otherwise random seeming sequence. I don't know whether or not some of these binary classes are new to mathematics but either way, their technique does show that transformer models can be helpful in uncovering mathematical patterns even in functions that are not continuous.

jacquesm

24 days ago

6 replies

A pocket calculator that would give the right numbers 99.7% of the time would be fairly useless. The lack of determinism is a problem and there is nothing 'uncharitable' about that interpretation. It is definitely impressive, but it is fundamentally broken, because when you start making chains of things that are 99.7% correct you end up with garbage after very few iterations. That's precisely why digital computers won out over analog ones, the fact that they are deterministic.

fkarg

24 days ago

2 replies

yeah it's only correct in 99.7% of all cases, but what if it's also 10'000 times faster? There's a bunch of scenarios where that combination provides a lot of value

lkey

24 days ago

Ridiculous counterfactual. The LLM started failing 100% of the time 60! orders of magnitude sooner than the point at which we have checked literally every number.

This is not even to mention the fact that asking a GPU to think about the problem will always be less efficient than just asking that GPU to directly compute the result for closed algorithms like this.

jacquesm

24 days ago

Correctness in software is the first rung of the ladder, optimizing before you have correct output is in almost all cases a complete waste of time. Yes, there are a some scenarios where having a ballpark figure quickly can be useful if you can produce the actual result as well and if you are not going to output complete nonsense the other times but something that approaches the final value. There are a lot of algorithms that do this (for instance: Newton's method for finding square roots).

99.7% of the time good and 0.3% of the time noise is not very useful, especially if there is no confidence indicating that the bad answers are probably incorrect.

pixl97

24 days ago

1 reply

Why do people keep using LLMs as algorithms?

LLMs are not calculators. If you want a calculator use a calculator. Hell, have your LLM use a calculator.

>That's precisely why digital computers won out over analog ones, the fact that they are deterministic.

I mean, no not really, digital computers are far easier to build and far more multi-purpose (and technically the underlying signals are analog).

Again, if you have a deterministic solution that is 100% correct all the time, use it, it will be cheaper than an LLM. People use LLMs because there are problems that are either not deterministic or the deterministic solution uses more energy than will ever be available in the local part of our universe. Furthermore a lot of AI (not even LLMs) use random noise at particular steps as a means to escape local maxima.

jacquesm

24 days ago

> Why do people keep using LLMs as algorithms?

I think they keep coming back to this because a good command of math underlies a vast domain of applications and without a way to do this as part of the reasoning process the reasoning process itself becomes susceptible to corruption.

> LLMs are not calculators. If you want a calculator use a calculator. Hell, have your LLM use a calculator.

If only it were that simple.

> I mean, no not really, digital computers are far easier to build and far more multi-purpose (and technically the underlying signals are analog).

Try building a practical analog computer for a non-trivial problem.

> Again, if you have a deterministic solution that is 100% correct all the time, use it, it will be cheaper than an LLM. People use LLMs because there are problems that are either not deterministic or the deterministic solution uses more energy than will ever be available in the local part of our universe. Furthermore a lot of AI (not even LLMs) use random noise at particular steps as a means to escape local maxima.

No, people use LLMs for anything and one of the weak points in there is that as soon as it requires slightly more complex computation there is a fair chance that the output is nonsense. I've seen this myself in a bunch of non-trivial trials regarding aerodynamic calculations, specifically rotation of airfoils relative to the direction of travel. It tends to go completely off the rails if the problem is non-trivial and the user does not break it down into roughly the same steps as you would if you were to work out the problem by hand (and even then it may subtly mess up).

spuz

24 days ago

1 reply

It's uncharitable because the comment purports to summarise the entire paper while simply cherry picking the worst result. It would be like if asked how did I do on my test and you said well you got question 1 wrong and then didn't elaborate.

Now I get your point that a function that is 99.7% accurate will eventually always be incorrect but that's not what the comment said.

esafak

24 days ago

I just tried to get to the heart of the claim based on a skim. Please feel free to refine my summary.

beambot

24 days ago

1 reply

Most primality tests aren't 100% accurate either (eg Miller Rabin), they just are "reasonably accurate" while being very fast to compute. You can use them in conjunction to improve your confidence in the result.

jacquesm

24 days ago

1 reply

Yes, and we know they are inaccurate and we know that if you find a prime that way you can only use it to reject, not confirm so if you think that something is prime you need to check it.

But now imagine that instead of it being a valid reject 0.3% of the time it would also reject valid primes. Now it would be instantly useless.

brokensegue

24 days ago

I don't know people are saying it's useful. Just interesting

briandw

24 days ago

2 replies

[delayed]

jacquesm

24 days ago

There is a massive difference between an 'unsolved problem' and a problem solved 'the wrong way'. Yes, 99.7% is surprisingly good. But it did not detect the errors in its own output. And it should have.

Besides, we're all stuck on the 99.7% as if that's the across the board output, but that's a cherry picked result:

"The best models (bases 24, 16 and 32) achieve a near-perfect accuracy of 99.7%, while odd-base models struggle to get past 80%."

I do think it is a very interesting thing to do with a model and it is impressive that it works at all.

godelski

24 days ago

Category error.

The problem here is deterministic. *It must be for accuracy to even be measured*.

The model isn't trying to solve the Collatz conjecture, it is learning a pretty basic algorithm and then doing this a number of times. The instructions it needs to learn is

  if x % 2:
      x /= 2
  else:
      x = x*3 + 1

It also needs to learn to put that in a loop and for that to be a variable, but the algorithm is static.

On the other hand, the Collatz conjecture states that for C(x) (the above algorithm) has a fixed point of 1 for all x (where x \in Z+). Meaning that eventually any input will collapse to the loop 1 -> 4 -> 2 -> 1 (or just terminate at 1). You can probably see we know this is true for at least an infinite set of integers...

famouswaffles

24 days ago

>A pocket calculator that would give the right numbers 99.7% of the time would be fairly useless.

Well that's great and all, but the vast majority of llm use is not for stuff you can just pluck out a pocket calculator (or run a similarly airtight deterministic algorithm) for, so this is just a moot point.

People really need to let go of this obsession with a perfect general intelligence that never makes errors. It doesn't and has never existed besides in fiction.

robot-wrangler

24 days ago

I'll take a shot at it. Using collatz as the specific target for investigating the underlying concepts here seems like a big red-herring that's going to generate lots of confused takes. (I guess it was done partly to have access to tons of precomputed training data and partly to generate buzz. The title also seems kind of poorly chosen and/or misleading)

Really the paper is about mechanistic interpretation and a few results that are maybe surprising. First, the input representation details (base) matters a lot. This is perhaps very disappointing if you liked the idea of "let the models work out the details, they see through the surface features to the very core of things". Second, learning was burst'y with discrete steps, not smooth improvement. This may or may not be surprising or disappointing.. it depends how well you think you can predict the stepping.

rikimaru0345

24 days ago

3 replies

Ok, I've read the paper and now I wonder, why did they stop at the most interesting part?

They did all that work to figure out that learning "base conversion" is the difficult thing for transformers. Great! But then why not take that last remaining step to investigate why that specifically is hard for transformers? And how to modify the transformer architecture so that this becomes less hard / more natural / "intuitive" for the network to learn?

embedding-shape

24 days ago

1 reply

Why release one paper when you can release two? Easier to get citations if you spread your efforts, and if you're lucky, someone needs to reference both of them.

A more serious answer might be that it was simply out of scope of what they set out to do, and they didn't want to fall for scope-creep, which is easier said than done.

Y_Y

24 days ago

2 replies

For interest, this popular pastime goes by several delicious names: https://en.wikipedia.org/wiki/Least_publishable_unit

kkylin

24 days ago

2 replies

[delayed]

godelski

24 days ago

2 replies

Not only that but in the academic world 20 papers with 50 citations is worth more than one paper with 1000. Even though the total citation count is the same the former gives you an h-index of 20 (and an i-10 of 20) but the latter only gives you an h-index of 1 (ditto for i-10).

Though truthfully it's hard to say what's better. All can be hacked (a common way to hack citations is to publish surveys. You also just get more by being at a prestigious institution or being prestigious yourself). The metric is really naïve but it's common to use since actual evaluating the merits of individual works is quite time consuming and itself an incredibly noisy process. But hey, publish or perish, am I right?[0]

[0] https://www.sciencealert.com/peter-higgs-says-he-wouldn-t-ha...

jacquesm

24 days ago

1 reply

That's a fantastic example of that that which gets measured gets optimized. The academic world's fascination with this citation metrics is hilarious, it is so reminiscent of programmers optimizing for whatever metric managements has decided is the true measure of programmer productivity. Object code size, lines of code, tickets closed and so on...

godelski

24 days ago

It's definitely a toxic part of academia. Honestly if it weren't for that I'd take an academic job over an industry one in a heartbeat.

Some irony is my PhD was in machine learning. Every intro course I now (including mine) discusses reward hacking (aka Goodhart's Law). The irony being that the ML community had dialed this problem up to 11. My peers that optimized this push out 10-20 papers a year. I think that's too many and means most of the papers are low impact. I have similar citation counts to them but lower h-index and they definitely get more prestige for that even though it's harder to publish more frequently in my domain (my experiments take a lot longer). I'm with Higgs though, it's a lazy metric and imo does more harm than good.

p1esk

24 days ago

in the academic world 20 papers with 50 citations is worth more than one paper with 1000

It depends. If your goal is to get a job at OpenAI or DeepMind, one famous paper might be better.

Y_Y

24 days ago

Well put. Nobody want salami slices, but nobody wants War and Peace, either (most of the time). Both are problems, even if papers are more often too short than too long.

senkora

24 days ago

Relevant SMBC: https://www.smbc-comics.com/index.php?db=comics&id=1624

fcharton

24 days ago

2 replies

Author, here. The paper is about the Collatz sequence, how experiments with a transformer can point at interesting facts about a complex mathematical phenomenon, and how, in supervised math transformers, model predictions and errors can be explained (this part is a follow-up to a similar paper about GCD). From a ML research perspective, the interesting (but surprising) take away is the particular way the long Collatz function is learned: "one loop at a time".

To me, the base conversion is a side quest. We just wanted to rule out this explanation for the model behavior. It may be worth further investigation, but it won't be by us. Another (less important) reason is paper length, if you want to submit to peer reviewed outlets, you need to keep pages under a certain number.

observationist

24 days ago

1 reply

It might be a side quest, or it could be an elegant way to frame a category of problems that are resistant to the ways in which transformers can learn; in turn, by solving that structural deficiency in order to enable a model to effectively learn that category of problems, you might empower a new leap in capabilities and power.

We're a handful of breakthroughs before models reach superhuman levels across any and all domains of cognition. It's clear that current architectures aren't going to be the end-all solution, but all we need might simply be a handful of well-posed categorical deficiencies that allow a smooth transition past the current jagged frontiers.

jacquesm

24 days ago

> We're a handful of breakthroughs before models reach superhuman levels across any and all domains of cognition.

That's a pretty bold claim to make.

godelski

24 days ago

I'm curious about 2 things.

1) Why did you not test the standard Collatz sequence? I would think that including that, as well as testing on Z+, Z+\2Z, and 2Z+, would be a bit more informative (in addition to what you've already done). Even though there's the trivial step it could inform how much memorization the network is doing. You do notice the model learns some shortcuts so I think these could help confirm that and diagnose some of the issues.

2) Is there a specific reason for the cross attention?

Regardless, I think it is an interesting paper (these wouldn't be criteria for rejection were I reviewing your paper btw lol. I'm just curious about your thoughts here and trying to understand better)

fiveMoreCents

24 days ago

cuz you don't sell nonsense in one piece. it used to be "repeat a lie often enough" ... now lies are split into pieces ...

you'll see more of all that in the next few years.

but if you wanna stay in awe, at your age and further down the road, don't ask questions like you just asked.

be patient and lean into the split.

brains/minds have been FUBARed. all that remains is buying into the fake, all the way down to faking it when your own children get swooped into it all.

Onavo

24 days ago

2 replies

Interesting, what about the old proof that neural networks can't model arbitrary length sine waves?

ChadNauseam

24 days ago

1 reply

I don't know that computers can model arbitrary length sine waves either. At least not in the sense of me being able to input any `x` and get `sin(x)` back out. All computers have finite memory, meaning they can only represent a finite number of numbers, so there is some number `x` above which they can't represent any number.

Neural networks are more limited of course, because there's no way to expand their equivalent of memory, while it's easy to expand a computer's memory.

Onavo

23 days ago

Here's the paper for your interest

https://arxiv.org/abs/2006.08195

kirubakaran

24 days ago

That proof only applies to fixed architecture feed forward multilayer perceptrons with no recurrence, iirc. Transformers are not that.

jebarker

24 days ago

1 reply

This is an interesting paper and I like this kind of mechanistic interpretability work - but I cannot figure out how the paper title "Transformers know more than they can tell" relates to the actual content. In this case what is it that they know and can't tell?

godelski

24 days ago

I believe it's a reference to the paper "Language Models (Mostly) Know What They Know".

There's definitely some link but I'd need to give this paper a good read and refresh on the other to see how strong. But I think your final sentence strengthens my suspicion

https://arxiv.org/abs/2207.05221

View full discussion on Hacker News

ID: 46137596Type: storyLast synced: 12/11/2025, 6:36:15 AM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN