Solving a Million-Step LLM Task with Zero Errors
Key topics
Researchers propose a method to decompose complex tasks into simpler steps that can be executed by relatively small LLMs with high accuracy, demonstrated on the Towers of Hanoi problem, sparking discussion on the potential and limitations of this approach.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
1h
Peak period
24
2-4h
Avg / period
4.9
Based on 64 loaded comments
Key moments
- 01Story posted
Nov 18, 2025 at 11:26 AM EST
about 2 months ago
Step 01 - 02First comment
Nov 18, 2025 at 12:41 PM EST
1h after posting
Step 02 - 03Peak activity
24 comments in 2-4h
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 19, 2025 at 1:49 PM EST
about 2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Briefly, the idea is recursively to decompose tasks into the simplest possible steps, recursively call (relatively small) LLMs as agents to execute one step at a time, and using a clever voting scheme to choose how to execute each step. The authors use this technique to get a relatively small LLM to solve Towers of Hanoi with 20 rings (1M steps). All of it using natural language.
The most obvious question is whether other tasks, more interesting -- less "rote" -- than Towers of Hanoi, can similarly be recursively decomposed into simple steps. I'm not sure that's always possible.
Either they win or don't. /s
Let’s say you break a job down into 3 tasks: A, B and C. Doing one of those tasks is too much for an LLM to accomplish in one turn (this is something you learn intuitively through experience), but an LLM could break each task into 3 subtasks. So you do that, and start by having the LLM break task A into subtasks A1, A2 and A3. And B into B1, B2 and B3. But when you break down task C, the LLM (which needs to start with a fresh context each time since each “breakdown” uses 60-70% of the context) doesn’t know the details of task A, and thus writes a prompt for C1 that is incompatible with “the world where A1 has been completed”.
This sort of “tunnel vision” is currently an issue with scaling 2025 agents. As useful context lengths get longer it’ll get easier, but figuring out how to pack exactly the right info into a context is tough, especially when the tool you’d reach for to automate it (LLMs) are the same tool that suffers from these context limitations.
None of this means big things aren’t possible, just that the fussyness of these systems increases with the size of the task, and that fussyness leads to more requirements of “human review” in the process.
Planning is definitely still something that requires a human in the loop, but I have been able to avoid the problem you are describing. It does require some trickery (not yet represented in the /plan command) when the overall plan exceeds reasonable context window size (~20k tokens). You basically have to start having the AI consider combinatorially many batches of the plan compared with each other, to discover and correct these dependency issues.
Depends on what is considered as small enough for the LLM to be resolved with a high confidence.
It's like humans! Everything old is new again :)
Most real world prompts can't be reduced to something so consistent and reliable.
Their key finding was that the number of votes grows linearly with number of prompts you are trying to chain.
However the issue is that the number of votes you need will grow exponentially with hallucination rate.
Combining this with those approaches that recursively reason in latent space would be interesting.
This can't be scaled to more generalised tasks. If you solve that then you've solved the hallucination issue.
Big if that the decomposition and the voting happen accurately for anything other than toy problems
In other words, it compensates for random error, not systematic error.
https://xkcd.com/1162/
As someone else points, the data is the worrying aspect, as it points towards state-of-the-art models not being able of making more than 0 consecutive steps without errors.
...
What a relief to see an obvious problem actually acknowledged. I can't even guess how many times I've been shouted down about this exact topic in the reasoning debates on HN, or seen papers just kind of glossing over it as if it were a non-issue.
The next really natural question is.. if you're committed to decomposing a problem into tons of microsteps and voting.. why aren't we just embracing hybrid symbolic systems? The decomposition step kind of implies you're in a problem domain where variables separate out somewhat cleanly and that this should be doable. As far as I can tell the "voting" discussed in the paper is about candidate outputs, i.e. solutions to subproblems? If you switch to hybrid symbolic systems, then you can vote on candidate inputs to solvers and at least be damned sure that their output is always correct.
Also the success of chain-of-code compared with chain-of-thought approaches could actually imply that having no real solver is maybe not the obstacle you'd expect! Maybe you can invent a semiformal logic just in time that appears to be expressive enough to encapsulate the problem domain, and have the LLM emulate a nonexistent solver. If the error rate with this sort of approach is still too high, then at least you know concretely what solver or formal-language you need to implement in order to improve.
This is also why (edit: non-LIDAR) FSD cars are an illusion.
Humans drive around with two mid-tier cameras on a pivot mount. Which means that any sufficiently advanced AI can do the same.
When a FSD car gets into an avoidable collision, you dump the blackbox data, and what do you see? You see that the cameras had it. All the information the car needed to avoid a collision was right there in the visual stream. The car had every bit of information it needed to make the right call, and didn't make the right call.
You can acknowledge that, and focus on building better AIs. Or you can neglect AI altogether, and have a car with 6 LIDARs drag a pedestrian, because it had all the sensor coverage but zero object permanence.
Prolog seemed like a natural choice for this (at least to me :-), since it's a relatively simple language that makes it easy to build meta-interpreters and allows for a fairly concise task/workflow representations.
Anyway the basic point being.. it is no wonder LLM reasoning abilities suck when we have no decent intermediate representation for "thinking" in terms of set/probability primitives. And it is no wonder LLMs suck at larger code-gen tasks when we have no decent intermediate representation for "thinking" in terms of abstract specifications. The obsession with natural-language inputs/intermediates has been a surprise to me. LLMs are compilers, and we need to walk with various spec -> spec compilers first so that we can run with spec -> code compilers
So nobody should be surprised that this also applies to LLMs.
The issue is when people assumes that a zero failure rate, or even close to zero, is necessary for utility, even though we don't need that from humans for humans to be useful for complex tasks.
For a whole lot of tasks, the acceptable error rate boils down to how costly it is to work around, and that is a function of the error rate, consequence of an error that slips past, and the cost of a "reliable enough" detector to let us mitigate to whatever extent is cost effective by using one or more detection steps.
For a lot of uses, voting or putting the AI in a loop, produces a good enough results cheap enough. For some it will require models with lower error rates first.
For some applications, sure, maybe solvers will be part of that, or in the mix. As will a lot of other tools. E.g. Claude likes to try to bisect when I ask it to fix a parser problem, and Claude is really bad at doing sensible bisection, so I had it write a dumb little bisection tool instead, and told it steps to solve this type of problem that includes using that tool. So when we can have planning steps output "microsteps" that we can automate with more deterministic tools, then we absolutely should.
Heck, the models themselves "likes" to write tools to automate if you give them long lists of tedious little tasks to do, to the point it's effort to make them not do it even when they have to write the tools themselves.
This argument doesn't carry because it is beside the point. Human vs. LLM utility parity isn't a sensible stop-goal for improvement. New technology isn't adopted for its legacy parity. Nor are there any specific technical barriers around human parity.
Fewer mistakes than humans, by definition, delivers unique value. People also want to spin up LLMs to handle tasks at scale in ways humans never could, where human level mistakes would be unacceptable.
So we very much do need LLMs (or whatever we call them tomorrow) to operate with lower error bars than humans. It is a reasonable demand. Lots of applications are waiting.
Given that demand, the value of avoiding any mistake, and the many people working on it, error rates will keep falling indefinitely.
“No one knows the cost of a defective product - don't tell me you do. You know the cost of replacing it, but not the cost of a dissatisfied customer.” -Deming
Who said anything about AI vs humans? The contest in this context would be AI vs classical deterministic code, algorithms, solvers
> how costly it is to work around .. a function of the error rate, consequence of an error that slips past, the cost of a "reliable enough" detector.. produces a good enough results cheap enough.
I mean, you're right, but only sort of. Someone can use this same argument to justify the assertion that bogosort is really the pinnacle of engineering excellence. How would you respond?
Also, if we decompose a big task into many tasks, some might be solved in an incompatible way with the rest of the tasks and you can not combine them.
The meat is in decomposing the difficult problem into steps
The paper however, meh...
No mention of MoE. One would think this is a logical evolution of that but not a mention (that I saw). Its own rubric for the task, Towers of Hanoi, was admittedly weak.
LLM papers are starting to look like the last decade of JS frameworks and Tools. Only with less code and more academics, and thats disappointing, because I think a lack of pragmatism and grounding is now holding the field back...
31 more comments available on Hacker News