Deepconf: Scaling LLM Reasoning with Confidence, Not Just Compute
Posted5 months agoActive5 months ago
arxiviq.substack.comTechstory
calmmixed
Debate
60/100
LLM OptimizationAI ResearchCompute Efficiency
Key topics
LLM Optimization
AI Research
Compute Efficiency
The article discusses a new method called DeepConf that improves LLM reasoning by using confidence measures, sparking discussion on its implications, limitations, and potential applications.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
51m
Peak period
34
0-12h
Avg / period
17.5
Comment distribution35 data points
Loading chart...
Based on 35 loaded comments
Key moments
- 01Story posted
Aug 24, 2025 at 10:41 AM EDT
5 months ago
Step 01 - 02First comment
Aug 24, 2025 at 11:32 AM EDT
51m after posting
Step 02 - 03Peak activity
34 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Aug 28, 2025 at 10:02 PM EDT
5 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45004617Type: storyLast synced: 11/20/2025, 1:17:51 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
It's not remotely practical to select the most probable path but you can do a little bit of search a few tokens at a time.
Well, the folks on this website think installing vLLM (pip install vLLM...) is hard and that ollama - a far slower and shittier inference engine - is better. Enormous damage has been done to the hobbyist LLM ecosystem due to folks not knowing what tools work on what platform.
The one exception is for mac peasants where llama.cpp is still probably the best implementation, but if you have nvidia and you're not using sglang or vLLM, you're doing it wrong.
But this is of ENORMOUS use for folks who want to run tiny models at home. Go to bed wake up with a K=512 solution answer to your problem.
vLLM i'm not sure about, i can't really tell what it does from docs.vllm.ai site. So i'm not sure you conveyed what, at least, i thought you were trying to; which is that llama.cpp isn't good enough for "home" use. Like, with a 3090 or 4000 series consumer GPU.
If you want to donate some L40S to the cause, i'll send you my P.O. box info
The authors seem to be Chinese, and may not be that confident in their English. I suspect that we'll be seeing a lot more of this kind of stuff, as time goes on.
Also, some of the very worst English I've ever read, has been technical prose, written by born-and-bred native English speakers with very high educational credentials.
Clear communication is important. The best idea on Earth, is worthless, if it can't be articulated well.
No, it was fully or almost fully LLM generated. See: https://arxiviq.substack.com/p/coming-soon
If otherwise, then it looks like The Singularity has arrived.
It’s a perfectly valid article; an AI-generated summary of a lot of work done by humans.
Not a paper that would be presented for peer review, but rather, to be consumed by regular mensch (like me).
That’s actually something that AI is pretty good at. I use it to summarize stuff for me, all the time.
It should probably have a disclaimer, somewhere, saying what it is, maybe with a link to the raw source, but it’s just another way of communicating.
I’ve been reading human-generated marketing drivel for decades. This is actually a lot better than that stuff.
Careful where you place your anger. You should not be angry at the people writing the paper.
No I think the confusing thing is that the LLM-written blog post doesn't adequately explain the screenshot.
> "Specifically, DeepConf-low uses top η= 10% (corresponding to the 90th percentile) and DeepConf-high uses top η = 90% (corresponding to the 10th percentile) uniformly across all settings. This threshold ensures that during online generation, traces are terminated when their confidence falls below the level that retains the top η% highest-confidence traces from the warmup phase."
I'm not sure if I'm parsing it right, but are they using "low" and "high" as descriptors of the number used as the %, meaning that the "low" 10 cuts anything outside the best 10%, while the "high" 90 leaves the best 90% ie high is less selective than low?
Source: https://arxiviq.substack.com/p/coming-soon
I also do manual reviews (https://gonzoml.substack.com/), but there are many more papers for which I don't have time to write a review. So I created a multi-agentic system to help me, and I'm constantly iterating to improve it. And I like the result. It was also validated by the paper authors a couple of times, they agree the reviews are correct. So, if you see something is definitely wrong, please let me know.
Regarding myself, I became at least x10 more productive in reading papers and understanding what's happening. Hope, it will also help some of you.
The previous self-consistency approach and this confidence pruning approach aren't really novel, but it's nice to see the numbers run. Fundamentally these approaches are about handling contradicting results, but not resolving the contradictions or increasing the quality of reasoning. What if the rare idea is the right answer? You can squeeze the training juice harder, but if you still get the wrong answer when it really really mattered, you're just left with a stress toy in your hand.
And we will be supposed to find said code at https://jiaweizzhao.github.io/deepconf at some point?
Reducing costs of reasoning is a huge ongoing challenge in LLMs. We're spending so much energy and compute resources today on reasoning that today's consumption rates were unexpected (to me) a short 1 yr ago. We're literally burning forests, the atmosphere and making electricity expensive for everyone.
DeepThink v3.1 made a significant leap in this direction recently -- significantly shorter thinking tokens at the same quality. GPT5's router was also one (important) attempt to reduce reasoning costs and make o3-quality available in the free tier without breaking the bank. This is also why Claude 4 is winning the coding wars against its reasoning peers -- it provides great quality without all the added reasoning tokens.
Getting inspiration from Alpha-go and MCMC literature -- applying tree weighting, prioritization and pruning feels extremely appropriate. (To improve the quality of Deep Think -- offered by Gemini & GPT5 Pro today)
So, yes, more of this please. Totally the right direction.