Continuous Autoregressive Language Models
Posted2 months agoActiveabout 2 months ago
arxiv.orgTechstory
calmpositive
Debate
40/100
Artificial IntelligenceLanguage ModelsMachine Learning
Key topics
Artificial Intelligence
Language Models
Machine Learning
The HN community discusses a new paper on Continuous Autoregressive Language Models, exploring its potential to improve efficiency and capabilities of LLMs, while also raising concerns about potential failure modes and limitations.
Snapshot generated from the HN discussion
Discussion Activity
Moderate engagementFirst comment
8d
Peak period
10
180-192h
Avg / period
10
Key moments
- 01Story posted
Nov 5, 2025 at 4:49 PM EST
2 months ago
Step 01 - 02First comment
Nov 13, 2025 at 5:01 AM EST
8d after posting
Step 02 - 03Peak activity
10 comments in 180-192h
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 13, 2025 at 3:11 PM EST
about 2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45828523Type: storyLast synced: 11/20/2025, 8:00:11 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
- Diversity: This term encourages the model to generate a diverse set of samples, preventing mode collapse. - Fidelity: This term rewards the model for making predictions that are close to the ground-truth
I'm wondering if a continuos next-vector generative approach also increase innate "reasoning" capabilities of the model, since it could potentially capture more of the semantics of the data vs just tokens.
Obviously, you can't do it in pre-training. But you can add it later as an optional 'extra' vector, I think. E.g. `input_embedding + MLP(prev_output) * alpha`. Alpha is zero during pre-training.
What if you trained a separate thinking phase using the auto encoder, though? Might be more efficient, and then you've got it using neuralese internally.
Actually, reading the (summary) paper - they tried your idea and had trouble with it for a different reason:
When I'm thinking about math proofs, sometimes I can have a single idea which can be unfolded into a hundred lines of proof
Maybe I'm getting the wrong analogy here, but if vectors = ideas then K should depend on the vector
Still — props to the team for going after the real root of inefficiency, not just piling on more layers. If nothing else, this is one to watch if you care about scaling models smarter.
I also wonder how far they can push K if other aspects are tweaked. The approach of just doubling each parameter each time leaves a lot of space between the chosen value and the next value known to not work.