Diffusion Beats Autoregressive in Data-Constrained Settings
Posted3 months agoActive3 months ago
blog.ml.cmu.eduTechstory
calmmixed
Debate
60/100
Diffusion ModelsAutoregressive ModelsMachine Learning
Key topics
Diffusion Models
Autoregressive Models
Machine Learning
A recent study suggests that diffusion models outperform autoregressive models in data-constrained settings, sparking discussion on the trade-offs between compute and data in machine learning.
Snapshot generated from the HN discussion
Discussion Activity
Light discussionFirst comment
1h
Peak period
4
2-3h
Avg / period
2.3
Comment distribution14 data points
Loading chart...
Based on 14 loaded comments
Key moments
- 01Story posted
Sep 22, 2025 at 2:21 PM EDT
3 months ago
Step 01 - 02First comment
Sep 22, 2025 at 3:46 PM EDT
1h after posting
Step 02 - 03Peak activity
4 comments in 2-3h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 22, 2025 at 11:23 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45337433Type: storyLast synced: 11/20/2025, 2:27:16 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
However, due to how diffusion models are trained, they never see their own predictions as input, so they cannot learn to store information across steps. This is the complete opposite for reasoning models.
It should be trivial to make an encoder that has some memory of at least part of the prompt (say the tailing part) and do a diffusion step there too.
Edit: from the source [1], this quote pretty much sums it all up: "Our 2022 paper predicted that high-quality text data would be fully used by 2024, whereas our new results indicate that might not happen until 2028."
[1] https://epoch.ai/blog/will-we-run-out-of-data-limits-of-llm-...
Easier said than done.
Robotics tends to be even more data-constrained than NLP. The real world only runs at 1x speed, and if your robot breaks something it costs real money. Simulators are simplistic compared to reality and take a lot of manual effort to build.
You will always need to make efficient use of the data you have.
There is also the problem that on-device learning is not yet practical.
There is evidence that training RNN models that compute several steps with same input and coefficients (but different state) lead to better performance. It was shown in a followup to [1] that performed ablation study.
[1] https://arxiv.org/abs/1611.06188
They fixed number of time steps instead of varying it, and got better results.
Unfortunately, I forgot the title of that ablation paper.
https://cdn.aaai.org/AAAI/1987/AAAI87-048.pdf
The fixed point nature of DEQs means that they inherently have a concept of self assessment how close they are to the solution. If they are at the solution, they will simply stop changing it. If not, they will keep performing calculations.