Nested Learning: a New ML Paradigm for Continual Learning
Key topics
The machine learning world is abuzz with Google Research's new paradigm, "Nested Learning," which promises to revolutionize continual learning. As researchers dig into the concept, some are already attempting to reproduce the results, with one enthusiast sharing an open-source implementation on GitHub. While some commenters are excited about the potential of Nested Learning, others are skeptical, with one pointing out that it may simply be "gradient descent wrapped in new terminology." The debate highlights the ongoing quest for more efficient and effective machine learning methods, with discussions around related research and potential applications adding to the conversation.
Snapshot generated from the HN discussion
Discussion Activity
Moderate engagementFirst comment
6h
Peak period
6
20-22h
Avg / period
2.5
Key moments
- 01Story posted
Dec 7, 2025 at 9:47 AM EST
about 1 month ago
Step 01 - 02First comment
Dec 7, 2025 at 3:58 PM EST
6h after posting
Step 02 - 03Peak activity
6 comments in 20-22h
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 8, 2025 at 7:18 AM EST
about 1 month ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
This tidbit from a discussion on that repo sounds really interesting:
> You can load a pretrained transformer backbone, freeze it, and train only the HOPE/TITAN/CMS memory pathways.
In principle, you would:
- Freeze the shared transformer spine (embeddings, attention/MLP blocks, layer norms, lm_head) and keep lm_head.weight tied to embed.weight.
- Train only the HOPE/TITAN memory modules (TITAN level, CMS levels, self-modifier projections, inner-optimizer state).
- Treat this like an adapter-style continual-learning finetune: base model provides stable representations; HOPE/CMS learn to adapt/test-time-learn on top.
----
Pretty cool if this works. I'm hopeful more research will go into reusing already trained models (other than freeze existing parts, train the rest) so all that training effort doesn't get lost. Something that can re-use that w/ architecture enhancements will be truly revolutionary.
We are not at the end of AI :)
Also, someone claimed that NVIDA combined diffusion and autoregression, making it 6 times faster, but couldn't find a source. Big if true!
The idea is simple, in a way, with diffusion several sentences / words get predicted, but they usually are not of great quality. With auto regression they select the correct words.
Increasing quality and speed. Sounds a bit like conscious and sub-conscious to me.
Thanks to AI search :)
You’ve got a frozen transformer and a second module still trained with SGD, so how exactly does that solve forgetting instead of just relocating it?