The Continual Learning Problem

Posted2 months agoActive2 months ago

Bogdanp

102 points

8 comments

jessylin.comTechstory

calmpositive

Debate

20/100

Continual LearningMachine LearningArtificial Intelligence

Key topics

Continual Learning

Machine Learning

Artificial Intelligence

The article discusses the continual learning problem in machine learning, and the discussion revolves around potential solutions, existing libraries, and the broader context of AI research.

Snapshot generated from the HN discussion

Discussion Activity

Moderate engagement

First comment

Peak period

Day 10

Avg / period

Key moments

01Story posted
Oct 25, 2025 at 2:45 AM EDT
2 months ago
Step 01
02First comment
Nov 3, 2025 at 12:31 PM EST
9d after posting
Step 02
03Peak activity
6 comments in Day 10
Hottest window of the conversation
Step 03
04Latest activity
Nov 4, 2025 at 4:17 AM EST
2 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (8 comments)

Showing 8 comments

optimalsolver

2 months ago

3 replies

Rather than handcrafting solutions like it’s 1993, why not make robustness against forgetting part of the training objective?

Let the search algorithm figure it out.

intalentive

2 months ago

1 reply

Funny you say that, this write-up recalled Stephen Grossberg's Adaptive Resonance Theory for me. The same basic ideas come up when addressing the stability-plasticity dilemma.

That said, the authors are saving this for future work. Fine-tuning is cheaper, easier, faster to validate.

>Switching to a new architecture at pretraining time has a high cost, but there are reasons we might want this (besides the better scaling behavior). The main benefit is that the model can learn to organize its memory from scratch, and once we’ve already “allocated” this high-capacity memory pool, there’s a clearer path to learning on multiple tasks and corpora over time.

This means you could "fine-tune" the model on your custom corpus at ingestion time, without having to actually train via backprop. Your corpus would be compressed into model-readable memory that updates model behavior. Then different memory units could be swapped in and out, like programs on a floppy disk. I can see this concept being especially useful for robotics.

yorwba

2 months ago

The memory is model-readable but not model-writable, so you still need to train via backprop to get the memory to store useful data.

imtringued

2 months ago

Elastic weight consolidation is already a thing and it's not enough.

vessenes

2 months ago

The reason you're getting slightly downvoted, I think, is that you need to answer this question first: which of the 15T tokens are you going to evaluate for forgetting? And, please explain how doing that is different than doing another full epoch type pass over the weights.

Some of the appeal here is that this architecture (handcrafted) allows ongoing gradient descent learning as you go on a much smaller set of weights.

esafak

2 months ago

1 reply

Great writeup. Are there any libraries that implement some of the methods described?

gdiamos

2 months ago

ScalarLM uses tokenformer adaptors by default, which have learnable key/values

https://www.scalarlm.com/blog/tokenformer-a-scalable-transfo...

skeptrune

2 months ago

I appreciate that people are going beyond RAG and few shot prompting.