The Continual Learning Problem

Posted3 months agoActive2 months ago

kiyanwang

68 points

5 comments

jessylin.comTechstory

calmpositive

Debate

40/100

Continual LearningArtificial IntelligenceMachine LearningLora

Key topics

Continual Learning

Artificial Intelligence

Machine Learning

Lora

The post discusses the continual learning problem in AI and presents a potential solution, sparking a thoughtful discussion on its similarities to existing methods like LORA and potential alternatives like context distillation.

Snapshot generated from the HN discussion

Discussion Activity

Light discussion

First comment

38m

Peak period

132-144h

Avg / period

1.7

Key moments

01Story posted
Oct 23, 2025 at 2:38 AM EDT
3 months ago
Step 01
02First comment
Oct 23, 2025 at 3:16 AM EDT
38m after posting
Step 02
03Peak activity
3 comments in 132-144h
Hottest window of the conversation
Step 03
04Latest activity
Oct 29, 2025 at 7:30 AM EDT
2 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (5 comments)

Showing 5 comments

mynti

3 months ago

1 reply

Super interesting blogpost. I just wonder how this is actually different to LORA, since LORA also adds some parameters and freezes the rest of the model. This seems like a sparse, memory efficient LORA with a couple of extra steps, since it uses attention again to make the sparsity work. All while making it a lot more effective compared to LORA (performance drop of only 11% compared to 71%).

sva_

2 months ago

> LORA

I think you meant LoRA (not to be confused with LoRa)

alyxya

2 months ago

2 replies

I think the solution to continual learning is as simple as using context distillation. We know that models are good at in-context learning, so we just want an efficient way to distill context into the weights. I suspect context rot may come from how the softmax in attention gets diluted with a longer context, so this wouldn't be an issue with context distillation.

killerstorm

2 months ago

Perhaps it can work through multiple stages: ICL -> prompt/context optimization (*) -> prefix tuning / KV distillation -> context distillation.

*: it is possible to measure how much part of a prompt helps with a task e.g. measuring change in entropy

imtringued

2 months ago

The problem with continual learning is that stochastic gradient descent is already an online algorithm applied incrementally on a shuffled dataset. If you add new data, you can't train on just the new data, because you will be running what amounts to a completely different training sequence. Further training requires the old data and the new data to be shuffled together.

With reinforcement learning, specifically actor critic, the actor is not training against a dataset. It's training against the critic. The critic is supposed to approximate the value function, which contains the current cost for a given action and the predicted future cost, assuming that you choose the optimal action at every step, including its impact on future actions. If you have a simple supervised cost function, what happens is that the critic acts as an averaging of loss functions. You could say that the critic is a compressed copy of the training data. When you train the actor, you're essentially taking not only the new data, but also the old data into account.

So, in a way, catastrophic forgetting is sort of solved, but not really. If you add new data, you run into the problem that your critic will slowly drift to the new data distribution. This means the problem wasn't solved, but you certainly managed to delay it. Delaying the problem is good though. What if you can delay it even more? What if you can delay it forever?

Here is my stupid and simple unproven idea: Nest the reinforcement learning algorithm. Each critic will add one more level of delay, thereby acting as a low pass filter on the supervised reward function. Since you have two critics now, you can essentially implement a hybrid pre-training + continual learning architecture. The most interesting aspect here is that you can continue training the inner-most critic without changing the outer critic, which now acts as a learned loss function.

View full discussion on Hacker News

ID: 45678859Type: storyLast synced: 11/20/2025, 1:45:02 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN