Modern Optimizers – an Alchemist's Notes on Deep Learning

Posted2 months agoActive2 months ago

maxall4

46 points

7 comments

notes.kvfrans.comTechstory

calmpositive

Debate

20/100

Deep LearningOptimizersMachine Learning

Key topics

Deep Learning

Optimizers

Machine Learning

The article discusses modern optimizers in deep learning, sparking a thoughtful discussion among commenters about the underlying mechanics and potential future developments.

Snapshot generated from the HN discussion

Discussion Activity

Moderate engagement

First comment

Peak period

120-132h

Avg / period

3.5

Key moments

01Story posted
Nov 7, 2025 at 1:08 AM EST
2 months ago
Step 01
02First comment
Nov 11, 2025 at 10:48 PM EST
5d after posting
Step 02
03Peak activity
6 comments in 120-132h
Hottest window of the conversation
Step 03
04Latest activity
Nov 12, 2025 at 12:58 PM EST
2 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (7 comments)

Showing 7 comments

big-chungus4

2 months ago

1 reply

the square root is from PCA/ZCA whitening, what it does it it makes empirical covariance of gradients become identity, so they become decorellated, which is exactly what hessian does on a quadratic objective by the way

big-chungus4

2 months ago

https://en.wikipedia.org/wiki/Whitening_transformation for ZCA whitening

big-chungus4

2 months ago

1 reply

I personally think we've hit the limit and no more better optimizers are to be developed in my humble opinion

big-chungus4

2 months ago

best we can do is something like make SOAP faster by replacing QR with something cheaper and maybe warm started

big-chungus4

2 months ago

which PSGD did you use because there is apparenly like a million of them

derbOac

2 months ago

Interesting read and interesting links.

The entry asks "why the square root?"

On seeing it, I immediately noticed that with log-likelihood as the loss function, the whitening metric looks a lot like the Jeffreys prior or an approximation (https://en.wikipedia.org/wiki/Jeffreys_prior), which is a reference prior when the CLT holds. The square root can be derived from the reference prior structure, but also has the effect in a lot of modeling scenarios of scaling things proportionally to the scale of the parameters (for lack of a better way of putting it; think standard error versus sampling variance).

If you think of the optimization method this way, you're essentially reconstructing a kind of Bayesian criterion with a Jeffreys prior.

big-chungus4

2 months ago

>Likely, there is a method that can use the orthogonalization machinery of Muon while keeping the signal-to-noise estimation of Adam, and this optimizer will be great.

if you take SOAP and change all betas to 0, it still works well, so SOAP is that already

View full discussion on Hacker News

ID: 45843927Type: storyLast synced: 11/20/2025, 1:26:54 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN