Modern Optimizers – an Alchemist's Notes on Deep Learning
Posted2 months agoActive2 months ago
notes.kvfrans.comTechstory
calmpositive
Debate
20/100
Deep LearningOptimizersMachine Learning
Key topics
Deep Learning
Optimizers
Machine Learning
The article discusses modern optimizers in deep learning, sparking a thoughtful discussion among commenters about the underlying mechanics and potential future developments.
Snapshot generated from the HN discussion
Discussion Activity
Moderate engagementFirst comment
5d
Peak period
6
120-132h
Avg / period
3.5
Key moments
- 01Story posted
Nov 7, 2025 at 1:08 AM EST
2 months ago
Step 01 - 02First comment
Nov 11, 2025 at 10:48 PM EST
5d after posting
Step 02 - 03Peak activity
6 comments in 120-132h
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 12, 2025 at 12:58 PM EST
2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45843927Type: storyLast synced: 11/20/2025, 1:26:54 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
The entry asks "why the square root?"
On seeing it, I immediately noticed that with log-likelihood as the loss function, the whitening metric looks a lot like the Jeffreys prior or an approximation (https://en.wikipedia.org/wiki/Jeffreys_prior), which is a reference prior when the CLT holds. The square root can be derived from the reference prior structure, but also has the effect in a lot of modeling scenarios of scaling things proportionally to the scale of the parameters (for lack of a better way of putting it; think standard error versus sampling variance).
If you think of the optimization method this way, you're essentially reconstructing a kind of Bayesian criterion with a Jeffreys prior.
if you take SOAP and change all betas to 0, it still works well, so SOAP is that already