How Does Gradient Descent Work?

Posted3 months agoActive3 months ago

jxmorris12

325 points

24 comments

centralflows.github.ioResearchstory

calmpositive

Debate

40/100

Gradient DescentDeep LearningOptimization

Key topics

Gradient Descent

Deep Learning

Optimization

The article explores the dynamics of gradient descent in deep learning, introducing the concept of 'central flows' to understand its behavior, sparking a thoughtful discussion on its implications and potential applications.

Snapshot generated from the HN discussion

Discussion Activity

Active discussion

First comment

Peak period

84-96h

Avg / period

Comment distribution24 data points

Loading chart...

Based on 24 loaded comments

Key moments

01Story posted
Oct 3, 2025 at 4:59 PM EDT
3 months ago
Step 01
02First comment
Oct 7, 2025 at 12:05 PM EDT
4d after posting
Step 02
03Peak activity
14 comments in 84-96h
Hottest window of the conversation
Step 03
04Latest activity
Oct 8, 2025 at 3:20 PM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (24 comments)

Showing 24 comments

untrimmed

3 months ago

2 replies

So all the classic optimization theory about staying in the stable region is basically what deep learning doesn't do. The model literally learns by becoming unstable, oscillating, and then using that energy to self-correct.

The chaos is the point. What a crazy, beautiful mess.

danielmarkbruce

3 months ago

I dont think it's a good way to think of it.

Researchers are constantly looking to train more expressive models more quickly. Any method which can converge + take large jumps will be chosen. You are sort of guaranteed to end up in a place where the sharpness is high but it somehow comes under control. If we weren't there.... we'd try a new architecture until we arrived there. So deep learning doesn't "do this", we do it using any method possible and it happened to be the various architectures that currently fit into "deep learning". Keep in mind many architectures which are deep do not converge - you see survivorship bias.

pkoird

3 months ago

Reminds me of Simulated Annealing. Some randomness have always been part of optimization processes that seek a better equilbrium than local. Genetic Algorithms have mutation, Simulated Annealing has temperature, Gradient Descent similarly has random batches.

DoctorOetker

3 months ago

1 reply

Fascinating, do the gained insights allow to directly compute the central flow in order to speed up convergence? Or is this preliminary exploration to understand how it had been working?

They explicitly ignore momentum and exponentially weighted moving average, but that should result in the time-averaged gradient descent (along the valley, not across it). But that requires multiple evaluations, do any of the expressions for the central flow admit fast / computationally efficient central flow calculation?

markisus

3 months ago

1 reply

The authors somewhat address your questions in the accompanying paper https://arxiv.org/abs/2410.24206

> We emphasize that the central flow is a theoretical tool for understanding optimizer behavior, not a practical optimization method. In practice, maintaining an exponential moving average of the iterates (e.g., Morales-Brotons et al., 2024) is likely a computational feasible way to estimate the optimizer’s time-averaged trajectory.

They analyze the behavior of RMSProp (Adam without momentum) using their framework to come up with simplified mathematical models that are able to predict actual training behavior in experiments. It looks like their mathematical models explain why RMSProp works, in a way that is more satisfying than the usual hand waving explanations.

DoctorOetker

3 months ago

1 reply

Yes, it certainly provides a lot more clarity than the handwaving.

While momentum seems to work, and the authors clearly state it is not intended as a practical optimization method, I can't exclude that we can improve convergence rates by building on this knowledge.

Is it guaranteed for the oscillating behavior to have a period of 2 steps? or is say 3 step period also possible (a vector in a plane could alternately point to 0 degrees, 120 degrees and 240 degrees).

The way I read this presentation the implication seems to be that its always a period of 2. Perhaps if the top-2 sharpnesses are degenerate (identical), a period of N distinct from 2 could be possible?

It makes you wonder what if instead of storing momentum with exponential moving average one were to use the average of the last 2 iterations, so there would be less lag.

It also makes me wonder if we should perform 2 iterative steps PER sequence so that the single-sample-sequence gives feedback along it's valley instead of across it. One would go through the corpus at half the speed, but convergence may be more accurate.

kingstnap

3 months ago

I doubt it would ever be period 3.

Not a formal proof, but there is this fun theorem called period 3 implies chaos that my gut instinct says applies here.

Basically if you have a continuous mapping from [a,b] -> [a,b] and there exists a 3 cycle then that implies every other cycle length exists.

Which in this case would kinda say that if you are bouncing between three values on the y axis (and the bouncing is a continuous function which admittedly the gradient of a relu is not) you are probably in a chaotic system

Now that requires assuming that the behaviour of y is largely a function of just y. But their derivation seems to imply that it is the case.

programjames

3 months ago

1 reply

It's a little easier to see what's happening if you fully write out the central flow:

    -1/η * dw/dt = ∇L - ∇S * ⟨∇L, ∇S⟩/‖∇S‖²

We're projecting the loss gradient onto the sharpness gradient, and subtracting it off. If you didn't read the article, the sharpness S is the sum of the eigenvalues of the Hessian of the loss that are larger than 2/η, a measure of how unstable the learning dynamics are.

This is almost Sobolev preconditioning:

    -1/η * dw/dt = ∇L - ∇S = ∇(I - Δ)L

where this time S is the sum of all the eigenvalues (so, the Laplacian of L).

lcnielsen

3 months ago

2 replies

Yeah, I did a lot of traditional optimization problems during my Ph. D., this type of expression pops up all the time with higher-order gradient-based methods. You rescale or otherwise adjust the gradient based on some system-characteristic eigenvalues to promote convergence without overshooting too much.

d3m0t3p

3 months ago

1 reply

Would you have some literature about that ?

lcnielsen

3 months ago

There's a ton but it's pretty scattered. Yurii Nesterov's a big name, for example.

d3m0t3p

3 months ago

This sounds a lot like what the Muon / Shampoo optimizer do.

evnix

3 months ago

2 replies

Can someone please ELI5

yorwba

3 months ago

Imagine you're in a hilly landscape and want to go down as far as possible, but you can only see a small area around you, so you can't just directly go to the lowest point, because you don't know in which direction it is. Gradient descent is based on the idea of looking at which direction your local area is sloping upward (the gradient) and jumping in the opposite direction (i.e. down) a distance proportional to the strength of the slope.

This works well when the slope near the lowest point gets flatter and flatter, so that gradient descent makes smaller and smaller jumps, until you reach the bottom and stop. But if you end up on a very steep wall, you would make a very large jump, maybe so large that you overshoot the target and end up on an even steeper wall, make an even larger jump in the opposite direction and so on, getting farther and farther from the goal.

So one idea is to make sure that your jumps are always small enough that even the steepest wall you could possibly encounter won't throw you into a vicious cycle of increasing step sizes. For example, if you pour water on the ground, the water molecules make truly tiny jumps flowing down to the bottom, and in the article they call this path the gradient flow.

But what they show is that gradient descent typically splits off from this smooth gradient flow and instead gets into an area where the jumps get bigger and bigger for a while, but then the cycle is broken and they get smaller again. That is surprising! Even though it seems like the jumps can only keep getting farther, somehow they must've gotten close enough to the goal to calm down again.

So what the authors did is to remove the direction in which the jumps get bigger and look at what happens in the middle. You can imagine this as a valley with steep walls and a river smoothly flowing down at the bottom, and the jumps go back and forth across this river.

They call this river the central flow and show that it doesn't only flow along the direction of the gradient, but also a little bit in a direction where the steepness decreases. So when the jumps cross the river, they're also moving downriver a little, until they get to a point where the valley isn't so steep anymore and the jumps get smaller again.

hshdhdhehd

3 months ago

ELI programmer

You have a function cleverly designed so that being zero is optimal. Closer to zero the better. It has 1000 dials to control it bit otherwise input and output.

So like a AWS Lambda with 1000 env vars!

Some clever math gal designed it so if you do this gradient descent thing it learns! But let's not worry about that for now. We just want to understand gradient descent.

So you have an input you like. And a desired output and thia function that makes an actual output and a way to turn that into a score of closeness. Closer to Zero better.

So you put the input, env vats, output and you get say 0.3

Not bad. But then you decide to wiggle an env var just a bit to see if it makes it better. 0.31 doh! Ok the other way. 0.29 yay! Ok so leave it there and do the next one and so on.

Now repeat with the next input and output pair.

And again with another.

Then do the whole set again!

You will find the average amount you are wrong by gets better!

This is sort of gradient descent.

One extra trick. Using maths and calculus you can figure out how to adjust the env vars so you dont need to guess and the amount you adjust them will be more optimal.

Calculus is about the rate things change, and if you say do A + B then a change in A becomes the same change in A + B but you can also do this in reverse! This let's you calculate not guess those changes needed to the env vars.

imtringued

3 months ago

I apparently didn't get the memo and used stochastic gradient descent with momentum outside of deep learning without running into any problems given a sufficiently low learning rate.

I'm not really convinced that their explanation truly captures why this success should be exclusive to deep learning.

amelius

3 months ago

Does a biological brain do something similar?

mxwsn

3 months ago

Wow! The title suggests introductory material, but in my opinion this has strong potential to win test of time awards for research.

big_hacker

3 months ago

https://youtu.be/ajGX7odA87k?t=613 the stuff is what the stuff, brother.

ComplexSystems

3 months ago

Very neat stuff! So one question is, if we had an analog computer that could run these flows exactly, would we get better results if we ran the gradient flow or this central flow?

3abiton

3 months ago

That's one of the best intuitive explanation for Gradient Descent I've seen floating on HN!

WithinReason

3 months ago

very nice, but it's only useful if it also works with stochastic gradients.

markisus

3 months ago

First I thought this would be just another gradient descent tutorial for beginners. But the article goes quite deep into gradient descent dynamics, looking into third order approximations of the loss function and eventually motivating a concept called "central flows." Their central flow model was able to predict loss graphs for various training runs across different neural network architectures.

xnx

3 months ago

Related: Gradient descent visualizer: https://uclaacm.github.io/gradient-descent-visualiser/

View full discussion on Hacker News

ID: 45467708Type: storyLast synced: 11/20/2025, 4:56:36 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN