Who Invented Deep Residual Learning?

Posted3 months agoActive3 months ago

timlod

114 points

35 comments

people.idsia.chResearchstory

controversialmixed

Debate

80/100

Deep LearningResidual Neural NetworksCredit Attribution

Key topics

Deep Learning

Residual Neural Networks

Credit Attribution

The article discusses the origins of deep residual learning, sparking a debate about who invented the technique and how credit should be attributed in the ML community.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

Peak period

120-132h

Avg / period

8.8

Comment distribution35 data points

Loading chart...

Based on 35 loaded comments

Key moments

01Story posted
Oct 13, 2025 at 7:07 AM EDT
3 months ago
Step 01
02First comment
Oct 18, 2025 at 2:10 PM EDT
5d after posting
Step 02
03Peak activity
22 comments in 120-132h
Hottest window of the conversation
Step 03
04Latest activity
Oct 20, 2025 at 8:05 AM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (35 comments)

Showing 35 comments

ansk

3 months ago

2 replies

Of all Schmidhuber's credit-attribution grievances, this is the one I am most sympathetic to. I think if he spent less time remarking on how other people didn't actually invent things (e.g. Hinton and backprop, LeCun and CNNs, etc.) or making tenuous arguments about how modern techniques are really just instances of some idea he briefly explored decades ago (GANs, attention), and instead just focused on how this single line of research (namely, gradient flow and training dynamics in deep neural networks) laid the foundation for modern deep learning, he'd have a much better reputation and probably a Turing award. That said, I do respect the extent to which he continues his credit-attribution crusade even to his own reputational detriment.

godelski

3 months ago

2 replies

I think one of the best things to learn from Schmidhuber is that progress involves a lot of players and over a lot of time. Attribution is actually a difficult game and usually we are only assigning credit to those at the end of some milestone. It's like giving a gold medal to the runner in the last leg of a relay race or focusing only on the lead singer of a band. It's never one person that does it alone. Shoulders of giants, but those giants are just a couple of dudes in a really big trenchcoat.

Another important lesson is that often good ideas get passed over because of hype or politics. We often like to pretend that science is all about the merit and what is correct. Unfortunately this isn't true. It is that way in the long run, but in the short run there's a lot of politics and humans still get in their own way. This is a solvable problem, but we need to acknowledge it and create systematic changes. Unfortunately a lot of that is coupled to the aforementioned one.

  > I do respect the extent to which he continues his credit-attribution crusade even to his own reputational detriment.

As should we all. Clearly he was upset that others got credit for his contributions. But what I do appreciate is that he has recognized that it is a problem bigger than him, and is trying to combat the problem at large and not just his own little battlefield. That's respectable.

dchftcs

3 months ago

1 reply

It's a bit of an aside but I believe this is one reason Zuckerberg's vision for establishing the superintelligence lab is misguided. Including VCs, too many people get distracted by rock stars in this gold rush.

godelski

3 months ago

Just last week I said something inline with that[0]. Many people conflated my claim that Meta has a lot of good people with "Meta /is/ winning the AI race". I just claimed they had some of who I think are some of the best researchers in the field, but do not give them nearly the same resources or capacity to further their research that they give to these "rock stars". Tbh, the same is true for any top lab, I just think this happens more at Meta because Meta is so metric and rock star focused.

So I agree. The vision is misguided. I think they'd have done better had they taken that same money and just thrown it at the people they already have but who are working in different research areas. Everyone is trying to win my doing the same things. That's not a smart strategy. You got all that money, you gotta take risks. It's all the money dumped into research that got us to this point in the first place.

It's good to shift funds around and focus on what is working now, but you also have to have a pipeline of people working on what will work tomorrow, next year, 5 years, and 10 years. The people are there that can do that work. The people are there that want to do the work. The only thing is there's little to no people that want to fund that work. Unfortunately it takes time to bake a cake.

Quite frankly, these companies also have more than enough money to do both. They have enough money to throw cash hand over fist at every wild and crazy idea. But they get caught in the hype, which is no different than an over focus on the attribution rather than the process or pipeline that got us the science in the first place.

[0] https://news.ycombinator.com/item?id=45554147

esafak

3 months ago

Also, it reminds us that the powerful write history. But history can be rewritten as the balance of power shifts. I imagine the world will hear all about China's contributions to the field if they continue their ascent.

snthpy

3 months ago

> That said, I do respect the extent to which he continues his credit-attribution crusade even to his own reputational detriment.

Lol, I still used to notice him before covid when he was railing against Bengio, Hinton, and LeCun. Can't believe he's still going.

ekjhgkejhgk

3 months ago

4 replies

I spent some time in the academia.

The person with whom an idea ends up associated often isn't the first person to have the idea. Most often is the person who explains why the idea is important, or find a killer application for the idea, or otherwise popularizes the idea.

That said, you can open what Schmidhuber would say is the paper which invented residual NNs. Try and see if you notice anything about the paper that perhaps would hinder the adoption of its ideas [1].

[1] https://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdv...

seanmcdirmid

3 months ago

1 reply

Surely they wrote some papers in English even if they wrote their dissertation in German? Most people don’t go straight to dissertations anyway, it’s more of a place to go after you read a much shorter paper.

ekjhgkejhgk

3 months ago

1 reply

Correct, that's [2]. In [2] they even say "[we] derive de main result using the approach first proposed in " and cite [1]. So the paper that everyone knows, in English (and with Bengio), explictly say that the original idea is in a paper in German, and still the scientific community chose not to cite the German original.

[1] https://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdv...

[2] https://sferics.idsia.ch/pub/juergen/gradientflow.pdf

versteegen

3 months ago

1 reply

Not to excuse ignoring the thesis, but I want to point out it's bad form to cite a supposed result from a paper you haven't even looked at (or can't read), unless you indicate (admit) you didn't read it. I have even seen it described, in a paper about the citation practices of science, as unethical. It can transform hearsay into established truth. Of course nearly everyone does it anyway when they think the chance of making an erroneous citation is low enough, and that's exactly the problem.

ekjhgkejhgk

3 months ago

That's exactly my point! My whole time in the academia I have not once cited a paper I hadn't read and I think doing so is a little bit dishonest at best.

They wrote the paper in German, so honest people who don't know German can't cite it directly. This is only even worth remarking on because Schmidhuber is annoyed for missing out on those specific citations.

MurizS

3 months ago

1 reply

I think what you're referring to is also known as Stigler's law of eponymy [1], which is interestingly self-referential and ironic in its own naming. There's also the related "Matthew effect" [2] in the sciences.

[1] https://en.wikipedia.org/wiki/Stigler's_law_of_eponymy

[2] https://en.wikipedia.org/wiki/Matthew_effect

thaumasiotes

3 months ago

The most annoying instance, to me, of Stigler's Law is De Morgan's Laws, which say the following:

1. If two things are not both true, then one or both of them must be false. (And the reverse.)

2. If neither of two things is true, then both of them are false. (And the reverse.)

You might notice that both statements are blindingly obvious, but we've named them after Augustus de Morgan anyway.

dchftcs

3 months ago

1 reply

Einstein published his relativity papers originally in German.

CamperBob2

3 months ago

3 replies

German was the lingua franca of physics at the time, so to speak.

Starting in the 1930s, though, that tradition began to change... for reasons that I'm sure won't ever apply to American English. Nosirree, Bob, we're special. Great, even.

barrenko

3 months ago

1 reply

The only constant is constant change.

kirubakaran

3 months ago

And even that's not constant sometimes.

sorokod

3 months ago

A relevant quote from David Hilbert:

"Mathematics in Göttingen?" Hilbert repiled. "There is really none any more."

https://hsm.stackexchange.com/questions/2486/source-for-hilb...

ggm

3 months ago

> German was the lingua franca

A finer expression of the matter cannot be made.

ricklamers

3 months ago

Perhaps then inventors of promising ideas should make multiple attempts at popularizing their ideas if they care about association, multiple attempts at explaining why the idea is important and demonstrations of killer applications.

alyxya

3 months ago

1 reply

The notion of inventing or creating something in ML doesn't seem very important as many people can independently come up with the same idea. Conversely, you can create novel results just by reviewing old literature and demonstrating it in a project.

ekjhgkejhgk

3 months ago

1 reply

That's how all/most science normally works.

Conversely, a huge amount of science is just scientists going "here's something I found interesting" but no one can figure out what to do with it. Then 30 or 100 years go by and it's a useful in a field that didn't even exist at the time.

alyxya

3 months ago

It doesn’t apply to empirical science because there’s a lot more variation in observations. The variation of ideas in ML model architecture is limited by being theoretical.

gwern

3 months ago

2 replies

> Note again that a residual connection is not just an arbitrary shortcut connection or skip connection (e.g., 1988)[LA88][SEG1-3] from one layer to another! No, its weight must be 1.0, like in the 1997 LSTM, or in the 1999 initialized LSTM, or the initialized Highway Net, or the ResNet. If the weight had some other arbitrary real value far from 1.0, then the vanishing/exploding gradient problem[VAN1] would raise its ugly head, unless it was under control by an initially open gate that learns when to keep or temporarily remove the connection's residual property, like in the 1999 initialized LSTM, or the initialized Highway Net.

After reading Lang & Witbrock 1988 https://gwern.net/doc/ai/nn/fully-connected/1988-lang.pdf I'm not sure how convincing I find this explanation.

imtringued

3 months ago

1 reply

For residual networks with an infinite number of layers it is absolutely correct. For a residual network with finite layers, you can get away with any non zero constant weight as long as the weight chosen appropriately for the fixed network depth. The problem is simply c^n gives you very big or very small numbers for large n and large deviations from 1.

Now let me address the other possibility that you are talking about: what if residual connections aren't necessary? What if there is another way? What are the criteria necessary to avoid exploding or vanishing gradient or slow learning in the absence of both?

For that we need to first know why residual connections work. There is no way around calculating the back propagation formula by hand, but there is an easy trick to make it simple. We don't care about the number of parameters in the network, we only care about the flow of the gradient. So just have a single input and output with hidden size 1 and two hidden layers.

Each layer has a bias and a single weight and an activation function.

Let's assume you initialize each weight and bias with zero. The forward pass returns zero for any input and the gradient is zero. In this artificial scenario the gradient starts vanished and stays vanished. The reason is pretty obvious when you apply back propagation. The second layer clips the gradient of the first layer. If there was a single layer, the gradient would be non zero and yield a non zero gradient, rescuing the network out of the vanishing gradient.

Now what if you add residual connections? The forward pass stays the same, but the backward pass changes for two layers and beyond. The gradient for the second layer consists of just the second layer activation function multiplied by the first layer activation of the forward pass. The first layer gradient consists of the second layer gradient where the first layer activation is substituted by the gradient of the first layer but because it is a residual net, you also add the gradient of just the first layer.

In other words, the first layer is trained independently of the layers that come after it, but also gets feedback from higher layers on top. This allows it to become non zero, which then lets the second layer become non zero, which lets the third be non zero and so on.

Since the degenerate case of a zero initialized network makes things easy to conceptualise, it should help you figure out what other ways there are to accomplish the same task.

For example, what if we apply the loss to every layer's output as a regularizer? That is essentially doing the same thing as a residual, but with skip connections that sum up the outputs. You could replace the sum with a weighted sum where the weights are not equal to 1.0.

But what if you don't want skip connections either, because they are too similar to residual networks? A residual network has one skip connection already and summing up in a different way is uninteresting. It is also too reliant on each layer being encouraged to produce an output that is matched against the label.

In other words, what if we wanted to let the inner layers not be subject to any correlation with the output data? You would need something that forces the gradients away from zero but also away from excessively high numbers. I.e. weight regularization or layer normalisation with a fixed non zero bias.

Predictive coding and especially batched predictive coding could also be a solution to this.

Predictive coding predicts the input of the next layer, so the only requirement is that the forward pass produces a non zero output. There is no requirement for the gradient to flow through the entire network.

gwern

3 months ago

My point is more that Schmidhuber is saying that the gates or the initialization are the innovation solely because they produce well-behaved gradients, which is why Hochreiter's 1991 thing is where he starts and nothing before that counts. But it's not clear to me why we should define it like that when you can solve the gradient misbehavior other ways, which is why https://gwern.net/doc/ai/nn/fully-connected/1988-lang.pdf#pa... works and doesn't diverge: if I'm understanding them right, they did warmup, so the gradients don't explode or vanish. So why doesn't that count? They have shortcut layers and a solution to exploding/vanishing gradients and it works to solve their problem. Is it literally 'well, you didn't use a gate neuron or fancy initialization to train your shortcuts stably, therefore it doesn't count'? Such an argument seems carefully tailored to exclude all prior work...

CamperBob2

3 months ago

That's a cool paper. Super interesting to see how work was progressing at the time, when Convex was the machine everybody wanted on (or rather next to) their desks.

ekjhgkejhgk

3 months ago

To comment on the substance.

It seems that these two people Schimidhuber and Hochreiter were perhaps solving the right problem for the wrong reasons. They thought this was important because they expected that RNNs could hold memory indefinitely. Because of BPTT, you can think of that as a NN with infinitely many layers. At the time I believe nobody worries about vanishing gradient for deep NNs, because the compute power for networks that deep just didn't exist. But nowadays that's exactly how their solution is applied.

That's science for you.

QuadmasterXLII

3 months ago

lol

jaberjaber23

3 months ago

science repeats itself

HarHarVeryFunny

3 months ago

How about Schmidhuber actually invents the next big thing rather than waiting for it to come along then claim credit for it?

bjourne

3 months ago

I'm not a giant like Schmidhuber so I might be wrong, but imo there are at least two features that set residual connections and LSTMs apart:

1. In LSTMs skip connections help propagate gradients backwards through time. In ResNets, skip connections help propagate gradients across layers.

2. Forking the dataflow is part of the novelty, not only the residual computation. Shortcuts can contain things like batch norm, down sampling, or any other operation. LSTM "residual learning" is much more rigid.

aDyslecticCrow

3 months ago

I thought it was ResNet that invented the technique, but it's interesting to see it rooted back through LSTM which feels like a very architecture. ResNet really made massive waves in the field, and it was hard finding a paper that didn't reference it for a while.

jszymborski

3 months ago

"LSTMs brought essentially unlimited depth to supervised RNNs"

LSTMs are an incredible architecture, I use them a lot in my research. While LSTMs are useful over many more timesteps than other RNNs, LSTMs certainly don't offer 'essentially unlimited depth'.

When training LSTMs whose input were sequences of amino acids, whose length easily top 3,000 timesteps, I got huge amounts of instability... with gradients rapidly vanishing. Tokenizing the AAs, getting the number of timesteps down to more like 1,500, has made things way more stable.

scarmig

3 months ago

From the domain, I'm guessing the answer is Schmidhuber.

View full discussion on Hacker News

ID: 45567050Type: storyLast synced: 11/20/2025, 5:33:13 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN