Were Rnns All We Needed? a GPU Programming Perspective
Posted4 months agoActive3 months ago
dhruvmsheth.github.ioTechstory
calmmixed
Debate
60/100
RnnsGPU ProgrammingNeural NetworksParallelization
Key topics
Rnns
GPU Programming
Neural Networks
Parallelization
The article discusses a new approach to parallelizing RNNs on GPUs, sparking a discussion on the trade-offs between RNNs and transformers, and the potential benefits and limitations of different parallelization methods.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
1d
Peak period
12
60-66h
Avg / period
4.4
Comment distribution31 data points
Loading chart...
Based on 31 loaded comments
Key moments
- 01Story posted
Sep 18, 2025 at 12:47 PM EDT
4 months ago
Step 01 - 02First comment
Sep 20, 2025 at 12:36 AM EDT
1d after posting
Step 02 - 03Peak activity
12 comments in 60-66h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 22, 2025 at 1:04 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45291903Type: storyLast synced: 11/20/2025, 8:52:00 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Is there a deeper reason why more complicated parallelization as in the OP or the article it references is more desirable?
When you take a batch and calculate gradients, you’re effectively calculating a direction the weights should move in, and then taking a step in that direction. You can do more steps at once by doing what you say, but they might not all be exactly in the right direction, so overall efficiency is hard to compare
I am not an expert, but if I understand correctly I think this is the answer.
The way the headline is written, it sounds like Amdahl’s law was violated. It wasn’t, of course.
The algorithm with fewer data dependencies is O(N log N).
This is covered in more detail in the article.
I opened it wondering what RNNs were, and for anyone else wondering they are apparently Recurrent Neural Networks, though you find that answer if you continue reading on (though the lack of a definition kind of stopped me in my tracks).
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Another great one (just on RNNs) is:
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
https://medium.com/capital-one-tech/why-you-dont-necessarily...
At the time it was clear to all on the team that RNNs, just like transformers later on, are general purpose frameworks that really only require more data and size to function. In the 2018-2020 era and probably today, they are slower to train. They also are less prone to certain pitfalls, but overall had the same characteristics.
In the 2019-2020 I was convinced that transformers would give way to better architecture. The RNNs in particular trained faster and required less data, particularly when combined with several architectural components I won’t get into. I believe that’s still true today, though I haven’t worked on it in the last 2-3 years.
That said, transformers “won” because they are better overall building blocks and don’t require the nuances of RNNs. Combined with the compute optimizations that are now present I don’t see that changing in the near term. Folks are even working to convert transformers to RNNs:
https://medium.com/@techsachin/supra-technique-for-linearizi...
There are also RNN based models beating Qwen 3 8B in certain benchmarks
https://www.rwkv.com/
I suspect over time the other methods my team explored and other types of networks and nodes will continue to expand beyond transformers for state of the art LLMs
Counter consensus is where the alpha is...
Do you think rnn/rwkv have an edge with verifiable domains and tree search inference time? You can use cheaper gpus and do multiple sampling.
(but of course, its hard to beat the sunk cost of a foundation model)
Anyway, claiming that they are equivalent to transformers when RNNs are Turing complete, and forward-only NNs are not is such a strange take.
That seems like too strong of a position. The equilibrium model seems to be a good candidate for some activity in the brain, but it doesn't seem like it applies to everything.
For example, the inter-layer recurrence in vision/object detection processing seems to be something different.
(not trying to say it wasn't an incredible accomplishment for the authors. There are quite a few details to get right in order to get to the "obvious" advance)
Even today it's pretty obvious how such a thing might be further extended. Create a network so big in input it just contains and does attention across it's entire dataset. Such a network would only need basic understanding of language and would not hallucinate. Also it'd be obvious where anything came from, as the attention vectors would show what data was used by what specific part of the network. But this is a theoretical exercise as all the compute power in the world can't do that for any decent size dataset.
RNN and LSTM are more training and inference efficient because they don't do this. They do compute for every token and more-or-less then just add every thought any part of the network had together, sequentially.
We need to go the opposite direction from attention. It has to be the case that attention is extremely inefficient.
In a way, the point of attention mechanisms is to bias the model towards long-range dependencies (as seen e.g. in language sentence structure) and away from the very short-term ones that a plain RNN would tend to focus on. LSTM is sort-of in the middle; it manages to persist information beyond the very short run (so "paying attention" to it in a very fuzzy sense) but not as long-term as attention does.
> The crux of the paper is to remove this direct dependency. The simplified models, minGRU and minLSTM, redefine the gates to depend only on the current input
The entire hypothesis of my machine learning experiments has been that we should embrace the time domain and causal dependencies. I really think biology got these elements correct. Now, the question remains - Which kind of computer system is most ideal to run a very branchy and recursive workload?
Constantly adapting our experiments to satisfy the constraints of a single kind of compute vendor is probably not healthy for science over the long term.
An analogue one, possibly?
You could imagine after one layer of the recurrence, the tokens already have some information about each other, so the input to the following layers is dependent on previous tokens, although the dependence is indirect compared to a traditional RNN.
It ended up optimising in a way that wasn't obvious at first, but turned out to be the noise of one part interacting with another.
Aha: Here's the paper https://osmarks.net/assets/misc/evolved-circuit.pdf
And a fluff article https://www.damninteresting.com/on-the-origin-of-circuits
And as per usual, Google was hopeless in finding the article from a rough description. No chance, at all. Chatgpt thought for 10s and delivered the correct result, first time.
I don't see logarithmic scaling, actually. From the table for GRU performance, going from 16384 -> 65536 (namely: increasing the input by 4x) is roughly a 4x increase in time whether looking at CPU-scan or GPU-scan. Okay, maybe the inputs need to be bigger. Looking at the next plot, which goes up to 524288, we see the same behavior: the delta between CPU-scan and GPU-scan doubles as we double the input. That's a constant multiplicative factor. Same holds for LSTM performance.
Is this an artifact of the benchmark setup? Are we actually measuring the amount of time needed to load the full context into RAM? Or perhaps we're bottlenecked on memory bandwidth?
> Success: The gate extraction kernel, which was a huge bottleneck, now only takes 8% of the total time and is memory-bandwidth bound, saturating L2 bandwidth at 1.9 TB/s. This is a good place to be.
Sounds like that might be the case.