Were Rnns All We Needed? a GPU Programming Perspective

Posted4 months agoActive3 months ago

omegablues

107 points

31 comments

dhruvmsheth.github.ioTechstory

calmmixed

Debate

60/100

RnnsGPU ProgrammingNeural NetworksParallelization

Key topics

Rnns

GPU Programming

Neural Networks

Parallelization

The article discusses a new approach to parallelizing RNNs on GPUs, sparking a discussion on the trade-offs between RNNs and transformers, and the potential benefits and limitations of different parallelization methods.

Snapshot generated from the HN discussion

Discussion Activity

Active discussion

First comment

Peak period

60-66h

Avg / period

4.4

Comment distribution31 data points

Loading chart...

Based on 31 loaded comments

Key moments

01Story posted
Sep 18, 2025 at 12:47 PM EDT
4 months ago
Step 01
02First comment
Sep 20, 2025 at 12:36 AM EDT
1d after posting
Step 02
03Peak activity
12 comments in 60-66h
Hottest window of the conversation
Step 03
04Latest activity
Sep 22, 2025 at 1:04 PM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (31 comments)

Showing 31 comments

DoctorOetker

3 months ago

2 replies

Is it not much simpler to parallelize by having different "readers" (using the same model parameters/weights) process different parts of the corpus in parallel? reader A is reading book A, while reader B is reading book B etc...?

Is there a deeper reason why more complicated parallelization as in the OP or the article it references is more desirable?

jsharf

3 months ago

1 reply

If you have independent copies of the network learning gradients, then you’re effectively making the batch size smaller— unless you’re doing an all collect and making them sync, in which case there’s a lot of overhead

When you take a batch and calculate gradients, you’re effectively calculating a direction the weights should move in, and then taking a step in that direction. You can do more steps at once by doing what you say, but they might not all be exactly in the right direction, so overall efficiency is hard to compare

I am not an expert, but if I understand correctly I think this is the answer.

immibis

3 months ago

Batch size is just averaging the gradients from multiple calculations.

zozbot234

3 months ago

AIUI, the thinking when developing transformers might have been that "reading text A vs. text B" just isn't parallel enough for truly large-scale training. The problem was to somehow also parallelize the learning of very long range dependencies within a single sequence, and transformers managed to do that.

TimorousBestie

3 months ago

1 reply

The single-thread performance of the parallel prefix sum that they use is O(N log N), so the improvement from that to O(log N) on N threads is not as surprising.

The way the headline is written, it sounds like Amdahl’s law was violated. It wasn’t, of course.

casta

3 months ago

2 replies

How's the prefix sum on a single thread O(N log(N))? Isn't it trivially O(N)? It's just a for loop.

gyrovagueGeist

3 months ago

It's from the depth of the computation, not the work

TimorousBestie

3 months ago

Yes, but for loop comes with all those data dependencies that prevent it from being parallelized trivially.

The algorithm with fewer data dependencies is O(N log N).

This is covered in more detail in the article.

akst

3 months ago

2 replies

I think the author may have linked to the wrong paper in their opening paragraph (hopefully they see this and update the link) https://arxiv.org/abs/2410.01201

I opened it wondering what RNNs were, and for anyone else wondering they are apparently Recurrent Neural Networks, though you find that answer if you continue reading on (though the lack of a definition kind of stopped me in my tracks).

jszymborski

3 months ago

1 reply

If you are looking to learn about LSTMs and RNNs, I can't recommend this post enough:

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Another great one (just on RNNs) is:

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

akst

3 months ago

thanks!

omegabluesAuthor

3 months ago

1 reply

sorry about that - it should be fixed!

akst

3 months ago

No worries

lettergram

3 months ago

1 reply

Back in 2016 - 2018 my work at Capital One resulted in a modified C-RNN style architecture that was producing gpt-2 level results. Using that model we were able to build a general purpose system that could generate data for any dataset (with minimal training, from scratch):

https://medium.com/capital-one-tech/why-you-dont-necessarily...

At the time it was clear to all on the team that RNNs, just like transformers later on, are general purpose frameworks that really only require more data and size to function. In the 2018-2020 era and probably today, they are slower to train. They also are less prone to certain pitfalls, but overall had the same characteristics.

In the 2019-2020 I was convinced that transformers would give way to better architecture. The RNNs in particular trained faster and required less data, particularly when combined with several architectural components I won’t get into. I believe that’s still true today, though I haven’t worked on it in the last 2-3 years.

That said, transformers “won” because they are better overall building blocks and don’t require the nuances of RNNs. Combined with the compute optimizations that are now present I don’t see that changing in the near term. Folks are even working to convert transformers to RNNs:

https://medium.com/@techsachin/supra-technique-for-linearizi...

There are also RNN based models beating Qwen 3 8B in certain benchmarks

https://www.rwkv.com/

I suspect over time the other methods my team explored and other types of networks and nodes will continue to expand beyond transformers for state of the art LLMs

algo_trader

3 months ago

> RNN based models beating Qwen 3 8B > https://www.rwkv.com/

Counter consensus is where the alpha is...

Do you think rnn/rwkv have an edge with verifiable domains and tree search inference time? You can use cheaper gpus and do multiple sampling.

(but of course, its hard to beat the sunk cost of a foundation model)

marcosdumay

3 months ago

1 reply

Well, on the title, our brain seems to be equivalent to a RNN... so yeah, possibly.

Anyway, claiming that they are equivalent to transformers when RNNs are Turing complete, and forward-only NNs are not is such a strange take.

imtringued

3 months ago

1 reply

It's a much stranger take to associate the brain with RNNs. It's far more likely that the brain does something similar to a path independent spiking equilibrium model trying to find a fix point, because those models are inherently robust with respect to noise and adversarial attacks, do not require more than one layer and inherently contain feedback loops and tend to generalize well to out of distribution data. Of course in practice they end up somewhere between 2x and 3x slower than a finite layer transformer for the same in distribution performance.

RaftPeople

3 months ago

> It's a much stranger take to associate the brain with RNNs

That seems like too strong of a position. The equilibrium model seems to be a good candidate for some activity in the brain, but it doesn't seem like it applies to everything.

For example, the inter-layer recurrence in vision/object detection processing seems to be something different.

zozbot234

3 months ago

1 reply

My understanding is that RNN and LSTM (potentially augmented with some bespoke attention mechanism) have the potential to be more training- and inference-efficient than the transformer models that are more common today, but transformers were adopted because of their unique ability to greatly scale up and parallelize training in a way that just isn't feasible with the older models. So transformers can get better outcomes from their allowed scale of compute despite possibly being less compute-efficient overall.

spwa4

3 months ago

1 reply

I wonder about that. I mean, attention is sort-of obvious isn't it? Just calculate across all data available, look at every moment in time simultaneously. I know the point of attention is to select the correct data and process that, but in order to select the correct data we do a truly massive matrix multiplication. That that was going to work, I believe, was not a mystery even in 1960. It just wasn't possible, and even in 2016 it was not really possible to train with such a principle outside of the FANGs.

(not trying to say it wasn't an incredible accomplishment for the authors. There are quite a few details to get right in order to get to the "obvious" advance)

Even today it's pretty obvious how such a thing might be further extended. Create a network so big in input it just contains and does attention across it's entire dataset. Such a network would only need basic understanding of language and would not hallucinate. Also it'd be obvious where anything came from, as the attention vectors would show what data was used by what specific part of the network. But this is a theoretical exercise as all the compute power in the world can't do that for any decent size dataset.

RNN and LSTM are more training and inference efficient because they don't do this. They do compute for every token and more-or-less then just add every thought any part of the network had together, sequentially.

We need to go the opposite direction from attention. It has to be the case that attention is extremely inefficient.

zozbot234

3 months ago

> I know the point of attention is to select the correct data and process that

In a way, the point of attention mechanisms is to bias the model towards long-range dependencies (as seen e.g. in language sentence structure) and away from the very short-term ones that a plain RNN would tend to focus on. LSTM is sort-of in the middle; it manages to persist information beyond the very short run (so "paying attention" to it in a very fuzzy sense) but not as long-term as attention does.

bob1029

3 months ago

5 replies

> You simply cannot compute the gates for the entire sequence in one shot, because each step requires the output from the one before it. This forces a sequential loop, which is notoriously inefficient on parallel hardware like GPUs.

> The crux of the paper is to remove this direct dependency. The simplified models, minGRU and minLSTM, redefine the gates to depend only on the current input

The entire hypothesis of my machine learning experiments has been that we should embrace the time domain and causal dependencies. I really think biology got these elements correct. Now, the question remains - Which kind of computer system is most ideal to run a very branchy and recursive workload?

Constantly adapting our experiments to satisfy the constraints of a single kind of compute vendor is probably not healthy for science over the long term.

jstanley

3 months ago

1 reply

> Which kind of computer system is most ideal to run a very branchy and recursive workload?

An analogue one, possibly?

tripplyons

3 months ago

I think of analogue computing as more continuous than branchy from what I have heard about it. I don't know much about it though.

nickpsecurity

3 months ago

Analog or FPGA. Cerebras' wafer-scale technology could help.

tripplyons

3 months ago

The output of the recurrence is still dependent on previous tokens, but it usually less expressive within the recurrence in order make parallelism possible. In MinGRU the main operation used to share information between tokens is addition (with a simple weighting).

You could imagine after one layer of the recurrence, the tokens already have some information about each other, so the input to the following layers is dependent on previous tokens, although the dependence is indirect compared to a traditional RNN.

inciampati

3 months ago

It turns out you can use a fused triton kernel for a true RNN GRU and run just as fast as the minGRU model in training. Yeah, it doesn't work for very long context but neither does minGRU (activation memory...)

fennecbutt

3 months ago

Absolutely, I remember an article from ages ago about a self learning algo implemented on an fpga (I think) that could modify its own make up on a hardware level.

It ended up optimising in a way that wasn't obvious at first, but turned out to be the noise of one part interacting with another.

Aha: Here's the paper https://osmarks.net/assets/misc/evolved-circuit.pdf

And a fluff article https://www.damninteresting.com/on-the-origin-of-circuits

And as per usual, Google was hopeless in finding the article from a rough description. No chance, at all. Chatgpt thought for 10s and delivered the correct result, first time.

vitus

3 months ago

1 reply

> The GPU implementation's logarithmic scaling becomes evident at longer sequence lengths.

I don't see logarithmic scaling, actually. From the table for GRU performance, going from 16384 -> 65536 (namely: increasing the input by 4x) is roughly a 4x increase in time whether looking at CPU-scan or GPU-scan. Okay, maybe the inputs need to be bigger. Looking at the next plot, which goes up to 524288, we see the same behavior: the delta between CPU-scan and GPU-scan doubles as we double the input. That's a constant multiplicative factor. Same holds for LSTM performance.

Is this an artifact of the benchmark setup? Are we actually measuring the amount of time needed to load the full context into RAM? Or perhaps we're bottlenecked on memory bandwidth?

> Success: The gate extraction kernel, which was a huge bottleneck, now only takes 8% of the total time and is memory-bandwidth bound, saturating L2 bandwidth at 1.9 TB/s. This is a good place to be.

Sounds like that might be the case.

tripplyons

3 months ago

Typically the fastest approaches for associative RNNs combine the advantages of the parallel O(n log n) algorithm with a recurrent non-parallel O(n) approach by computing results for subsequence chunks in parallel and moving to the next chunk in a recurrent manner. This blog post explains the method (chunkwise parallel algorithm) well: https://sustcsonglin.github.io/blog/2024/deltanet-2/

tripplyons

3 months ago

Have you explored chunkwise parallel approaches? They use the O(n log n) parallel algorithm within a subsequence and update bewteen chunks recurrently like the O(n) recurrent algorithm. These are usually the fastest kernels for these kinds of RNNs. Here is the best explanation I have seen: https://sustcsonglin.github.io/blog/2024/deltanet-2/

View full discussion on Hacker News

ID: 45291903Type: storyLast synced: 11/20/2025, 8:52:00 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN