We Reverse-Engineered Flash Attention 4
Posted4 months agoActive3 months ago
modal.comTechstory
calmmixed
Debate
60/100
Flash Attention 4GPU OptimizationDeep Learning
Key topics
Flash Attention 4
GPU Optimization
Deep Learning
The post discusses the reverse-engineering of Flash Attention 4, a GPU optimization technique, sparking a discussion on the meaning of 'reverse-engineering' and the complexity of writing optimized GPU kernels.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
1h
Peak period
14
2-4h
Avg / period
3.7
Comment distribution48 data points
Loading chart...
Based on 48 loaded comments
Key moments
- 01Story posted
Sep 27, 2025 at 5:50 PM EDT
4 months ago
Step 01 - 02First comment
Sep 27, 2025 at 7:15 PM EDT
1h after posting
Step 02 - 03Peak activity
14 comments in 2-4h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 29, 2025 at 3:36 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45399637Type: storyLast synced: 11/20/2025, 3:29:00 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Reductively, software engineering means taking an idea and mapping it into code. So one form of "reverse" engineering would be taking the code and extracting the ideas. That's what we did here.
Because the source is public, there's quite a lot to work with from the start -- the warp specializations are named and there are helpful comments in many places.
But for many components, we didn't have much. Maybe the clearest case of "reverse engineering" explained in the post is with the cubic approximation for the rational part of the exponentiation. That required staring at some inline assembly and doing math.
Not trying to be uncharitable, I found your article informative. Reverse engineering has historically been reserved for cases where there is an adversial aspect, as in binaries or server APIs. Anyhow, Cheers and thank you, sincerely.
Certainly I can't get on board with reverse engineered.
If you had reverse engineered it, you would have tried to "recreate something" that does not exist to do the same.
So, if you have a binary code, you recreate the source code that in theory could allow you to recreate the binary.
If you have the source code, I guess that would be when you are missing pieces of info that allows you to run this code like it is done by others...
For example, simple hardware reversing can just be learning what, how and why something works, you don't need to "recreate" anything other than ideas.
Thus it was natural to call the process of producing design documents from undocumented software "reverse engineering". These days coding without any formal design documents is so common that it seems the original meaning of reverse engineering has become obscured.
https://en.wikipedia.org/wiki/IBM_Rational_Rose
> cudnn kernels are closed source, so Jensen only knows what’s going on in there.
Starting from high level source code is like starting from engineering drawings or the CAD model. You've already been handed most or all of the info that reverse engineering is attempting to recover.
I would get some pretty weird looks if I changed my CV to replace "maintained legacy application that I did not write" with "reverse engineering".
Similarly, I would get instant hoots of laughter if told my dev managers over the last 28 years that I reversed engineered the legacy application I was hired to work on.
I mean, I get what you're saying, but when you use the term "reverse engineering" in the context of software, you're just going to confuse everyone who already knows what it means.
Because I'm pretty sure most devs would not just read the code and go "ah yes, of course".
[1]: https://en.wikipedia.org/wiki/Fast_inverse_square_root#Overv...
According to Wikipedia[1], "In 1990, the Institute of Electrical and Electronics Engineers (IEEE) defined (software) reverse engineering (SRE) as "the process of analyzing a subject system to identify the system's components and their interrelationships and to create representations of the system in another form or at a higher level of abstraction" in which the "subject system" is the end product of software development." It goes on to clarify that "Reverse engineering can be performed from any stage of the product cycle, not necessarily from the functional end product."
Further, "There are two components in reverse engineering: redocumentation and design recovery."
Are you arguing that the work here does not fit the definition or that the definition is wrong? In the latter case, could you please share your definition, and maybe even explain why it is superior to IEEE's?
[1] https://en.wikipedia.org/wiki/Reverse_engineering#Software
Though password cracking is not necessarily the best example, some (very bad!) hashing algorithms can actually be reversed that way. Figuring out the reverse is, reverse engineering. You would reverse engineer the algorithm to figure out how to create a collision that way. Same as superoptimizers sort of reverse engineer the behavior you want to come up with a very efficient implementation. I'm using the term reverse engineer a bit loosely there but you get the point. It has nothing to do with source code really, you can just as easily reverse engineer physical objects. Or artwork. Or the psyche.
So yes, you can reverse engineer source code to understand on a deeper level how it works. Sometimes reading it over once or twice is enough for this, sometimes even reading the API documentation or observing behavior is enough, but sometimes you have to do a bit of thinking and/or testing to fully understand it.
If reverse engineering is reserved for cases without source code, which I assume also means no decompilation which often is an option, then what do we call figuring out what some piece of code does and why it does what it does? And why is it sufficiently different from reverse engineering to warrant a separate term?
As a fellow Tri Dao groupie and lucky duck who gets to build on Hopper/Blackwell clusters, I find it amazing how difficult it is becoming to write kernels that saturate GPU hardware.
When I squint, there appears to be a trend emerging across work like FA4, monolithic (mega) kernels, etc. Namely, a subversion of the classic CUDA programming model in the form of fine grained task based parallelism, managed entirely in “user space”.
Not exactly sure what’s ahead but I’m strapping in for a wild ride…
I was also reminded of HazyResearch's MegaKernels. Didn't want to distract from the main thrust of the post, but definitely think that's a promising approach.
Couple of years ago I did some experiments using a surrogate for attention using a feed forward network (MLP) to avoid the quadratic explosion.
It worked but had problems at the time, and my mind wasn't really in it.
This has dug it back out again with the benefit of time and additional insights.
So now I'm thinking, you can use a lot of the insights in the work here, but also shoot for a full linear scaling surrogate.
The trick is to use the surrogate as a discriminator under an RL regime during training.
Instead of just applying better/faster math and optimizations alone, have the model learn to work with a fundamentally better inference approach during training.
If you do that, you can turn the approximation error present in the FFN surrogate inference method into a recovery signal encoded into the model itself.
I haven't tried it, but don't see a reason it shouldn't work. Will give it a go on a GPT-2 model ASAP.
Thanks again for the awesome article.
This question set aside, I'm not fan at all of this blog post content, might be me being too stupid, but I don't think that it is well understandable. Very few concrete info and a lot of digressions. Like the constant reference to research article or reference on related topics. Looks like low value research papers trying to show that you did your work with lot of references.