Writing Speed-of-Light Flash Attention for 5090 in Cuda C++
Mood
calm
Sentiment
positive
Category
other
Key topics
The post discusses optimizing flash attention for NVIDIA's 5090 GPU using CUDA C++, sparking a discussion on GPU performance, optimization techniques, and the trade-offs between different NVIDIA GPU models.
Snapshot generated from the HN discussion
Discussion Activity
Moderate engagementFirst comment
3h
Peak period
8
Hour 15
Avg / period
2.6
Based on 34 loaded comments
Key moments
- 01Story posted
Aug 23, 2025 at 8:29 AM EDT
3 months ago
Step 01 - 02First comment
Aug 23, 2025 at 11:37 AM EDT
3h after posting
Step 02 - 03Peak activity
8 comments in Hour 15
Hottest window of the conversation
Step 03 - 04Latest activity
Aug 24, 2025 at 5:17 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
> The main objective is to learn writing attention in CUDA C++, since many features are not available in Triton, such as MXFP8 / NVFP4 MMA for sm120.
Starting with 4090, NVIDIA limits the performance of tensor cores on gaming cards, specifically for ops that might be used in ML training. FP8 and FP16 matmuls run at full speed if accumulating in FP16 (I've never seen anyone use this), but only half speed when accumulating in FP32. This restriction is not present for lower precision matmuls like FP4, and is removed entirely on the workstation-class cards like RTX Pro 6000.
It doesn't seem worth it to use NVIDIA gaming cards as a "cheaper FLOPs" alternative anymore (e.g. diffusion models could have been cheaper to run on 3090 than A100). They are generous with memory bandwidth though, nearly 2TB/s on 5090 is amazing!
5090: 210 TF / $2k == 105 TF/$k
B200: 2250 TF / $40k == 56 TF/$k
Getting only 2x the FLOPs per dollar probably isn't worth the hassle of having to rack 10x as many GPUs, while having no NVLink.
There's a lot more to performance computing than FLOPs. FLOPs are you good high level easy to understand metric but it's a small part of the story when you're in the weeds.
To help make sense of this, look at CPU frequencies. I think most people on HN know that two CPU with the same frequency can have dramatically different outcomes on benchmarks, right? You might know how some of these come down to things like IPC (instructions per cycle) or the cache structures. There's even more but we know it's not so easy to measure, right?
On a GPU all that is true but there's only more complexity. Your GPU is more similar to a whole motherboard where your PCIe connection is a really really fast network connection. There's lots of faults to this analogy but this closer than just comparing TFLOPs.
Nvidia's moat has always been "CUDA". Quotes because even that is a messier term than most think (Cutlass, CuBLAS, cuDNN, CuTe, etc). The new cards are just capable of things the older ones aren't. Mix between hardware and software.
I know this isn't a great answer but there is none. You'll probably get some responses and many of them will have parts of the story but it's hard to paint a real good picture in a comment. There's no answer that is both good and short.
What's less proven is a recipe using MXFP4 x MXFP4 -> FP32 compute, e.g. [1], which needs more involved techniques to work. But if you get it to work stably, that pathway is running at full throughput on 5090.
[0]: https://arxiv.org/abs/2506.08027 [1]: https://arxiv.org/abs/2502.20586
On the contrary, older GPUs are a lot easier to hit rooflines on. Newer GPUs run so fast that they keep adding new tricks to remove bottlenecks. Not to discount the author's work here but a 5090 is pretty bad on the FLOPs/memory bandwidth ratio so it's comparatively easier to get throttled by tensor cores there; on datacenter hardware your tensor cores are so fast that you'll hit limits that were glossed over here.
For example, using Ampere "mma" instructions won't cut it, because they compute a really small MMA and force your input to live in registers. You'll need TMA to get data into shared memory and wmma to do a matrix multiply out of them. At those speeds you will run into issues with dispatching instructions and computing addresses (and doing out-of-bounds calculation) fast enough that you will need to offload it to specialized hardware to keep up with the tensor cores.
The tip that Nsight can run on Mac over SSH is great, too. I've been capturing and viewing data over RDP, basically, will have to give it a shot next week.
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.