Matmul on Blackwell: Part 2 – Using Hardware Features to Optimize Matmul
Posted4 months agoActive4 months ago
modular.comTechstory
skepticalmixed
Debate
60/100
GPU OptimizationMatrix MultiplicationNvidia Blackwell
Key topics
GPU Optimization
Matrix Multiplication
Nvidia Blackwell
The article discusses optimizing matrix multiplication on NVIDIA's Blackwell architecture using hardware features, sparking a discussion on vendor-specific extensions and their implications for lock-in.
Snapshot generated from the HN discussion
Discussion Activity
Moderate engagementFirst comment
1d
Peak period
7
30-36h
Avg / period
3.3
Comment distribution13 data points
Loading chart...
Based on 13 loaded comments
Key moments
- 01Story posted
Sep 5, 2025 at 2:16 PM EDT
4 months ago
Step 01 - 02First comment
Sep 6, 2025 at 8:19 PM EDT
1d after posting
Step 02 - 03Peak activity
7 comments in 30-36h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 8, 2025 at 9:22 AM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45141769Type: storyLast synced: 11/20/2025, 6:36:47 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Modular also has its paid platform for serving models called Max. I’ve not used that but heard good things.
There seem to be enthusiasts who have experimented a bit and like what they see but I haven’t seen much else.
I have not used Triton/Cute/Cutlass though, so I can't compare against anything other than Cuda really.
Modular has been pushing the notion that they are building technology that allows writing HW-vendor neutral solutions so that users can break free of NVIDIA's hold on high performance kernels.
From their own writing:
> We want a unified, programmable system (one small binary!) that can scale across architectures from multiple vendors—while providing industry-leading performance on the most widely used GPUs (and CPUs).
So, you can support either vendor with as-good-vendor-library performance. That’s not lock-in to me at least.
It’s not as good as the compiler being able to just magically produce optimized kernels for arbitrary hardware though, fully agree there. But it’s a big step forward from Cuda/HIP.
Basically, you need a good description of the hardware and the compiler automatically generates the state of the art GEMM kernel.
Maybe it's 20% worse than Nvidia's hand written kernels, but you can switch hardware vendors or build arbitrary fused kernels at will.