Matmul on Blackwell: Part 2 – Using Hardware Features to Optimize Matmul

Posted4 months agoActive4 months ago

robertvc

23 points

13 comments

modular.comTechstory

skepticalmixed

Debate

60/100

GPU OptimizationMatrix MultiplicationNvidia Blackwell

Key topics

GPU Optimization

Matrix Multiplication

Nvidia Blackwell

The article discusses optimizing matrix multiplication on NVIDIA's Blackwell architecture using hardware features, sparking a discussion on vendor-specific extensions and their implications for lock-in.

Snapshot generated from the HN discussion

Discussion Activity

Moderate engagement

First comment

Peak period

30-36h

Avg / period

3.3

Comment distribution13 data points

Loading chart...

Based on 13 loaded comments

Key moments

01Story posted
Sep 5, 2025 at 2:16 PM EDT
4 months ago
Step 01
02First comment
Sep 6, 2025 at 8:19 PM EDT
1d after posting
Step 02
03Peak activity
7 comments in 30-36h
Hottest window of the conversation
Step 03
04Latest activity
Sep 8, 2025 at 9:22 AM EDT
4 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (13 comments)

Showing 13 comments

saagarjha

4 months ago

2 replies

Is anyone using Modular? Curious how you find it compares against the competitors in this space.

totalperspectiv

4 months ago

I have used Mojo quite a bit. It’s fantastic and lives up to every claim it makes. When the compiler becomes open source I fully expect it to really start taking off for data science.

Modular also has its paid platform for serving models called Max. I’ve not used that but heard good things.

subharmonicon

4 months ago

I’ve also been curious to see actual users compare/contrast their experiences with other options, but so far haven’t seen that.

There seem to be enthusiasts who have experimented a bit and like what they see but I haven’t seen much else.

subharmonicon

4 months ago

2 replies

TLDR: In order to get good performance you need to use vendor-specific extensions that result in the same lock-in Modular has been claiming they will enable you to avoid.

totalperspectiv

4 months ago

2 replies

I don’t follow your logic. Mojo can target multiple gpu vendors. What is the Modular specific lock in?

smilekzs

4 months ago

1 reply

Not OP but I think this could be an instance of leaky abstraction at work. Most of the time you hand-write an accelerator kernel hoping to optimize for runtime performance. If the abstraction/compiler does not fully insulate you from micro-architectural details affecting performance in non-trivial ways (e.g. memory bank conflict as mentioned in the article) then you end up still having per-vendor implementations, or compile-time if-else blocks all over the place. This is less than ideal, but still arguably better than working with separate vendor APIs, or worse, completely separate toolchains.

whimsicalism

4 months ago

1 reply

Yes, it looks like they have some sort of metaprogramming setup (nicer than C++) for doing this: https://www.modular.com/mojo

totalperspectiv

4 months ago

1 reply

I can confirm, it’s quite nice.

whimsicalism

4 months ago

1 reply

jw: why do you use mojo here over triton or the new pythonic cute/cutlass?

totalperspectiv

4 months ago

Because I was originally writing some very CPU intensive SIMD stuff, which Mojo is also fantastic for. Once I got that working and running nicely I decided to try getting the same algo running on GPU since, at the time, they had just open sourced the GPU parts of the stdlib. It was really easy to get going with.

I have not used Triton/Cute/Cutlass though, so I can't compare against anything other than Cuda really.

subharmonicon

4 months ago

1 reply

The blog post is about using an NVIDIA-specific tensor core API that they have built to get good performance.

Modular has been pushing the notion that they are building technology that allows writing HW-vendor neutral solutions so that users can break free of NVIDIA's hold on high performance kernels.

From their own writing:

> We want a unified, programmable system (one small binary!) that can scale across architectures from multiple vendors—while providing industry-leading performance on the most widely used GPUs (and CPUs).

totalperspectiv

4 months ago

They allow you to write a kernel for Nvidia, or AMD, that can take full advantage of the Hardware of either one, then throw a compile time if-statement in there to switch which kernel to use based on the hardware available.

So, you can support either vendor with as-good-vendor-library performance. That’s not lock-in to me at least.

It’s not as good as the compiler being able to just magically produce optimized kernels for arbitrary hardware though, fully agree there. But it’s a big step forward from Cuda/HIP.

imtringued

4 months ago

Correct. There is too much architectural divergence between GPU vendors. If they really wanted to avoid vendor specific extensions in user level code, they would have gone with something that could be said to be loosely inspired by tiny grad (which isn't ready yet).

Basically, you need a good description of the hardware and the compiler automatically generates the state of the art GEMM kernel.

Maybe it's 20% worse than Nvidia's hand written kernels, but you can switch hardware vendors or build arbitrary fused kernels at will.

View full discussion on Hacker News

ID: 45141769Type: storyLast synced: 11/20/2025, 6:36:47 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN