Writing High-Performance Matrix Multiplication Kernels for Blackwell
Posted3 months agoActive3 months ago
docs.jax.devTechstory
excitedpositive
Debate
20/100
High-Performance ComputingMatrix MultiplicationGPU Optimization
Key topics
High-Performance Computing
Matrix Multiplication
GPU Optimization
The post discusses the implementation of high-performance matrix multiplication kernels for NVIDIA's Blackwell GPUs using JAX's Pallas, sparking interest and comparisons to previous work on CUDA matrix multiplication.
Snapshot generated from the HN discussion
Discussion Activity
Light discussionFirst comment
4d
Peak period
5
96-108h
Avg / period
3
Key moments
- 01Story posted
Oct 2, 2025 at 11:43 AM EDT
3 months ago
Step 01 - 02First comment
Oct 6, 2025 at 4:54 PM EDT
4d after posting
Step 02 - 03Peak activity
5 comments in 96-108h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 7, 2025 at 5:03 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45451217Type: storyLast synced: 11/20/2025, 2:49:46 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Seems like the Pallas of old has completely been upgraded
What's interesting is that the MGPU team has achieved SOTA Blackwell GEMM performance before Triton (which IIUC is trying to bring up Gluon to reach the same level). All the big players are coming up with their own block-based low-level-ish DSLs for CUDA: OpenAI, NVIDIA, and now Google.
I wonder if the same person wrote it.