Matrix Core Programming on Amd Gpus

Posted3 months agoActive3 months ago

skidrow

116 points

5 comments

salykova.github.ioTechstory

calmmixed

Debate

40/100

GPU ProgrammingMatrix MultiplicationAmd Hardware

Key topics

GPU Programming

Matrix Multiplication

Amd Hardware

The article discusses programming matrix cores on AMD GPUs, sparking a discussion on the suitability of GPUs for matrix multiplication and the complexities of parallel processing.

Snapshot generated from the HN discussion

Discussion Activity

Light discussion

First comment

Peak period

6-8h

Avg / period

1.3

Key moments

01Story posted
Oct 4, 2025 at 5:22 PM EDT
3 months ago
Step 01
02First comment
Oct 4, 2025 at 11:42 PM EDT
6h after posting
Step 02
03Peak activity
2 comments in 6-8h
Hottest window of the conversation
Step 03
04Latest activity
Oct 5, 2025 at 12:41 PM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (5 comments)

Showing 5 comments

gleenn

3 months ago

1 reply

Glad to see more articles out using AMD hardware acceleration especially for matrix math. More diversity in this space is welcome.

latchkey

3 months ago

Many people have been asking them for this sort of content, and it is happening. Can't be more excited. Also note that it is AMD, but not AMD. Being published in the open on an individual github.

imtringued

3 months ago

1 reply

Whenever I see code like this, I'm starting to think that GPUs are uniquely unsuited for matrix multiplication.

You're pretending that each streaming multiprocessor can handle independent threads, when in reality you're feeding something that only exists once or twice per SM. It's like independently controlling one out of 32 cars on a 32 lane highway where the cars aren't allowed to switch lanes and having the controls on one car replicated to all the others when in reality everyone is sitting in the same bus.

MaxBarraclough

3 months ago

1 reply

I'm not sure I follow. Matrix multiplication isn't inherently 'branchy' in a way that we would expect to cause inefficient execution on SIMT (i.e. branch divergence).

touisteur

3 months ago

I think the remark is more about Tensor Cores (or Matrix Cores in AMD lingo) are distributed by SM (and not aside on an interconnect and individually programmable) so on the same SM you have your classical warps (cuda cores) AND the Tensor units and switching between one and the other might be confusing.

My vision of SMs has always been "assume AVX512 is the default ISA" and "tensor cores are another layer aside of this" (kind-of like AMX) and you have this heterogeneous "thing" to program. Don't know if it helps. The CUDA programming model hides a lot and looking at PTX code in nsight-compute is most enlightening.

View full discussion on Hacker News

ID: 45476821Type: storyLast synced: 11/20/2025, 12:59:45 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN