How to Tile Matrix Multiplication (2023)
Posted3 months agoActive3 months ago
alvinwan.comTechstory
calmpositive
Debate
20/100
Matrix MultiplicationAlgorithm OptimizationLinear Algebra
Key topics
Matrix Multiplication
Algorithm Optimization
Linear Algebra
The post discusses how to optimize matrix multiplication using tiling/blocking, with commenters sharing additional resources and insights on related algorithms.
Snapshot generated from the HN discussion
Discussion Activity
Moderate engagementFirst comment
4d
Peak period
6
96-108h
Avg / period
2.3
Key moments
- 01Story posted
Oct 2, 2025 at 9:37 PM EDT
3 months ago
Step 01 - 02First comment
Oct 6, 2025 at 6:39 PM EDT
4d after posting
Step 02 - 03Peak activity
6 comments in 96-108h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 8, 2025 at 9:20 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45457825Type: storyLast synced: 11/20/2025, 2:21:16 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
At first, there is 16 fetches per row x column, 1024 in total. Then, it is observed that an input row needs to be fetched only once per output row, reducing the amount to 8 fetches per row, plus 8 per row x column, 8 * 8 + 8 * 64 = 576 in total. This requires the same amount of 16 numbers to be kept in registers.
But then it is claimed that by doing one quadrant at a time, all that is needed is 64 fetches per quadrant or 256 fetches in total. But that assumes we can keep 4 rows and 4 columns, 8 numbers per row or column = 64 numbers in registers! If we can only keep 16 numbers like above, each row of the quadrant is going to take 40 fetches, and we get 160 fetches per quadrant or 640 fetches in total, a pessimization from 576 fetches!
The next section discusses what you’re talking about eg, how to deal with finite register/shared capacity by splitting the k dimension. I’ll mention the shared/register memory limitation sooner to clarify confusion.
"How effective is tiling?" and "Why tiling tiling is so fast" should be at the end, while the key section "Why there's a limit to tiling" which should be front and center is in the middle, followed by a subversion of the entire concept in "How to sidestep tiling limits"
It's also incredibly jarring to read this:
"Wondering how we were able to reduce memory usage "for free"? Indeed, the reduction wasn't free. In fact, we paid for this reduction a different way — by incurring more writes."
This is again, completely backwards. Let's assume you don't have a cache at all, you'll have to write out everything to DRAM every single time. The opposite is also true. Imagine you had an infinite number of registers. Every addition operation will accumulate into a register, which is a write operation. Hence, the number of write operations doesn't change.
Really the main points should be in this order: 1. matrix multiplication works best with square or almost square matrices. 2. registers and SRAM (including caches) is limited, forcing you to process matrices of finite size (aka tiles) 3. memory hierarchy means that the biggest matrix you can store at a given hierarchy gets bigger. 4. you can split matrix multiplication using inner and outer products 5. outer products take few inputs and have many outputs/accumulators, inner products take many inputs and have few outputs/accumulators. 6. You want to calculate the biggest outer product you can get away with, since this significantly reduces the memory needed to store inputs and maximizes number of cycles doing calculations, once you hit the limit, you want to reuse the accumulator, so you calculate inner products of outer products.
Normal block multiplication works like:
Which takes 8 matrix multiplications on the sub blocks. But by cleverly defining only 7 different matrix multiplications on top of block additions and subtractions, like: You can make the C blocks out of just additions and subtractions of the 7 different matrix multiplications.https://en.wikipedia.org/wiki/Strassen_algorithm
As far as I know this is not useful in the major GPU libraries for saving bandwidth, but I have never bothered to spend the time to figure out why. It must have something to do with the ratio of bandwidth to FLOPs, which is way past my knowledge of GPUs.
Asymptotically, I don't think Strassen performs Theta(n^3) memory operations in sub-n^3 time.