AMD GPUs Go Brrr
Mood
excited
Sentiment
positive
Category
tech
Key topics
AMD GPUs
AI Hardware
Stanford Research
Researchers at Stanford have made significant advancements in optimizing AMD GPUs for AI workloads, achieving impressive performance gains.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
3h
Peak period
74
Day 1
Avg / period
74
Based on 74 loaded comments
Key moments
- 01Story posted
11/15/2025, 2:06:16 AM
4d ago
Step 01 - 02First comment
11/15/2025, 5:31:45 AM
3h after posting
Step 02 - 03Peak activity
74 comments in Day 1
Hottest window of the conversation
Step 03 - 04Latest activity
11/15/2025, 4:57:40 PM
3d ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
I don't get why AMD doesn't solve their own software issues. Now they have a lot of money so not having money to pay for developers is not an excuse.
And data centers GPUs are not the worst. Using GPU compute for things like running inference at home is a much, much better experience with Nvidia. My 5 years old RTX 3090 is better than any consumer GPU AMD released up to this date, at least for experimenting with ML and AI.
Anything specific related to DC level computing?
I must say it's been a completely positive experience. The mainline Fedora kernel just worked without any need to mess with the DKMS. I just forwarded /dev/dri/* devices to my containers, and everything worked fine with ROCm.
I needed to grab a different image (-rocm instead of -cuda) for Ollama, change the type of whisper build for Storyteller. And that was it! On the host, nvtop works fine to visualize the GPU state, and VAAPI provides accelerated encoding for ffmpeg.
Honestly, it's been an absolutely pleasant experience compared to getting NVidia CUDA to work.
NVidia is the exception to the rule when it comes to hardware companies paying competitive salaries for software engineers. I imagine AMD is still permeated by the attitude that software "isn't real work" and doesn't deserve more compensation, and that kind of inertia is very hard to overcome.
I’m genuinely dumbfounded by what’s up at AMD at this point.
Keeps the incentives pure.
I'm even willing to accept a 20% performance hit for this requirement, should someone bring that up.
That's self contradictory. Their incentive is to sell more HW and at higher prices using whatever shady practices they can get away with, software or no software. There's nothing pure about that, it's just business. High end chips aren't commodity HW like lawnmowers, they can't function without the right SW.
And this isn't the 90's anymore when Hercules or S3 would only make the silicon, and then system integrators would write the drivers for it which was basically MS-DOS calls to read/write to registers via the PCI bus, by the devs reading a 300 page manual, those days are long gone. Modern silicone is orders of magnitude more complex that nobody else besides the manufacturer could write the drivers for it to extract the most performance out of it.
>I'm even willing to accept a 20% performance hit for this requirement, should someone bring that up.
I'm also willing to accept arbitrary numbers I make up, as a tradeoff, but the market does not work like that.
And you don't think these shady practices will leak into the software?
> Modern silicone is orders of magnitude more complex that nobody else besides the manufacturer could write the drivers for it...
The hardware people at the manufacturer are not the software people. So there __must__ be documentation.
YES, internal documentation, full of proprietary IP.
That depends on whether OP is buying/renting AMD gpu machines.
Let's not go too far here. Reverse engineering and independent development of usable drivers are not impossible, they're 'merely' extremely challenging. Alyssa Rosenzweig in particular had great success reverse engineering the Apple M1 GPU and writing drivers for it, and that was just a few years ago.
This is just a HN fantasy that's not compatible with business of making money. That's why everyone here make money working in SW.
The 300 page manual would be 3,000 or 30,000 pages long, if modern ARM ISR manuals are any indication. Independent developers could totally write performant drivers if they had the documents, but those manuals do not exist - or if they do, they're proprietary.
And depending on others to write firmware for your hardware, I don’t think that’s a recipe for success.
Hardware team at AMD: "Sorry, hardware can't exist without software; we'll first have to write the software"
Software team: "But we're the software team ..."
Hardware team: "Uhm yeah ... seems we have a nasty chicken and egg problem here"
Also, the 20% would be open to further optimization by the community, so it wouldn't be that bad in practice, probably.
If nvidia dominate because of CUDA and why it can do it but amd should not?
It should be obvious by now though that there's symbiosis between software and hardware, and that support timescales are longer. Another angle is that it's more than just AMD's own software developers, also the developers making products for their customers who in turn buy AMD's if everyone works together to make them run well and it's those second developers they need to engage with in a way their efforts will be welcomed.
It runs great. Run all my steam stuff through them. Those days to mention have been long gone for quite awhile.
The common denominator to the crashes you mention might possibly not be AMD? Do you friends perchance play on Windows?
AMD Ryzen 7 PRO 8700GE w/ Radeon 780M Graphics
the solution was adding amdgpu.ppfeaturemask=0xffff7fff to the command line. Before that I could reliably crash the driver with firefox.The future will probably see more chiplets rather than less, so I wonder if dealing with complexity here will pay more dividends in the long run
A few choice examples:
> Checkout part one of this series for an intro to HipKittens and checkout this post for a technical deep dive.
> Unsurprisingly, making AMD GPUs go brr boils down to keeping the “matrix cores” (tensor cores on NVIDIA) fed.
> These two patterns tradeoff programmability and performance, where 8-wave and its large tile primitives lead to compact code and 4-wave fine-grained interleaving expands code size. Surprisingly, the 8-wave schedule is sufficient to achieve SoTA-level performance on GEMMs and attention forwards. For GQA non-causal attention backwards, 8-wave also outperforms all AMD baselines by 1.8 × 1.8×, and our HK 4-wave further outperforms by 2.3 × 2.3×.
And I could go on. And on.
But overall besides the overuse of cliche/memespeak places it doesn’t make sense, the entire section that deals with the hot loop describes something that should be explained in a graph and instead explained in 100 lines of source code.
15 more comments available on Hacker News
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.