Amd’s Rdna4 GPU Architecture
Posted4 months agoActive4 months ago
chipsandcheese.comTechstory
calmmixed
Debate
40/100
Amd Rdna4 GPU ArchitectureGPU TechnologyAI Computing
Key topics
Amd Rdna4 GPU Architecture
GPU Technology
AI Computing
The article discusses AMD's RDNA4 GPU architecture, sparking discussion on its features, power consumption, and potential applications in gaming and AI computing.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
5h
Peak period
16
12-18h
Avg / period
5.2
Comment distribution47 data points
Loading chart...
Based on 47 loaded comments
Key moments
- 01Story posted
Sep 13, 2025 at 5:04 PM EDT
4 months ago
Step 01 - 02First comment
Sep 13, 2025 at 9:53 PM EDT
5h after posting
Step 02 - 03Peak activity
16 comments in 12-18h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 16, 2025 at 7:32 AM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45235293Type: storyLast synced: 11/20/2025, 2:40:40 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
They support FP8/BF8 with F32 accumulate and also IU4 with I32 accumulate. The max matrix size is 16x16. For comparison, NVIDIA Blackwell GB200 supports matrices up to 256x32 for FP8 and 256x96 for NVFP4.
This matters for overall throughput, as feeding a bigger matrix unit is actually cheaper in terms of memory bandwidth, as the number of FLOPs grows O(n^2) when increasing the size of a systolic array, while the number of inputs/outputs as O(n).
1. https://www.amd.com/content/dam/amd/en/documents/radeon-tech...
2. https://semianalysis.com/2025/06/23/nvidia-tensor-core-evolu...
Also, the size of the native / atomic matrix fragment size isn't relevant for memory bandwidth because you can always build larger matrices out of multiple fragments in the register file. A single matrix fragment is read from memory once and used in multiple matmul instructions, which has the same effect on memory bandwidth as using a single larger matmul instruction.
Don’t get me wrong this is very interesting and AMD does great engineering and I loath to throw shade on an engineering focused company but… Is this going to convert to even a single net gain purchase for AMD?
I’m a relatively (to myself) a large AMD shareholder (colloquially: fanboy) and damn I’d love to see more focus on hardware matmul acceleration rather than idle monitor power draw.
Another angle I'm wondering about is longevity of the card. Not sure if AMD would positively care in the first place, but as a user if the card didn't have to grind much on the idle parts and thus last a year or two longer, it would be pretty valuable.
The same architecture will also be used in mobile, so depending on where this comes from (architecturally) it could mean more power savings there, too.
Besides, lower power also means lower cooling/noise on idle, and shorter cooldown times after a burst of work.
And since AMD is slowly going to the (ever next-time) unified architecture, any gains there will also mean less idle power draw in other environments, like servers.
Nothing groundbreaking, sure, but I won't say no to all of that.
I imagine you wouldn't attach a display to your home server. Would the display engine draw any power in that case?
You don't want the hardware in a high-power state during the time it's not doing anything, even when the user is actively looking at the screen.
They started to experiment on that in mesa and linux ("user queues", as "user hardware queues").
I don't know how they will work around the scarse VM IDs, but here, we are talking near 0 driver. Obviously, they will have to simplify/cleanup a lot 3D pipeline programming and be very sure of its robustness, basically to have it ready for "default" rendering/usage right away.
Userland will get from the kernel stuff along those lines: command/event hardware ring buffers, data dma buffers, read/write pointers & doorbells memory page for those ring buffers, and an event file descriptor for an event ring buffer. Basically, what the kernel currently has.
I wonder if it will provide some significant simplification than the current way which is giving indirect command buffers to the kernel and deal with 'sync objects'/barriers.
The major upside is removing the context switch on a submission. The idea is that an application only talks to the kernel for queue setup/teardown, everything else happens in userland.
The thing are the vulkan "fences", namely the GPU to CPU notifications. Probably hardware interrupts which will have to be forwarded by the kernel to the userland for an event ring buffer (probably a specific event file descriptor). There are alternatives though: we could think of userland polling/spinning on some cpu-mapped device memory content for notification or we could go one "expensive" step further which would "efficiently" remove the kernel for good here but would lock a CPU core (should be fine nowdays with our many cores CPUs): something along the line of a MONITOR machine instruction, basically a CPU core would halt until some memory content is written, with the possibility for another CPU core to un-halt it (namely spurious un-halting is expected).
Does nvidia handle their GPU to CPU notifications without the kernel too?
Well, polling? erk... I guess a event file descriptor is in order, and that nvidia is doing the same.
Your monitor configuration has always controlled idle power of a GPU (for about the past 15 years), and you need to be aware of what is "too much" for your GPU.
RDNA4 and Series 50 is anything more than the equivalent of a single 4k 120hz kicks it out of super-idle, and it sits at around ~75W.
Hm, do they? I don't think any stationary PC I've had the past 15 years have idled that low. They have all had modest(ish) specs, and the setups were tuned for balanced power consumption rather than performance. My current one idles at 50-55W. There's a Ryzen 5 5600G and an Nvidia GTX 1650 in there. The rest of the components are unassuming in terms of wattage: a single NVMe SSD, a single 120mm fan running at half RPM, and 16 GiB of RAM (of course without RGB LED nonsense).
So, I assume its Nvidia incompetence. Its my first and last Nvidia card in years, AMD treats users better.
Shame you don't have it around anymore, because I'd say set your desktop to like, native res, 60hz, only have one monitor installed, 8 bit SDR not 10 bit SDR, and see if the power usage goes away.
Like, on a 9800x3D /w 7900XTX with my 4th monitor unplugged (to get under the maximum super-idle load for the GPU), I'm sub-50W idle.
There are settings in the BIOS that can effect this, and sometimes certain manufacturers (Asus, mainly) screw with them because they're morons, so I wonder if you might be effected by it.
Depending on the generation, either equivalent to 4k 60hz or 4k 120hz kicks them out of idle. For my card, an RDNA3, its 4k 60hz; RDNA4 is 4k 120hz.
So, I have 4 1080p monitors... if they're all at 60hz, it properly idles. If my primary one is at 120hz, it crosses that line and is now at lowest active clockrate, so total system wattage is ~75W instead of ~50W.
I almost considered having my 9800x3D's IGPU service the trio of 60hz monitors, but its like an extra 10-15W to enable the media engine and run enough shader to run WDDM... where my 9800x3D can reach (but not cross) 80c on particularly well optimized code under a NH-D15 G2 LBC, I don't really want to throw another 15W at that.
In general this generation of Radeon GPUs seems highly efficient. Radeon 9070 is a beast of a GPU.
It doesn't make sense that it would draw this much power. A laptop can do the same thing with ~10W.
This sort of improvement might not increase sales with one generation, but it'll make a difference if they keep focusing on this year after year. It also makes their design easier to translate into mobile platforms.
And if it would be more expensive stuff like cooking or washing clothes would still hurt more than downloading a file with a big PC.
Expect energy cost to also go up in the USA with the administration pushing to phase out renewable energy with fossil fuels. Fossil fuel never goes down in value. They may seem to go up and down but over time always increase.
As we keep adding more and more computers to the grid it will require more and more energy.
Second hand computers with this energy efficiency will benefit the poor and counties where energy is still a costly commodity. I don't mind paying the initial cost.
These two statements appear to be in conflict. 150W is a high idle power consumption for a modern PC. Unless you have something like an internal RAID or have reached for the bottom of the barrel when choosing a power supply, 40W is on the high side for idle power consumption and many exist that will actually do ~10W.
There's also simply laptop longevity that would be nice.
[1] https://thinkcomputers.org/here-is-the-solution-for-high-idl...
I am optimistic, april of 2026 a new LTS of ubuntu + rocm 7. I guess technically i dont know if they even plan to support this, but I am hoping they do.
This might be game changing for me; but until then, 30-50% load on my gpus on vulkan.