Amd Claims Arm Isa Doesn't Offer Efficiency Advantage Over X86
Posted4 months agoActive4 months ago
techpowerup.comTechstoryHigh profile
heatedmixed
Debate
80/100
CPU ArchitectureInstruction Set ArchitecturePower Efficiency
Key topics
CPU Architecture
Instruction Set Architecture
Power Efficiency
AMD claims that Arm ISA doesn't offer an efficiency advantage over x86, sparking a debate among commenters about the role of ISA in determining CPU efficiency and the relative merits of different architectures.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
37m
Peak period
41
12-18h
Avg / period
13.3
Comment distribution160 data points
Loading chart...
Based on 160 loaded comments
Key moments
- 01Story posted
Sep 8, 2025 at 10:36 AM EDT
4 months ago
Step 01 - 02First comment
Sep 8, 2025 at 11:12 AM EDT
37m after posting
Step 02 - 03Peak activity
41 comments in 12-18h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 11, 2025 at 10:48 AM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45168854Type: storyLast synced: 11/20/2025, 8:14:16 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
In the past x86 didn't dominate in low power because Intel had the resources to care but never did, and AMD never had the resources to try. Other companies stepped in to full that niche, and had to use other ISAs. (If they could have used x86 legally, they might well have done so. Oops?) That may well be changing. Or perhaps AMD will let x86 fade away.
Remember Atom tablets (and how they sucked)?
Care to elaborate. I had the 9" mini laptop kind of device based on Atom and don't remember Atom to be the issue.
However, what I meant is Atom-based Android tablets. At about the same time as the netbook craze (late 2000s to early 2010s) there was a non-negligible number of Android tablets, and a noticeable fraction of them was not ARM- but Atom-based. (The x86 target in the Android SDK wasn’t only there to support emulators, originally.) Yet that stopped pretty quickly, and my impression is that that happened because, while Intel would certainly have liked to hitch itself to the Android train, they just couldn’t get Atoms fast enough at equivalent power levels (either at all or quickly enough). Could have been something else, e.g. perhaps they didn’t have the expertise to build SoCs with radios?
Either way, it’s not that Intel didn’t want to get into consumer mobile devices, it’s that they tried and did not succeed.
IMHO, If Intel had done another year or two of trying, it probably would have worked, but they gave up. They also canceled x86 for phone like the day before the Windows Mobile Continuum demo, which would have been a potentially much more compelling product with x86, especially if Microsoft allowed running win32 apps (which they probably wouldn't, but the potential would be interesting)
Atom wasn't about power efficiency or performance, it was about cost optimization.
I have a ten-year old Lenovo Yoga Tab 2 8" Windows tablet, which I still use at least once every week. It is still useful. Who can say that they are still using a ten-year old Android tablet?
(Also probably due to it is a tablet, so it have a reasonable fast storage instead of hdds like notebooks in that era)
DoD originally required all products to be sourced by at least three companies to prevent supply chain issues. This required Intel to allow AMD and VIA to produce products based on ISA.
For me this is good indicator if someone that talks about good national security knows what they are talking about or are just spewing bullshit and playing national security theatre.
Afaik DoD wasnt the reason behind original AMD second source license, it was IBM forcing Intel on chips that went into first PC.
https://web.archive.org/web/20210622080634/https://www.anand...
Basically the gist of it is that the difference between ARM/x86 mostly boils down to instruction decode, and:
- Most instructions end up being simple load/store/conditional branch etc. on both architectures, where there's literally no difference in encoding efficiency
- Variable length instruction has pretty much been figured out on x86 that it's no longer a bottleneck
Also my personal addendum is that today's Intel efficiency cores are have more transistors and better perf than the big Intel cores of a decade ago
M1 - 8 wide
M4 - 10 wide
Zen 4 - 4 wide
Zen 5 - 8 wide
https://chipsandcheese.com/i/149874010/frontend
That was pretty much the uArch/design mantra at intel.
Also, more complicated decoding and extra caches means longer pipeline, which means more price to pay when a branch is mispredicted (binary search is a festival of branch misprediction for example, and I got 3x acceleration of linear search on small arrays when I switched to the branchless algorithm).
Also I am not a CPU designer, but branch prediction with wide decoder also must be a pain - imagine that while you are loading 16 or 32 bytes from instruction cache, you need to predict the address of next loaded chunk in the same cycle, before you even see what you got from cache.
As for encoding efficiency, I played with little algorithms (like binary search or slab allocator) on godbolt, and RISC-V with compressed instruction generates similar amount of code as x86 - in rare cases, even slightly smaller. So x86 has a complex decoding that doesn't give any noticeable advantages.
x86 also has flags, which add implicit dependencies between instructions, and must make designer's life harder.
AMD isn’t saying that decoding x86 is easy. They are just saying that decoding x86 doesn’t have a notable power impact.
Well, obviously because there aren't 100 individual parallel execution units to which those instructions could be issued. And lower down the stack because a 3000 bit[1] wide cache would be extremely difficult to manage. An instruction fetch would be six (!) cache lines wide, causing clear latency and bottleneck problems (or conversely would demand your icache be 6x wider, causing locality/granularity problems as many leaf functions are smaller than that).
But also because real world code just isn't that parallel. Even assuming perfect branch prediction the number of instructions between unpredictable things like function pointer calls or computed jumps is much less than 100 in most performance-sensitive algorithms.
And even if you could, the circuit complexity of decoding variable length instructions is superlinear. In x86, every byte can be an instruction boundary, but most aren't, and your decoder needs to be able to handle that.
[1] I have in my head somewhere that "the average x86_64 instruction is 3.75 bytes long", but that may be off by a bit. Somewhere around that range, anyway.
SMT is about addressing the underutilization of execution resources where your 6-wide superscalar processor gets 2.0 ILP.
See eg https://my.eng.utah.edu/~cs6810/pres/6810-09.pdf
But I could be way off…
First of all. In SMT there is only one instruction decoder. SMT merely adds a second set of registers, which is why it is considered a "free lunch". The cost is small in comparison to the theoretical benefit (up to 2x performance).
Secondly. The effectiveness of SMT is workload dependent, which is a property of the software and not the hardware.
If you have a properly optimized workload that makes use of the execution units, e.g. a video game or simulation, the benefit is not that big or even negative, because you are already keeping the execution units busy and two threads end up sharing limited resources. Meanwhile if you have a web server written in python, then SMT is basically doubling your performance.
So, it is in fact the opposite. For SMT to be effective, the instruction decoder has to be faster than your execution units, because there are a lot of instructions that don't even touch them.
(But then again, do the AMD e-cores have uop caches?)
So one of the projects I've been working on and off again is the World's Worst x86 Decoder, which takes a principled approach to x86 decoding by throwing out most of the manual and instead reverse-engineering semantics based on running the instructions themselves to figure out what they do. It's still far from finished, but I've gotten it to the point that I can spit out decoder rules.
As a result, I feel pretty confident in saying that x86 decoding isn't that insane. For example, here's the bitset for the first two opcode maps on whether or not opcodes have a ModR/M operand: Mod
I haven't done a k-map on that, but... you can see that a boolean circuit isn't that complicated. Also, it turns out that this isn't dependent on presence or absence of any prefixes. While I'm not a hardware designer, my gut says that you can probably do x86 instruction length-decoding in one cycle, which means the main limitation on the parallelism in the decoder is how wide you can build those muxes (which, to be fair, does have a cost).
That said, there is one instruction where I want to go back in time and beat up the x86 ISA designers. f6/0, f6/1, f7/0, and f7/1 [1] take in an extra immediate operand whereas f6/2 and et al do not. It's the sole case in the entire ISA where this happens.
[1] My notation for when x86 does its trick of using one of the register selector fields as extra bits for opcodes.
That's some very faint praise there. Especially when you're trying to chop up several instructions every cycle. Meanwhile RISC-V is "count leading 1s. 0-1:16bit 2-4:32bit 5:48bit 6:64bit"
> e.g. a single 32-byte x86 cache line could have up to 32 instructions where the original RISC-V ISA might only have 8
With compressed instructions the theoretical maximum is 16.
> so they have to deal with the problem too.
Luckily you can determine the length from first bits of an instruction, and you can have either 2 bytes left from previous line, or 0.
It still causes issues.
> Since x86 instructions can be as small as one byte, in principle the throughput-per-cache-line can be higher on x86 than on RISC-V (e.g. a single 32-byte x86 cache line could have up to 32 instructions where the original RISC-V ISA might only have 8).
RISC-V has better code density. The handful of one byte instructions don't make up for other longer instructions.
> And in any case, there are RISC-V extensions that allow variable-length instructions now, so they have to deal with the problem too.
Now? Have to deal with the problem too?
It feels like you didn't read my previous post. I was explaining how it's much much simpler to decode length. And the variable length has been there since the original version.
That's been my understanding as well. X86 style length decoding is about one pipeline stage if done dynamically.
The simpler riscv length decoding ends up being about a half pipeline stage on the wider decoders.
The P6 is arguably the most important x86 microarch ever, it put Intel on top over the RISC workstations.
What was your favorite subsystem in the P6 arch?
Was it designed in Verilog? What languages and tools were used to design P6 and the PPro?
Tooling & Languages: IHDL, a templating layer on top of HDL that had a preprocessor for intel-specific macros. DART test template generator for validation coverage vectors. The entire system was stitched together with PERL, TCL, and shellscripts, and it all ran on three OSes: AIX, HPUX and SunOS. (I had a B&W sparcstation and was jealous of the 8514/a 1024x768 monitors on AIX.) We didn't go full Linux until Itanic and by then we were using remote computing via Exceed and gave up our workstations for generic PCs. When I left in the mid 2000's, not much had changed in the glue/automation languages, except a little less Tcl. I'm blanking on the specific formal verification tool, I think it was something by Cadence. Synthesis and timing was ... design compiler and primetime? Man. Cobwebs. When I left we were 100% Cadence and Synopsys and Verilog (minus a few custom analog tools based on SPICE for creating our SSAs). That migration happened during Bonnell, but gahd it was painful. Especially migrating all the Pentium/486/386/286/8088 test vectors.
I have no idea what it is like ~20 years later (gasp), but I bet the test vectors live on, like Henrietta Lacks' cells. I'd be interested to hear from any Intelfolk reading this?
They don't anymore they have uop caches, but trace caches are great and apple uses them [1].
They allow you to collapse taken branches into a single fetch.
Which is extreamly important, because the average instructions/taken-branch is about 10-15 [2]. With a 10 wide frontend, every second fetch would only be half utilized or worse.
> extra caches
This is one thing I don't understand, why not replace the L1I with the uop-cache entirely?
I quite like what Ventana does with the Veyron V2/V3. [3,4] They replaced the L1I with a macro-op trace cache, which can collapse taken branches, do basic instruction fusion and more advanced fusion for hot code paths.
[1] https://www.realworldtech.com/forum/?threadid=223220
[2] https://lists.riscv.org/g/tech-profiles/attachment/353/0/RIS... (page 10)
[3] https://www.ventanamicro.com/technology/risc-v-cpu-ip/
[4] https://youtu.be/EWgOVIvsZt8
Fortunately flags (or even individual flag bits) can be renamed just like other registers, removing that bottleneck. And some architectures that use flag registers, like aarch64, have additional arithmetic instructions which don't update the flag register.
Using flag registers brings benefits as well. E.g. conditional jump distances can be much larger (e.g. 1 MB in aarch64 vs. 4K in RISC-V).
To be fair, a lot of modern ARM cores also have uop caches. There's a lot to decide even without the variable length component, to the point that keeping a cache of uops and temporarily turning pieces of the IFU off can be a win.
Yeah, you [ideally] want to predict the existence of taken branches or jumps in a cache line! Otherwise you have cycles where you're inserting bubbles into the pipeline (if you aren't correctly predicting that the next-fetched line is just the next sequential one ..)
The P4 microarch had trace caches, but I believe that approach has since been avoided. What practically all contemporary x86 processors do have, though is u-op caches, which contain decoded micro-ops. Note this is not the same as a trace cache.
For that matter, many ARM cores also have u-op caches, so it's not something that is uniquely useful only on x86. The Apple M* cores AFAIU do not have u-op caches, FWIW.
Personally I do not entirely buy it. Intel and AMD have had plenty of years to catch up to Apple's M-architecture and they still aren't able to touch it in efficiency. The PC Snapdragon chips AFAIK also offer better performance-per-watt than AMD or Intel, with laptops offering them often having 10-30% longer battery life at similar performance.
The same goes for GPUs, where Apple's M1 GPU completely smoked an RTX3090 in performance-per-watt, offering 320W of RTX 3090 performance in a 110W envelope: https://images.macrumors.com/t/xuN87vnxzdp_FJWcAwqFhl4IOXs=/...
But Apple cannot beat Intel/AMD in single-thread performance. (Apple marketing works very hard to convince people otherwise, but don't fall for it.) Apple gets very, very close, but they just don't get there. (As well, you might say they get close enough for practical matters; that might be true, but it's not the question here.)
That gap, however small it might be for the end user, is absolutely massive on the chip design level. x86 chips are tuned from the doping profiles of the silicon all the way through to their heatsinks to be single-thread fast. That last 1%? 2%? 5%? of performance is expensive, and is far far far past the point of diminishing returns in turns of efficiency cost paid. That last 20% of performance burns 80% of the power. Apple has chosen not to do things this way.
So x86 chips are not particularly well tuned to be efficient. They never have been; it's, on some level, a cultural problem. Could they be? Of course! But then the customers who want what x86 is right now would be sad. There are a lot of customers who like the current models, from hyperscalers to gamers. But they're increasingly bad fits for modern "personal computing", a use case which Apple owns. So why not have two models? When I said "doping profiles of the silicon" above, that wasn't hyperbole, that's literally true. It is a big deal to maintain a max-performance design and a max-efficiency design. They might have the same RTL but everything else will be different. Intel at their peak could have done it (but was too hubristic to try); no one else manufacturing x86 has had the resources. (You'll note that all non-Apple ARM vendor chips are pure efficiency designs, and don't even get close to Apple or Intel/AMD. This is not an accident. They don't have the resources to really optimize for either one of these goals. It is hard to do.)
Thus, the current situation: Apple has a max-efficiency design that's excellent for personal computing. Intel/AMD have aging max-performance designs that do beat Apple at absolute peak... which looks less and less like the right choice with every passing month. Will they continue on that path? Who knows! But many of their customers have historically liked this choice. And everyone else... isn't great at either.
Can you explain then, how come switching from Intel MBP to Apple Silicon MBP feels like literally everything is 3x faster, the laptop barely heats up at peak load, and you never hear the fans? Going back to my Intel MBP is like going back to stone age computing.
In other words if Intel is so good, why is it... so bad? I genuinely don't understand. Keep in mind though, I'm not comparing an Intel gaming computer to a laptop, let's compare oranges to oranges.
"let's compare oranges to oranges"
That's impossible because Apple has bought up most of TSMC's 3nm production capacity. You could try to approximate by comparing Apple M4 Max against NVIDIA B300 but that'll be a very one-sided win for NVIDIA.
Have you not heard that Intel's Lunar Lake is made on the same TSMC 3nm process as Apple's M3? It's not at all "impossible" to make a fair and relevant comparison here.
Is it possible that your workloads are bound by something other than single-threaded compute performance? Memory? Drive speed?
Is it possible that Apple did a better job tuning their OS for their hardware, than for Intel’s?
Pretty much everything else about the M-series parts is better. In particular, Apple's uncore is amazing (partly because it's a lot newer design) and you really notice that in terms of power management.
My understanding of it is that Apple Silicon's very very long instruction pipeline plays well with how the software stack in MacOS is written and compiled first and foremost.
Similarly that the same applications take less RAM in MacOS than even in Linux often even because at the OS level stuff like garbage collection are better integrated.
It's literally one of the main Apple M chips advantage over Intel/AMD. At the time when M chip came out, it was the only chip that managed to consume ~100GB/s of MBW with just a single thread.
https://web.archive.org/web/20240902200818/https://www.anand...
> From a single core perspective, meaning from a single software thread, things are quite impressive for the chip, as it’s able to stress the memory fabric to up to 102GB/s. This is extremely impressive and outperforms any other design in the industry by multiple factors, we had already noted that the M1 chip was able to fully saturate its memory bandwidth with a single core and that the bottleneck had been on the DRAM itself.
It does seem like for at least the last 3-5 years it's been pretty clear that Intel x86 was optimizing for the wrong target / a shrinking market.
HPC increasingly doesn't care about single core/thread performance and is increasingly GPU centric.
Anything that cares about efficiency/heat (basically all consumer now - mobile, tablet, laptop, even small desktop) has gone ARM/RISC.
Datacenter market is increasingly run by hyperscalers doing their own chip designs or using AMD for cost reasons.
I dunno. I sort of like all the vector extensions we’ve gotten on the CPU side as they chase that dream. But I do wonder if Intel would have been better off just monomaniacally focusing on single-threaded performance, with the expectation that their chips should double down on their strength, rather than trying to attack where Nvidia is strong.
Letting you limit just how much of that extra 20% power hogging perf you want.
In the real world we have our computers running JIT'ed JS, Java or similar code taking up our cpu time, tons of small branches (mostly taken the same way and easily remembered by the branch predictor) and scattering reads/writes all over memory.
Transistors not spent on larger branch prediction caches or L1 caches are badly spent, doesn't matter if the CPU can issue a few less instructions per clock to ace an benchmark if it's waiting for branch mispredictions or cache misses most of the time.
There's no coincidence that the Apple teams iirc are partly the same people that built Pentium-M (that begun the Core era by delivering very good perf on mobile chips when P4 was supposed to be the flagship).
Saying it offers a certain wattage worth of the desktop part means even less because it measures essentially nothing.
You would probably want to compare it to a mobile 3050 or 4050 although this still risks being a description of the different nodes more so than the actual parts.
It's like comparing an F150 with an Ferrari, a decision that no buyer needs to make.
... maybe a Prius? Bruh.
Do not conflate battery life with core efficiency. If you want to measure how efficient a CPU core is you do so under full load. The latest AMD under full load uses the same power as M1 and is faster, thus it has better performance per watt. Snapdragon Elite eats 50W under load, significantly worse than AMD. Yet both M1 and Snapdragon beat AMD on battery life tests, because battery life is mainly measured using activities where the CPU is idle the vast majority of the time. And of course the ISA is entirely irrelevant when the CPU isn't being used to begin with.
> The same goes for GPUs, where Apple's M1 GPU completely smoked an RTX3090 in performance-per-watt, offering 320W of RTX 3090 performance in a 110W envelope
That chart is Apple propaganda. In Geekbench 5 the RTX 3090 is 2.5x faster, in blender 3.1 it is 5x faster. See https://9to5mac.com/2022/03/31/m1-ultra-gpu-comparison-with-... and https://techjourneyman.com/blog/m1-ultra-vs-nvidia-rtx-3090/
So, yes, if you want to look good on pointless benchmarks, a M1 ultra ran for 1 minute is more efficient than a downclocked 3090.
Look at that updated graph which has less BS. It's never close in perf/watt.
The BS part about apples graph was that they cut the graph short for the nvidia card (and bending the graph a bit at the end). The full graph still shows apple being way better per watt.
He also said per watt. An AMD CPU running at full power and then stopping will use less battery than an M1 with the same task; that's comparing power efficiency.
Look at their updated graph which has less BS. It's never close in perf/watt.
The BS part about apples graph was that they cut the graph short for the nvidia card (and bending the graph a bit at the end). The full graph still shows apple being way better per watt.
I've tried a Ryzen 7 that had a similar efficiency to an M1 according to some tests, and that thing ran hot like crazy. Its just marketing bullshit to me now..
I recently had to remove Windows completely from a few years old laptop with an 12th gen cpu and a Intel Iris / GeForce RTX 3060 Mobile combo because it was running very hot (90c+) and the fans were constantly running. Running Linux, I have no issues. I just double checked since I had not for several months, and temperature is 40c lower on my lap than it was propped up on a book for maximum airflow. Full disclaimer, I would have done this anyways, but the process was sped up because my wife was extremely annoyed with the noise my new-to-me computer was making, and it was cooking the components.
I have learned to start with the OS when things are tangibly off, and only eventually come back to point the finger at my hardware.
Though it has definitely been getting better in the last 1.5 years using Asahi Linux and in some areas it is a better experience than most laptops running Linux (sound, cameras, etc.). The developers even wrote a full fledged physical speaker simulator just so it could be efficiently driven over its "naive" limits.
My guess is ARM64 is a few percent more efficient, something AMD has claimed in the past. They're now saying it would be identical, which is probably not far from the truth.
The simple fact of the matter is that the ISA is only a small part of how long your battery lasts. If you're gaming or rendering or compiling it's going to matter a lot, and Apple battery life is pretty comparable in these scenarios. If you're reading, writing, browsing or watching then your cores are going to be mostly idle, so the only thing the ISA influences won't even have a current running through it.
In fact it’s because that manufacturer has made architectural choices that are not inherent to the x86-64 ISA.
And that’s just hardware. MacOS gets roughly 30% better battery life on M series hardware than Asahi Linux. I’m not blaming the Asahi team, they do amazing work, they don’t even work on many of the Linux features relevant to power management, and Apple has had years of head start on preparing for and optimising for the M architecture. It’s just that software matters, a lot.
So if I’m reading this right, ISA can make a difference, but it’s incremental compared to the many architectural decisions and trade offs that go into a particular design.
This is true, but only in the sense that is very rarely correct to say “Factor Y can’t possibly make a difference.”
Performance and battery life are lived experiences. There’s probably some theoretical hyper optimization where 6502 ISA is just as good as ARM, but does it matter?
Higher wattage gives diminishing returns. Chips will run higher wattage under full load just to eke out a marginal improvement in performance. Therefore efficiency improves if the manufacturer chooses to limit the chip rather than pushing it harder.
Test efficiency using whatever task the chip will be used for. For most ultralight laptops, that will be web browsing etc. so the m1 MacBook/snapdragon results are valid for typical users. Maybe your workload hammers the CPU but that doesn't make it the one true benchmark.
To further hammer the point home, let me reduce do a reductio ad absurdum: The chip is still "in use" when its asleep. Sleeping your laptop is a typical usecase. Therefore how much power is used while sleeping is a good measure of ISA efficiency. This is of course absurd because the CPU cores are entirely turned off when sleeping, they could draw 1kW with potato performance and nothing in this measurement would change.
To go back to your original argument, you're claiming that x86 ISA is more efficient than ARM because a certain AMD chip beat certain M1/snapdragon chips at 50w. You can't draw that conclusion because the two chips may be designed to have peak efficiency at different power levels, even if the maximum power draw is the same. Likely the Snapdragon/M1 have better efficiency at 10W with reduced clock speed even if the CPU is not idling.
Hence my response: it doesn't make sense to talk about performance per watt, without also specifying the workload. Not only will different workloads use different amounts of power, they will also rely on different instructions which may give different ISAs or CPUs the edge.
Not to mention -- who even cares about ISA efficiency? What matters is the result for the product I can buy. If M1/snapdragon are able to match AMD on performance but beat it in battery life for my workloads, I don't care if AMD has better "performance per watt" according to your metric.
Dynamic clock speeds are exactly why you need to do this testing under full load. No you're probably not getting the same power draw on each chip, but at least you're eliminating the frequency scaling algorithms from the equation. This is why it's so hard to evaluate core efficiency (let alone what effect the ISA has).
What you can do however is compare chips with wildly different performance and power characteristics. A chip that is using significantly more power to do significantly less work is less efficient than a chip using less power to do more.
> To go back to your original argument, you're claiming that x86 ISA is more efficient than ARM because a certain AMD chip beat certain M1/snapdragon chips at 50w.
I never claimed x86 is more efficient than ARM. I did claim the latest AMD cores are more efficient than M1 and snapdragon elite X, while still having worse battery life.
> You can't draw that conclusion because the two chips may be designed to have peak efficiency at different power levels, even if the maximum power draw is the same. Likely the Snapdragon/M1 have better efficiency at 10W with reduced clock speed even if the CPU is not idling.
Firstly this is pretty silly: All chips get more efficient at lower clock speed. Maximum efficiency is at at whatever the lowest clockspeed you can go, which is primarily determined by how the chip is manufactured. Which brings me to my second point: What does any of this have to do with the ISA?
> Not to mention -- who even cares about ISA efficiency? What matters is the result for the product I can buy.
If you don't care about ISA efficiency why are you here‽ That's what this discussion is about! If you don't care about this just leave.
And to answer your question: We care about ISA efficiency because we care about the efficiency of our chips. If ARM was twice as efficient then we should be pushing to kill x86, backwards compatibility be damned. In actuality the reason ARM-based laptops have better efficiency has nothing/little to do with the ISA, so instead of asking AMD/intel to release an ARM-based chips we should be pushing them to optimize battery usage.
"The same goes for GPUs, where Apple's M1 GPU completely smoked an RTX3090 in performance-per-watt"
Gamers are not interested in performance-per-watt but fps-per-$.
If some behavior looks strange to you, most probably you don't understand the underlying drivers.
I game a decent amount on handheld mode for the switch. Like tens of millions of others.
The demographics aren't the same, nor the games.
- M-series chips have closely integrated RAM right next to the CPU, while AMD makes do with standard DDR5 far away from the CPU, which leads to a huge latency increase
- I wouldn't be surprised if Apple CPUs (which have a mobile legacy) are much more efficient/faster at 'bursty' workloads - waking up, doing some work and going back to sleep
- M series chips are often designed for a lower clock frequency, and power consumption increases quadratically (due to capactive charge/dischargelosses on FETs) Here's a diagram that shows this on a GPU:
https://imgur.com/xcVJl1h
So while it's entirely possible that AArch64 is more efficient (the decode HW is simpler most likely, and encoding efficiency seems identical):
https://portal.mozz.us/gemini/arcanesciences.com/gemlog/22-0...?
It's hard to tell how much that contributes to the end result.
2/3rds the speed of light must be very slow over there
I think since on mobile CPUs, the RAM sits right on top of the SoC, very likely the CPUs are designed with a low RAM latency in mind.
I'm curious how that is; in practice it "feels" like my 9950x is much more efficient at "move tons of RAM" tasks like a duckDB workload above a M4.
But then again a 9950x has other advantages going on like AVX512 I guess?
You can get 8-RAM-channel motherboards and CPUs and have 400 GB/s of DDR5 too, but you pay a price for the modularity and capacity over it all being integrated and soldered. DIMMs will also have worse latency than soldered chips and have a max clock speed penalty due to signal degradation at the copper contacts. A Threadripper Pro 9955WX is $1649, a WRX90 motherboard is around $1200, and 8x16GB sticks of DDR5 RDIMMS is around $1200, $2300 for 8x32GB, $3700 for 8x64GB sticks, $6000 for 8x96GB.
That's 0.5ns - if you look at end-to-end memory latencies, which are usually around 100ns for mobile systems, that actually is negligible, and M series chips do not have particularly low memory latency (they trend higher in comparison).
Hardware Unboxed just did an interesting video[1] comparing gaming performance of 7600X Zen 4 and 9700X Zen 5 processors, and also the 9800X3D for reference. In some games the 9700X Zen 5 had a decent lead over the Zen 4, but in others it had exactly the same performance. But the 9800X3D would then have a massive lead over the 9700X.
For example, in Horizon Zero Dawn benchmark, the 7600X had 182 FPS while the 9700X had 185 FPS, yet the 9800X3D had a massive 266 FPS.
[1]: https://www.youtube.com/watch?v=emB-eyFwbJg
They're still good benchmarks IMO because they represent a "real workload" but to understand why the 9800X3D performs this much better you'd want some metrics on CPU cache misses in the processors tested.
It's often similar to hyperthreading -- on very efficient sofware you actually want to turn SMT off sometimes because it causes too many cache evictions as two threads fight for the same L2 cache space which is efficiently utilized.
So software having a huge speedup from a X3D model with a ton of cache might indicate the sofware has a bad data layout and needs the huge cache because it keeps doing RAM round trips. You'd presumably also see large speedups in this case from faster RAM on the same processor.
But as far as I can tell the 9600X and the 9800X3D are the same except for the 3D cache and a higher TDP. However they have similar peak extended power (~140W) and I don't see how the different TDP numbers explain the differences between 9600X and 7600X where the is sometimes ahead and other times identical, while the 9800X3D beats both massively regardless.
What other factors could it be besides fewer L3 cache misses that lead to 40+% better performance of the 9800X3D?
> You'd presumably also see large speedups in this case from faster RAM on the same processor.
That was precisely my point. The Zen 5 seems to have a relatively slow memory path. If the M-series has a much better memory path, then the Zen 5 is at a serious disadvantage for memory-bound workloads. Consider local CPU-run LLMs as a prime example. The M-s crushes AMD there.
I found the gaming benchmark interesting because it represented workloads that had workloads that just straddled the cache sizes, and thus showed how good the Zen 5 could be had it had a much better memory subsystem.
I'm happy to be corrected though.
Why would they spend billions to "catch up" to an ecological niche that is already occupied, when the best they could do - if the argument here is right that x86 and ARM are equivalent - is getting the same result?
They would only invest this much money and time if they had some expectation of being better, but "sharing the first place" is not good enough.
I see it's measuring full system wattage with a 12900k which tended to use quite a bit of juice compared to AMD offerings.
https://gamersnexus.net/u/styles/large_responsive_no_waterma...
A big reason for this, at least for AMD, is because Apple buys all of TSMC's latest and greatest nodes for massive sums of money, so there is simply none left for others like AMD who are stuck a generation behind. And Intel is continually stuck trying to catch up. I would not say its due to x86 itself.
It doesn't matter if most instructions have simple encodings. You still need to design your front end to handle the crazy encodings.
I doubt it makes a big difference, so until recently he would have been correct - why change your ISA when you can just wait a couple of months to get the same performance improvement. But Moore's law is dead now so small performance differences matter way more now.
The argument has been that even if you have a CISC ISA that also happens to have a subset of instructions following the RISC philosophy, that the bloat and legacy instructions will hold CISC back. In other words, the weakness of CISC is that you can add, but never remove.
Jim Keller disagrees with this assessment and it is blatantly obvious.
You build a decoder that predicts that the instructions are going to have simple encodings and if they don't, then you have a slow fallback.
Now you might say that this makes the code slow under the assumption that you make heavy use of the complex instructions, but compilers have a strong incentive to only emit fast instructions.
If you can just add RISC style instructions to CISC ISAs, the entire argument collapses into irrelevance.
Obviously they've done an amazing job of working around it, but that adds a ton of complexity. At the very least it's going to mean you spend engineering resources on something that ARM & RISC-V don't even have to worry about.
This seems a little like a Java programmer saying "garbage collection is solved". Like, yeah you've made an amazingly complicated concurrent compacting garbage collector that is really fast and certainly fast enough almost all of the time. But it's still not as fast as not doing garbage collection. If you didn't have the "we really want people to use x86 because my job depends on it" factor then why would you use CISC?
Except that RISC-V had to bolt on a variable-length extension, giving the worst of all possible worlds....
The notebooks of TFA aren't really big computers.
At one time, ISA had a significant impact on predictors: variable length instructions complicated predictor design. The consensus is that this is no longer the case: decoders have grown to overcome this and now the difference is negligible.
I imagine that the difference is much greater for the tiny in-order CPUs we find in MCUs though, just because an amd64 decoder would be a comparatively much larger fraction of the transistor budget
No, not really. The advantage is Apple prioritizing efficiency, something Intel never cared enough about.
269 more comments available on Hacker News