Analyzing the Memory Ordering Models of the Apple M1
Key topics
Researchers analyzed the memory ordering models of the Apple M1 chip, finding a 10% performance degradation when using x86's total order instead of ARM's weaker consistency model, sparking discussion on the implications and potential optimizations.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
4m
Peak period
30
84-96h
Avg / period
7.9
Based on 63 loaded comments
Key moments
- 01Story posted
Sep 11, 2025 at 5:33 PM EDT
4 months ago
Step 01 - 02First comment
Sep 11, 2025 at 5:38 PM EDT
4m after posting
Step 02 - 03Peak activity
30 comments in 84-96h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 18, 2025 at 4:58 PM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
They see an average of 10% degradation on SPEC and show some synthetic benchmarks with a 2x hit.
It’s neat to see real numbers on it. Didn’t seem to be very big in many circumstances which I guess would have been my guess.
Of course Apple just implemented that on the M1 and AMD/Intel had been doing it for a long time. I wonder if later M chips reduced the effect. And will they drop the feature once they drop Rosetta 2?
> Rosetta was designed to make the transition to Apple silicon easier, and we plan to make it available for the next two major macOS releases – through macOS 27 – as a general-purpose tool for Intel apps to help developers complete the migration of their apps. Beyond this timeframe, we will keep a subset of Rosetta functionality aimed at supporting older unmaintained gaming titles, that rely on Intel-based frameworks.
However, that leaves much unsaid. Unmaintained gaming titles? Does this mean native, old macOS games? I thought many of them were already no longer functional by this point. What about Crossover? What about Rosetta 2 inside Linux?
https://developer.apple.com/documentation/virtualization/run...
I wouldn't be surprised if they really do drop some x86 amenities from the SoC at the cost of performance, but I think it would be a bummer of they dropped Rosetta 2 use cases that don't involve native apps. Those ones are useful. Rosetta 2 is faster than alternative recompilers. Maybe FEX will have bridged the gap most of the way by then?
Apple keeps trying to be a platform for games. Keeping old games running would be a step in that direction. Might include support for x86 games running through wine/apple game porting toolkit/etc
Well... They'd need to bring back 32-bit support also then. This is what killed most of my Mac-compatible Steam library....
And I do not see that happening.
https:/www.wikipedia.org/wiki/QuickTransit
If you have to pay the licensing fee again every time you want to release a new version of the OS, you've got a fiscal incentive to sunset Rosetta early.
Rosetta 2 was developed in-house.
Apple owns it, so there is no fiscal reason to sunset it early.
Silicon (or verification thereof) isn't free.
Except not having to pay to maintain it.
Also -- 9% is huge! I am kind of skeptical of this result (haven't yet read the paper). E.g., is it possible ARM's TSO order isn't optimal, providing a weaker relative performance than a TSO native platform like x86?
> An application can benefit from weak MCMs if it distributes its workload across multiple threads which then access the same memory. Less-optimal access patterns might result in heavy cache-line bouncing between cores. In a weak MCM, cores can reschedule their instructions more effectively to hide cache misses while stronger MCMs might have to stall more frequently.
So to some extent, this is avoidable overhead with better design (reduced mutable sharing between threads). The impact of TSO vs WO is greater for programs with more sharing.
> The 644.nab_s benchmark consists of parallel floating point calculations for molecular modeling. ... If not properly aligned, two cores still share the same cache-line as these chunks span over two instead of one cache-line. As shown in Fig. 5, the consequence is an enormous cache-line pressure where one cache-line is permanently bouncing between two cores. This high pressure can enforce stalls on architectures with stronger MCMs like TSO, that wait until a core can exclusively claim a cache-line for writing, while weaker memory models are able to reschedule instructions more effectively. Consequently, 644.nab_s performs 24 percent better under WO compared to TSO.
Yeah, ok, so the huge magnitude observed is due to some really poor program design.
> The primary performance advantage applications might gain from running under weaker memory ordering models like WO is due to greater instruction reordering capabilities. Therefore, the performance benefit vanishes if the hardware architecture cannot sufficiently reorder the instructions (e.g., due to data dependencies).
Read the thing all the way through. It's interesting and maybe useful for thinking about WO vs TSO mode on Apple M1 Ultra chips specifically, but I don't know how much it generalizes.
As TSO support is only a transitional aid for Apple, it is possible that they didn't bother to implement the full extend of optimizations possible.
For example, modern x86 architectures still readily out-perform ARM64 in performance-engineered contexts. I don’t think that is controversial. There are a lot of ways to explain it e.g. x86 is significantly more efficient in some other unrelated areas, x86 code is tacitly designed to minimize the performance impact of TSO, or the Apple Silicon implementations nerf the TSO because it isn’t worth the cost to optimize a compatibility shim. TSO must have some value in some contexts, it wasn’t chosen arbitrarily.
Apple Silicon is also an unconventional implementation of ARM64, so I wonder the extent to which this applies to any other ARM64 implementation. I’d like to see more thorough and diverse data. It feels like there are confounding factors.
I think it is great that this is being studied, I’m just not sure it is actionable without much better and more rigorous measurement across unrelated silicon microarchitectures.
Now, like with everything in life, of course, there's highly-specialised datapaths like AVX-512, but then again these only contribute towards single-threaded performance, & you yourself said that "High-performance workloads are not single-threaded programs." Now, as your compute network grows larger the implementation details of the memory fabric (NUMA, etc.) become more pronounced. Suffice to say the SoC co-packaging of CPU and GPU cores, along with some coprocessors, did wonders for Apple Silicon. Strix Halo exists, but it's not competitive by any stretch of imagination. You could say it's unfair, but then again, AMD MI300A (LGA6096 socket) exists, too! Do we count 20k APU's that only come in eights, bundled up in proprietary Infinity Fabric-based chassis towards "outperforming ARM64 in high-perf workloads"... really? Compute-bound is a far cry from high-performance, where the memory bus, and idiosyncrasies of message-passing are King as number of cores in the compute network continues to grow.
I think their slogan could be "unlimited, coherent, persistent, encrypted high-bandwidth memory is here, and we are the only ones that really have it."
Disclaimer: proud owner of thoroughbred OpenPOWER system from Raptor
Really, 1TB/s of memory bandwidth to and from system memory?
I don't believe it since that's impossible from HW limits PoV - there's no such DRAM that would allow such performance and Apple doesn't design their memory sticks ...
It is also no more special with their 512-, 768- or 1024-bit memory interface since this is also not designed by them nor it is exclusively reserved to Apple. Intel has it. AMD has it as well.
However, regardless of that, and regardless of the way how you're the one skewing the facts, I would be happy to see the benchmark that shows, for example, a sustained load bandwidth of 1TB/s. Do you have one since I couldn't find it?
> You can get somewhat better with Turin
High-end Intel/AMD server-grade CPUs can achieve a system memory bandwidth of 600-700GB/s. So not somewhat better but 3x better.
5x is false, it's more like 4x. Apple doesn't use memory sticks, they use on-SoC dram ICs.
The M3 Ultra has 8 memory channels at 128-bit per channel for a total of 1024-bit memory bus. It uses LPDDR5-6400 so it has 1024-bit * 6400000000 bits = 819.2 gigabytes per second of memory bandwidth.
https://web.archive.org/web/20240902200818/https://www.anand...
> While 243GB/s is massive, and overshadows any other design in the industry, it’s still quite far from the 409GB/s the chip is capable of.
> That begs the question, why does the M1 Max have such massive bandwidth? The GPU naturally comes to mind, however in my testing, I’ve had extreme trouble to find workloads that would stress the GPU sufficiently to take advantage of the available bandwidth.
> Granted, this is also an issue of lacking workloads, but for actual 3D rendering and benchmarks, I haven’t seen the GPU use more than 90GB/s (measured via system performance counters)
You're realistically going to reach power/thermal limits before you saturate the memory bandwidth. Otherwise I'd like to hear about a workload that'll make use of the CPU, GPU, NPU, etc. to make use of Apple's marketing point.
From https://www.ibm.com/support/pages/ibm-aix-power10-performanc...
> The Power10 processor technology introduces the new OMI DIMMs to access main memory. This allows for increased memory bandwidth of 409 GB/s per socket. The 16 available high-speed OMI links are driven by 8 on-chip memory controller units (MCUs), providing a total aggregated bandwidth of up to 409 GBps per SCM. Compared to the Power9 processor-based technology capability, this represents a 78% increase in memory bandwidth.
And that is again a theoretical limit which usually isn't that interesting but rather it's the practical limit the CPU is able to hit.
Also there's to note the substantial overprovisioning of the lanes to handle lane-localized transmission issues without degrading observed performance.
Now, I'm not sure whether it's genuine to compare Apple Silicon to AMD's Turin architecture, where 600 GB/s is theoretically possible, considering at this point you're talking about 5K euro CPU with a smudge under 600W TDP. This is why I brought up Sienna, specifically, which is giving comparable performance in comparable price bracket and power envelope. Have you seen how much 12 channels of DDR5-6400 would set you back? The "high-end AMD server-grade," to borrow your words, system—would set you back 10K at a minimum, and it would still have zero GPU cores, and you would still have a little less memory bandwidth than a three-year old M2 Ultra.
I own both a Mac studio, and a Sienna-based AMD system.
There are valid reasons to go for x86, mainly it's PCIe lanes, various accelerator cards, MCIO connectivity for NVMe stuff, hardware IOMMU, SR-IOV networking and storage, in fact, anything having to do with hardware virtualisation. This is why people get "high-end" x86 CPU's, and indeed, this is why I used Sienna for the comparison as it's at least comparable in terms of price. And not some abstract idea of redline performance, where x86 CPU's by the way absolutely suck in a single most important general purpose task, i.e. LLM inference. If you were going for the utmost bit of oompf, you would go for a superchip anyway. So your choice is not even whether you're getting a CPU, instead it's how big and wide you wish your APU cluster to be, and what you're using for interconnect, as it's the largest contributing factor to your setup.
Update: I was unfair in my characterisation of NVIDIA DGX Spark as "big nothing," as despite its shortcomings, it's a fascinating platform in terms of connectivity: the first prosumer motherboard to natively support 200G, if I'm not mistaken. Now, you could always use a ConnectX-6 in your normal server's PCIe 5.0 slot, but that would already set you back many thousands of euros for datacenter-grade server specs.
See my reply to adjacent comment; hardware is not marketing, and LLM inference stands to witness.
The opposite case is also possible. You can be compute limited. Or there could be bottlenecks somewhere else. This is definitely the case for Apple Silicon because you will certainly not be able to make use of all of the memory bandwidth from the CPU or GPU. As always, benchmark instead of looking at raw hardware specifications.
All of it, and it is transparent to the code. The correct question is «how much data does the code transfer?»
Whether you are scanning large string ropes for a lone character or multiplying huge matrices, no manual code optimisation is required.
[0]: https://developer.apple.com/documentation/accelerate
[1]: https://ml-explore.github.io/mlx/build/html/usage/quick_star...
Yes.
> Have they really cracked it at homogenous computing […]
Yes.
> have it emit efficient code […]
Yes. I had also written compilers and code generators for a number of platforms (all RISC) decades before Apple Silicon became a thing.
> […] for whatever target in the SoC?
You are mistaking the memory bus width that I was referring to for CPU specific optimisations. You are also disregarding the fact that the M1-4 Apple SoC's have the same internal CPU architecture, differing mostly in the ARM instruction sets they support (ARM64 v8.2 in M1 through to ARM64 v8.6 in M4).
> Or are you merely referring to the full memory capacity/bandwidth being available to CPU in normal operation?
Yes.
Is there truly a need to be confrontantial in what otherwise could have become an insightgul and engaging conversation?
Others also have. The https://lemire.me/blog/ blog has a wealth of insights across multiple architectures, which include all of the current incumbents (Intel, Apple, Qualcomm, etc.)
Do you have any detailed insights? I would be eager to assess them.
They already are open enough to boot and run Linux, the things that Asahi struggles with are end-user peripherals.
> OTOH server aarch64 implementations such Neoverse or Graviton are not as good as x86_64 in terms of absolute performance. Their core design cannot yet compete.
These are manufactured on far older nodes than Apple Silicon or Intel x86, and it's a chicken-egg problem once again - there will be no incentive for ARM chip designers to invest into performance as long as there are no customers, and there are no customers as long as both the non-Apple hardware has serious performance issues and there is no software optimized to run on ARM.
That's for entertainment and for geeks such as ourselves but not realistically for hosting a service in a data center that millions of people would depend on.
> These are manufactured on far older nodes than Apple Silicon
True but I don't think this would be the main bottleneck but perhaps. IMO it's the core design that is lacking.
> there will be no incentive for ARM chip designers to invest into performance as long as there are no customers
Well, AWS is hosting a multitude of their EC2 instances - Graviton4 (Neoverse V2 cores). This implies that there are customers.
Why not? Well form factor is an issue. But you can easily fit a few mac pros in a couple Us. Support is generally better then some HP or Dell servers.
AWS has a bit of a different cost-benefit calculation though. For them, similar to Apple, ARM is a hedge against the AMD/Intel duopoly, and they can run their own services (for which they have ample money for development and testing) for far cheaper because the power efficiency of ARM systems is better than x86 - and like in the early AWS time that started off as Amazon selling off spare compute capacity, they expose to the open market what they don't need.
As a key exhibit, AVX-512 native code destroys Apple Silicon. To be clear, I like and use Apple Silicon but they can’t carry a workload like x86 and they seem disinterested in trying. Super-efficient for scalar code though.
> TSO must have some value in some contexts, it wasn’t chosen arbitrarily.
Ehhh. I think they might have just backed themselves into it? I believe Intel initially claimed SeqCst but the chips never implemented that and the lack was observable. TSO happened to accurately describe the existing behavior of early multicore Intel chips and they can't exactly relax it now without breaking existing binaries.
Google's AI slop claims Intel published something vague in 2007, and researchers at Cambridge came up with the TSO name and observation in 2009 ("A Better x86 Memory Model: x86-TSO").
https://www.cl.cam.ac.uk/~pes20/weakmemory/x86tso-paper.tpho...
In practice Intel never took advantage of this and, given the guarantees provided by the memory barriers, it was hard to formally recover SC, so Intel slightly strengthened it to TSO, which is what was actually implemented in hardware anyway.
I don't think intel ever claimed SC since their first CPU with builtin support for cache coherency (it was the PPro I think?), and the memory model was not well defined before that and left to external chips.
> Regarding cache-line size, sysctl on macOS reports a value of 128 B, while getconf and the CTR_EL0 register on Asahi Linux returns 64 B, which is also supported by our measurements.
How would this be even possible?
- Architectural interfaces like (I think, I don't really know aarch64) DC CVAU. These don't necessarily have to reflect physical cache organization, cleaning a "line" could clean two physical lines.
- Performance. The only thing you really care about is behavior on stores and on load misses for avoiding false sharing cache line bouncing problems.
It's possible that either they think 128 byte lines will be helpful for performance and hope they could switch over after legacy software goes away, seeding their mac ecosystem with 128 byte lines now, or that 128 byte line behavior actually does offer some performance benefit and they have a mode that basically gangs two lines together (Pentium 4 had similar IIRC) so it has performance characteristics of a 128 byte line.
Perhaps what fb experiment measured was the artifact of L2 HW prefetcher which takes advantage of 128-byte data layout:
FWIW, IA-32 SDM Vol 3A, section 10.10.6.7 still explicitly recommends 128-byte alignment for locks on NetBurst uarch (~Pentium 4) processors.
> 10.10.6.7 Place Locks and Semaphores in Aligned, 128-Byte Blocks of Memory
> When software uses locks or semaphores to synchronize processes, threads, or other code sections; Intel recommends that only one lock or semaphore be present within a cache line (or 128 byte sector, if 128-byte sector is supported). In processors based on Intel NetBurst microarchitecture (which support 128-byte sector consisting of two cache lines), following this recommendation means that each lock or semaphore should be contained in a 128-byte block of memory that begins on a 128-byte boundary. The practice minimizes the bus traffic required to service locks.
There is also an interesting table of cache line and "sector" sizes in Table 13-1 (it's all 64-byte cache lines in newer Intel CPUs).
True, it is what I assumed. Do you think they meant false-sharing on L2 level?
CPU0 stores to byte 0x10 and dirties CL0 (0x00-0x40). CPU1 loads byte 0x50 in a different data structure which is in CL1, and its adjacent line prefetcher also loads CL0, which is what Pentium 4 did.
> Perhaps what fb experiment measured was the artifact of L2 HW prefetcher which takes advantage of 128-byte data layout:
Seems plausible.
I think that the measurements are a hard evidence, and if they are not incorrect, why would Apple sysctl return 128B then? I am actually wondering if Apple M design really supports two different cache-line sizes, 64B and 128B respectively, but the mode is somehow configurable.
Your browser is outdated Update your browser to view ScienceDirect correctly.
Guess I won't read this.
6 more comments available on Hacker News