Analyzing the Memory Ordering Models of the Apple M1

Posted4 months agoActive4 months ago

charles_irl

131 points

69 comments

sciencedirect.comTechstoryHigh profile

calmpositive

Debate

40/100

Apple M1Memory Ordering ModelsArm vs X86

Key topics

Apple M1

Memory Ordering Models

Arm vs X86

Researchers analyzed the memory ordering models of the Apple M1 chip, finding a 10% performance degradation when using x86's total order instead of ARM's weaker consistency model, sparking discussion on the implications and potential optimizations.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

Peak period

84-96h

Avg / period

7.9

Comment distribution63 data points

Loading chart...

Based on 63 loaded comments

Key moments

01Story posted
Sep 11, 2025 at 5:33 PM EDT
4 months ago
Step 01
02First comment
Sep 11, 2025 at 5:38 PM EDT
4m after posting
Step 02
03Peak activity
30 comments in 84-96h
Hottest window of the conversation
Step 03
04Latest activity
Sep 18, 2025 at 4:58 PM EDT
4 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (69 comments)

Showing 63 comments of 69

charles_irlAuthor

4 months ago

3 replies

Cool paper! The authors use the fact that the M1 chip supports both ARM's weaker memory consistency model and x86's total order to investigate the performance hit from using the latter, ceteris paribus.

They see an average of 10% degradation on SPEC and show some synthetic benchmarks with a 2x hit.

MBCook

4 months ago

1 reply

I’ve seen the stronger x86 memory model argued as one of the things that affects its performance before.

It’s neat to see real numbers on it. Didn’t seem to be very big in many circumstances which I guess would have been my guess.

Of course Apple just implemented that on the M1 and AMD/Intel had been doing it for a long time. I wonder if later M chips reduced the effect. And will they drop the feature once they drop Rosetta 2?

jchw

4 months ago

3 replies

I'm really curious how exactly they'll wind up phasing out Rosetta 2. They seem to be a bit coy about it:

> Rosetta was designed to make the transition to Apple silicon easier, and we plan to make it available for the next two major macOS releases – through macOS 27 – as a general-purpose tool for Intel apps to help developers complete the migration of their apps. Beyond this timeframe, we will keep a subset of Rosetta functionality aimed at supporting older unmaintained gaming titles, that rely on Intel-based frameworks.

However, that leaves much unsaid. Unmaintained gaming titles? Does this mean native, old macOS games? I thought many of them were already no longer functional by this point. What about Crossover? What about Rosetta 2 inside Linux?

https://developer.apple.com/documentation/virtualization/run...

I wouldn't be surprised if they really do drop some x86 amenities from the SoC at the cost of performance, but I think it would be a bummer of they dropped Rosetta 2 use cases that don't involve native apps. Those ones are useful. Rosetta 2 is faster than alternative recompilers. Maybe FEX will have bridged the gap most of the way by then?

toast0

4 months ago

1 reply

> However, that leaves much unsaid. Unmaintained gaming titles? Does this mean native, old macOS games? I thought many of them were already no longer functional by this point. What about Crossover? What about Rosetta 2 inside Linux?

Apple keeps trying to be a platform for games. Keeping old games running would be a step in that direction. Might include support for x86 games running through wine/apple game porting toolkit/etc

warpspin

4 months ago

> Apple keeps trying to be a platform for games. Keeping old games running > would be a step in that direction. Might include support for x86 games > running through wine/apple game porting toolkit/etc

Well... They'd need to bring back 32-bit support also then. This is what killed most of my Mac-compatible Steam library....

And I do not see that happening.

guappa

4 months ago

2 replies

They dropped rosetta 1, what makes you think they will keep supporting this one?

GeekyBear

4 months ago

2 replies

Rosetta 1 was licenced third party technology back when the company wasn't exactly rolling in money.

https:/www.wikipedia.org/wiki/QuickTransit

If you have to pay the licensing fee again every time you want to release a new version of the OS, you've got a fiscal incentive to sunset Rosetta early.

Rosetta 2 was developed in-house.

Apple owns it, so there is no fiscal reason to sunset it early.

15155

4 months ago

> so there is no fiscal reason to sunset it early.

Silicon (or verification thereof) isn't free.

LtWorf

4 months ago

> Apple owns it, so there is no fiscal reason to sunset it early.

Except not having to pay to maintain it.

jchw

4 months ago

Rosetta 1 wasn't really useful for much because PowerPC was a dead platform by the time Apple switched off of it. Rosetta 2 is used for much more than just compatibility with old macOS apps.

twoodfin

4 months ago

I think they’re trying to maintain the stick for ordinary “Cocoa” app developers, but otherwise leave themselves the room to keep using the technology where it makes sense.

loeg

4 months ago

2 replies

This comment is a two sentence summary of the six sentence Abstract at the very top of the linked article. (Though the paper claims 9%, not 10% -- to three sig figs, so rounding up to 10% is inappropriate.)

Also -- 9% is huge! I am kind of skeptical of this result (haven't yet read the paper). E.g., is it possible ARM's TSO order isn't optimal, providing a weaker relative performance than a TSO native platform like x86?

> An application can benefit from weak MCMs if it distributes its workload across multiple threads which then access the same memory. Less-optimal access patterns might result in heavy cache-line bouncing between cores. In a weak MCM, cores can reschedule their instructions more effectively to hide cache misses while stronger MCMs might have to stall more frequently.

So to some extent, this is avoidable overhead with better design (reduced mutable sharing between threads). The impact of TSO vs WO is greater for programs with more sharing.

> The 644.nab_s benchmark consists of parallel floating point calculations for molecular modeling. ... If not properly aligned, two cores still share the same cache-line as these chunks span over two instead of one cache-line. As shown in Fig. 5, the consequence is an enormous cache-line pressure where one cache-line is permanently bouncing between two cores. This high pressure can enforce stalls on architectures with stronger MCMs like TSO, that wait until a core can exclusively claim a cache-line for writing, while weaker memory models are able to reschedule instructions more effectively. Consequently, 644.nab_s performs 24 percent better under WO compared to TSO.

Yeah, ok, so the huge magnitude observed is due to some really poor program design.

> The primary performance advantage applications might gain from running under weaker memory ordering models like WO is due to greater instruction reordering capabilities. Therefore, the performance benefit vanishes if the hardware architecture cannot sufficiently reorder the instructions (e.g., due to data dependencies).

Read the thing all the way through. It's interesting and maybe useful for thinking about WO vs TSO mode on Apple M1 Ultra chips specifically, but I don't know how much it generalizes.

ip26

4 months ago

1 reply

I’m not an expert… but it seems like it could be even simpler than program design. They note false sharing occurs due to data not being cacheline aligned. Yet when compiling for ARM, that’s not a big deal due to WO. When targeting x86, you would hope the compiler would work hard to align them! So the out of the box compiler behavior could be crucial. Are there extra flags that should be used when targeting ARM-TSO?

loeg

4 months ago

False sharing mostly needs to be avoided with program design. I'm not aware of any compiler flags that help here.

gpderetta

4 months ago

1 reply

My understanding is that x86 implementations use speculation to be able to reorder beyond what's allowed by the memory model. This is not free in area and power, but allows recovering some of the cost of the stronger memory model.

As TSO support is only a transitional aid for Apple, it is possible that they didn't bother to implement the full extend of optimizations possible.

Someone

4 months ago

1 reply

Or chose not to fully implement it. Speculative execution has its share of security issues, so they may have chosen to be cautious.

adgjlsfhk1

4 months ago

based on the value speculation they do, side channel security doesn't seem to have been one of the primary goals

jandrewrogers

4 months ago

2 replies

This raises questions.

For example, modern x86 architectures still readily out-perform ARM64 in performance-engineered contexts. I don’t think that is controversial. There are a lot of ways to explain it e.g. x86 is significantly more efficient in some other unrelated areas, x86 code is tacitly designed to minimize the performance impact of TSO, or the Apple Silicon implementations nerf the TSO because it isn’t worth the cost to optimize a compatibility shim. TSO must have some value in some contexts, it wasn’t chosen arbitrarily.

Apple Silicon is also an unconventional implementation of ARM64, so I wonder the extent to which this applies to any other ARM64 implementation. I’d like to see more thorough and diverse data. It feels like there are confounding factors.

I think it is great that this is being studied, I’m just not sure it is actionable without much better and more rigorous measurement across unrelated silicon microarchitectures.

rowanG077

4 months ago

3 replies

Apple M4 cpu is pretty much kimg in terms of single threaded performance. In multithreaded the M4 ultra of course loses against extreme high core count server CPUs. But I think it's wrong to say that x86 readily outperforms ARM64. Apple essentially dominates in all CPU segments they are in.

menaerus

4 months ago

2 replies

But x86_64 does outperform ARM64 in high-performance workloads. High-performance workloads are not single-threaded programs. Maybe if Apple decides one day to manufacture the server CPU, which I believe they will not since they would have to open their chips to Linux. OTOH server aarch64 implementations such Neoverse or Graviton are not as good as x86_64 in terms of absolute performance. Their core design cannot yet compete.

tucnak

4 months ago

3 replies

You're misrepresenting what CPU's do exactly, and opaque term "high-performance workloads" does not help it, either. M-class chips have 256-bit (M4) and 512-bit (M3 Max) memory bus per "socket" options, as high as 1024-bit total in M2 Ultra, which is significantly higher than 64-bit and 128-bit DDR5 buses you get in x86 CPU's. For example, my relatively modern datacenter AMD EPYC 8434PN CPU (based on Zen 4c cores) is a six-channel DDR5, effectively 384-bit at 200 GB/s bidirectional bandwidth. Apple Silicon beats it by a factor of 5x. You can get somewhat better with Turin, but not by much, and at perhaps unreasonable premium.

Now, like with everything in life, of course, there's highly-specialised datapaths like AVX-512, but then again these only contribute towards single-threaded performance, & you yourself said that "High-performance workloads are not single-threaded programs." Now, as your compute network grows larger the implementation details of the memory fabric (NUMA, etc.) become more pronounced. Suffice to say the SoC co-packaging of CPU and GPU cores, along with some coprocessors, did wonders for Apple Silicon. Strix Halo exists, but it's not competitive by any stretch of imagination. You could say it's unfair, but then again, AMD MI300A (LGA6096 socket) exists, too! Do we count 20k APU's that only come in eights, bundled up in proprietary Infinity Fabric-based chassis towards "outperforming ARM64 in high-perf workloads"... really? Compute-bound is a far cry from high-performance, where the memory bus, and idiosyncrasies of message-passing are King as number of cores in the compute network continues to grow.

avhception

4 months ago

1 reply

There is also IBMs POWER11, with regard to memory bandwidth :)

tucnak

4 months ago

It's quite impressive what they were able to achieve with ppc64el in recent years, including Linux support for it, too. Unfortunately, they turned the wrong way with proprietary encryption of memory, which may or may not be deliberate as far as backdoors come and go, but in all honesty so much in it is contingent on IBM's proprietary fabric (OSC or what was it?) implementation for tiered memory anyway. There's similar setups from Samsung, even including fully transparent swapping to NVMe for persistence which is really cool, and hard to match in open source setting.

I think their slogan could be "unlimited, coherent, persistent, encrypted high-bandwidth memory is here, and we are the only ones that really have it."

Disclaimer: proud owner of thoroughbred OpenPOWER system from Raptor

menaerus

4 months ago

3 replies

> Apple Silicon beats it by a factor of 5x

Really, 1TB/s of memory bandwidth to and from system memory?

I don't believe it since that's impossible from HW limits PoV - there's no such DRAM that would allow such performance and Apple doesn't design their memory sticks ...

It is also no more special with their 512-, 768- or 1024-bit memory interface since this is also not designed by them nor it is exclusively reserved to Apple. Intel has it. AMD has it as well.

However, regardless of that, and regardless of the way how you're the one skewing the facts, I would be happy to see the benchmark that shows, for example, a sustained load bandwidth of 1TB/s. Do you have one since I couldn't find it?

> You can get somewhat better with Turin

High-end Intel/AMD server-grade CPUs can achieve a system memory bandwidth of 600-700GB/s. So not somewhat better but 3x better.

rowanG077

4 months ago

1 reply

> Really, 1TB/s of memory bandwidth to and from system memory?

5x is false, it's more like 4x. Apple doesn't use memory sticks, they use on-SoC dram ICs.

The M3 Ultra has 8 memory channels at 128-bit per channel for a total of 1024-bit memory bus. It uses LPDDR5-6400 so it has 1024-bit * 6400000000 bits = 819.2 gigabytes per second of memory bandwidth.

menaerus

4 months ago

1 reply

You're deceiving yourself and falling for Apple marketing. Regardless of a stick or SoC memory, which has been the case with pretty much SoC in 2010's (nowadays I have no idea), it is not possible to drive the memory with such high speeds.

rowanG077

4 months ago

2 replies

This is definitely citation needed. I very much expect a combined GPU/CPU/NPU load to saturate the memory channels if necessary. This is not some marketing fluff. The channels are real, the number of RAM ICs are physically there and connected.

menaerus

4 months ago

We are talking about the memory bandwidth available to the CPU cores and not all the co-processors/accelerators present in the SoC so you're pulling in the argument that is not valid.

https://web.archive.org/web/20240902200818/https://www.anand...

> While 243GB/s is massive, and overshadows any other design in the industry, it’s still quite far from the 409GB/s the chip is capable of.

> That begs the question, why does the M1 Max have such massive bandwidth? The GPU naturally comes to mind, however in my testing, I’ve had extreme trouble to find workloads that would stress the GPU sufficiently to take advantage of the available bandwidth.

> Granted, this is also an issue of lacking workloads, but for actual 3D rendering and benchmarks, I haven’t seen the GPU use more than 90GB/s (measured via system performance counters)

Rohansi

4 months ago

https://web.archive.org/web/20250125040351/anandtech.com/sho...

You're realistically going to reach power/thermal limits before you saturate the memory bandwidth. Otherwise I'd like to hear about a workload that'll make use of the CPU, GPU, NPU, etc. to make use of Apple's marketing point.

namibj

4 months ago

1 reply

Power 10 offers that much for a while now. Per-socket. And you can join up to iirc 16 sockets together into a coherent single-linux-kernel machine.

menaerus

4 months ago

1 reply

Not sure which part of my comment you were referring to but if it was about the 1TB/s of mem bw it seems it is rather 409GB/s per-socket.

From https://www.ibm.com/support/pages/ibm-aix-power10-performanc...

> The Power10 processor technology introduces the new OMI DIMMs to access main memory. This allows for increased memory bandwidth of 409 GB/s per socket. The 16 available high-speed OMI links are driven by 8 on-chip memory controller units (MCUs), providing a total aggregated bandwidth of up to 409 GBps per SCM. Compared to the Power9 processor-based technology capability, this represents a 78% increase in memory bandwidth.

And that is again a theoretical limit which usually isn't that interesting but rather it's the practical limit the CPU is able to hit.

namibj

4 months ago

For one, OMI is just like PCIe full-duplex; second, OMI-with-DDR4-3200 is substantially lacking in throughput vs. e.g. GDDR6 that was shown in the early Power10/OMI slides.

Also there's to note the substantial overprovisioning of the lanes to handle lane-localized transmission issues without degrading observed performance.

tucnak

4 months ago

You're right, I looked it up, the hardware limit is actually 800 GB/s for M2 Ultra. You're also right that the actual bandwidth in real workloads is typically lower than that due to the aforementioned idiosyncrasies in caches, message-passing, prefetches, or lack thereof, etc. The same is the case for any high-end Intel/AMD CPU, though. If you wish to compare benchmarks, a single most relevant benchmark today is LLM inference, where M-series chips are a contender to beat. This is almost entirely due to combination of high-bandwidth, high-capacity (192 GB) on-package DRAM, available to all CPU and GPU cores. The closest x86 contender is AMD Strix Halo, and it's only somewhat competitive in high-sparsity, small MoE setups. NVIDIA were going to produce a desktop one based on their Grace superchip, but it turned out to be a big nothing.

Now, I'm not sure whether it's genuine to compare Apple Silicon to AMD's Turin architecture, where 600 GB/s is theoretically possible, considering at this point you're talking about 5K euro CPU with a smudge under 600W TDP. This is why I brought up Sienna, specifically, which is giving comparable performance in comparable price bracket and power envelope. Have you seen how much 12 channels of DDR5-6400 would set you back? The "high-end AMD server-grade," to borrow your words, system—would set you back 10K at a minimum, and it would still have zero GPU cores, and you would still have a little less memory bandwidth than a three-year old M2 Ultra.

I own both a Mac studio, and a Sienna-based AMD system.

There are valid reasons to go for x86, mainly it's PCIe lanes, various accelerator cards, MCIO connectivity for NVMe stuff, hardware IOMMU, SR-IOV networking and storage, in fact, anything having to do with hardware virtualisation. This is why people get "high-end" x86 CPU's, and indeed, this is why I used Sienna for the comparison as it's at least comparable in terms of price. And not some abstract idea of redline performance, where x86 CPU's by the way absolutely suck in a single most important general purpose task, i.e. LLM inference. If you were going for the utmost bit of oompf, you would go for a superchip anyway. So your choice is not even whether you're getting a CPU, instead it's how big and wide you wish your APU cluster to be, and what you're using for interconnect, as it's the largest contributing factor to your setup.

Update: I was unfair in my characterisation of NVIDIA DGX Spark as "big nothing," as despite its shortcomings, it's a fascinating platform in terms of connectivity: the first prosumer motherboard to natively support 200G, if I'm not mistaken. Now, you could always use a ConnectX-6 in your normal server's PCIe 5.0 slot, but that would already set you back many thousands of euros for datacenter-grade server specs.

Rohansi

4 months ago

2 replies

Memory bandwidth is just a marketing term for Apple at this point. Sure, the bus is capable of reaching that bandwidth, but how much can your code actually use? You'd be mistaken if you think the CPU can make use of all that bandwidth, or even the GPU!

tucnak

4 months ago

1 reply

It's solely dependent on the workload's memory access patterns. The higher you go in thread count, the more you're constrained by contention, caches, etc. The paper in OP is demonstrating how relatively subtle differences in the memory model are leading to substantial differences in performance on actual hardware. The same as having lots of FLOPS on paper doesn't necessarily mean you'll get to use all that compute, if you're waiting on memory all the time. M-series processors have packaging advantage that is very hard to beat, and indeed, is yet to be beat—in consumer and prosumer segments.

See my reply to adjacent comment; hardware is not marketing, and LLM inference stands to witness.

Rohansi

4 months ago

> The same as having lots of FLOPS on paper doesn't necessarily mean you'll get to use all that compute, if you're waiting on memory all the time.

The opposite case is also possible. You can be compute limited. Or there could be bottlenecks somewhere else. This is definitely the case for Apple Silicon because you will certainly not be able to make use of all of the memory bandwidth from the CPU or GPU. As always, benchmark instead of looking at raw hardware specifications.

inkyoto

4 months ago

2 replies

> […] but how much can your code actually use?

All of it, and it is transparent to the code. The correct question is «how much data does the code transfer?»

Whether you are scanning large string ropes for a lone character or multiplying huge matrices, no manual code optimisation is required.

tucnak

4 months ago

1 reply

Are you well-read enough into the platform so that you can attest to it requiring no manual code optimisation for high-performance datapaths? I'm only familiar with Apple Silicon-specific code in llama.cpp, and not really familiar with either Accelerate[0] or MLX[1] specifically. Have they really cracked it at homogenous computing so that you could use a single description of computation, and have it emit efficient code for whatever target in the SoC? Or are you merely referring to the full memory capacity/bandwidth being available to CPU in normal operation?

[0]: https://developer.apple.com/documentation/accelerate

[1]: https://ml-explore.github.io/mlx/build/html/usage/quick_star...

inkyoto

4 months ago

> Are you well-read enough into the platform so that you can attest to it requiring no manual code optimisation for high-performance datapaths?

Yes.

> Have they really cracked it at homogenous computing […]

Yes.

> have it emit efficient code […]

Yes. I had also written compilers and code generators for a number of platforms (all RISC) decades before Apple Silicon became a thing.

> […] for whatever target in the SoC?

You are mistaking the memory bus width that I was referring to for CPU specific optimisations. You are also disregarding the fact that the M1-4 Apple SoC's have the same internal CPU architecture, differing mostly in the ARM instruction sets they support (ARM64 v8.2 in M1 through to ARM64 v8.6 in M4).

> Or are you merely referring to the full memory capacity/bandwidth being available to CPU in normal operation?

Yes.

Is there truly a need to be confrontantial in what otherwise could have become an insightgul and engaging conversation?

Rohansi

4 months ago

1 reply

Have you tested it or is that just what you expect?

inkyoto

4 months ago

Yes, I have actually tested it.

Others also have. The https://lemire.me/blog/ blog has a wealth of insights across multiple architectures, which include all of the current incumbents (Intel, Apple, Qualcomm, etc.)

Do you have any detailed insights? I would be eager to assess them.

mschuster91

4 months ago

1 reply

> Maybe if Apple decides one day to manufacture the server CPU, which I believe they will not since they would have to open their chips to Linux.

They already are open enough to boot and run Linux, the things that Asahi struggles with are end-user peripherals.

> OTOH server aarch64 implementations such Neoverse or Graviton are not as good as x86_64 in terms of absolute performance. Their core design cannot yet compete.

These are manufactured on far older nodes than Apple Silicon or Intel x86, and it's a chicken-egg problem once again - there will be no incentive for ARM chip designers to invest into performance as long as there are no customers, and there are no customers as long as both the non-Apple hardware has serious performance issues and there is no software optimized to run on ARM.

menaerus

4 months ago

2 replies

> They already are open enough to boot and run Linux, the things that Asahi struggles with are end-user peripherals.

That's for entertainment and for geeks such as ourselves but not realistically for hosting a service in a data center that millions of people would depend on.

> These are manufactured on far older nodes than Apple Silicon

True but I don't think this would be the main bottleneck but perhaps. IMO it's the core design that is lacking.

> there will be no incentive for ARM chip designers to invest into performance as long as there are no customers

Well, AWS is hosting a multitude of their EC2 instances - Graviton4 (Neoverse V2 cores). This implies that there are customers.

rowanG077

4 months ago

1 reply

> That's for entertainment and for geeks such as ourselves but not realistically for hosting a service in a data center that millions of people would depend on.

Why not? Well form factor is an issue. But you can easily fit a few mac pros in a couple Us. Support is generally better then some HP or Dell servers.

menaerus

4 months ago

1 reply

Are you serious? But maybe you're not aware how such businesses are run - Linux is not officially supported by Apple and someone has to take the liability when something goes wrong, either you loose your data or your CPU melts down or whatever.

rowanG077

4 months ago

Do you think HP or Dell will take liability? Tell me you have never dealt with any large OEM without telling me you have never dealt with any large OEM. No way they will take any responsibility for loss of life, data loss, or literally anything at all. The best they do is send some cannon fodder to replace the hardware if it fails. Perhaps it's different if you have a few hundred thousands of their devices running but my experience with small operations is that it's basically impossible to deal with them.

mschuster91

4 months ago

1 reply

> Well, AWS is hosting a multitude of their EC2 instances - Graviton4 (Neoverse V2 cores). This implies that there are customers.

AWS has a bit of a different cost-benefit calculation though. For them, similar to Apple, ARM is a hedge against the AMD/Intel duopoly, and they can run their own services (for which they have ample money for development and testing) for far cheaper because the power efficiency of ARM systems is better than x86 - and like in the early AWS time that started off as Amazon selling off spare compute capacity, they expose to the open market what they don't need.

menaerus

4 months ago

Sure, there's a different cost-benefit calculation. My argument was that there is an incentive to optimize for ARM64 because that translates to $$$. It's not only Amazon but Oracle and Microsoft too.

jandrewrogers

4 months ago

I can’t replicate that on the server on a per core basis, which is the only thing I care about.

As a key exhibit, AVX-512 native code destroys Apple Silicon. To be clear, I like and use Apple Silicon but they can’t carry a workload like x86 and they seem disinterested in trying. Super-efficient for scalar code though.

crinkly

4 months ago

Yeah and outside benchmarks you have to consider the power envelope and platform on top which is definitely out somewhere on its own.

loeg

4 months ago

1 reply

The programs that see the most benefit of WO vs TSO are poorly written multithreaded programs. Most of the software you actually use might be higher quality than that?

> TSO must have some value in some contexts, it wasn’t chosen arbitrarily.

Ehhh. I think they might have just backed themselves into it? I believe Intel initially claimed SeqCst but the chips never implemented that and the lack was observable. TSO happened to accurately describe the existing behavior of early multicore Intel chips and they can't exactly relax it now without breaking existing binaries.

Google's AI slop claims Intel published something vague in 2007, and researchers at Cambridge came up with the TSO name and observation in 2009 ("A Better x86 Memory Model: x86-TSO").

https://www.cl.cam.ac.uk/~pes20/weakmemory/x86tso-paper.tpho...

gpderetta

4 months ago

Intel initially claimed Processor Ordering that, IIRC, allows processors doing independent reads of independent writes (IRIW) to observe different orderings. This is slightly weaker than TSO.

In practice Intel never took advantage of this and, given the guarantees provided by the memory barriers, it was hard to formally recover SC, so Intel slightly strengthened it to TSO, which is what was actually implemented in hardware anyway.

I don't think intel ever claimed SC since their first CPU with builtin support for cache coherency (it was the PPro I think?), and the memory model was not well defined before that and left to external chips.

menaerus

4 months ago

2 replies

Plenty of interesting details.

> Regarding cache-line size, sysctl on macOS reports a value of 128 B, while getconf and the CTR_EL0 register on Asahi Linux returns 64 B, which is also supported by our measurements.

How would this be even possible?

stinkbeetle

4 months ago

1 reply

Cache must be physically organized as 64 byte lines. Cache line size is most important for software for two things:

- Architectural interfaces like (I think, I don't really know aarch64) DC CVAU. These don't necessarily have to reflect physical cache organization, cleaning a "line" could clean two physical lines.

- Performance. The only thing you really care about is behavior on stores and on load misses for avoiding false sharing cache line bouncing problems.

It's possible that either they think 128 byte lines will be helpful for performance and hope they could switch over after legacy software goes away, seeding their mac ecosystem with 128 byte lines now, or that 128 byte line behavior actually does offer some performance benefit and they have a mode that basically gangs two lines together (Pentium 4 had similar IIRC) so it has performance characteristics of a 128 byte line.

loeg

4 months ago

1 reply

Early x86 prefetcher would fetch two adjacent cache lines, so despite a 64 byte physical size, in practice adjacent lines would cause false-sharing. This is mostly historical, though it's relatively common to use a 128 byte line size on x86, still. E.g., https://github.com/facebook/folly/blob/main/folly/lang/Align... (Sandy Bridge was a 2011 CPU). (Clang's impl of the std version of these constants uses 64 for both on x86: https://godbolt.org/z/r1fdYTWEn .)

menaerus

4 months ago

2 replies

I can't wrap my head around on how is it that triggering the L1 HW prefetcher so that it loads two pairs of cache-lines, from L2 into L1, can cause false-sharing.

Perhaps what fb experiment measured was the artifact of L2 HW prefetcher which takes advantage of 128-byte data layout:

    Streamer — Loads data or instructions from memory to the second-level cache. To use the streamer, organize the data or instructions in blocks of 128 bytes, aligned on 128 bytes. The first access to one of the two cache lines in this block while it is in memory triggers the streamer to prefetch the pair line. To software, the L2 streamer’s functionality is similar to the adjacent cache line prefetch mechanism found in processors based on Intel NetBurst microarchitecture.

loeg

4 months ago

1 reply

Not sure where you're getting L1 from. The folly comment doesn't mention it.

FWIW, IA-32 SDM Vol 3A, section 10.10.6.7 still explicitly recommends 128-byte alignment for locks on NetBurst uarch (~Pentium 4) processors.

> 10.10.6.7 Place Locks and Semaphores in Aligned, 128-Byte Blocks of Memory

> When software uses locks or semaphores to synchronize processes, threads, or other code sections; Intel recommends that only one lock or semaphore be present within a cache line (or 128 byte sector, if 128-byte sector is supported). In processors based on Intel NetBurst microarchitecture (which support 128-byte sector consisting of two cache lines), following this recommendation means that each lock or semaphore should be contained in a 128-byte block of memory that begins on a 128-byte boundary. The practice minimizes the bus traffic required to service locks.

There is also an interesting table of cache line and "sector" sizes in Table 13-1 (it's all 64-byte cache lines in newer Intel CPUs).

menaerus

4 months ago

> Not sure where you're getting L1 from. The folly comment doesn't mention it.

True, it is what I assumed. Do you think they meant false-sharing on L2 level?

stinkbeetle

4 months ago

> I can't wrap my head around on how is it that triggering the L1 HW prefetcher so that it loads two pairs of cache-lines, from L2 into L1, can cause false-sharing.

CPU0 stores to byte 0x10 and dirties CL0 (0x00-0x40). CPU1 loads byte 0x50 in a different data structure which is in CL1, and its adjacent line prefetcher also loads CL0, which is what Pentium 4 did.

> Perhaps what fb experiment measured was the artifact of L2 HW prefetcher which takes advantage of 128-byte data layout:

Seems plausible.

loeg

4 months ago

1 reply

Are you asking how it is possible for a sysctl to report the wrong value? It's trivial for the OS to return whatever it likes for a sysctl; they're just software. (The sysctl is wrong.)

menaerus

4 months ago

No, that's not what I was wondering. Cache-line size being a HW property it is not exactly "configurable", although I guess technically it could be, and I was confused how is it that the Apple sysctl returns 128B, which is a ground truth, and this paper then says that their measurements support the 64B cache-line size reported by Asahi Linux.

I think that the measurements are a hard evidence, and if they are not incorrect, why would Apple sysctl return 128B then? I am actually wondering if Apple M design really supports two different cache-line sizes, 64B and 128B respectively, but the mode is somehow configurable.

SomeHacker44

4 months ago

Using Android Firefox:

Your browser is outdated Update your browser to view ScienceDirect correctly.

Guess I won't read this.

TinkersW

4 months ago

Maybe x86 should add a flag to toggle the weaker memory model on

6 more comments available on Hacker News

View full discussion on Hacker News

ID: 45216327Type: storyLast synced: 11/20/2025, 8:00:11 PM

Want the full context?