From Rust to Reality: the Hidden Journey of Fetch_max
Posted3 months agoActive3 months ago
questdb.comTechstoryHigh profile
calmpositive
Debate
20/100
RustCompiler OptimizationAtomic Operations
Key topics
Rust
Compiler Optimization
Atomic Operations
The article explores how Rust's `fetch_max` function is compiled to efficient machine code, sparking a discussion on atomic operations, CPU instructions, and compiler optimizations.
Snapshot generated from the HN discussion
Discussion Activity
Moderate engagementFirst comment
39m
Peak period
8
0-2h
Avg / period
4.4
Comment distribution57 data points
Loading chart...
Based on 57 loaded comments
Key moments
- 01Story posted
Sep 23, 2025 at 5:24 PM EDT
3 months ago
Step 01 - 02First comment
Sep 23, 2025 at 6:03 PM EDT
39m after posting
Step 02 - 03Peak activity
8 comments in 0-2h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 24, 2025 at 11:27 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45352944Type: storyLast synced: 11/20/2025, 6:45:47 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
An example you'll see in say a modern C compiler is that if you write the obvious loop to calculate how many bits are set in an int, the actual machine code on a brand new CPU should be a single population count instruction, C provides neither intrinsics (like Rust) not a dedicated "popcount" feature, so you can't write that but it's obviously what you want here and yup an optimising C compiler will do that.
However, LLVM is dealing with an IR generated by other compiler folk so I think it probably has less use for idiom recognition. Clang would do the recognition and lower to the same LLVM IR as Rust does for its intrinsic population count core::intrinsics::ctpop so the LLVM backend doesn't need to spot this. I might be wrong, but I think that's how it works.
C compilers definitely have intrinsics for this, for GCC for instance it is `__builtin_popcount`.
And apparently it has even standard language support for it since C23, it's `stdc_count_ones` [1] and in C++ you have `std::popcount` [2]
[1] https://en.cppreference.com/w/c/numeric/bit_manip.html [2] https://en.cppreference.com/w/cpp/numeric/popcount.html
But yes stdc_count_ones is indeed the intrinsic you'd want here, and only a few years after I stopped writing C, so thanks for mentioning that.
std::popcount is C++ but it's also kinda miserable that it took until C++ 20 and yet they still only landed the unsigned integer types, even though C++ 20 also insists the signed integers have two's complement representation, so the signed integers do have these desirable properties in fact but you can't use that.
And FWIW all major C/C++ have for a long time have had a an intrinsic for this. In clang it even has the same name, Visual Studio it's something like just '_popcount'. So it has long been easy to roll your own macro that works everywhere.
Yesterday I watched that "Sea of Thieves" C++ 14 to C++ 20 upgrade story on Youtube, that feels much more like what I've seen - code that shouldn't have worked but it did, kept alive by people whose priority is a working game.
I don't think this generalization is actually true. Fast portable software compiles conditionally based on the target platform, picking the fast platform-specific intrinsic, and falls back to a slow but guaranteed portable software implementation. This pattern is widespread in numerical linear algebra, media codecs, data compressors, encryption, graphics, etc.
Edit: I could not find any pass with a pattern matching to replace CAS loops. The closest thing I could find is this pass: https://github.com/llvm/llvm-project/blob/06fb26c3a4ede66755... I reckon one could write a similar pass to recognize CAS idioms, but its usefulness would be probably rather limited and not worth the effort/risks.
https://gcc.godbolt.org/z/b5s4WjnTG
(amomax is the atomic fetch-max instruction. lr and sc are load-reserved and store-conditional instructions; sc is like a regular store except it only succeeds if the address was not modified since the previous lr that accessed it. IOW the assembly is basically one-to-one with the C source.)
If this kind of thing floats your boat, you might be interested in the non-reading variants of these as well. Mostly for things like add, max, etc but some recent architectures actually offer alternate operations to skip the read-back. The paper calls them “atomic reduction operations” https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p31...
Afaik on all of x86, arm, and riscv an atomic load of a word sized datum is just a regular load.
Also am I understanding it correctly that n is the number of threads in your example? Don't you find it suspicious that the number of operations goes up as the thread count goes up?
edit: ok, you are saying that under heavy contention the check avoids having to do the store at all. This is racy, and whether this is correct or not, would be very application specific.
edit2: I thought about this a bit, and I'm not sure i can come up with a scenario where the race matters...
edit3: ... as long as all threads are only doing atomic_max operations on the memory location, which an implementation can't assume.
What assumes that?
If your early read gives you a higher number, quitting out immediately is the same as doing the max that same nanosecond. You avoid setting a variable to the same value it already is. Doing or not doing that write shouldn't affect other atomics users, should it?
In general, I should be able to add or remove as many atomic(x=x) operations as I want without changing the result, right?
And if your early read is lower then you just do the max and having an extra read is harmless.
The only case I can think of that goes wrong is the read (and elided max) happening too early in relation to accesses to other variables, but we're assuming relaxed memory order here so that's explicitly acceptable.
A load that finds a smaller value is trickier to analyze, but i think you are just free to ignore it and just proceed with the atomic max. An underlying LL/SC loop to implement a max operation might spuriously fail anyway.
edit: here is another argument in favour: if your only atomic RMW is a cas, to implement X.atomic_max(new) you would:
So a cas loop would naturally implement the same optimization (unless it starts with a random expected), so the race is benign.- the value is often doesn't require an update, and
- there's contention on the cache line, i.e., at least two cores frequently read or write that cache line.
But there are important details to consider:
1) The probing load must be atomic. Both the compiler and the processor in general are allowed to split non-atomic loads into two or more partial loads. Only atomic loads – even with relaxed ordering – are guaranteed to not return intermediate or mixed values from other atomic stores.
2) If the ordering on the read part of the atomic read-modify-write operation is not relaxed, the probing load must reflect this. For example, an acq-rel RMW op would require an acquire ordering on the probing read.
Well, if you target a specific architecture, then of course you can assume more guarantees than in general, portable code. And in general, a processor might distinguish between non-atomic and relaxed-atomic reads and writes – in theory.
But more important, and relevant in practice, is the behavior of the compiler. C, C++, and Rust compilers are allowed to assume that non-atomic reads aren't influenced by concurrent writes, so the compiler is allowed to split non-atomic reads into smaller reads (unlikely) or even optimize the reads away if it can prove that the memory location isn't written to by the local thread (more likely).
I believe this is a bit trickier than that, you would also need at least some kind of atomic barrier to preserve the ordering semantics of the successful update case.
Ah, a magician. welcome.
It makes me feel a little better reading about the history of memory models in CPUs. If this stuff wasn't intuitive to Intel either, I'm at least in good company in being confused (https://research.swtch.com/hwmm#path_to_x86-tso)
I actually knew about fetch_max from "implementing" the corresponding instruction (risc-v amomax), but I haven't done any of the fun parts yet since my soft-CPU still only has a single core.
[0]: https://marabos.nl/atomics/
In most cases it's even better to just store a maximum per thread separately and loop over all threads once to compute the current maximum if you really need it.
RTFA
https://doi.org/10.1145/2486159.2486189 https://jshun.csail.mit.edu/contention.pdf
C++26 (work-in-progress) does have std::atomic<T>::fetch_max . Not implemented in any toolchains yet, though.
https://en.cppreference.com/w/cpp/atomic/atomic/fetch_max
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p04...
> PS: After conducting this journey I learned that C++26 adds fetch_max too!
Rust fetch_update uses the lowest common denominator, CAS, regardless of platform: https://godbolt.org/z/ncssGnsfx (see the call __aarch64_cas8_acq_rel). In hot loops this can mean double-digit perf loss.
It could be implemented with a CAS fallback of course, but it seems a performance trap.
You could add the logic to the compiler to detect which specific code sequences are LL/SC safe, but at that point just providing built-ins for the most common operations is simpler.
2 more comments available on Hacker News