Ditch Your Mutex, You Deserve Better
Key topics
The article discusses the limitations of mutexes in concurrent programming and introduces Software Transactional Memory (STM) as a better alternative, sparking a discussion on the merits and challenges of STM and other concurrency approaches.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
2h
Peak period
147
Day 7
Avg / period
40
Based on 160 loaded comments
Key moments
- 01Story posted
Nov 11, 2025 at 5:11 PM EST
about 2 months ago
Step 01 - 02First comment
Nov 11, 2025 at 7:41 PM EST
2h after posting
Step 02 - 03Peak activity
147 comments in Day 7
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 24, 2025 at 10:11 PM EST
about 1 month ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
https://arrow-kt.io/learn/coroutines/stm/
I remember 15 or 20 years (has it been that long?) when the Haskell people like dons were banging on about: 1) Moore's law being dead, 2) future CPUs will have tons of cores, and 3) good luck wrangling them in your stone age language! Check out the cool stuff we've got going on over in Haskell!
I would also not blame Java for the worst of OO, it would have happened without it. There were so many languages pretending to that throne. Java got there first because of the aforementioned VM advantage.
I would say Java also had a materially strong impact on the desire for server and client hardware agnosticism, and even database agnosticism.
Some of this is positive: Java demonstrates that it's better if you don't have to worry about what brand your server is, and JDBC arguably perfected ODBC.
Some of it is negative: a lot of the move to richer client experiences in the browser has to do with trying to remove client-side Java as a dependency, because it failed. It's not the only bridged dependency we removed, of course: Flash is equally important as a negative.
Really? Were there other major contenders that went as far as the public class MyApp { public static void main( nonsense?
i.e. C++ primitive types are defined to be objects but do not fit into a traditional object-oriented definition of “object”.
What I’m saying is that the discrepancy between primitives in C++ and Java is entirely one of definition. Java didn’t actually change this. Java just admitted that “objects” that don’t behave like objects aren’t.
In any case, Alan Kay is constantly clear that object oriented programming is about messages, which you can do in C++ in a number of ways. (I'm not sure exactly what Alan Kay means here, but it appears to exclude function calls, but would allow QT signal/slots)
Nothing to do with the objects of OOP.
There are other cases though. E.g. in C++ you could always have function pointers which were not objects and functions which were not methods. In Java you had to use instances of anonymous classes in place of function pointers, and all of your functions had to be (possibly static) methods.
Perhaps you would prefer a multi-paradigm programming language?
http://mozart2.org/mozart-v1/doc-1.4.0/tutorial/index.html
Nothing came out of it unfortunately.
To me, the appealing things about STMs are the possibility of separating concerns of worst-case execution time and error handling, which are normally pervasive concerns that defeat modularity, from the majority of the system's code. I know this is not the mainstream view, which is mostly about manycore performance.
STM shines in a concurrent setting, where you know you'll multiple threads accessing your system and you want to keep everything correct. And nothing else comes close.
Also, with a queue, you've moved the shared state elsewhere, namely, into the queue.
There are lots of standard shared queues you can pick from that have been fully regression tested. That's almost always better than mixing concurrency in with your business logic.
Yes! I would even go so far as to have the type system separate the mutable from the non-mutable for this reason!
> Something actor like fits better and C# or Go channels works wonders.
Or even STM channels/queues:
This isn't an attack on the (very well written) article though. Just wanted to add my two cents.
If it's asynchronous, it's not a mutex anymore, or it's just used to synchronously setup some other asynchronous mechanism.
CPS shouldn't be able to deadlock for example?
It’s what synchronized classes wish they had been, maybe.
It's even nicer than that: you can actually access data without locking the mutex, because while you hold a mutable borrow to the mutex, Rust statically guarantees that no one else can acquire locks on the mutex.
https://doc.rust-lang.org/std/sync/struct.Mutex.html#method....
Mutex is for where you have that ability, and ensures at runtime that accesses get serialized.
This means no atomic operations or syscalls or what have you.
Are you sure? Isn't having data be local to a thread the most common situation, with data sharing being the exception?
>Or, the data is linked globally in a data structure and the only way to access it safely is by knowing exactly what you're doing and what the other threads are doing.
That's exactly what the borrow checker does. It tracks how many mutable references you have to your data structure at compile time. This means you can be sure what is local and what is shared.
Meanwhile without the borrow checker you always have to assume there is a remote probability that your mental model is wrong and that everything goes wrong anyways. That's mentally exhausting. If something goes wrong, it is better to only have to check the places where you know things can go wrong, rather than the entire code base.
I keep very little state on the stack -- mostly implicit stuff like mutex lock / mutex unlock. I've found this to be the right pattern because it enables composability. I'm now free to factor functionality into separate functions.
So yes, I have lots of locals but I share basically none of them with other threads. Also, I avoid writing any code that blocks on other threads (other than maybe locking a mutex), so there's another reason why I would not intentionally share a local with another thread. Anything that will be shared with another thread should be allocated on the heap just for the reason that we want to avoid blocking on other threads.
In that sense, the borrow checker is a tool that would allow me to write code more easily that I never wanted written in the first place.
You want the object to present its valid operations, but the object could also be constructed in single or multithreaded situations.
So you'd offer two APIs; one which requires a shared reference, and internally locks, and a second which requires a mutable reference, but does no locking.
Internally the shared reference API would just lock the required mutexes, then forward to the mutable reference API.
They're liable to deadlocks/livelocks, and sometimes not only with other explicitly Mutex-like things (it might happen some library you use has a lock hidden deep inside).
They're also often backed byOS primitives (with big overheads) with inconsistent behaviors between platforms (spinlocks, waiting etc). We've run into an issue with .NET, that their version of Mutex didn't wake up the blocked thread on Linux as fast as on Windows, meaning we needed about 100x the time to serve a request as the thread was sleeping too long.
There are questions like when to use spinlocks and when to go to wait sleep, which unfortunately the developer has to answer.
Not assigning blame here, just pointing out that threading primitives and behaviors don't translate perfectly between OSes.
Multi-threading is hard, other solutions like queues suffer from issues like backpressure.
That's why I'm skeptical about Rust's fearless concurrency promise - none of these bugs are solved by just figuring out data races - which are a huge issue, but not the only one.
1) the mutex byte/word is more likely to be in the same cache line as the data you want to access anyway, and
2) different threads are more likely to write to mutex bytes/words in different cache lines, whereas in coarse-grained locking, different threads will fight for exclusive access over the cache line containing that one, global mutex.
And allocating the int contiguously might actually be the right solution is the cost of sporadic false sharing is less than the cost of wasting memory.
There's no silver bullet.
It seems it's optimized for scenarios with high performance compute heavy code, and short critical sections.
These assumptions may let it win benchmarks, but don't cover the use cases of all users. To illustrate why this is bad, imagine if you have a Mutex protected resource that becomes available after 10us on average. This locks spins 10 times checking if it has become available )(likely <1us) then yields the thread. The OS (lets assume Linux) wont wake it up the thread until the next scheduler tick, and its under no obligation to do so even then (and has no idea it should). But even best-case, you're left waiting 10ms, which is a typical scheduler tick.
In contrast OS based solutions are expensive but not that expensive, let's say that add 1us to the wait. Then you would wait 11us for the resource.
A method call taking 10ms and one taking 15 us is a factor of 60x, which can potentially kill your performance.
You as the user of the library are implicitly buying into these assumptions which may not hold for your case.
There's also nothing in Rust that protects you from deadlocks with 100% certainty. You can fuzz them out, and use helpers, but you can do that in any language.
So you do need to be mindful of how your mutex works, if you want to build a system as good as the one it replaces.
How would you handle the archetypical example of a money transfer between two bank accounts, in which 100 units of money need to be subtracted from one account and atomically added to another account, after checking that the first account contains at least 100 units?
and the way to do it is to either have some kind single-point-of-control (designated actor or single-threaded executor) or mark the data (ie. use some concurrency control primitive either wrapping the data or in some dedicated place where the executors check [like JVM's safepoints])
using consistent hashing these hypothetical accounts could be allocated to actors and then each transaction is managed by the actor of the source (ie. where the money is sent from, where the check needs to happen), with their own durable WAL, and periodically these are aggregated
(or course then the locking is hidden in the maintenance of the hashring as eating philosophers are added/removed)
Distributing accounts among different actors, without two-phase commit or its moral equivalent, enables check kiting.
In your case, you could have a channel where the Receiver is the only part of the code that transfers anything. It'd receive a message Transfer { from: Account, to: Account, amount: Amount } and do the required work. Any other threads would therefore only have copies of the Sender handle. Concurrent sends would be serialized through the queue's buffering.
I'm not suggesting this is an ideal way of doing it
No single concurrency primitive covers all use cases. I was addressing your misconceptions about mutex performance and overhead, not whether mutexes are the best solution to your particular problem.
> […] it has no way of getting notified when the mutex gets unblocked […] The OS (lets assume Linux) wont wake it up the thread until the next scheduler tick, and its under no obligation to do so even then (and has no idea it should).
You've misunderstood the parking_lot implementation. When thread B tries to lock a mutex that's currently locked by thread A, then, after spinning a few cycles, thread B
"parks" itself, i.e., it asks the kernel to remove it from the Runnable task queue. On Linux, this is done using the futex syscall. When thread A unlocks the mutex, it detects that another thread is waiting on that mutex. Thread A takes one thread from the queue of waiting threads and "unparks" it, i.e., it asks the kernel to move it into the Runnable task queue. The kernel is notified immediately, and if there's a free CPU core available, will tend to dispatch the thread to that core. On a non-realtime OS, there's no guarantee how long it takes for an unblocked thread to be scheduled again, but that's the case for all* concurrency primitives.> they were designed to turn single-threaded code into multi-threaded
not really
> usually a single point of contention quickly becomes a problem.
not generally, no
> They're liable to deadlocks/livelocks,
deadlocks/livelocks are orthogonal to any specific primitive
> They're also often backed byOS primitives (with big overheads) with inconsistent behaviors between platforms (spinlocks, waiting etc).
the mutex as a primitive is orthogonal to any specific implementation...
etc. etc.
http://brinch-hansen.net/papers/1999b.pdf
I know OOP isn't cool any more, but the above is what OOP solves.
TFA's transfer() and withdraw() functions aren't compliant with double-entry accounting anyways, so you'd mark them private and only expose Transfer to callers. Let the class handle its own details.
The article was a lengthy explanation of why the problem occurs in non-STM settings. In OO settings.
What do you propose as an OO solution?
Not anything that's not already covered in any undergraduate CS course.
Most parallel programming courses are at Masters level and require specialization, but those are much more advanced (sorting networks, distributed computing on supercomputers and GPU, consensus algorithms, parallelization of linear solvers...)
There are undergraduate courses that cover simple things like multithreading patterns in mainstream languages, but it seems they are only available if you go to an education institution with both a practical mindset and a firm grip of the fundamentals, which is unfortunately quite rare, as most institutions tend to specialize in one or the other.
I already foresaw (and it gets mentioned later), the problem that if you have many small, frequent operations, they will prevent a big, long operation from happening because they will always change the state and cause conflicts before the big one can finish. You can easily code yourself an app that will softlock forever.
The post doesn't offer a good solution (it talks about one where there's a tradeoff, but you don't need tradeoffs for this)
The way this gets fixed it is to make the lock acquisition (or in this case, the priority of the merge?) preemptive (Wound-Wait)
All transactions have a global, ever incrementing number attached to them, their ID, which we call "seniority", if you try to get a a lock, and the lock is held by a transaction with a lower seniority (=> a higher ID), you kill the transaction, take the lock, and once you're done the transaction you killed is allowed to retry
In the meantime, if a transaction with the lower seniority tries to get the lock, it gets blocked
This insures that your program will always finish.
In the case of "lots of frequent small transactions" + "one big, long report", the report will get killed a few times, until your report inevitably becomes the most senior transaction to ask for this resource, and is allowed to complete.
the problem with STM (or any other concurrency control mechanism) is that it's not their job to solve these issues (as usually they require conscious product design), so they are not going to solved by "just use STM".
They could, but are typically configured not to.
With default settings they let other processes' commits cut into the middle of your read. (See READ_COMMITTED.). I was flabbergasted when I read this.
When you turn MSSQL up to a higher consistency level you can get it to grind to a halt and throw Deadlock exceptions on toy code. (Moving money between accounts will do it).
I used to advertise STM as like ACID-databases for shared-memory systems, but I guess it's even more consistent than that.
If you assign exclusive ownership of all accounting data to a single thread and use CSP to communicate transfers, all of these made up problems go away.
Is there any way for an external thread to ask (via CSP) for the state, think about the state, then write back the new state (via CSP)?
If so, you're back to race conditions - with the additional constraints of a master thread and CSP.
Everyone else has multiple threads, and should replace their locks with STM for ease and safety.
You've got safe single-thread and CSP, you should try STM to gain multithreading and get/set.
Also in practice the CSP node that is providing access control is effectively implementing shared memory (in an extremely inefficient way).
The fundamental problem is not shared memory, it is concurrent access control.
There's no silver bullet.
Here's the example from a Go STM package that's based on Haskell STM. It has gotchas that you won't encounter in Haskell though, due to the nature of these languages.
https://github.com/anacrolix/stm/blob/master/cmd/santa-examp...
For the equivalent Haskell, check out the link at the top of the file.
The "lock" instruction is what the article is telling you to ditch.
> If the programmer forgets to lock the mutex the system won't stop them from accessing the data anyways
If the programmer forgets to "lock"
> and even then there's no actual link between the data being locked and the lock itself
lock (thing) { return thing.contents // shared, mutable array given out freely to the world }
'contents' has no notion that it has anything to do with this "lock"ing thing.
edit: it could theoretically livelock, but I believe most if not all STM implementations also do not guarantee forward progress.
Its normative force seems to be founded on claims about performance, but it would be very surprising if the performance cost of guaranteed forward progress or obstruction-freedom were too high for me to want to pay it, since what I'm most interested in is latency and fault-tolerance, not parallel speedups.
>"LP" is not obstruction-free but is wait-free
As far as I know, wait-free is a superset of lock-free and lock-free is a superset of obstruction-free. How can LP be wait-free but not obstruction free?
In any case a wait-free algorithm can't live-lock by definition (progress and fairness of all threads is guaranteed), but the catch is that while the STM runtime itself might have this property, it doesn't necessarily translate to an algorithm implemented in the runtime (which makes sense, you should be able to implement a lock with an STM).
So, yes, the paper is interesting, but probably not relevant for this discussion and I shouldn't have brought it up.
Now the question again remains, do concrete STM implementations actually provide the guarantee you mentioned earlier, i.e. does a transaction aborting guarantees that another succeeds? I would think it doesn't as I think it would be very costly to implement: both transactions might end up aborting when detecting conflict.
Maybe what concrete runtimes actually guarantee is that there is an upper bound on the spurious aborts and restarts as worst case you can fall back on a single global lock for serialization when the upper bound is reached.
On the other hand, there are many other things that could be done to avoid wasting all the extra power gained over the years which don't even require any parallelism boost.
The article says a couple of things about STMs that are not true of STMs in general, just true of the Haskell STM the author is familiar with, like a small Brazilian child confidently telling you that almost everyone speaks Portuguese.
One of these is, "STM is an optimistic concurrency system." The possibility of making your concurrency 100% lock-free is one of the most appealing things about STMs, and I think it could be a key to solving the UI latency problem, which just keeps getting worse and worse. Actors and CSP don't normally help here; an Actor is just as "blocking" as a lock. But you can implement an STM with partly pessimistic concurrency, or purely pessimistic, and it might even be a good idea.
Another is, "One last benefit of STM which we haven't yet discussed is that it supports intelligent transaction retries based on conditions of the synchronized data itself." This was an innovation introduced by "Composable Memory Transactions", and many STMs do not support it, including Keir Fraser's awesomely fast version. I am even less certain that it is the correct tradeoff for all uses than I am about purely optimistic synchronization.
But all of this is why I'm rereading Gray and Reuter's Transaction Processing right now after 25 years. With the benefit of 35 years of hindsight, it's a frustrating mix of inspiring long-range vision and myopic boneheadedness. But it shares a lot of hard-won wisdom about problems like long transactions that pop up in a new guise in STMs.
Are you planning for this comment to be relevant for long enough that the leading zeros will be helpful for disambiguation?
Remember, Brawndo: It's What Plants Crave.
From what I understand of the C++ memory model (shared by C and Rust), this is not the definition of data race – a data race occurs when two or more threads access memory concurrently where at least one access is a write and the accesses are unsynchronized. However, synchronized accesses may not have a deterministic ordering, in which case a race condition occurs.
(Confusing as it may be, I believe this is standard terminology.)
Not every language is like that. I would not use language (that allows this) for parallel programming. You can use other ways but how can you guarantee everyone who will edit the code-base will not use unenforced mutex?
https://en.wikipedia.org/wiki/MESI_protocol
So it has most of the hardware to support transactional memory, only it's not exposed to the user.
Intel had their own version, but it turned out it was buggy, so they disabled it and never put it in any subsequent CPU so that was that.
40 more comments available on Hacker News