The Death of Thread Per Core
Posted2 months agoActive2 months ago
buttondown.comTechstoryHigh profile
calmmixed
Debate
70/100
ConcurrencyMultithreadingPerformance Optimization
Key topics
Concurrency
Multithreading
Performance Optimization
The article discusses the shift away from the traditional 'thread per core' model, and the discussion revolves around the trade-offs and complexities of different concurrency approaches.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
7h
Peak period
34
24-36h
Avg / period
9.7
Comment distribution68 data points
Loading chart...
Based on 68 loaded comments
Key moments
- 01Story posted
Oct 20, 2025 at 5:19 PM EDT
2 months ago
Step 01 - 02First comment
Oct 21, 2025 at 12:11 AM EDT
7h after posting
Step 02 - 03Peak activity
34 comments in 24-36h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 27, 2025 at 8:06 AM EDT
2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45649510Type: storyLast synced: 11/20/2025, 3:53:09 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
That being said, there are some things that are generally true for the long term: use a pinned thread per core, maximize locality (of data and code, wherever relevant), use asynchronous programming if performance is necessary. To incorporate the OP, give control where it's due to each entity (here, the scheduler). Cross-core data movement was never the enemy, but unprincipled cross-core data movement can be. If even distribution of work is important, work-stealing is excellent, as long as it's done carefully. Details like how concurrency is implemented (shared-state, here) or who controls the data are specific to the circumstances.
This was on a wide variety of intel, AMD, NUMA, ARM processors with different architectures, OSes and memory configurations.
Part of the reason is hyper threading (or threadripper type archs) but even locking to groups wasn’t usually faster.
This was even moreso the case when you had competing workloads stealing cores from the OS scheduler.
[0] https://news.ycombinator.com/item?id=45651183
A really interesting niche is all of the performance considerations around the design/use of VPP (Vector Packet Processing) in the networking context. It's just one example of a single niche, but it can give a good idea of how both "changing the way the computation works" and "changing the locality and pinning" can come together at the same time. I forget the username but the person behind VPP is actually on HN often, and a pretty cool guy to chat with.
Or, as vacuity put it, "there are no hard rules; use principles flexibly".
The memory-bandwidth bound cases is where thread-per-core tends to shine. It was the problem in HPC that thread-per-core was invented to solve and it empirically had significant performance benefits. Today we use it in high-scale databases and other I/O intensive infrastructure if performance and scalability are paramount.
That said, it is an architecture that does not degrade gracefully. I've seen more thread-per-core implementations in the wild that were broken by design than ones that were implemented correctly. It requires a commitment to rigor and thoroughness in the architecture that most software devs are not used to.
You see this most obviously (visually) in places like game engines. In Unity, the difference between non-burst and burst-compiled code is very extreme. The difference between single and multi core for the job system is often irrelevant by comparison. If the amount of cpu time being spent on each job isn't high enough, the benefit of multicore evaporates. Sending a job to be ran on the fleet has a lot of overhead. It has to be worth that one time 100x latency cost both ways.
The GPU is the ultimate example of this. There are some workloads that benefit dramatically from the incredible parallelism. Others are entirely infeasible by comparison. This is at the heart of my problem with the current machine learning research paradigm. Some ML techniques are terrible at running on the GPU, but it seems as if we've convinced ourselves that GPU is a prerequisite for any kind of ML work. It all boils down to the latency of the compute. Getting data in and out of a GPU takes an eternity compared to L1. There are other fundamental problems with GPUs (warp divergence) that preclude clever workarounds.
Paper quotes 76,800 bits per template (less compressed) and with 64-bit words it's what, 1200 64-bit bitwise ops. at 4.5 Ghz it's 4.5b ops per second / 1200 ops per per comparison which is ~3.75 million recognitions per second. Give or take some overhead, it's definitely possible.
[0] https://www.christoph-busch.de/files/Gomez-FaceBloomFilter-I...
Cache locality is a thing. Like in raytracing and the old confucian adage that says "Primary rays cache, secondary trash".
Modern CPUs don't quite work this way. Many instructions can be retired per clock cycle.
> Second of all L1 cache is at most in the hundreds of kilobytes, so the faces aren't in L1 but must be retrieved from elsewhere...??
Yea, from L2 cache. It's caches all the way down. That's how we make it go really fast. The prefetcher can make this look like magic if the access patterns are predictable (linear).
As others have mentioned, they're probably doing some kind of embedding like search primarily and then 500 cycles per face makes more sense, but it's not a full comparison.
It doesn't help that GPU beats the CPU in compute, if a plain SIMD approach outperforms the total execution time.
Pre-work time + pack up time + send time + unpack time + work time + pack up time + send time + unpack time + post-work time.
All remote work has these properties. Even something 'simple' like a remote REST call. If 'remote work time' plus all that other stuff is less than your local calls then it is time wise worth sending it remote. If not local CPU would win.
That in many cases right now the GPU is 'winning' that race.
I mean... that's kind of a pathological case, no?
Say you're making a four-course meal. In the abstract, each course is independent of the other three, but internally the steps of its preparation have exactly this kind of dependence, where step 3 is scheduled after step 2 because doing those steps in the other order will ruin the food.
If you ever want to make just one of those courses -- maybe you're going to a potluck -- now you've got an almost fully sequential workflow.
(And in practice, the full four-course meal is much more sequential than it appears in the abstract, because many of the steps of each course must contend for scarce resources, such as the stove, with steps of other courses.)
That's not to say it's perfect. The problem is in anticipating how much workload is about to arrive and deciding how many worker threads to spawn. If you overestimate and have too many worker threads running, you will get wasteful stealing; if you're overly conservative and slow to respond to growing workload (to avoid over-stealing), you'll wait for threads to spawn and hurt your latencies just as the workload begins to spike.
In other words, you could probably easily do 10m op/s per core on a thread per core design but struggle to get 1m op/s on a work stealing design. And the work stealing will be total throughput for the machine whereas the 10m op/s design will generally continue scaling with the number of CPUs.
Not always. For differential equations with large enough matrices, the independent work each core can do outperforms the communication overhead of core-to-core latency.
if your workload is majority cpu-bound then this is true, sometimes, and at best
most workloads are io (i.e. syscall) bound, and io/syscall overhead is >> cross-core communication overhead
* I prefer the term "work-sharding" over "thread-per-core", because work-stealing architectures usually also use one thread per core, so it tends to confuse people.
You can adjust some settings for how schedulers work with respect to balancing load, but afaik, work stealing cannot be disabled... when a scheduler has no runnable processes, it will look at the runqueue of another scheduler and steal a runnable process if any are availabke (in priority order).
It does default to one 'cpu scheduler' per cpu thread, plus some i/o schedulers and maybe some dirty schedulers.
The architectures from circa 2010 were a bit rough. While the article has some validity for architectures from 10+ years ago, the state-of-the-art for thread-per-core today looks nothing like those architectures and largely doesn't have the issues raised.
News of thread-per-core's demise has been greatly exaggerated. The benefits have measurably increased in practice as the hardware has evolved, especially for ultra-scale data infrastructure.
Writing a series of articles about the history and theory of thread-per-core software architecture has been on my eternal TODO list. HPC in particular is famously an area of software that does a lot of interesting research but rarely publishes, in part due to its historical national security ties.
The original thought exercise was “what if we treated every core like a node in a supercomputing cluster” because classical multithreading was scaling poorly on early multi-core systems once the core count was 8+. The difference is that some things are much cheaper to move between cores than an HPC cluster and so you adapt the architecture to leverage the things that are cheap that you would never do on a cluster while still keeping the abstraction of a cluster.
As an example, while moving work across cores is relatively expensive (e.g. work stealing), moving data across cores is relatively cheap and low-contention. The design problem then becomes how to make moving data between cores maximally cheap, especially given modern hardware. It turns out that all of these things have elegant solutions in most cases.
There isn’t a one-size-fits-all architecture but you can arrive at architectures that have broad applicability. They just don’t look like the architectures you learn at university.
* Use a multi-threaded tokio runtime that's allocated a thread-per-core * Focus on application development, so that tasks are well scoped / skewed and don't _need_ stealing in the typical case * Over time, the smart people working on Tokio will apply research to minimize the cost of work-stealing that's not actually needed. * At the limit, where long-lived tasks can be distributed across cores and all cores are busy, the performance will be near-optimal as compared with a true thread-per-core model.
What's your hot take? Are there fundamental optimizations to a modern thread-per-core architecture which seem _impossible_ to capture in a work-stealing architecture like Tokio's?
At some point, people started using thread-per-core style while delegating scheduling to a third-party runtime, which almost completely defeats the purpose. If you let tokio et al do that for you, you are leaving a lot of performance and scale on the table. This is an NP-Hard problem; the point of solving it at compile-time is that it is computationally intractable for generic code to create a good schedule at runtime unless it is a trivial case. We need schedulers to consistently make excellent decisions extremely efficiently. I think this point is often lost in discussions of thread-per-core. In the old days we didn’t have runtimes, it was just assumed you would be designing an exotic scheduler. The lack of discussion around this may have led people to believe it wasn’t a critical aspect.
The reality that designing excellent workload-optimized I/O and execution schedulers is an esoteric, high-skill endeavor. It requires enormous amounts of patience and craft, it doesn’t lend itself to quick-and-dirty prototypes. If you aren’t willing to spend months designing the many touch points for the scheduler throughout your software, the algorithms for how events across those touch points interact, and analyzing the scheduler at a systems level for equilibria and boundary conditions then thread-per-core might not be worth the effort.
That said, it isn’t rocket science to design a reasonable schedule for software that is e.g. just taking data off the wire and doing something with it. Most systems are not nearly as complex as e.g. a full-featured database kernel.
Your past has already been super interesting, so if you ever do get around to writing this, I’d be very excited to read it!
Libraries like .NET's Task Parallel Library or Intel Threaded Building Blocks pretty much cemented these work-stealing task architectures. It's not that they didn't work well enough, but Intel Core came along, and single-threaded perf scaling was possible again, so these libraries became less of a focus.
It seems multi-core interest is back.
"At that time, ensuring maximum CPU utilization was not so important, since you’d typically be bound by other things, but things like disk speed has improved dramatically in the last 10 years while CPU speeds have not."
I have absolutely no idea why anyone would think breaking the thread per core model is better and I seriously question the knowledge of anyone proposing another model without some VERY good explanation. The GP isn't even close to this in any way.
Per core threads and not much else are fairly required for nyse, trading, oms, and i bet things like switches. A web browser might be their polar opposite.
Never thought of it that way, but it’s indeed true — a new task does get enqueued in that case. Thanks for the insight!
Java, .NET, Delphi, and C++ co-routines, all provide mechanisms to provide our own scheduler, which can then be used to say what goes where.
Maybe cool languages should look more into the ideas of these not so cool our parents ecosystems kind of languages. There are some interesting ideas there.