Beyond Openmp in C++ and Rust: Taskflow, Rayon, Fork Union
Posted3 months agoActive3 months ago
ashvardanian.comTechstory
calmmixed
Debate
60/100
Parallel ComputingC++Rust
Key topics
Parallel Computing
C++
Rust
The article compares task parallelism libraries in C++ and Rust, including Taskflow, Rayon, and Fork Union, sparking discussion on performance, design, and potential applications.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
6m
Peak period
12
0-3h
Avg / period
4.4
Comment distribution31 data points
Loading chart...
Based on 31 loaded comments
Key moments
- 01Story posted
Sep 28, 2025 at 4:53 AM EDT
3 months ago
Step 01 - 02First comment
Sep 28, 2025 at 4:59 AM EDT
6m after posting
Step 02 - 03Peak activity
12 comments in 0-3h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 29, 2025 at 5:02 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45402820Type: storyLast synced: 11/20/2025, 5:39:21 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
The threads “busy-wait” by running an infinite loop in a lower energy state on modern CPUs.
And yes, there are more details in the actual implementation in the repository itself. This section, for example, describes the atomic variables needed to control all of the logic: https://github.com/ashvardanian/fork_union?tab=readme-ov-fil...
Doesn't that still use part of the process's timeslots from the OS scheduler's POV?
None of them involve allocation or system calls after initialization. Queues are pre-allocated, as they should.
BTW the for-case can simply be supported by setting a pool/global boolean and using that to decide how to wait for a new task (during the paralle for the boolean will be true, otherwise do sleeps with mutexes in the worst case for energy saving)
That said, closed-source solutions for local use aren’t quite the same as an open-source project with wider validation. With more third-party usage, reviews, and edge cases, you often discover issues you’d never hit in-house. Some of the most valuable improvements I’ve seen have come from external bug reports or occasional PRs from people using the code in very different environments.
Casey Muratori, while a great programmer in his own right, often disregards the use cases, which leads to apple-to-orange comparisons. E.g., why is this ASCII editor from 40 years ago much faster than this Unicode (with full ZWC joining emoji suite) text editor?
The parts I'm specifically interested in: 1. What the 300 line pool and allocator look like 2. What this means: "BTW the for-case can simply be supported by setting a pool/global boolean and using that to decide how to wait for a new task (during the paralle for the boolean will be true, otherwise do sleeps with mutexes in the worst case for energy saving)"
Thank you!
Arena allocation on windows for example is basically calling VirtualAlloc for a couple gigabytes on a 64 bit system (you have terabytes of virtual memory available) and then slicing it up into sub-ranges that you pass as parameters to threads grouped hierarchically for each cpu and then within that each group of cores that share cache and then single cores for their own cache memory. Lock the software threads to their hardware threads and done. Then for each arena use bump and maybe pool allocators for most stuff. Very basic and little code, much higher performance than most software out there. It’s also why a lot of diehard C programmers find rust lifetime management overengineered and boring btw because you don’t have so many lifetimes as modern C++ code for example.
For the boolean stuff look at the “better software conference” youtube talk about that video game physics engine for example (sorry, I’m on my phone on the jump). Again, old ideas being rediscovered
From a first glance, they seem to be tackling a slightly different problem: focusing on efficient handling of nested parallelism. Fork Union doesn’t support nested parallelism at all.
I think performance is a very critical property for Rust infrastructure. One can only hope that newer Tokio versions could address overheads which make everyone slower than necessary.
I completely agree that tuning is needed for better CPU sleep scheduling. I’m hoping to look into it this October, ideally on some Alder Lake CPUs with mixed Performance/Efficiency cores and NUMA enabled.
And how big a player is the busyloop-locking. Yes, the code is telling CPU do it energy-efficiently, but it's not going to beat the OS if the loop is waiting for more job that's not coming for a while.. Is it doing this with every core?
One factor could be that when a subprocess dies, then it doesn't need to release any memory, as the OS deals with it in one* go, versus a thread teardown where you need to be neat. Though I suppose this would not be a lot of work.
Hard to rank the rest without a proper breakdown. If I ever tried, I’d probably end up writing a paper — and I’d rather write code :)
There are still a few blind spots I’m working on hardening, and if you have suggestions, I’d be glad to see them in issues or PRs :)
I wish it was benchmarked against some “real” work and not summing integers though. I find such nano-benches incredibly unreliable and unhelpful.