Beyond Openmp in C++ and Rust: Taskflow, Rayon, Fork Union

Posted3 months agoActive3 months ago

ashvardanian

127 points

31 comments

ashvardanian.comTechstory

calmmixed

Debate

60/100

Parallel ComputingC++Rust

Key topics

Parallel Computing

C++

Rust

The article compares task parallelism libraries in C++ and Rust, including Taskflow, Rayon, and Fork Union, sparking discussion on performance, design, and potential applications.

Snapshot generated from the HN discussion

Discussion Activity

Active discussion

First comment

Peak period

0-3h

Avg / period

4.4

Comment distribution31 data points

Loading chart...

Based on 31 loaded comments

Key moments

01Story posted
Sep 28, 2025 at 4:53 AM EDT
3 months ago
Step 01
02First comment
Sep 28, 2025 at 4:59 AM EDT
6m after posting
Step 02
03Peak activity
12 comments in 0-3h
Hottest window of the conversation
Step 03
04Latest activity
Sep 29, 2025 at 5:02 PM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (31 comments)

Showing 31 comments

seivan

3 months ago

1 reply

Wow that was a big difference between rayon and fork union. But it’s still missing convenience apis for drop in par_iter().

ashvardanianAuthor

3 months ago

Implementing convenience APIs in Rust has been tricky, since a lot of the usual memory-safety semantics start to break once you push into fast concurrent code that shares buffers across threads. My early drafts were riddled with unsafe and barely functional, so I’d definitely welcome suggestions for quality-of-life improvements.

alextingle

3 months ago

1 reply

How does it compare to Intel's TBB?

ashvardanianAuthor

3 months ago

I was asked this a few months back but don’t have the measurements fresh anymore. In general, I think TBB is one of the more thorough and feature-rich parallelism libraries out there. That said, I just found a comparable usage example in my benchmarks, and it doesn’t look like TBB will have the same low-latency profile as Fork Union: https://github.com/ashvardanian/ParallelReductionsBenchmark/...

SkiFire13

3 months ago

1 reply

It's cool to see the end result, but I would prefer if the article focused a bit more on how it achieves such solution. For example, how does it dispatch work to the various threads? Fo they sleep when there's no work to do? If so how do you wake them up? How does it handle cases where work is not uniformly distributes between your work items (i.e. some of them are a lot slower to process)? Is that even part of the end goal?

ashvardanianAuthor

3 months ago

1 reply

Yes, non-uniform workloads are supported! See `for_n_dynamic`.

The threads “busy-wait” by running an infinite loop in a lower energy state on modern CPUs.

And yes, there are more details in the actual implementation in the repository itself. This section, for example, describes the atomic variables needed to control all of the logic: https://github.com/ashvardanian/fork_union?tab=readme-ov-fil...

SkiFire13

3 months ago

> The threads “busy-wait” by running an infinite loop in a lower energy state on modern CPUs.

Doesn't that still use part of the process's timeslots from the OS scheduler's POV?

mgaunard

3 months ago

1 reply

I have built many asynchronous and parallel with task queues.

None of them involve allocation or system calls after initialization. Queues are pre-allocated, as they should.

eska

3 months ago

3 replies

I think thread pools are one of those solved problems that the silent majority of C programmers has solved ages ago and doesn’t release open source projects for. I’ve also written my 300 line for pool and allocator together and always laughed at taskflow, rayon, etc.. Even NUMA is easy with arena allocation. When casey muratori of the handmade network said the same thing I remember agreeing and he got made fun of for it.

BTW the for-case can simply be supported by setting a pool/global boolean and using that to decide how to wait for a new task (during the paralle for the boolean will be true, otherwise do sleeps with mutexes in the worst case for energy saving)

ashvardanianAuthor

3 months ago

1 reply

I totally agree — most C/C++ developers with 10+ years of experience have built similar thread pools or allocators in their own codebases. I’d include myself and a few of my former colleagues on that list.

That said, closed-source solutions for local use aren’t quite the same as an open-source project with wider validation. With more third-party usage, reviews, and edge cases, you often discover issues you’d never hit in-house. Some of the most valuable improvements I’ve seen have come from external bug reports or occasional PRs from people using the code in very different environments.

mgaunard

3 months ago

There are many open-source and academic libraries for parallel programming which have performance similar or better than OpenMP.

Ygg2

3 months ago

1 reply

> When casey muratori of the handmade network said the same thing I remember agreeing and he got made fun of for it.

Casey Muratori, while a great programmer in his own right, often disregards the use cases, which leads to apple-to-orange comparisons. E.g., why is this ASCII editor from 40 years ago much faster than this Unicode (with full ZWC joining emoji suite) text editor?

eska

3 months ago

Are you alluding to the microsoft terminal fiasco? It was the other way around: his terminal supported more text features.

articulatepang

3 months ago

1 reply

I'd love to learn more about this. What resources/books/articles/code can I look at to understand this more? Or, if you have some time, would you mind expanding on it?

The parts I'm specifically interested in: 1. What the 300 line pool and allocator look like 2. What this means: "BTW the for-case can simply be supported by setting a pool/global boolean and using that to decide how to wait for a new task (during the paralle for the boolean will be true, otherwise do sleeps with mutexes in the worst case for energy saving)"

Thank you!

eska

3 months ago

This stuff is sometimes difficult to search for because people don’t name it or there are many different names.

Arena allocation on windows for example is basically calling VirtualAlloc for a couple gigabytes on a 64 bit system (you have terabytes of virtual memory available) and then slicing it up into sub-ranges that you pass as parameters to threads grouped hierarchically for each cpu and then within that each group of cores that share cache and then single cores for their own cache memory. Lock the software threads to their hardware threads and done. Then for each arena use bump and maybe pool allocators for most stuff. Very basic and little code, much higher performance than most software out there. It’s also why a lot of diehard C programmers find rust lifetime management overengineered and boring btw because you don’t have so many lifetimes as modern C++ code for example.

For the boolean stuff look at the “better software conference” youtube talk about that video game physics engine for example (sorry, I’m on my phone on the jump). Again, old ideas being rediscovered

quackzar

3 months ago

2 replies

Any comparison with heartbeat scheduling i.e., https://github.com/judofyr/spice or the rust port https://github.com/dragostis/chili ?

ashvardanianAuthor

3 months ago

1 reply

As a rule of thumb, I find Zig projects often come across as higher quality than many C, C++, or Rust counterparts — it’s a surprisingly strong community. That said, I don’t write much in Zig myself, haven’t explored the projects you linked in detail, and wasn’t even considering Zig alternatives as I needed something for my C++ and Rust projects.

From a first glance, they seem to be tackling a slightly different problem: focusing on efficient handling of nested parallelism. Fork Union doesn’t support nested parallelism at all.

johnisgood

3 months ago

I think Ada is great with its builtin constructs for concurrency. It helps you avoid data races, too. You can formally verify your code as well if you so wish. Ada may be too "serious" for people. :D

vulcan01

3 months ago

This Rayon issue has a good discussion of rayon vs spice/chili. https://github.com/rayon-rs/rayon/issues/1235

felixguendling

3 months ago

2 replies

Would you recommend this as a "thread pool" / coroutine scheduler replacement for an application web server?

nextaccountic

3 months ago

1 reply

Tokio actually has some similarities with Rayon. Tokio is used in most Rust web servers, like Axum and Actix-web

ashvardanianAuthor

3 months ago

1 reply

That’s true — though in my benchmarks Tokio came out as one of the slower parallelism-enabling projects. The article still included a comparison:

  $ PARALLEL_REDUCTIONS_LENGTH=1536 cargo +nightly bench -- --output-format bencher
  
  test fork_union ... bench:  5,150 ns/iter (+/- 402)
  test rayon ... bench:      47,251 ns/iter (+/- 3,985)
  test smol ... bench:       54,931 ns/iter (+/- 10)
  test tokio ... bench:     240,707 ns/iter (+/- 921)

... but I now avoid comparing to Tokio since it doesn’t seem fair — fork-join style parallel processing isn’t really its primary use case.

nextaccountic

3 months ago

That's outrageous.. and I don't agree with your assessment, because smol is in the same niche as Tokio (that is, an async execuutor, which isn't necessarily optimizing for CPU-bound workloads) and isn't nearly as slow.

I think performance is a very critical property for Rust infrastructure. One can only hope that newer Tokio versions could address overheads which make everyone slower than necessary.

ashvardanianAuthor

3 months ago

Yes, I was planning a similar experiment with UCall (https://github.com/unum-cloud/ucall), leveraging the NUMA functionality introduced in v2 of Fork Union. I don’t currently have the right hardware to test it properly, but it would be very interesting to measure how pinning behaves on machines with multiple NUMA nodes, NICs, and a balanced PCIe topology.

jcelerier

3 months ago

1 reply

We've used fork_union in spatgris (https://github.com/GRIS-UdeM/SpatGRIS) recently and got a nice speedup! That said there were some trouble with the busy wait eating 100% of CPU

ashvardanianAuthor

3 months ago

1 reply

Wow, I didn’t realize someone had integrated it into their project before I even tried it in mine — thanks for the trust and for sharing!

I completely agree that tuning is needed for better CPU sleep scheduling. I’m hoping to look into it this October, ideally on some Alder Lake CPUs with mixed Performance/Efficiency cores and NUMA enabled.

jcelerier

3 months ago

Haha, when I saw the previous post I thought "this is exactly what I need" - our problem maps 1:1 to what fork_union provides and has a low-latency requirement (real-time audio dsp)

_flux

3 months ago

1 reply

Where does the improved performance come from? The project does actually outline different factors, but I wonder which of them are the biggest ones, or are they all equally important?

And how big a player is the busyloop-locking. Yes, the code is telling CPU do it energy-efficiently, but it's not going to beat the OS if the loop is waiting for more job that's not coming for a while.. Is it doing this with every core?

One factor could be that when a subprocess dies, then it doesn't need to release any memory, as the OS deals with it in one* go, versus a thread teardown where you need to be neat. Though I suppose this would not be a lot of work.

ashvardanianAuthor

3 months ago

Compared to Rayon or Taskflow, the biggest initial win is cutting out heap allocations for all the promise/result objects — those act like mutexes once the allocator gets hammered by many threads.

Hard to rank the rest without a proper breakdown. If I ever tried, I’d probably end up writing a paper — and I’d rather write code :)

ashvardanianAuthor

3 months ago

I likely posted this a few months ago, and it looks like it just got bumped by the platform. Since then the library has seen some improvements (especially if you run on a Linux NUMA box), and the GitHub README is probably the best place to start now: https://github.com/ashvardanian/fork_union

There are still a few blind spots I’m working on hardening, and if you have suggestions, I’d be glad to see them in issues or PRs :)

forrestthewoods

3 months ago

Interesting project. More so than I expected!

I wish it was benchmarked against some “real” work and not summing integers though. I find such nano-benches incredibly unreliable and unhelpful.

View full discussion on Hacker News

ID: 45402820Type: storyLast synced: 11/20/2025, 5:39:21 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN