Cancellations in Async Rust
Posted3 months agoActive3 months ago
sunshowers.ioTechstoryHigh profile
calmmixed
Debate
60/100
RustAsync ProgrammingCancellation
Key topics
Rust
Async Programming
Cancellation
The article discusses the challenges of cancellation in async Rust, and the discussion revolves around the nuances of cancel safety and correctness, as well as the trade-offs of async programming in Rust.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
2h
Peak period
52
0-6h
Avg / period
8.1
Comment distribution89 data points
Loading chart...
Based on 89 loaded comments
Key moments
- 01Story posted
Oct 3, 2025 at 12:18 PM EDT
3 months ago
Step 01 - 02First comment
Oct 3, 2025 at 2:06 PM EDT
2h after posting
Step 02 - 03Peak activity
52 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 7, 2025 at 8:52 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45464632Type: storyLast synced: 11/20/2025, 4:35:27 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
It's really not about "cancelling async Rust" which is what I expected, even if it didn't make much sense.
Or am I missing context?
Oxide looks to be superb engineering up and down the whole stack, and if it drives more rust code into linux all the better.
Now that linode has been consumed by Akamai, we need an alternative.
IMHO async is an anti-pattern, and probably the final straw that will prevent me from ever finishing learning Rust. Once one learns pass-by-value and copy-on-write semantics (Clojure, PHP arrays), the world starts looking like a spreadsheet instead of spaghetti code. I feel that a Rust-like language could be built with no borrow checker, simply by allocating twice the memory. Since that gets ever-less expensive, I'm just not willing to die on the hill of efficiency anymore. I predict that someday Rust will be relegated to porting scripting languages to a bare-metal runtime, but will not be recommended for new work.
That said, I think that Rust would make a great teaching tool in an academic setting, as the epitome of imperative languages. Maybe something great will come of it, like Swift from Objective-C or Kotlin from Java. And having grown up on C++, I have a soft spot in my heart for solving the hard problems in the fastest way possible. Maybe a voxel game in Rust, I dunno.
1) I learned about pin in Rust to prevent values from moving in memory.
2) I learned about the html <summary> tag (the turndown arrows in your article that work with Javascript disabled) hah.
I can see how dealing with stream and resource cleanup in async code could be a chore. It sounds like you were able to do that in a fairly declarative manner, which is what I always strive for as well.
I think my hesitation with async is that I already went down that road early in my programming life with cooperative threads/multitasking on Mac OS 9 and earlier. There always seems to be yet another brittle edge case to deal with, so it can feel infuriating playing whack-a-mole until they're all nailed down.
For example, pinning memory looks a lot like locking handles in Mac OS. Handles were pointers to pointers, so it was a bare hands way to implement a memory defragmenter before runtimes were smart enough to handle it. If apps used handles, then blocks of data could be unlocked, moved somewhere else in memory, and then re-locked. Code had to do an extra hop through each handle to get to the original pointer, which was a frequent source of bugs because one async process might be working on a block, yield, and then have another async process move the handle out from under it.
The lock's state was stored in a flag in the memory manager, basically a small bit of metadata. I haven't investigated, but I suspect that Rust may be able to handle locking more efficiently, perhaps more like reference counting or the borrow checker where it can infer whether a pointer is locked without storing that flag somewhere (but I could be wrong).
Apple abandoned handles when it migrated to OS 10 and Darwin inherited protected memory and better virtual memory from FreeBSD. Although now that I write this out, I'm not sure that they solved in-process fragmentation. I think they just give apps the full 32 or 64 bit address space so that effectively there is always another region available for the next allocation, and let the virtual memory subsystem consolidate 4k memory blocks into contiguous strips internally. The dereferencing of memory step became implicit rather than explicit, as well as hidden from apps, so that whole classes of bugs became unreachable.
Anyway, that's why I prefer the runtime to handle more of this. I want strong guarantees that I can terminate a process and all locks inside it will get freed as well. I can pretty much rely on that even in hacky languages like PHP.
My frustration with all of this is that we could/should have demanded better runtimes. We could have had realtime unixes where task switching and memory allocation were effectively free. Unfortunately the powers that be (Mac OS and Windows) had runtimes that were too entrenched with too many users relying on quirks and so they dragged their feet and never did better. Languages like Rust were forced to get very clever and go to the ends of the earth to work around that. Then when companies like Google and Facebook won the internet lottery, they pulled the ladder up behind them by unilaterally handing down decrees from on high that developers should use bare hands techniques, rather than putting real resources into reforming the fundamentals so that we wouldn't have to.
What I'm trying to say is that your solution is clever and solves a common pattern in about the simplest way possible, but is not as simple as synchronous-blocking unix pipes to child processes in shell scripts. That's in no way a criticism. I have similar feelings about stuff like Docker and Kubernetes after reading about Podman. If we could magically go back and see the initial assumptions that led us down the road we're on, we might have tried different approaches. It's all of those roads not taken that haunt me, because they represent so much of my workload each day.
It is not as simple as synchronous pipes, but it also has far better edge case and error handling.
For example, on Unix, if you press ctrl-Z to pause execution, nextest will send SIGTSTP to test processes and also pause its internal timers (resuming them when you type in fg or bg). That kind of bookkeeping is pretty hard to do with linear code, and especially hard to coordinate across subprocesses.
State machines with message passing (as seen in GUI apps) are very helpful at handling this, but they're quite hard to write by hand.
The async keyword in Rust allows you to write state machines that look somewhat like linear code (though with the big cancellation asterisk).
Right, and that is one of the absolute worst things about the Rust ecosystem. Most programs don't benefit from async, and should use plain old threads because they are much easier to work with.
https://kushallabs.com/understanding-concurrency-in-go-green...
So lots of concepts are worth learning like atomicity, ACID compliance, write ahead logs (WALs), statically detecting livelocks and deadlocks (or making them unreachable), consensus algorithms like Raft and Paxos, state transfer algorithms like software transaction memory (STM), connectionless state transfer like hash trees and Merkle trees, etc.
The key insight is that manual management of tasks is, for the most part, not tenable by humans. It's better to take a step back and work at a higher level of abstraction. For example, declarative programming works in terms of goals/specifications/tests, so that the runner has more freedom to cancel and restart/retry tasks arbitrarily. That way the user can fire off a workload and wait until all of the tasks match a success criteria, and even treat that process as idempotent so it can all be run again without harm. In this way, trees of success criteria can be composed to manage a task pool.
I'd probably point to CockroachDB as one of the best task-cancellers, since it doesn't have a shutdown procedure. Its process can simply be terminated by the user with control-c, then it reconciles any outstanding transactions the next time it's booted, which just adds some latency. If an entire database can do that, then "this is the way".
Not really. The talk describes problems that can show up in any environment where you have concurrency and cancellation. To adapt some examples: a thread that consumes a message from a channel but is killed before it can process it, has still resulted in that message being lost. A synchronous task that needs to temporarily violate invariants in some data structure that can't be updated atomically, has still left that data structure in an invalid state when it gets killed part way through.
> Arguably the Go language's goroutines strike a good balance between cooperate and preemptive threads/multitasking.
Goroutines are pretty nice. It's especially nice that Go has avoided the function colouring problem. I'm not convinced that having to litter your code with select's if you need to make your goroutine's cancel-able is good though. And if you don't care about being able to cancel tasks, you can write async rust in a way that ensures they won't be cancelled by accident fairly easily. Unless there's some better way to write cancel-able goroutines that I'm not familiar with.
> The key insight is that manual management of tasks is, for the most part, not tenable by humans. It's better to take a step back and work at a higher level of abstraction.
Of course it's always important to look at systems as a whole. But to build larger systems out of smaller components you need to actually build the small components.
> I'd probably point to CockroachDB as one of the best task-cancellers, since it doesn't have a shutdown procedure. Its process can simply be terminated by the user with control-c, then it reconciles any outstanding transactions the next time it's booted, which just adds some latency. If an entire database can do that, then "this is the way".
I'm not familiar with CockroachDB specifically, but I do think a database should generally have a more involved happy-path shutdown procedure than that. In particular, I would like the database not to begin processing new transactions if it is not going to be able to finish them before it needs to shut down, even if not finishing them wouldn't violate ACID or any of my invariants.
That kind of thinking made sense in the 90s when things followed Moore’s law. But DRAM was one of the first things to fail to keep up: https://ourworldindata.org/grapher/historical-cost-of-comput... and barely gets cheaper anymore. Thats why mobile phones still only have 16gb of memory despite having 4gib a decade ago.
And there’s all sorts of problems that Rust doesn’t necessarily make a great fit for. But Rust’s target marketplace is where you’d otherwise use a low level language like C or C++. If you can just heap allocate everything and aggressively create copies all over the place, then why would you ever use those languages in the first place.
And for what it’s worth Rust is finding a lot of success even replacing all the tooling in other language ecosystems like Ruby, Python, and JS precisely because the tools in those ecosystems written in the native language end up being horribly slow. And memory allocation and randomly deep copying arrays are the kinds of things that add up and make things slow (in addition to GC pauses, slow startups, interpreter costs etc).
And you can always choose not to do async in Rust although personally I’m a huge fan as it makes it really clear where you have sprinkled in I/O in places you shouldn’t have.
I used to write web backends in Clojure, and justified it with the fact that the JVM has some of the best profiling tools available (I still believe this), and the JVM itself exposes lots of knobs to not only fine-tune the GC, but even choose a GC! (This cannot be understated; garbage collectors tend to be deeply integrated into a language's runtime, and it's amazing to me that the Java platform manages to ship several garbage collectors, each of which are optimal in their own specific situations).
After rewriting an NLP-heavy web app in Rust, I saw massive performance gains over the original Clojure version, even though both aggressively copy data and the Rust version is full of atomic refcounts (atomic refcounting is not the fastest GC out there...)
The binary emitted by rustc is also much smaller. ~10 MB static binary vs. GraalVM's ~80 MB native images (and longer build times, since classpath analysis and reflection scanning require a lot of work)
What surprised me the most is how high-level Rust feels in practice. I can use pattern matching, async/await, functional programming idioms, etc., and it ends up being fast anyway. Coming from Clojure, Rust syntax trying its best to be expression-oriented is a key differentiator from other languages in its target domain (notably, C++). I sometimes miss TypeScript's anonymous enums, but Rust's type system can express a lot of of runtime behavior, and it's partly why many jokingly state "if it compiles, it's likely correct". Then there's the little things, like how Rust's Futures don't immediately start in the background. In contrast, JavaScript Promises are immediately pushed to a microtask queue, so cancelling a Promise is impossible by design.
Overall, it's the little things like this -- and the toolchain (cargo, clippy, rustfmt) -- that have kept me using Rust. I can write high-level code and still compile down to a ~5 MB binary and outperform idiomatic code in other languages I'm familiar with (e.g. Clojure, Java, and TypeScript).
I think that Rust is making an admiral attempt to attack challenges that have already been solved better in other ways. I just don't have much use for its arsenal.
For example, I wasted 2 years of my life trying to write a NAT-punching peer to peer networking framework for games around 2005, but was first exposed to synchronous blocking vs asynchronous nonblocking networking in the late 90s when I read Beej's Guide to Network Programming:
https://beej.us/guide/bgnet/
I was hopelessly trying to mimic the functionality of libraries like RakNet and Zoidcom without knowing some fundamentals that I wouldn't fully understand for years:
https://www.reddit.com/r/gamedev/comments/93kr9h/recommended...
20 years later, Rust has iroh:
https://github.com/n0-computer/iroh
I realize there is some irony in pointing to a Rust library as a final solution.
But my point is that when developers reached high levels of financial success and power, they didn't go back to address the fundamentals. NAT was always an abomination to me. And as far as I know, they kept it in IPv6. Someone like Google should have provided a way to get around it that's not as heavy as WebRTC. So many developer years of work have been wasted due to the mistakes of the status quo. So that we wander in the desert for years using lackluster paradigms because we don't know that better stuff exists.
Knowing what I know now, I would have created open source C (portable) libraries to solve NAT punching, state transfer with a software transactional memory (STM) or Raft, entity state machines (like in Unity), movement prediction/dead reckoning, etc etc etc to form the basis of a distributed computing network for virtual worlds and let the developer community solve that. Someone will do that in a year or two with AI now I assume.
Ok you kinda got me. I realize after writing this out that I wouldn't use Rust for new work, but it's not so much about the language itself as building upon proven layers to "get real work done". The lower the level of abstraction, the harder that is to do. So it's hard for me to see the problem which Rust is trying to solve.
I'm a big fan of the type system and how expressive I feel with Rust. The compiler is incredibly helpful too. rust-analyzer is a superpower. Just yesterday I embarked on a pretty big refactor and all it took was changing a couple of types—and then fixing the 500 problems vscode was pointing out.
Being able to jump in at the deep end like this in a ~90kloc codebase is only feasible (to me) because I know the tooling has my back.
It's not the perfect tool for every project. But it's a really great choice for a really large number of projects. I encourage to try it a little more on a variety of domains to see if it clicks
If we imagine a function passing a block of memory to sub functions which may write bytes to it randomly, then each of those writes may allocate another block. If those allocations are similar in size to the VM block size, then each invocation can potentially double the amount of memory used.
A do-one-thing-and-do-it-well (DOTADIW?) program works in a one-shot fashion where the main process fires off child processes that return and free the memory that was passed by value. Surrounded by pipes, so that data is transmuted by each process and sent to the next one. VM usage may grow large temporarily per-process, but overall we can think of each concurrent process as roughly doubling the amount of memory.
Writing this out, I realized that the worst case might be more like every byte changing in a 4k block, so a 4096 times increase in memory. Which still might be reasonable, since we accept roughly a 200x speed decrease for scripting languages. It might be worth profiling PHP to see how much memory increases when every byte in a passed array is modified. Maybe they use a clever tree or refcount strategy to reduce the amount of storage needed when arrays are modified. Or maybe they just copy the entire array?
Another avenue of research might be determining whether a smarter runtime could work with "virtual" VMs (VVMs?) to use a really small block size, maybe 4 or 8 bytes to match the memory bus. I'd be willing to live with a 4x or 8x increase in memory to avoid borrow checkers, refcounts or garbage collection.
-
Edit: after all these years, I finally looked up how PHP handles copy-on-write, and it does copy the whole array on write unfortunately:
http://hengrui-li.blogspot.com/2011/08/php-copy-on-write-how...
If I were to write something like this today, I'd maybe use "smart" associative arrays of some kind instead of contiguous arrays, so that only the modified section would get copied. Internally that might be a B-Tree with perhaps 8 bytes per leaf to hold N primitives like 1 double, 2 floats, etc. In practice, a larger size like 16-256 bytes per leaf might improve performance at the cost of memory.
Looks like ZFS deduplication only copies the blocks within the file that changed, not the entire file. Their strategy could be used for a VM so that copy-on-write between processes only copies the 4k blocks that change. Then if it was a realtime unix, functions could be synchronous blocking processes that could be called with little or no overhead.
This is the level of work that would be required to replace Rust with simpler metaphors, and why it hasn't happened yet.
it analyses code. if it finds raii/linearity/single-ownership, it does exactly like rust mem mgmt.
but if it js not, it does rc.
so it does what rust, but automagically without polluting code.
so cow or pbw or 2mem are not only options to improve rust.
If that's what you're looking for, have you considered OCaml?
of course... Its obviously not as simple as "just give me a way to turn it off", but more importantly, I just don't see this concern being addressed by the Powers That Be. Am I just not looking hard enough? Did I miss the rust blog post titled "hey - so you didn't want to use async but the libraries that you did want to use ship with async so you're up shit creek.. Here's what our plan for that is"?
I'm sorry. I generally lurk because I don't consider myself up to the caliber of others on this website, but nonetheless the few posts I make do end up being about async because it does make me feel quite hopeless at times. Hopefully someone can look passed my ignorance/incompetence/selfishness/immaturity and tell me its all going to be okay.
This abstraction has served me well and facilitates stepping through code in a debugger, though I jump out of thinking it at that level when I need to think of it at a lower level.
I really hope we get async drop soon.
So on cancellation, the transaction times out and nothing is written. Bad but safe.
The problem is the same on other platforms. For example, what if writing to the DB throws an exception if you’re on Python? Your app just dies, the transaction times out. Unfortunate but safe.
If it does not run transactionally you have a problem in any execution scenario.
(This is related to the fact that Rust doesn't have async drop — you can't run async code on drop, other than spawning a new task to do the cleanup.)
This is prong 3 of my cancel correctness framework (that the cancellation violates a system property, in this case a cleanup property.) The solution here is to ensure the connection is in a pristine state before handing it out the next time it's used.
If you want to tie multiple actions together as an atomic unit, you need the other side to have some concept of transactions; — and you need to utilize it.
Let's say my code looks like this
Where does an issue occur which causes `d` to not to be called? Is it some sort of cancellation in c? Or some upstream action in a?`d` not being called would happen because of actions in `a`.
If `a` were rewritten as
Then if `c` ends up failing in the try_join then process on `b` will be halted and thus the `d` in `b` won't be executed.Because rust is ultimately constructing a state machine which is ran by the caller, the execution of that state machine can be interrupted or partially executed at any of the `await` points. Or more accurately the caller can simply not advance the state machine.
So, the `try_join` macro can start work on the various functions and if any of them fail, the others are ultimately cancelled. Which can happen before those functions finish fully executing.
This is particularly bad if there's a partial state change.
I'm not entirely sure what that means for memory allocation.
Glad to see it converted to a blog post. Talks are great, but blogs are much easier to share and reference.
I don't like the "cancel safety" term. Not only it's unrelated to the Rust's concept of safety, it's also unnecessarily judgemental.
Safe/unsafe implies there's a better or worse behavior, but what is desirable for cancellation to do is highly context-dependent.
Futures awaiting spawned tasks are called "cancellation safe", because they won't stop the task when dropped. But that's not an inherently safe behavior – leaving tasks running after their spawner has been cancelled could be a bug: piling up work that won't be used, and even interfering with the rest of the program by keeping locks locked or ports used. OTOH a spawn handle that stops the task when dropped would be called "cancellation unsafe", despite being a very useful construct specifically for propagating cleanup to dependent tasks.
What's he trying to do? Get a clean program shutdown? That's moderately difficult in threaded programs, and async has problems, too. The use case here is unclear.
The real use cases involve when you're sending messages back and forth to a remote site, and the remote site goes away. Now you need to dispose of the state on your end.
To this day I'm not aware of a better way to express what's become a set of increasingly complex state machines (the most recent improvement being to make the state machines responsive to user input). Nextest's runner loop is structured mostly like a GUI event loop, but without explicit state machines. It's quite nice being able to write code that's this complex in a bug-free manner.
Maybe it's a bit contrived, but it's also the kind of code you'd sprinkle through your system in response to "nothing seems to be happening and I don't know why".
The note on mpsc::Sender::send losing the message on drop [1] was actually added by me [2], after I wrote the Oxide RFD on cancellations [3] that this talk is a distilled form of. So even the great folks on the Tokio project hadn't documented this particular landmine.
[1] https://docs.rs/tokio/latest/tokio/sync/mpsc/struct.Sender.h...
[2] https://github.com/tokio-rs/tokio/pull/5947
[3] https://rfd.shared.oxide.computer/rfd/0400
They go by they/she
https://sunshowers.io/about/
The examples presented for "cancel unsafe" futures seem to me like the root of the problem is some sort of misalignment of expectations to the reality:
Example 1: one future cancelled on error in the other
let res = tokio::try_join!( do_stuff_async(), more_async_work(), );
Example 2: data not written out on cancellation
let buffer: &[u8] = /* ... */; writer.write_all(buffer)?;
Both of these cases are claimed to not be cancel-safe, because the work gets interrupted and so not driven to completion. But again, what else is supposed to happen? If you want the work to finish regardless of the async context being cancelled, then don't put it in the same async context but spawn a task instead.
I feel like I must be missing something obvious that keeps me from understanding the author's issue here. I thought work getting dropped on cancellation is exactly how futures are supposed to work. What's the nuance that I'm missing?
I am asking because I've noticed that many developers with previous experience from "task-based" languages (specifically the JS/TS world) tend to grasp the basics of Rust async quickly enough, but then run into expectation-misalignment related problems similar to the examples that you used in your post. That in turn has made want to understand whether it is the Rust futures that are themselves difficult or strange, or whether it's a case of the Rust futures appearing simple and familiar, even though they are completely different in very subtle ways. I suppose that it's a combination of both.
Also, as another comment on the thread points out [1], languages where futures are active by default can have the opposite problem.
[1] https://news.ycombinator.com/item?id=45467188
I'm sure experienced async Rust programmers always have this things in mind, but Rust is also about preventing these kinds of missable behaviour, be it via the type system or otherwise.
It also wouldn't help when you have no valid state to restore to, as in the mutex example in the post.
Is this not enough? What could go wrong? If the network connection dies or the task is cancelled, I'm assuming the database server cleans up the connection state and does a rollback automatically.
And adding async Drop will probably add a whole new set of footguns.
LoL, an insane amount of things. TCP connections are an illusion of safely, for the purpose of database commits use UDP packets as a model instead, it'll be much closer to reality.
List a couple
> TCP connections are an illusion of safely
Why?
- Proposal from 2020 about async functions which are forced to run to completion (and thereby would use graceful cancellation if necessary). Quite old, but I still feel that no better idea has come up so far. https://github.com/Matthias247/rfcs/pull/1
- Proposal for unified cancellation between sync and async Rust ("A case for CancellationTokens" - https://gist.github.com/Matthias247/354941ebcc4d2270d07ff0c6...)
- Exploration of an implementation of the above: https://github.com/Matthias247/min_cancel_token
is the title like that on purpose?
There was only one threaded web server, https://lib.rs/crates/rouille . It has 1.1M lines of code (including deps). Its hello-world example reaches only 26Krps on my machine (Apple M4 Pro). It also has a bug that makes it problematic to use in production: https://github.com/tiny-http/tiny-http/issues/221 .
I wrote https://lib.rs/crates/servlin threaded web server. It uses async internally. It has 221K lines of code. Its hello-world example reaches 102Krps on my machine.
https://lib.rs/crates/ehttpd is another one but it has no tests and it seems abandoned. It does an impressive 113Krps without async, using only 8K lines of code.
For comparison, the popular Axum async web server has 4.3M lines of code and its hello-world example reaches 190Krps on my machine.
The popular threaded Postgres client uses Tokio internally and has 1M lines of code: http://lib.rs/postgres .
Recently a threaded Postgres client was released. It has 500K lines of code: https://lib.rs/crates/postgres_sync .
There was no ergonomic way to signal cancellation to threads, so I wrote one: https://crates.io/crates/permit .
Rust's threaded libraries are starting to catch up to the async libraries!
---
I measured lines of code with `rm -rf deps.filtered && cargo vendor-filterer --platform=aarch64-apple-darwin --exclude-crate-path='*#tests' deps.filtered && tokei deps.filtered`.
I ran web servers with `cargo run --release --example hello-world` and measured throughput with `rewrk -c 1000 -d 10s -h http://127.0.0.1:3000/`.
These are all common traps. And now cancellations in async Rust are a new complement to state management in async Rust (Futures).
When I'm developing the mea (Make Easy Async) [1] library, I document the cancel safety attribute when it's non-trivial.
Additionally, I recall [2] an instance where a thoughtless async cancellation can disrupt the IO stack.
[1] https://github.com/fast/mea
[2] https://www.reddit.com/r/rust/comments/1gfi5r1/comment/luido...