Key Takeaways
What we need is an opinionated framework that doesn't allow you to do anything except durable workflows, so your junior devs stop doing two POSTs in a row thinking things will be OK.
It'd be an interesting experiment to take memory snapshots after each step in a workflow, which an API like Firecracker might support, but likely adds even more overhead than current engines in terms of expense and storage. I think some durable execution engines have experimented with this type of system before, but I can't find a source now - perhaps someone has a link to one of these.
There's also been some work, for example in the Temporal Python SDK, to overwrite the asyncio event loop to make regular calls like `sleep` work as durable calls instead, to reduce the risk to developers. I'm not sure how well this generalizes.
> take memory snapshots after each step in a workflow
Don't do this. Just give people explicit boundaries of where their snapshots occur, and what is snapshotted, so they have control both over durability and performance. Make it clear to people that everything should be in the chain of command of the snapshotting framework: e.g no file-local or global variables. This is already how people program web services but somehow nobody leans into it.
The thing is, if you want people to understand durability but you also hide it from them, it will actually be much more complicated to understand and work with a framework.
The real golden ticket I think is to make readable intuitive abstractions around durability, not hide it behind normal-looking code.
Please steal my startup.
We're investing heavily in separating out some of these primitives that are separately useful and come together in a DE system: tasks, idempotency keys and workflow state (i.e. event history). I'm not sure exactly what this API will look like in its end state, but idempotency keys, durable tasks and event-based histories are independently useful. This is only true of the durable execution side of the Hatchet platform, though; I think our other primitives (task queues, streaming, concurrency, rate limiting, retries) are more widely used than our `durableTasks` feature because of this very problem you're describing.
Indeed, I'm happy to hear you say this.
I think it should be the other way around: if durable tasks are properly understood its actually the queues/streaming/concurrency/ratelimits/retries that can be abstracted away and ignored.
Funny, I never realised this before.
> The real golden ticket I think is to make readable intuitive abstractions around durability, not hide it behind normal-looking code.
It's a tradeoff. People tend to want to use languages they are familiar with, even at the cost of being constrained within them. A naive DSL would not be expressive enough for the turing completeness one needs, so effectively you'd need a new language/runtime. It's far easier to constrain an existing language than write a new one of course.
Some languages/runtimes are easier to apply durable/deterministic constraints too (e.g. WASM which is deterministic by design and JS which has a tiny stdlib that just needs a few things like time and rand replaced), but they still don't take the ideal step you mention - put the durable primitives and their benefits/constraints in front of the dev clearly.
Trigger.dev currently uses CRIU, but I recall reading on HN (https://news.ycombinator.com/item?id=45251132) that they're moving to MicroVMs. Their website (https://feedback.trigger.dev/p/make-runs-startresume-faster-...) suggests that they're using Firecracker specifically, but I haven't seen anything beyond that. It would definitely be interesting to hear how it works out, because I'm not aware of another Durable Execution platform that has done this yet.
Yes, but what that means depends on your durability framework. For example, the one that my company makes can use the same database for both durability and application data, so updates to application data can be wrapped in the same database transaction as the durability update. This means "the work" isn't done unless "recording the work" is also done. It also means they can be undone together.
If just meaning workflow logic, as the article mentions it has to be deterministic, which implies idempotency but that is fine because workflow logic doesn't have side effects. But the side-effecting functions invoked from a workflow (what Temporal dubs "activities") of course _should_ be idempotent so they can be retried upon failure, as is the case for all retryable code, but this is not a requirement. These side effecting functions can be configured at the callsite to have at-most-once semantics.
In addition to many other things like observability, the value of durable execution is persisted advanced logic like loops, try/catch, concurrent async ops, sleeping, etc and making all of that crash proof (i.e. resumes from where it left off on another machine).
I agree determinism/idempotency and the complexities of these systems are a tough pill to swallow. Certainly need to be suited to the task.
At first glance it looks more complicated than DBOS, not easier. DBOS is just a library and doesn't require a special DSL.
Also we use node which it Little horse doesn't seem to support.
Please do refute. I'm genuinely interested in this problem as I deal with it daily.
Let's just take the application state side. If your entire async job can be modeled as a single database transaction, you don't need a durable execution platform, you need a task queue with retries - our argument at Hatchet is that this is many (perhaps most) async workloads, which is why the durable task queue is the primary entrypoint to Hatchet, and durable execution is only a feature for more complex workloads.
But once you start to distribute your application state - for example, different teams building microservices which don't share the same database - you have a new set of problems. The most difficult edge case here is not the happy path with multiple successful writes, it's distributed rollbacks: a downstream step fails and you need to undo the upstream step in a different system. In these systems, you usually introduce an "orchestrator" task which catches failures and figures out how to undo the system in the right way.
It turns out these orchestrator functions are hard to build, because the number of failure scenarios are many. So this is why durable execution platforms place some constraints on the orchestrator function, like determinism, to reduce the number of failure scenarios to an amount easy to reason about.
There are other scenarios other than distributed rollbacks that lead to durable execution, it turns out to be a useful and flexible model for program state. But this is a common one.
A bit off-topic, but I recently switched from Celery to Hatchet. I haven't even fully explored everything it can do, but the change has already made me a fan. Overall simplified my stack and made several features easier to implement.
Some takeaways from my experience
1. Streaming — My application provides real-time log streaming to users (similar to GitHub Actions or AWS CodeBuild). With Celery, I had to roll my own solution using Supabase Realtime. Hatchet’s streaming is straightforward: my frontend now connects to a simple SSE endpoint in my API that forwards the Hatchet stream.
2. Dynamic cron scheduling — Celery requires a third-party tool like RedBeat for user-defined schedules. Hatchet supports this natively.
3. Logs — Hatchet isolates logs per task out of the box, which is much easier to work with.
4. Worker affinity — Hatchet’s key-value tags on workers and workflows allow dynamic task assignment based on worker capabilities. For example, a customer requiring 10 Gbps networking can have tasks routed to workers tagged {'network_speed': 10}. This would have required custom setup in Celery.
5. Cancellation — Celery has no graceful way to cancel in-flight tasks without risking termination of the entire worker process (Celery docs note that terminate=True is a “last resort” that sends SIGTERM to the worker). Hatchet handles cancellation more cleanly.
Not affiliated with Hacker News or Y Combinator. We simply enrich the public API with analytics.