Stategraph: Terraform State as a Distributed Systems Problem
Key topics
The article proposes Stategraph, a solution to manage Terraform state as a distributed systems problem, sparking discussion on its potential benefits and challenges, as well as alternative approaches to managing infrastructure.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
1h
Peak period
57
Day 1
Avg / period
15.3
Based on 61 loaded comments
Key moments
- 01Story posted
Sep 17, 2025 at 4:38 AM EDT
4 months ago
Step 01 - 02First comment
Sep 17, 2025 at 5:49 AM EDT
1h after posting
Step 02 - 03Peak activity
57 comments in Day 1
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 29, 2025 at 10:40 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
I don’t see the state file as a complete downside. It is very simple and very easy to understand. It makes it easy to tell or predict what terraform will do given the current state and desired state.
Its simpleness makes troubleshooting easier: the state files are easy to read and manipulate or repair in the event of a drift, mismatch, or botched provider update.
With the solution proposed it feels like the state becomes a black box I shouldn’t put my hands in. I wonder how the troubleshooting scenarios change with it.
Personally, I haven’t ran into the scaling issue described; at any given time there is usually only one entity working with the state file. We do use terragrunt for larger systems but it is manageable. ~1000 engineer org.
Not every Terraform setup runs into scaling pain. The trouble tends to show up in larger repos with thousands of resources where teams share big chunks of infra. That is where global locks and full refreshes become a bottleneck and where we think graph semantics help.
This is a bit worrying, though. Do you mean through regular tools like cat or vim, or do we have to install a stategraph-manager tool (and upgrade it ad nauseum) just to look at the state?
predict is the operative word there, because Terraform is so disconnected from the underlying provider's mental model that it is the expression "no plan survives first contact with the enemy" made manifest
Now, I am one million percent open to the pushback that "well, that's a provider's problem" but I also can't easily tell if they are operating within the bounds of TF's mental model, or is it literally that every provider ever is just that lazy?
First of all: Very cool project! I have spent the last couple months studying this problem space and arrived at the exact same conclusions as you. So Stategraph would be very interesting to us. However, we use Pulumi (with Azure blob storage as "DIY storage backend", i.e. rather similar to a TF state file) or are in the process of migrating to it. Do you think it would be feasible to write a storage backend (or a "meta" provider) for Pulumi which uses Stategraph behind the scenes?
I might be wrong regarding more sophisticated infra though.
Splitting state files is the common workaround but that only creates new problems like cross state dependencies and orchestration glue. The real issue is the storage model which is a single JSON blob with a global lock. Treating state as a graph with proper concurrency control avoids contention while keeping a cohesive view of infrastructure.
We have about 30 services with each managing their own terraform state. We also have a shared infra repo managing some top level items. We haven’t run into any issues (with any regularity at least) that I can think of but I’m wondering if this could be a good tool for us as we grow and things become even more complex?
With Stategraph, you'll get all the benefits and isolation of separate state files, but when you changed resources, you'll get meaningful plans around all of the infrastructure they impact, not just the statically defined boundaries of a state file.
We were all experienced with Go but at the time the Go SDK was very awkward, although I think some of that has been resolved with generics now. TF is less expressive but I think that’s actually better for 99% of cases.
Sounds very neat if you’re an big enough org.
I'm curious if this will be compatible with tools like Spacelift or Env Zero, or if they are going to build their own runner/agent to compete in that space.
0: https://terrateam.io
At scale the choices are pretty simple. You split state and live with orchestration glue. You move to a controller model and take on the operational overhead (see Crossplane). Or you keep a cohesive graph and fix the state layer. Those are the real options (imo). It's not about outgrowing Terraform.
I'm not really a fan of crossplane, it's much simpler to roll your own custom operator, especially now that things like the Azure Service Operator exist (I think there's something equivalent for aws as well). This gives you a lot more flexibility for writing unit tests for your business logic.
Team A manages VPCs and Security groups, for example.
Team B manages autoscaling groups, EC2, etc.
It's great that now the two teams can look after their own things and not be too worried about resource contention with the other team. But if it's a centralized Postgres database (as you seem to be suggesting?) and both teams have write access to it...
How do we prevent teams from making changes to stuff that isn't "theirs" ?
And if the answer is "well this team only has IAM access to resources xyz", well then might it be a little tricky to represent the Stategraph DAG permission boundaries in IAM policy?
(ps: huge fan of terrateam's offerings -- Alex from tfstate.com)
Additionally, Stategraph should Just Work with your existing TF codebase. You important the state and you're off to the races.
Once all of your state is in Stategraph, though, moving state around really becomes a question of what name those piece of state should have. So, if you want to merge two root modules, it could be the case that you can check "do my resource names overlap?" If no, you can tell Stategraph to merge the states and then copy your code into a single root module and go. Otherwise, you need to do some resource renaming.
While we don't have all the details in place, I think it is quite likely that Stategraph will support metadata on your resources, perhaps with a new block. This way you could provide namespaces to collections of resources, and that could make merging even easier. But, there is a bit to figure out before that is a reality or determined to be the best way to go.
While the intermediary states might form a graph when considering different subsystems, the desire to move towards a given state should probably appear serial over time. When you make a terraform transaction, you have to coordinate in a serializable way with everyone else managing the desired state, even though the actual state will be messy along the way.
Long time ago, I was simply doing stuff like this:
> resources = describe_resources_by_tag(env_name=env, some_tag=tag) > if resource_doesnt_exist(resources, some_resource): > create_resource(resource)
This was very robust and easy to explain. You look around the system, using some tag-based filtering (in AWS, GCP, Azure) and then perform actions to bring the system to the desired state.
You portray this as a design flaw, but it's just the Hashicorp marketing funnel towards hosted Terraform, which solves the arbitration problems that you encounter at scale while allowing Hashicorp to give the cli tooling away for free.
The first is that the scope being managed by a single Terraform application is too broad (e.g., thousands of resources instead of tens or hundreds). File-level locking is fine for small databases with few to no concurrent writers, but as more users come in, and the database gets bigger, you need record-level locking. For Terraform state files, it begs the question why the database got so big and why there were so many concurrent users in the first place.
Second, Terraform state files are a cache but they're being mistreated as a source of truth. This isn't the user's fault but it is the result of (understandable) impatience which results in inevitable shortcut-seeking. It's been a risk since Terraform's inception, and it won't go away as long as people complain that collecting current actual state from the resource provider is too slow.
I keep wondering though if split state files isn’t a good thing in some sense anyway. It isolates a lot of problems. Access, obviously. Another common theme is that applying tends to be a game of roulette, even if you haven’t changed anything. Cloud vendor added or renamed something in their backed. Provider update made some field deprecated. Expected state drift that wasn’t properly marked as ”ignore_changes”. Unexpected state drift by your over-excited intern. When I as an app-developer apply a simple config file change, I really don’t want to be bothered about dirty state in the networking backbone that I understand nothing about.
It’s also not entirely clear how the solution keeps track of the desired state if multiple actors are changing it at the same time. Wouldn’t one successful apply make the next person, who don’t have your local changes, revert it back on first apply? Does this expect heavy use of resource targeting?