K8s with 1m Nodes
Posted3 months agoActive2 months ago
bchess.github.ioTechstoryHigh profile
calmmixed
Debate
60/100
KubernetesScalabilityDistributed Systems
Key topics
Kubernetes
Scalability
Distributed Systems
The article explores scaling Kubernetes to 1 million nodes, sparking a discussion on the trade-offs between reliability, performance, and the need for such large-scale clusters.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
14h
Peak period
33
48-60h
Avg / period
10
Comment distribution80 data points
Loading chart...
Based on 80 loaded comments
Key moments
- 01Story posted
Oct 16, 2025 at 6:04 PM EDT
3 months ago
Step 01 - 02First comment
Oct 17, 2025 at 7:42 AM EDT
14h after posting
Step 02 - 03Peak activity
33 comments in 48-60h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 21, 2025 at 1:30 AM EDT
2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45611252Type: storyLast synced: 11/20/2025, 4:29:25 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
[1] what is a node? Typically it is a synonym for "server". In some configurations HPC schedulers allow node sharing. Then we talk about order of 100k cores to be scheduled.
https://openai.com/index/scaling-kubernetes-to-7500-nodes/
I'm sure they mean actual servers / not just cores. Even in traditional HPC it isn't abstracted to the level of individual cores usually since most HPC jobs care about memory bandwidth - even with Infiniband or other techniques throughput / latency is much worse than on a single machine. Of course, multiple machines are connected (usually using MPI / Infiniband) but important to try to minimize communication between nodes where possible.
For AI workloads, they are running GPUs - so 10K+ cores on a single device so even less likely to be talking about cores here.
This assumption is completely out of touch, and is especially funny when the goal is to build an extra large cluster.
https://github.com/k3s-io/kine is a reasonably adequate substitute for etcd. sqlite, MySQL, PostgreSQL can also be substituted in. Etcd is from the ground up built to be more scale-out reliable, and that rocks to have baked in. But given how easy it is to substitute etcd out, I feel like we are at least a little off if we're trying to say "etcd is also the entire point of k8s" (the APIserver is)
Ideally, i'd love to see a database specific offering. Use postgres async replication (ideally somehow sharded so there's not a single consumer node) to some fan out system that's doing all the watching.
But etcd mostly does the job, seems unlikely to be going anywhere. It's be cool though.
Holy Raft protocol is the blockchain of cloud.
k3s doesn't require etcd, I'm pretty sure GKE uses Spanner and Azure uses Cosmos under the hood.
and it's pretty modular too, so it can even serve as the host for the bespoke whatever that's needed
though I remember reading the fly.io blog post about their custom scheduler/allocator which illustrates nicely how much of a difference a custom in-house solution makes if works well
Have you looked at the etcd keys and values in a Kubernetes cluster? It's a _remarkably_ simple schema you could do in pretty much any database with fast prefix or path scans.
Sorry, this is just BS. etcd is a fifth wheel in most k8s installations. Even the largest clusters are better off with something like a large-ish instance running a regular DB for the control plane state storage.
Yes, etcd theoretically protects against any kind of node failures and network partitions. But in practice, well, nobody really cares about the control plane being resilient against meteorite strikes and Cthulhu rising from the deeps.
But from the article, turning off fsync and expecting to only lose a few ms of updates. I've tried to recover etcd on volumes that lied about fsync and experienced a power outage, and I don't think we managed to recover it. There might be more options now to recover and ignore corrupted WAL entries, but at that time it was very difficult and I think we ended up just reinstalling from scratch. For clusters where this doesn't matter or the SLOs for recovery account for this, I'm totally onboard, but only if you know what you're doing.
And similar the point from the article that "full control plane data loss isn’t catastrophic in some environments" is correct, in the sense of what the author means by some environments. Because I don't think it's limited to those that are management by gitops as suggested, but where there is enough resiliency and time to redeploy and do all the cleanup.
Anyways, like much advice on the internet, it's not good or bad, just highly situational, and some of the suggestions should only be applied if the implications are fully understood.
Once in maybe 10 years?
(For the record I don't really see Erlang clusters as a replacement for k8s)
At 1M nodes I’d still expect an average of a dozen or so pods per node.
From what I know basically everyone approaching this scale with k8s has different problems to solve, namely multi-tenancy (shared hosting/internal plattform providers) and compatibility with legacy or standard software.
I would like to see how moving to database that scales write throughput with replicas would behave, namely FoundationDB. I think this will require more than an intermediary like kine to be efficient, as the author illustrates the apisever does a fair bit of its own watching and keeping state. I also think there's benefit, at least for blast radius, to shard the server by api group or namespace.
I think years ago this would have been a non starter with the community, but given AWS has replaced etcd (or at least aspects) with their internal log service for their large cluster offering, I bet there's some appetite for making this interchangable and bringing and open source solution to market.
I share the authors viewpoint that for modern cloud based deployments, you're probably best avoiding it and relying on VMs being stable and recoverable. I think reliability does matter if you want to actually realize the "borg" value and run it on bare metal across a serious fleet. I haven't found the business justification to work on that though!
To be honest, I was building it with the purpose of matching the Etcd scale, but making foundationdb a multitenant data store.
But with the recent craze of scalability , I'll be investing time into understanding how far foundationdb can be pushed as a K8s data store. Stay tuned.
It would be great to see where the limits are with this approach.
I think at some point, you need to go deeper into the apiserver for scale than an API compatible shim, but this is just conjecture and not real data.
https://github.com/bchess/k8s-1m/tree/main/mem_etcd
https://github.com/bchess/k8s-1m/blob/main/RUNNING.adoc#mem_...
From what I remember, GKE has implemented an etcd shim on top of spanner as a way to get around the scalability issues, but unfortunately for the rest of us who do not have spanner there aren’t any great options.
I feel like at a fundamental level that pod affinity, antiaffinity, and topology spreads are not compatible with very large clusters due to the complexity explosion in large clusters.
Another thing to consider is that the larger a cluster becomes, the larger the blast radius is. I have had clusters of 10k nodes spectacularly fail due to code bugs within k8s. Sharding total compute capacity compute capacity into multiple isolated k8s clusters reduces the likelihood that a software bug is going to take down everything as you can carefully upgrade only a single cell at a time with bake periods between each cell.
[1]: https://aws.amazon.com/blogs/containers/under-the-hood-amazo...
[2]: https://cloud.google.com/blog/products/containers-kubernetes...
[3]: https://azure.microsoft.com/en-us/blog/a-cosmonaut-s-guide-t...
[1] https://learn.microsoft.com/en-us/answers/questions/154061/a...
Reference: https://www.rondb.com/post/100m-key-lookups-sec-with-rest-ap...
All in all, it was a poor choice for Kubernetes to use this as its backend in the first place. Apparently, Google uses its own shim, but there is also kine, which was created a long time ago for k3s and allows you to use a RDBMS. k3s used sqlite as its default originally, but any API equivalent database would work.
We should keep in mind etcd was meant to literally be the distributed /etc directory for CoreOS, something you would read from often but perform very few writes to. It's a configuration store. Kubernetes deciding to also use it for /var was never a great idea.
If you just turned off file system syncs in etcd you could probably get an order of magnitude better performance as well.
> Sharding total compute capacity compute capacity into multiple isolated k8s clusters reduces the likelihood that a software bug is going to take down everything as you can carefully upgrade only a single cell at a time with bake periods between each cell.
Yeah, I've been meaning to try out something like Armada to simplify things on the cluster-user side. Cluster-providers have lots of tools to make managing multiple clusters easier but if it means having to rewrite every batch job..
More clusters means one more layer of things that can crash your (very expensive) training.
You also then still need to write tooling to manage cross cluster trainings correctly just as starting/stopping roughly at the same time, resuming from checkpoints, node health monitoring etc.
Nothing dealbreaking, but if it could just work in a single cluster that would be nicer.
In retrospect, at my previous company, what we really needed in the early days was something that was Heroku-like (don't make me think about infra (!)) but could be easily added to and scaled up over time, as our service grew. We eventually grew to about 10M users, using the site monthly, and had to do a huge effort to migrate to Kubernetes.
Canine's philosophy is: full Kubernetes, with a deployment layer on top. If you ever out grow it, just dump Canine entirely, and work directly with the Kubernetes system it's operating. It even gives you all the K8s YAML config needed to offboard.
It's also similar to how the dev infra works at Airbnb (where I worked before that) -- Kubernetes underneath, a user friendly interface on top.
click
[1]: https://sirupsen.com/napkin
A few thoughts:
*On watch streams and caching*: Your observation about the B-Tree vs hashmap cache tradeoff is fascinating. We hit similar contention issues with our agent's context manager - switched from a simple dict to a more complex indexed structure for faster "list all relevant context" queries, but update performance suffered. The lesson about O(1) writes vs O(log n) reads being the wrong tradeoff for high-write workloads is universal.
*On optimistic concurrency for scheduling*: The scatter-gather scheduler design is elegant. We use a similar pattern for our dual-agent system (TARS planner + CASE executor) where both agents operate semi-independently but need coordination. Your point about "presuming no conflicts, but handling them when they occur" is exactly what we learned - pessimistic locking kills throughput far worse than occasional retries.
*The spicy take on durability*: "Most clusters don't need etcd's reliability" is provocative but I suspect correct for many use cases. For our Django development agent, we keep execution history in SQLite with WAL mode (no fsync), betting that if the host crashes, we'd rather rebuild from Git than wait on every write. Similar philosophy.
The mem_etcd implementation in Rust is particularly interesting - curious if you considered using FoundationDB's storage engine or something similar vs rolling your own? The per-prefix file approach is clever for reducing write amplification.
Fantastic work - this kind of empirical systems research is exactly what the community needs more of. The "what are the REAL limits" approach vs "conventional wisdom says X" is refreshing.
Or the amount of funding a startup has.
The bottom line is, you are not OpenAI or Google.
I was about to say that Nomad did something similar, but that was 2 million Docker containers across 6100 nodes, https://www.hashicorp.com/en/c2m
Anyone familiar with the space will tell you this is the biggest blocker in production.
You will have to pay for an "enterprise" CNI to make it work.