The Netflix Simian Army (2011)
Key topics
The Netflix Simian Army, a chaos testing tool introduced years ago, is still sparking debate about its effectiveness and maturity level in systems engineering. Some commenters, like closeparen, argue that it's more of a social technology that keeps developers on their toes, while others, like oooyay, suggest that it rarely uncovers significant issues beyond what's already apparent through thorough reviews. The discussion also touches on the evolution of terminology in software development, with some wishing that 'monkey' had stuck as a term, giving projects a "Planet of the Apes" feel, as mbb70 quipped. As belter and htrp note, Netflix's innovation likely stemmed from necessity, paving the way for third-party tools to follow suit.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
1h
Peak period
24
0-12h
Avg / period
6.8
Key moments
- 01Story posted
Jan 2, 2026 at 9:17 AM EST
9 days ago
Step 01 - 02First comment
Jan 2, 2026 at 10:24 AM EST
1h after posting
Step 02 - 03Peak activity
24 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Jan 8, 2026 at 8:06 PM EST
2d ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
I think the companies I worked for were prioritizing working on no issue deployments (built from a series of documented and undocumented manual processes!) rather than making services resilient through chaos testing. As a younger dev this priority struck me as heresy (come on guys, follow the herd!); as a more mature dev I understand time & effort are scarce resources and the daily toil tax needs to be paid to make forward progress… it’s tough living in a non-ideal world!
I think that's why most companies don't do it. A lot of tedium and the main benefit was actually getting your ducks in a row.
If you know things will break when you start making non-deterministic configuration changes, you aren't ready for chaos engineering. Most companies never get out of this state.
I remember this getting a lot of buzz at the time, but few orgs are at the level of sophistication to implement chaos testing effectively.
Companies all want a robust DR strategy, but most outages are self-inflicted and time spent on DR would be better spent improving DX, testing, deployment and rollback.
Today, many of these ideas map directly to some of their managed services like AWS Fault Injection Simulator, AWS Resilience Hub, or AWS Config, AWS Inspector, Security Hub, GuardDuty, and IAM Access Analyzer for example.
There is also a big third-party ecosystem (Gremlin, LitmusChaos, Chaos Mesh, Steadybit, etc...) offering similar capabilities, often with better multi-cloud or CI/CD integration.
Some of these Netflix tools, I dont think they get much maintenance now, but as free options, they can be cheaper to run than AWS managed services or Marketplace offerings...
I was reading this the other day looking for ideas on how to test query retries in our app. I suppose we could go at it from the network side by introducing latency and such.
However, it’d be great if there also was a proxy or something that could inject pg error codes.
https://developer.chrome.com/docs/chromedriver/mobile-emulat...
Currently we do shadow shifts for a month or two first, but still eventually drop people into the deep end with whatever experience production gifts them in that time. That experience is almost certainly going to be a subset of the types of issues we see in a year, and the quantity isn’t predictable. Even if the shadowee drives the recovery, the shadow is still available for support & assurance. I don’t otherwise have a good solution for getting folks familiar with actually solving real-world problems with our systems, by themselves, under severe time pressure, and I was thinking controlled chaos could help bridge the gap.
Hazing us a cycle of abuse that expresses in a magnification of the abuse inflicted in the hazing than was suffered in the previous cycle.
Maybe you are optimizing your personnel.
In the stateful world, chaos testing is useful, but you really want to be treating every possible combination of failures at every possible application state, theoretically with something like TLA or experimentally with something like Antithesis. The scenarios that you can enumerate and configure manually are just scratching the surface.
It doesn’t test nearly as much as the real tools can, but it did find some bugs in our workflow engine where it wouldn’t properly resume failed tasks.
So ad-hoc, home-grown, chaos testing is still a useful exercise!
It's been 15 years. Aws still sucks compared to your own hardware on so many levels, and total Roi is dropping.