Thundering Herd Problem: Preventing the Stampede
Key topics
The article discusses the 'thundering herd problem' in distributed systems, where multiple clients simultaneously request the same resource, causing a surge in load; commenters share various solutions and experiences, including caching strategies and synchronization techniques.
Snapshot generated from the HN discussion
Discussion Activity
Moderate engagementFirst comment
19h
Peak period
9
18-21h
Avg / period
6.7
Based on 20 loaded comments
Key moments
- 01Story posted
Sep 22, 2025 at 7:43 PM EDT
3 months ago
Step 01 - 02First comment
Sep 23, 2025 at 2:48 PM EDT
19h after posting
Step 02 - 03Peak activity
9 comments in 18-21h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 24, 2025 at 12:54 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
OrbitCache is one example, described in this paper: https://www.usenix.org/system/files/nsdi25-kim.pdf
It should solve the thundering herd problem, because the switch would "know" what outstanding cache misses it has pending, and the switch would park subsequent requests for the same key in switch memory until the reply comes back from the backend server. This has an advantage compared to a multi-threaded CPU-based cache, because it avoids performance overheads associated with multiple threads having to synchronize with each other to realize they are about to start a stampede.
A summary of OrbitCache will be published to my blog tomorrow. Here is a "draft link": https://danglingpointers.substack.com/p/4967f39c-7d6b-4486-a...
The switch presumably also has multiple cores which still need to do this work, no? Or is the claim that moving this synchronization to the router behind a network hop saves CPU cycles on the app server?
I wrote a brief description of RMT here: https://danglingpointers.substack.com/p/scaling-ip-lookup-to...
These days you'd have to assume someone somewhere has a neural net based router.
It also doesn't mentionn the most obvious solution to this problem: adding a random factor to retry timing during backoff, since a major cause of it is everyone coming back at the precise instant a service becomes available again, only to knock it offline.
I would think that in the rare instance of multiple concurrent requests for the same key where none of the caches have it cached, it might just be worth it to take the slightly increased hit (if any) of going to the db instead of complicated it further and slowing down everyone else with the same mechanism.
Well, There's also the 'overhead' of connection pooling. I put it that way because I've definitely run into the case of a 'hot' key (i.e. imagine hundreds of users that all need to download the same set of data because they are in the same group). Next thing you know your connection pool is getting saturated with these requests.
To your point however, I've also had cases where frankly querying the database is always fast enough (i.e. simple lookup on a table small enough that the DB engine practically always has it in memory anyway) so a cache would just be wasted dev time.
To make matters worse, due to the budget for this lab, we had just three servers that the testing computers could download from. In the worst case the horrible snarl-up would cause computers to wait for as much as two hours before they got the materials needed to run the tests.
My solution was to use peer-to-peer BitTorrent (no Trackers involved), with HTTP seeding. So the BitTorrent files had no trackers listed, but the three servers listed as HTTP seeds, and the clients were all started with local peer discovery. So the first couple of computers to get the job would pull most/all of the file contents from our servers, and then the rest of the computers would wind up getting the file chunks mostly from their peers.
I did need to do some work so that the clients would first try a URL on the servers that would check for the .torrent file, and if it did not exist, build it (sending the clients a 503 code, causing them to wait a minute or two before retrying).
There are lots of things I would do differently if I rebuilt the system (write my own peer-to-peer code), but the result meant that we rarely had systems waiting more than a few minutes to get full files. It took the thundering heard and made it its own solution.
If you can, it's easier to have every client fetch from cache, and then a cron job e.g., every second, refresh the cache.
In CDN feature to prevent this is "Collapse Forwarding"
This query will probably find loads already: https://github.com/search?q=language%3Atypescript+%22new+Map...
I wonder if you could extend the `In-process synchronization` example so that when `CompleteableFuture.supplyAsync()` thunk first does a random sleep (where the sleep time is bounded by an informed value based on the expensive query execution time), then it checks the cache again, and only if the cache is still empty does it proceed with the rest of the example code.
That way you (stochastically) get some of the benefits of distributed locking w/o actually having to do distributed locking.
Of course that only works if you are ok adding in a bit of extra latency (which should be ok; you're already on the non-hot path), and that there still may be more than 1 query issued to fill the cache.