How to Make Things Slower So They Go Faster
Posted4 months agoActive4 months ago
gojiberries.ioTechstory
calmpositive
Debate
20/100
Performance OptimizationQueueing TheorySystem Design
Key topics
Performance Optimization
Queueing Theory
System Design
The article discusses how introducing delays or slowing down certain processes can ultimately lead to faster overall performance, and the discussion revolves around various examples and applications of this concept.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
1d
Peak period
27
Day 2
Avg / period
7.8
Comment distribution39 data points
Loading chart...
Based on 39 loaded comments
Key moments
- 01Story posted
Aug 24, 2025 at 1:10 AM EDT
4 months ago
Step 01 - 02First comment
Aug 25, 2025 at 11:19 AM EDT
1d after posting
Step 02 - 03Peak activity
27 comments in Day 2
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 1, 2025 at 2:08 PM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45001556Type: storyLast synced: 11/20/2025, 6:27:41 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
For example, the paragraphs around the paragraph with "compute the exact Poisson tail (or use a Chernoff bound)" and that paragraph itself could be better illustrated with lines of math instead of mostly language.
I think you do need some math if you want to approach this probabilistically, but I agree that might not be the most accessible approach, and a hard threshold calculation is more accessible and maybe just as good.
Particularly because distributed computer systems aren't pure math problems to be solved. Load often comes from usage which is often closer to random inputs rather than predicable variables. Further, how load is processed depends on a bunch of things from the OS scheduler to the current load on the network.
It can be hard to really intuitively understand that a bottlenecked system processes the same load slower than an unbound system.
I feel tha I'm missing something obvious. Isn't this doc reinventing the wheel in terms of what very basic task queue systems do? It describes task queues and task prioritization, and how it supports tasks that cache user data. What am I missing?
I just call nanosleep(2) based upon the amount if data processed. This is set by a parameter file that contains the sleep time and amount of data to determine when to sleep.
In programs I know will execute for a very long time, if the parameter file changes, parameters are adjusted during the run. Plus I will catch cancel signals to create a restart file should the program be cancelled.
Why ? On a laptop I have a program that will read an 8 billion record text file and matches it against a 1 billion record text file and doing some calculations based upon the data found between the 2 records.
So, slowing it down will prevent my laptop from overheating, it just runs quietly via a cron job.
I need something 100% portable between systems, what I do meets that requirement :)
I've mostly heard it in the context of building and construction videos where they are approaching a new skill or technique and have to remind themselves to slow down.
Going slowly and being careful leads to fewer mistakes, which will be a "smoother" process and ends up taking less time, whereas going too fast and making mistakes means work has to be redone and ultimately takes longer.
On rereading it, I see some parallels: When one is trying to go too fast, and is possibly becoming impatient with their progress, their mental queue fills up and processing suffers. If one accepts a slower pace, one's natural single-tasking capability will work better, and they will make better progress as a result.
And maybe its just my selection bias working hard to confirm that he actually is talking about what I want him to say!
Common to hear this in auto racing and probably a lot of other fields
There is a saying: “You don’t rise your level when performing. You fall to your level of practice.”
To the author of the article: I stopped reading after the first two sentences. I have no idea what you are talking about.
Imagine everyone in a particular timezone browsing Amazon as they sit down for their 9 to 5; or an outage occurring, and a number of automated systems (re)trying requests just as the service comes back up. These clients are all "acting almost together".
"In a service with capacity mu requests per second and background load lambda_0, the usable headroom is H = mu - lambda_0 > 0"
Subtract the typical, baseline load (lambda_0) from the max capacity (mu), and that gives you how much headroom (H) you have.
The signal processing definition of headroom: the "space" between the normal operating level of a signal and the point at which the system can no longer handle it without distortion or clipping.
So headroom here can be thought of "wiggle room", if that is a more intuitive term to you.
Or, if possible make latency a feature (embrace the queue!). For service to service internal stuff e.g. something like a request to hard delete something, this can always be a queue.
And obviously you can scale up as the queue backs up.
I do love the maths tho!
https://en.wikipedia.org/wiki/Braess%27_paradox
https://en.wikipedia.org/wiki/Jevons_paradox
I guess it's the same underlying principle for both paradoxii.
In fact, increasing capacity can make the problems worse due to the new capacity being thought of as available by many people at the same time.
Also, the plural should be quantified when possible: one paradox, two tridox, three quatrodox...
A few fun videos covering this. I first saw Steve Mould's. He links to Up and Atom. Both are fun.
In a simple, ideal world, your developers can issue the same number of jobs as you have CPUs available. Until you run into jobs that take more memory than is available. Or that access more disk/network IO than is available.
So you setup temporary storage, or in-memory storage, or stagger the jobs so only a couple of them hit the disks at a time, and then you measure performance in groups of 4 or 8 to see when performance falls off, or stand up an external caching server, or whatever else you can come up with to work within your budget and available resources.
If you do that, you're likely to have a latency on the order of almost a millisecond, putting the previous tokens in one end would get you the logits for the next at a rate of let's say 1000 tokens per second... impressive at current rates.
You could also take that same array, and program in several latches along the way to synchronize data at selected points, and enabling pipelining. This might produce a slight (10%) increase in latency, so a 10% or so loss in throughput for a single stream. However, it would allow you to have multiple independent streams flowing through the FPGAs. Instead of serving 1 customer at 1000 tokens/second, you might have 10 or more customers each with their 900 tokens/second.
Parallelism and pipelining are the future of compute.