Beating the L1 Cache with Value Speculation (2021)
Posted3 months agoActive3 months ago
mazzo.liTechstory
calmpositive
Debate
20/100
CPU OptimizationValue SpeculationL1 Cache
Key topics
CPU Optimization
Value Speculation
L1 Cache
The article discusses how value speculation can be used to optimize CPU performance beyond L1 cache limitations, with commenters exploring the implications and potential applications of this technique.
Snapshot generated from the HN discussion
Discussion Activity
Moderate engagementFirst comment
18h
Peak period
8
96-108h
Avg / period
4.7
Comment distribution14 data points
Loading chart...
Based on 14 loaded comments
Key moments
- 01Story posted
Oct 10, 2025 at 5:54 PM EDT
3 months ago
Step 01 - 02First comment
Oct 11, 2025 at 12:00 PM EDT
18h after posting
Step 02 - 03Peak activity
8 comments in 96-108h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 15, 2025 at 2:57 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45544235Type: storyLast synced: 11/20/2025, 1:48:02 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
However I imagine you'd also get the same great performance using an array?
Idea: what if we implement something that resembles CDR coding, but doesn't compact the cells together (not a space-saving device). The idea as is that when we have two cells A and B such that A->cdr == B, and such that A + 1 == B, then we replace A->cdr with some special constant which says the same thing; indicates that A->cdr is equivalent to A + 1.
Then, I think, we could have a very simple, stable and portable form of the trick in the article:
The branch predictor can predict that the branch is taken (our bump allocator ensures that is frequently the case), and go straight to next = node + 1. When in the speculatively executed alternative path, the load of node->next completes and is not equal to the magic value, then the predicted path is canceled and we gret node->next.This doesn't look like something that can be optimized away, because we are not comparing node->next to node + 1; there is no tautology there.
Someone with a M >= 2 might try the code and find no speedup with the "improved" version, and that it's already iterating faster than L1 load-to-use latency.
But the effect on the main sequence of instructions in the backend will be quite similar. In neither case is it a "prefetch" as such, it is actually executing the load with the predicted value and the result will be consumed by other instructions, decoupling address generation from dependency on previous load result.
Address prediction for loads and stores, by detecting various kinds of strides and access patterns, is done by most CPUs designed during the last 25 years and it is used for prefetching the corresponding data. This is as important for loads and stores as branch prediction for branches.
On the other hand, value prediction for loads is done by very few CPUs and for very restricted use cases, because in general it is too costly in comparison with meager benefits. Unlike for branch direction prediction and branch target prediction, where the set from which the predicted value must be chosen is small, the set from which to choose the value that will be returned by a load is huge, except for very specific applications, e.g. which repeatedly load values from a small table.
The application from the parent article is such a very special case, because the value returned by the load can be computed without loading it, except for exceptional cases, which are detected when the loaded value is different from the pre-computed value.
Both are classes of data prediction, and Apple CPUs do both.
> Address prediction for loads and stores, by detecting various kinds of strides and access patterns, is done by most CPUs designed during the last 25 years and it is used for prefetching the corresponding data. This is as important for loads and stores as branch prediction for branches.
That is not what is known as load/store address prediction. That is cache prefetching, which of course has to predict addresses in some manner too.
> On the other hand, value prediction for loads is done by very few CPUs and for very restricted use cases, because in general it is too costly in comparison with meager benefits. Unlike for branch direction prediction and branch target prediction, where the set from which the predicted value must be chosen is small, the set from which to choose the value that will be returned by a load is huge, except for very specific applications, e.g. which repeatedly load values from a small table.
I'm talking about load address prediction specifically. Apple has both, but load value prediction would not trigger here because I don't think it does pattern/stride detection like load address, but rather is value based and you'd have to see the same values coming from the load. Their load address predictor does do strides though.
I don't know if it needs cache misses or other long latency sources to kick in and start training or not, so I'm not entirely sure if it would capture this pattern. But it can capture similar for sure. I have an M4 somewhere, I should dig it out and try.
> The application from the parent article is such a very special case, because the value returned by the load can be computed without loading it, except for exceptional cases, which are detected when the loaded value is different from the pre-computed value.
no.