Can We Know Whether a Profiler Is Accurate?
Posted3 months agoActive3 months ago
stefan-marr.deTechstory
calmmixed
Debate
60/100
ProfilingPerformance OptimizationCPU Architecture
Key topics
Profiling
Performance Optimization
CPU Architecture
The article discusses the accuracy of profilers, and the discussion revolves around the challenges and limitations of profiling, including observer effects and potential inaccuracies.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
1h
Peak period
19
Day 1
Avg / period
5.5
Comment distribution22 data points
Loading chart...
Based on 22 loaded comments
Key moments
- 01Story posted
Oct 14, 2025 at 10:04 PM EDT
3 months ago
Step 01 - 02First comment
Oct 14, 2025 at 11:20 PM EDT
1h after posting
Step 02 - 03Peak activity
19 comments in Day 1
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 23, 2025 at 1:55 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45587289Type: storyLast synced: 11/20/2025, 12:35:35 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
If you encounter a slowdown using RTIT or IPT (the old and new names for hardware trace) it's usually a single-digit percentage. (The sources here are Intel's vague documentation claims plus anecdotes; Magic Trace, Hagen Paul Pfeifer, Andi Kleen, Prelude Research.)
Decoding happens later and is significantly slower, and this is where the article's focus, JIT compilation, might be problematic using hardware trace (as instruction data might change/disappear, plus mapping machine code output to each Java instruction can be tricky).
Intuitively this works because the hardware can just spend some extra area to stream the info off on the side of the datapath -- it doesn't need to be in the critical path.
The function that triggers GC is typically not the function that made the mess.
The function that stalls on L2 cache misses often did not cause the miss.
Just using the profiler can easily leave 2-3x performance on the table, and in some cases 4x. And in a world where autoscaling exists and computers run in batteries that’s a substantial delta.
And the fact is that with few exceptions nobody after 2008 really knows me as the optimization guy, because I don’t glorify it. I’m the super clean code guy. If you want fast gibberish, one of those guys can come after me for another 2x if you or I don’t shoo them away. Now you’re creeping into order of magnitude territory. And all after the profiler stopped feeding you easy answers.
It turns out that the tester had not been looking closely at the output, other than to verify that output consisted of numbers. He didn’t have any ideas about how to test it, so he opted for mere aesthetics.
This is one of many incidents that convinced me to look closely and carefully at the work of testers I depend upon. Testing is so easy to fake.
Profilers alter the behavior of the system. Nothing has high enough clock resolution or fidelity to make them accurate. Intel tried to solve this by building profiling into the processor, and that only helped slightly.
Big swaths of my career, and the resulting wins, started with the question,
“What if the profiler is wrong?”
One of the first things I noticed is that no profilers make a big deal out of invocation count, which is a huge source of information for continuing past tall tent poles or hotspots into productive improvement. I have seen one exception to this, but that tool became defunct sometime around 2005 and nobody has copied them since.
Because of cpu caches and branch prediction and amortized activities in languages or libraries (memory defrag, GC, flushing), many things get tagged by the profiler as expensive that are being scapegoated because they get stuck paying someone else’s bill. They exist at the threshold where actions can no longer be deferred and have to be paid for now.
So what you’re really looking for in the tools is everything that looks weird. And that often involves ignoring the fancy visualization and staring at the numbers. Which are wrong. “Reading the tea leaves” as they say.
Knowing that all profilers aren't perfectly accurate isn't a very useful piece of information. However, knowing which types of profilers are inaccurate and in which cases is indeed very useful information, and this is exactly what this article is about. Well worth 15 minutes.
> And that often involves ignoring the fancy visualization and staring at the numbers.
Visualisations are incredibly important. I've debugged a large number [1] of performance issues and production incidents highlighted by the async profiler producing Brendan Gregg's flame graphs [2]. Sure, things could be presented as numbers, but what I really care about most of the time when I take a CPU profile from a production instance is – what part of the system was taking most of the CPU cycles.
[1]: https://x.com/SerCeMan/status/1305783089608548354
[2]: https://www.brendangregg.com/flamegraphs.html
That’s perfectly inaccurate.
Most of the people who seem to know how to actually tune code are in gaming, and in engine design in particular. And the fact that they don’t spend all day every day telling us how silly the rest of us are is either a testament to politeness or a shame. I can’t decide which.
That's a very strong claim, and it's not true in my experience as I've showed above.
last i had to deal with it.. which was eons ago.. Higher end CPUs like Xeons had more counters and more useful ones
im sure there are plenty of situations where theyre insufficient, but its absurd to paint the situation as completely always hopeless
It doesn't mean that you get wrong measurements, it means there's a level of inaccuracy that has to be accepted.
BTW, I am aware that Intel PCM is not a profiler, and more of a measurement tool, however you CAN you use it to 'profile' your program and see how it behaves in terms of computing and memory utilization (with deep analysis of cache behavior (cache hit, cache miss, etc.))
Getting a non-sampling profiler is pretty expensive. Nobody has worked out how to create one that audits cause and effect. That's up to you to suss out on your own.
There's a fiction about how computation works that has been built into processors for more than thirty years. And when that fiction breaks down the code doesn't just get slower. It gets 10 times, 100 times, 1000 times slower.
Profilers mostly play into that fiction, and the gap between the fiction and reality has only grown bigger over time. See also the 'memory access is O(n^1/3) ' thread a few weeks ago for hints.
This is where we get into sampling vs. tracing profilers. Tracing is even more disruptive to the runtime, but gives you more useful information. It can point you at places where your O-notation is not what you expected it to be. This is a common cause of things which grid to a halt after great performance on small examples.
It gets even worse in distributed systems, which is partly why microservice-oriented things "scale" at the expense of a lot more hardware than you'd expect.
It's definitely a specialized discipline, whole-system optimization, and I wish I got to do it more often.
One of my first big counter examples to standard “low hanging fruit” philosophy was a profiler telling me a function was called 2x as often as the sequence diagram implied and occupying 10% of the overall cpu time for the operation. So I removed half of the calls by inverting a couple of calls to expose the data, which should have been a 5% gain (0.1 / 2). Total time reduction: more than 20%. Couple jobs later I managed a 10x improvement on one page transition from a similar change with an intersection test between two lists. Part of that was algorithmic but most was memory pressure.
Remember if microbenchmarks lie, a profiler is just an inversion of the benchmark idea. Two sides of the same coin.
We devs could take turns and claim to have improved our own code by a huge margin simply by rearranging the order of calls so ours was not first. Then pass the baton; global performance utterly unchanged, everyone gets to "optimize" their own code.
Something like https://github.com/plasma-umass/coz might be the way forward.
https://www.youtube.com/watch?v=r-TLSBdHe1A