Jit: So You Want to Be Faster Than an Interpreter on Modern Cpus
Key topics
The article discusses the challenges and techniques for building a Just-In-Time (JIT) compiler that outperforms interpreters on modern CPUs, sparking a discussion on the trade-offs between JITs and interpreters, and potential optimizations.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
1d
Peak period
30
36-48h
Avg / period
10.5
Based on 63 loaded comments
Key moments
- 01Story posted
Oct 12, 2025 at 3:08 PM EDT
3 months ago
Step 01 - 02First comment
Oct 13, 2025 at 4:50 PM EDT
1d after posting
Step 02 - 03Peak activity
30 comments in 36-48h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 17, 2025 at 11:46 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
My take is that you can get pretty far these days with a simple bytecode interpreter. Food for thought if your side project could benefit from a DSL!
But something I hadn't expected was it also improved compilation time by 40 percent too (fewer virtual registers made for much faster register allocation).
[1] https://github.com/ZQuestClassic/ZQuestClassic/commit/68087d...
Back when Parrot was a thing and the Perl 6 people were targeting it, I profiled the prelude of Perl 6 to optimize startup time and discovered two things:
- the first basic block of the prelude was thousands of instructions long (not surprising) - the compiler had to allocate thousands of registers because the prelude instructions used virtual registers
The prelude emitted two instructions, one right after another: load a named symbol from a library, then make it available. I forget all of the details, but each of those instructions either one string register and one PMC register. Because register allocation used the dominance frontier method, the size of the basic block and total number of all symbolic registers dominated the algorithm.
I suggested a change to the prelude emitter to reuse actual registers and avoid virtual registers and compilation sped up quite a bit.
I wrote a toy copy-patch JIT before and I don't remember being impressed with the performance, even compared to a naive dispatch loop, even on my ~11 year old processor.
What's odd about the "JIT vs interpreter" debate is that it keeps coming up, given that it is fairly easy to see even in toy examples.
And if you care about performance, why aren't you writing that code in native to begin with?
AutoFDO has since been ported to Android and adopted by Yandex [3].
[0] https://lwn.net/Articles/995397/
[1] https://news.ycombinator.com/item?id=40868224
[2] https://news.ycombinator.com/item?id=42896716
Have no clue what this means - you can pre-compile for target platforms and therefore "fully" use whichever Apple device CPU.
However the EU decreed that it must allow for fair competition, leading to it claiming that it will enable JIT for authorized developers: https://developer.apple.com/support/alternative-browser-engi...
But I'm not sure that they have done so...
Mozilla: https://github.com/mozilla/platform-tilt/issues/3
Chrome: https://issues.chromium.org/issues/42203058
https://developer.apple.com/documentation/browserenginekit/p...
Why not do the same thing the CPU does and fetch N jump addresses at once?
Now the overhead is gone and you just need to figure out how to let the CPU fetch the chain of instructions that implement the opcodes.
You simply copy the interpreter N times, store N opcode jump addresses in N registers and each interpreter copy is hardcoded to access its own register during the computed goto.
No, that's speculative execution you just described. Branch prediction was implemented long before out-of-order CPUs were a thing, as you need branch prediction to make the most of pipelining (eg. fetching and decoding a new instruction while you're still executing the previous one--if you predict branches, you're more likely to keep the pipeline full).
Static branch prediction like "predict taken if negative branch offset" doesn't leak anything, but just about any dynamically updated tables will (almost tautologically) contain statistical information about what was executed recently.
[1] https://dl.acm.org/doi/10.1145/3563311
Compare the running speed of the two binaries built with different options:
Also I recommend reading the previous blog post first, then this one, for additional context: https://www.pinaraf.info/2024/03/look-ma-i-wrote-a-new-jit-c...
Sql server hekaton punted this problem in a seemingly effective way by requiring the client to use stored procedures to get full native compilation. Not sure though if they recompile if the table statistics indicate a different query plan is needed.