Tracing Jits in the Real World Cpython Core Dev Sprint
Key topics
The article discusses a CPython Core Dev Sprint focused on tracing JITs, sparking a discussion on the potential benefits and limitations of JITs in Python and comparisons with other languages and implementations like PyPy and Julia.
Snapshot generated from the HN discussion
Discussion Activity
Moderate engagementFirst comment
35m
Peak period
10
0-6h
Avg / period
5.4
Based on 27 loaded comments
Key moments
- 01Story posted
Sep 25, 2025 at 2:40 PM EDT
4 months ago
Step 01 - 02First comment
Sep 25, 2025 at 3:15 PM EDT
35m after posting
Step 02 - 03Peak activity
10 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 28, 2025 at 3:00 PM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
JS JITs (the production ones, like JSC's) have no such thing as trace blockers that prevent the surrounding code from being optimized. You might have an operation (like a call to some wacky native function) that is itself not optimized, but that won't have any impact on the JIT's ability to optimize the code surrounding that operation.
Tracing is just too much of a benchmark hack overall IMO. Tracing would only be a good idea in a world where it's too expensive to run a real optimizing JIT. But the JS, Java, and .NET experiences show that a real optimizing JIT - with all of its compile time costs - is exactly what you want because it results in predictable speed-ups
When we talk about JS or Java JITs working well, we are making statements based on intense industry competition where if a JIT had literally any shortcoming then a competitor would highlight it in competitive benchmarking and blog posts. So, the competition forced aggressive improvements and created a situation where the top JITs deliver reliable perf across lots of workloads.
OTOH PyPy is awesome but just hasn’t had to face that kind of competitive challenge. So we probably can’t know how far off from JS JITs it is.
One thing I can say is when I compared it to JSC by writing the same benchmark in both Python and JS, JSC beat it by 4x or so.
For example, static initialization on classes. The JDK has a billion different classes and on startup a not insignificant fraction of those end up getting loaded for all but the simplest applications.
Essentially, Java and the JS jits are both initially running everything interpreted and when a hot method is detected they progressively start spending the time sending those methods and their statistics to more aggressive JIT compilers.
A non-insignificant amount of time is being spent to try and make java start faster and a key portion of that is resolving the class loading problem.
That's similar to how js does things.
Java does have a "client" optimization mode for more short lived operations (like guis for example) and AFAIK it's basically unused at this point. The more aggressive "server" optimizations are faster than ever and get triggered pretty aggressively now. The nature of the jvm is also changing. With fast scaling and containerization, a slow start and long warmup aren't good. That's why part of the jdk dev has been dedicated to resolve that.
All commercial JVMs have had JIT caches for quite some time, and this is finally also available as free beer on OpenJDK, thus code can execute right away as if it was an AOT language.
In some of those implementations, the JIT cache gets updated after each execution taking into account profiling data, thus we have the possibilitiy to reach an optimal status across the lifetime of the executable.
The .NET and ART cousins also have similar mechanisms in place.
Which I guess is what your last sentence refers to, but I wasn't sure.
Yup, the CDS and now AOT stuff in openjdk is what I was referring to. Project Leyden.
It’s true that there are some necessary pessimizations but nothing as severe as failing to optimize the code at all
But LuaJIT is also Tracing JIT which seems to work well enough.
Ive heard that LuaJIT has Pre stable perf than Mozilla’s tracing JIT had and I’ve heard plenty of stories about how flaky LuaJIT’s performance is. But we can’t know how good it really is due to lack of competitors
Can cross fertilization between PyPy and CPython JIT efforts help already fast PyPy to get even faster? Like, did CPython JIT team try something PyPy developers didn't attempt before?
PyPy is awesome, btw.
The biggest differences between the two JITs are: 1. PyPy is meta tracing, CPython is tracing 2. PyPy has "standard" code generation backends, CPython has copy&patch. 3. CPython so far uses "trace projection" while PyPy uses "trace recording".
(1+2) make CPython JIT much faster to compile and warmup than PyPy, although I suspect that most of the gain is because of (1). However, this comes at the expense of generality, because in PyPy you can automatically trace across all builtins, whereas in CPython you are limited to the bytecode.
Trace projection looked very interesting to me because it automatically solve a problem which I found everywhere in real world code: if you do trace recording, you don't know whether you will be actually able to close the loop, and so you must decide to give up after a certain threshold ("trace too long"). The problem is that there doesn't seem to be threshold which is generally good, so you always end up tracing too much (big warmup costs + literally you are doing unnecessary work) or not enough (the generated code is less optimal, sometimes up to 5-10x).
With trace projection you decide which loop to optimize "in retrospect" so you don't have that specific problem. However, you have OTHER problems (in particular that you don't know the actual values used in the trace) which makes it harder to optimize, so CPython JIT plans to switch to trace recording.
The problem of cpyext is that it's super slow, for good reasons: https://pypy.org/posts/2018/09/inside-cpyext-why-emulating-c...
There are efforts to create a new C API which is more friendly to alternative implementations (including CPython itself, when they want to change how they do things internally): https://hpyproject.org/ https://github.com/py-ni
Diagram: https://docs.julialang.org/en/v1/devdocs/img/compiler_diagra...
Documentation: https://docs.julialang.org/en/v1/devdocs/eval/
From what I understand, Julia doesn’t do any tracing at all, it just compiles each function based on the types it receives. Obviously Python doesn’t have multiple dispatch, but that actually might make compilation easier. Swap out the LLVM step with python's IR and they could probably expect a pretty substantial performance improvement. That said I don’t know anything about compilers, I just use both Python and Julia.
I'm not sure exactly how it differs from most JavaScript JITs, but I believe it just compiles each method once for each set of function argument types - for example, it doesn't try to dynamically determine the types of local variables.
One big advantage of tracing JITs is that they are generally easier to write an to maintain. For the specific case of PyPy, it's actually a "meta tracing JIT": you trace the interpreter, not the underlying program, which TL;DR means that you can write the interpreter (which is "easy") and you get a JIT compiler for free. The basic assumption of a tracing JIT is that you have one or more "hot loops" in which you have one (or few) fast paths which are taken most of the time.
If the assumption holds, tracing has big advantages because you eliminate most of dynamism and you automatically inline across multiple layer of function calls, which in turns make it possible to eliminate allocation of most temporary objects. The problem is that the assumption not always holds, and that's where you start to get problems.
But methods JITs are not THE solution either. Meta has a whole team developing Cinder, which is a method JIT for Python, but they had to introduce what they call "static python", which is an opt-in sets of constraints to remove some Python dynamism to make the JIT job easier.
Finally, as soon as you call any C extension, any JIT is out of luck and must deoptimize to present a "state of the world" which is compatible with that the C extension finds.