Python 3.15’s Interpreter for Windows X86-64 Should Hopefully Be 15% Faster
Key topics
The quest for a faster Python interpreter is heating up, with a recent experiment showing a 15% speed boost for Python 3.15's Windows x86-64 interpreter. Commenters are buzzing about the potential of tail-calling interpreters, with some suggesting that AI-powered optimization tools like AlphaEvolve could further accelerate the Python core loop. While some debate the underlying reasons for the speedup, with some attributing it to better register use, others are relieved that CPython maintainers plan to keep multiple interpreter versions as fallbacks, mitigating long-term risks. As the discussion unfolds, the prospect of a significantly faster Python interpreter is generating excitement among developers.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
57m
Peak period
106
0-12h
Avg / period
22.3
Based on 156 loaded comments
Key moments
- 01Story posted
Dec 25, 2025 at 8:02 AM EST
8 days ago
Step 01 - 02First comment
Dec 25, 2025 at 8:59 AM EST
57m after posting
Step 02 - 03Peak activity
106 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 31, 2025 at 1:24 PM EST
1d ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Looks like it refers to this:
https://youtu.be/pUj32SF94Zw
(wish it's a link in the article)
I've asked Ken. He said he'll update the article.
> I used to believe the the tailcalling interpreters get their speedup from better register use. While I still believe that now, I suspect that is not the main reason for speedups in CPython.
> My main guess now is that tail calling resets compiler heuristics to sane levels, so that compilers can do their jobs.
> Let me show an example, at the time of writing, CPython 3.15’s interpreter loop is around 12k lines of C code. That’s 12k lines in a single function for the switch-case and computed goto interpreter.
> […] In short, this overly large function breaks a lot of compiler heuristics.
> One of the most beneficial optimisations is inlining. In the past, we’ve found that compilers sometimes straight up refuse to inline even the simplest of functions in that 12k loc eval loop.
Also the interpreter loop's dispatch is autogenerated and can be selected via configure flags. So there's almost no additional maintenance overhead. The main burden is the MSVC-specific changes we needed to get this working (amounting to a few hundred lines of code).
> Impact on debugging/profiling
I don't think there should be any, at least for Windows. Though I can't say for certain.
I'd have expected it to be hand rolled assembly for the major ISAs, with a C backup for less common ones.
How much energy has been wasted worldwide because of a relatively unoptimized interpreter?
For comparison: when Javascript was first designed, performance wasn't a goal. Later on, people who had performance as a goal worked on Javascript implementations. Thanks to heroic efforts, nowadays Javascript is one of the language with decently fast implementation around. The base design of the language hasn't changed much (though how people use it might have changed a bit).
Python could do something similar.
So the problem is basically that a simple JIT is not beneficial for Python. So you have to invest a lot of time and effort to get a few percent faster on a typical workload. Or you have to tighten up the language and/or break the C ABI, but then you break many existing popular libraries.
Most of the time, people don't use any of these customisations, don't they?
So you'd need machinery that makes the common path go fast, but can fall back onto the customised path, if necessary?
For all its dynamism, Python doesn't have anything closer to becomes:.
I would say that by now what is holding Python back is the C ABI and the culture that considers C code as Python.
For frequent, short-running scripts: start-up time! Every import has to scan a billion different directories for where the module might live, even for standard modules included with the interpreter.
The new `uv` is making good progress there.
Apparently people that care about performance do run Windows.
Eh, what about users? Games are made for windows, because that's where users (= players) are?
That's even more true for mobile and console games.
MSVC's support for musttail is hot off the press:
> The [[msvc::musttail]] attribute, introduced in MSVC Build Tools version 14.50, is an experimental x64-only Microsoft-specific attribute that enforces tail-call optimization. [1]
MSVC Build Tools version 14.50 was released last month, so it took less than 30 days for the CPython crew to turn that around into a performance improvement.
Yay to getting undocumented MSVC features disclosed if Microsoft thinks you’re important enough :/
Python is one of the Microsoft blessed languages on their devblogs.
Generally not that much has happened in 5 years, sometimes 10-15% improvements are posted that are later offset by bloat.
I think the project started in 3.10, so 3.9 is the last version to compare to. The improvements aren't that great, I don't think any other language would get so much positive feedback for so little.
https://thenewstack.io/guido-van-rossums-ambitious-plans-for...
Agree with the sentiment, Python is the only dynamic language where it seems a graveyard from efforts.
And nope it isn't the dynamism per se, Smalltalk, Self, Common Lisp are just as dynamic, with lots of possibilities to reboot the world and mess up JIT efforts, as any change impacts the whole image.
Naturally those don't have internals exposed to C where anything goes, and the culture C libraries are seen as the language libraries.
Python has some semantics and behaviors that are particularly hostile to optimization, but as the Faster Python and related efforts have suggested, the main challenge is full compatibility including extensions plus the historical desire for a simple implementation within CPython.
There are limits to retrofitting truly high performance to any of these languages. You want enough static, optional, or gradual typing to make it fast enough in the common case. That's why you also saw the V8 folks give up and make Dart, the Facebook ones made Hack, etc. It's telling that none of those gained truly broad adoption though. Performance isn't all that matters, especially once you have an established codebase and ecosystem.
V8 still got substantially faster after the first team left to do Dart. A lot of runtime optimizations (think object model optimizations), several new compilers, and a lot of GC work.
It's a huge investment to make a dynamic language go as fast as JS these days.
Yes, and on the other hand, other language implementations like CPython can learn from everything people figured out for JS.
And this is no small part of why Java and JS have frequently been pushing VM performance forward — there’s enough code people very much care about continuing to work on performance. (Though the two care about different things mostly: Java cares much more about long-term performance, and JS cares much more about short-term performance.)
It doesn’t hurt they’re both languages which are relatively static compared with e.g. Python, either.
Sorry but unless your workload is some C API numpy number cruncher that just does matmuls on the CPU, that's probably false.
In 3.11 alone, CPython sped up by around 25% over 3.10 on pyperformance for x86-64 Ubuntu. https://docs.python.org/3/whatsnew/3.11.html#whatsnew311-fas...
3.14 is 35-45% faster than CPython 3.10 for pyperformance x86-64 Ubuntu https://github.com/faster-cpython/benchmarking-public
These speedups have been verified by external projects. For example, a Python MLIR compiler that I follow has found a geometric mean 36% speedup moving from CPython 3.10 to 3.11 (page 49 of https://github.com/EdmundGoodman/masters-project-report)
Another academic benchmark here observed an around 1.8x speedup on their benchmark suite for 3.13 vs 3.10 https://youtu.be/03DswsNUBdQ?t=145
CPython 3.11 sped up enough that PyPy in comparison looks slightly slower. I don't know if anyone still remembers this: but back in the CPython 3.9 days, PyPy had over 4x speedup over CPython on the PyPy benchmark suite, now it's 2.8 on their website https://speed.pypy.org/ for 3.11.
Yes CPython is still slow, but it's getting faster :).
Disclaimer: I'm just a volunteer, not an employee of Microsoft, so I don't have a perf report to answer to. This is just my biased opinion.
System python 3.9.6: 26.80s user 0.27s system 99% cpu 27.285 total MacPorts python 3.9.25: 23.83s user 0.32s system 98% cpu 24.396 total MacPorts python 3.13.11: 15.17s user 0.28s system 98% cpu 15.675 total MacPorts python 3.14.2: 15.31s user 0.32s system 98% cpu 15.893 total
Wish I'd known to try this test sooner now!
Historically CPython performance has been so bad, that massive speedups were quite possible, once someone seriously got into it.
> By 1977[2][3] the phrase had entered American usage as slang for the cum shot in a pornographic film
https://en.wikipedia.org/wiki/Money_shot
https://news.ycombinator.com/item?id=46385526
Also this time, I'm pretty confident because there are two perf improvements here: the dispatch logic, and the inlining. MSVC can actually convert switch-case interpreters to threaded code automatically if some conditions are met [1]. However, it does not seem to do that for the current CPython interpreter. In this case, I suspect the CPython interpreter loop is just too complicated to meet those conditions. The key point also that we would be relying on MSVC again to do its magic, but this tail calling approach takes more control into the writers of the C code. The inlining is pretty much impossible to convince MSVC to do except with `__forceinline` or changing things to use macros [2]. However, we don't just mark every function as forceinline in CPython as it might negatively affect other compilers.
[1]: https://github.com/faster-cpython/ideas/issues/183 [2]: https://github.com/python/cpython/issues/121263
Also, I’m not that familiar with the whole process, but I just wanted to say that I think you were too hard on yourself during the last performance drama. So thank you again and remember not to hold yourself to an impossible standard no one else does.
That was a very niche error, that you promptly corrected, no need to be so apologetic about it! And thanks for all the hard work making Python faster!
If getting the optimal code is relying on getting a pile of heuristics to go in your favor, you can’t exactly call it a bug when the heuristics go the other way. Tail duplication is what we want in case, but in other cases it might not be desired because of the increased code size. With the new design, the code can express the desired pattern more directly, leaving it less vulnerable to the whims of the optimizer.
(I actually spent most of Sep/Oct working on optimizing the Immer JS immutable update library, and used a benchmarking tool called `mitata`, so I was doing a lot of this same kind of work: https://github.com/immerjs/immer/pull/1183 . Would love to add some new tools to my repertoire here!)
It's in essence a histogram for the distribution, with smoothing, and mirrored on each side.
It looks nice, but is not without well-deserved opposition because 1) the use of smoothing can hide the actual distribution, 2) mirroring contains no extra information, while taking up space, and implying the extra space contains information, and 3) when shown vertically, too often causes people to exclaim it looks like a vulva.
In an HN discussion on the topic, medstrom at https://news.ycombinator.com/item?id=40766519 points to a half-violin plot at https://miro.medium.com/v2/1*J3Q4JKXa9WwJHtNaXRu-kQ.jpeg with the histogram on the left, and the half-violin on the right, which gives you a chance to see side-by-side presentation of the same data.
But that also means we are used to seeing histograms and their bin count and widths in order to estimate possible variances from the true distribution;.
While it's much harder to do the same with violin plots.
First is the Google's manpower. Google somehow succeeds in writing fast software. Most Google products I use are fast in contrast to the rest of the ecosystem. It's possible that Google simply did a better job.
The second is CPython legacy. There are faster implementations of Python that completely implement the API (PyPy comes to mind), but there's a huge ecosystem of C extensions written with CPython bindings, which make it virtually impossible to break compatibility. It is possible that this legacy prevents many possible optimizations. On the other hand, V8 only needs to keep compatibility on code-level, which allows them to practically switch out the whole inside in incremental search for a faster version.
I might be wrong, so take what I said with a grain of salt.
see https://en.wikipedia.org/wiki/Unladen_Swallow
V8 was a much higher priority - Google hired many of the world’s best VM engineers to develop it.
Anything goes regarding changing object shapes, it is one step further than Smalltalk in language plasticity.
JavaScript is JIT’ed where CPython is not. Pypy has JIT and is faster, but I think is incompatible with C extensions.
I think Pythons threading model also adds complexity to optimizing where JavaScripts single thread is easier to optimize.
I would also say there’s generally less impetus to optimize CPython. At least until WASM, JavaScript was sort of stuck with the performance the interpreter had. Python had more off-ramps. You could use pypy for more pure Python stuff, or offload computationally heavy stuff to a C extension.
I think there are some language differences that make JavaScript easier to optimize, but I’m not super qualified to speak on that.
Nonetheless, Microsoft employed a whole "Faster CPython" team for 4 years - they targeted a 5x speedup but could only achieve ~1.5x. Why couldn't they make a significantly faster Python implementation, especially given that PyPy exists and proves it's possible?
Not an expert here, but my understanding is that Python is dynamic to the point that optimizing is hard. Like allowing one namespace to modify another; last I used it, the Stackdriver logging adapter for Python would overwrite the stdlib logging library. You import stackdriver, and it changes logging to send logs to stackdriver.
All package level names (functions and variables) are effectively global, mutable variables.
I suspect a dramatically faster Python would involve disabling some of the more unhinged mutability. Eg package functions and variables cannot be mutated, only wrapped into a new variable.
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
Even "simple" stuff like field access in python may refer to multiple dynamically-mapped method resolution.
Also, the ffi-bindings of python, while offering a way to extend it with libraries written in c/c++/fortran/... , limit how freely the internals can be changed (see the bug-by-bug compatibility work done for example by pypy, just to name an example, with some constraint that limit some optimizations)
Very true, but IMO the existence of PyPy proves that this doesn't necessarily prevent a fast implementation. I think the reason for CPython's poor performance must be your other point:
> the ffi-bindings of python [...] limit how freely the internals can be changed
PyPy pays for this by having slower C interaction.
Most people that parrot repeat Python dynamism as root cause never used Smalltalk, Self or Common Lisp, or even PyPy for that matter.
Which can change on the fly anything that is currently executing in the image.
Also after breaking into the debugger, the world can be totally different after resuming execution at the trap location.
Then there are nice primitives like a becomes: b. where all occurrences of a get swapped with b.
Because JS’s centrality to the web and V8’s speed’s centrality to Google’s push to avoid other platform owners controlling the web via platform-default browsers meant virtually unlimited resources were spent in optimizing V8 at a time when the JS language itself was basically static; Python has never had the same level of investment and has always spent some of its smaller resources on advancing the language rather than optimizing the implementation.
Also Python has a de facto stable(ish) C ABI for extensions that is 1) heavily used by popular libraries, and 2) makes life more difficult for the JIT because the native code has all the same expressive power wrt Python objects, but JIT can't do code analysis to ensure that it doesn't use it.
If you have a separate switch at the end of each instruction then it will be right any time an instruction is followed by the same instruction as last time, which can probably happen quite a lot for short loops.
"apology peice" and "tail caling"
I’m sure with enough cajoling you can make the LLM spit out a technical blog post that isn’t discernibly slop - wanton emoji usage, clichés, self-aggrandizement, relentlessly chipper tone, short “punchy” paragraphs, an absence of depth, “it’s not just X—it’s a completely new Y” - but it must be at least a little tricky what with how often people don’t bother.
[ChatGPT, insert a complaint about how people need to ram LLMs into every discussion no matter how irrelevant here.]
You can ask the AI to make typos for you.
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
So far I think using clang instead of MSVC compiler is a strict win? Not a huge difference mind you. But a win nonetheless.