Python Numbers Every Programmer Should Know
Key topics
Diving into the nuances of Python performance, a recent article sparked debate over whether certain numerical benchmarks are essential knowledge for programmers. While some commenters, like jerf, argued that understanding these numbers is crucial for "serious" Python programmers, others, such as ktpsns and Demiurge, countered that they're often irrelevant in real-world applications. A consensus emerged that when performance becomes critical, Python might not be the best tool for the job, with some suggesting that pushing performance-critical code to lower-level languages like C or Rust is a more effective optimization strategy. The discussion highlights the tension between Python's high-level abstraction and the need for low-level optimization.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
2h
Peak period
79
0-6h
Avg / period
14.5
Based on 160 loaded comments
Key moments
- 01Story posted
Jan 1, 2026 at 9:39 AM EST
8 days ago
Step 01 - 02First comment
Jan 1, 2026 at 11:28 AM EST
2h after posting
Step 02 - 03Peak activity
79 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Jan 5, 2026 at 11:22 AM EST
4d ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Though IMHO it suffices just to know that "Python is 40-50x slower than C and is bad at using multiple CPUs" is not just some sort of anti-Python propaganda from haters, but a fairly reasonable engineering estimate. If you know that you don't really need that chart. If your task can tolerate that sort of performance, you're fine; if not, figure out early how you are going to solve that problem, be it through the several ways of binding faster code to Python, using PyPy, or by not using Python in the first place, whatever is appropriate for your use case.
More contentiously: don't fret too much over performance in Python. It's a slow language (except for some external libraries, but that's not the point of the OP).
When this starts to matter, python stops being the right tool for the job.
But I agree with the spirit of what you wrote - these numbers are interesting but aren’t worth memorizing. Instead, instrument your code in production to see where its slow in the real world, profile your code (with pyspy, it’s the best tool for this if you’re looking for cpu-hogging code), and if you find yourself worrying about how long it takes to add something to a list in Python you really shouldn’t be doing that operation in Python at all.
You don't see any value in knowing that numbers?
People usually approach this the other way, use something like pandas or numpy from the beginning if it solves your problem. Do not write matrix multiplications or joins in python at all.
If there is no library that solves your problem, it's a great indication that you should avoid python. Unless you are willing to spend 5 man-years writing a C or C++ library with good python interop.
If you are writing performance sensitive code that is not covered by a popular Python library, don't do it unless you are a megacorp that can put a team to write and maintain a library.
Many problems can performantly solved in pure Python, especially via the growing set of tools like the JIT libraries I cited. Even more will be solvable when things like free threaded Python land. It will be a minority of problems that can’t be, if it isn’t already.
That is exactly how we approach it though. We didn't start out with turbodbc + pandas, it started as an sql alchemy and pandas service. Then when it was too slow, I got involved, found and dealth with the bottle necks. I'm not sure how you would find and fix such things without knowing the efficiency or lack there of in different parts of Python. Also, as you'll notice, we didn't write ur own stuff, we simply used more efficient Python libraries.
Most of the time starting up is time spent seartching the filesystem for thousands of packages.
I think as they said: when dynamically building a shell input prompt it starts to become very noticable if you have like 3 or more of these and you use the terminal a lot.
Yes, after 2-3 I agree you'd start to notice if you were really fast. I suppose at that point I'd just have Gemini rewrite the prompt-building commands in Rust (it's quite good at that) or merge all the prompt-building commands into a single one (to amortize the startup cost).
Some of those number are very important:
- Set membership check is 19.0 ns, list is 3.85 μs. Knowing what data structure to use for the job is paramount.
- Write 1KB file is 35.1 μs but 1MB file is only 207 μs. Knowing the implications of I/O trade off is essential.
- sum() 1,000 integers is only 1,900 ns: Knowing to leverage the stdlib makes all the difference compared to manual loop.
Etc.
A few years ago I did a Python rewrite of a big clients code base. They had a massive calculation process that took 6 servers 2 hours.
We got it down to 1 server, 10 minutes, and it was not even the goal of the mission, just the side effect of using Python correctly.
In the end, quadratic behavior is quadratic behavior.
imo this kind of analysis is the weakest part of this post. The key hidden here is for a 1k element set/list. I'd want to know what the tipping point is. When does the list become cheaper? Although I guess it's probably < 5 elements given that the list is 200x slower.
Relevant if your problem demands instatiation of a large number of objects. This reminds me of a post where Eric Raymond discusses the problems he faced while trying to use Reposurgeon to migrate GCC. See http://esr.ibiblio.org/?p=8161
Your cognition of it is either implicit or explicit.
Even if you didn't know for example that list appends was linear and not quadratic and fairly fast.
Even if you didn't give a shit if simple programs were for some reason 10000x slower than they needed to be because it meets some baseline level of good enough.
Library authors beneath you would know and the APIs you interact with and the pythonic code you see and the code LLMS generate will be affected by that leaky abstraction.
If you think that n^2 naive list appends is a bad example its not btw, python string appends are n^2 and that has and does affect how people do things, f strings for example are lazy.
The list of floats is larger, despite also being simply an array of 1000 8-byte pointers. I assume that it's because the int array is constructed from a range(), which has a __len__(), and therefore the list is allocated to exactly the required size; but the float array is constructed from a generator expression and is presumably dynamically grown as the generator runs and has a bit of free space at the end.
"Performance is something we should only think about when it's too late" is basically doomed to fail unless your problem is trivial and the perf never matters at all.
P50 latency for a fastapi service’s endpoint is 30+ seconds. Your ingestion pipeline, which has a data ops person on your team waiting for it to complete, takes more than one business day to run.
Your program is obviously unacceptable. And, your problems are most likely completely unrelated to these heuristics. You either have an inefficient algorithm or more likely you are using the wrong tool (ex OLTP for OLAP) or the right tool the wrong way (bad relational modeling or an outdated LLM model).
If you are interested in shaving off milliseconds in this context then you are wasting your time on the wrong thing.
All that being said, I’m sure that there’s a very good reason to know this stuff in the context of some other domains, organizations, company size/moment. I suspect these metrics are irrelevant to disproportionately more people reading this.
At any rate, for those of us who like to learn, I still found this valuable but by no means common knowledge
In my experience writing computer vision software, people really struggle with the common sense of how fast computers really are. Some knowledge like how many nanoseconds an add takes can be very illuminating to understand whether their algorithm's runtime makes any sense. That may push loose the bit of common sense that their algorithm is somehow wrong. Often I see people fail to put bounds on their expectations. Numbers like these help set those bounds.
This page is a nice reminder of the fact, with numbers. For a while, at least, I will Know, instead of just feel, like I can ignore the low level performance minutiae.
I completely understand why it's frustrating or confusing by itself, though.
From what I've been able to glean, it was basically created in the first few years Jeff worked at Google, on indexing and serving for the original search engine. For example, the comparison of cache, RAM, and disk: determined whether data was stored in RAM (the index, used for retrieval) or disk (the documents, typically not used in retrieval, but used in scoring). Similarly, the comparison of California-Netherlands time- I believe Google's first international data cetner was in NL and they needed to make decisions about copying over the entire index in bulk versus serving backend queries in the US with frontends in the NL.
The numbers were always going out of date; for example, the arrival of flash drives changed disk latency significantly. I remember Jeff came to me one day and said he'd invented a compression algorithm for genomic data "so it can be served from flash" (he thought it would be wasteful to use precious flash space on uncompressed genomic data).
To use a trivial example, using a set instead of a list to check membership is a very basic replacement, and can dramatically improve your running time in Python. Just because you use Python doesn't mean anything goes regarding performance.
The case is among the example numbers given in TFA:
"Dict lookup by key", "List membership check"
Does it have to spell out the difference is algorithmic in this case for the comparison to be useful?
Or, inversely, is the difference between e.g. memory and disk access times insignificant, because it's not algorithmic?
After skimming over all of them, it seems like most "simple" operations take on the order of 20ns. I will leave with that rule of thumb in mind.
Last time I benchmarked a VPS it was about the performance of an Ivy Bridge generation laptop.
I have a number of Intel N95 systems around the house for various things. I've found them to be a pretty accurate analog for small instances VPSes. The N95 are Intel E-cores which are effectively Sandy Bridge/Ivy Bridge cores.
Stuff can fly on my MacBook but than drag on a small VPS instance but validating against an N95 (I already have) is helpful. YMMV.
I usually prefer classic %-formatting for readability when the arguments are longer and f-strings when the arguments are shorter. Knowing there is a material performance difference at scale, might shift the balance in favour of f-strings for some situations.
Thanks for the feedback everyone. I appreciate your posting it @woodenchair and @aurornis for pointing out the intent of the article.
The idea of the article is NOT to suggest you should shave 0.5ns off by choosing some dramatically different algorithm or that you really need to optimize the heck out of everything.
In fact, I think a lot of what the numbers show is that over thinking the optimizations often isn't worth it (e.g. caching len(coll) into a variable rather than calling it over and over is less useful that it might seem conceptually).
Just write clean Python code. So much of it is way faster than you might have thought.
My goal was only to create a reference to what various operations cost to have a mental model.
For example, from the post "Maybe we don’t have to optimize it out of the test condition on a while loop looping 100 times after all."
It is helpful to know the relative value (costs) of these operations. Everything else can be profiled and optimized for the particular needs of a workflow in a specific architecture.
To use an analogy, turbine designers no longer need to know the values in the "steam tables", but they do need to know efficient geometries and trade-offs among them when designing any Rankine cycle to meet power, torque, and Reynolds regimes.
1) Measurements are faulty. List of 1,000 ints can be 4x smaller. Most time measurements depend circumstances that are not mentioned, therefore can't be reproduced.
2) Brainrot AI style, hashmap is not 200x faster than list, that's not how complexity work.
3) orjson/ujson are faulty, which is one of the reasons they don't replace stdlib implementation. Expect crashes, broken jsons, anything from them
4) What actually will be used in number-crunching applications - numpy or similar libraries - is not even mentioned.
I disagree. A lot of important and large codebases were grown and maintained in Python (Instagram, Dropbox, OpenAI) and it's damn useful to know how to reason your way out of a Python performance problem when you inevitably hit one without dropping out into another language, which is going to be far more complex.
Python is a very useful tool, and knowing these numbers just makes you better at using the tool. The author is a Python Software Foundation Fellow. They're great at using the tool.
In the common case, a performance problem in Python is not the result of hitting the limit of the language but the result of sloppy un-performant code, for example unnecessarily calling a function O(10_000) times in a hot loop.
I wrote up a more focused "Python latency numbers you should know" as a quiz here https://thundergolfer.com/computers-are-fast
How does this happen? Is it just inertia that cause people to write large systems in a essentially type free, interpreted scripting language?
while I'm on the soapbox I'll give java a special mention: a couple years ago I'd have said java was easy even though it's tedious and annoying, but I've become reacquainted with it for a high school program (python wouldn't work for what they're doing and the school's comp sci class already uses java.)
this year we're switching to c++.
10 years later "ok it's too slow; our options are a) spend $10m more on servers, b) spend $5m writing a faster Python runtime before giving up later because nobody uses it, c) spend 2 years rewriting it and probably failing, during which time we can make no new features. a) it is then."
What makes that language not strictly superior to Python?
I'd say they are almost strictly superior to Python, but there are some minor factors why you might still choose Python over those. E.g. arbitrary precision integers, or the REPL. Go is a bit tedious and Rust is harder to learn (but productive once you have).
But overall they would all be a better choice than Python. Yes even for startups who need to move fast.
Kotlin owns the mobile development market with 80% Android market share.
Scala was the AI before Python with Hadoop, Spark and friends.
Lisps might be niche, yet they were Python's flexibility, with machine code compilers, since 1958.
Some startups end up in between the two extremes above. I was at one of the Python-based ones that ended up in the middle. At $30M in annual revenue, Python was handling 100M unique monthly visitors on 15 cheap, circa-2010 servers. By the time we hit $1B in annual revenue, we had Spark for both heavy batch computation and streaming computation tasks, and Java for heavy online computational workloads (e.g., online ML inference). There were little bits of Scala, Clojure, Haskell, C++, and Rust here and there (with well over 1K developers, things creep in over the years). 90% of the company's code was still in Python and it worked well. Of course there were pain points, but there always are. At $1B in annual revenue, there was budget for investments to make things better (cleaning up architectural choices that hadn't kept up, adding static types to core things, scaling up tooling around package management and CI, etc.).
But a key to all this... the product that got to $30M (and eventually $1B+) looked nothing like what was pitched to initial investors. It was unlikely that enough things could have been tried to land on the thing that worked without excellent developer productivity early on. Engineering decisions are not only about technical concerns, they are also about the business itself.
If I told you that we were going to be running a very large payments system, with customers from startups to Amazon, you'd not write it in ruby and put the data in MongoDB, and then using its oplog as a queue... but that's what Stripe looked like. They even hired a compiler team to add type checking to the language, as that made far more sense than porting a giant monorepo to something else.
I think the list itself is super long winded and not very informative. A lot of operations take about the same amount of time. Does it matter that adding two ints is very slightly slower than adding two floats? (If you even believe this is true, which I don’t.) No. A better summary would say “all of these things take about the same amount of time: simple math, function calls, etc. these things are much slower: IO.” And in that form the summary is pretty obvious.
I agree. I have to complement the author for the effort put in. However it misses the point of the original Latency numbers every programmer should know, which is to build an intuition for making good ballpark estimations of the latency of operations and that e.g. A is two orders of magnitude more expensive than B.
Python’s issue is that it is incredibly slow in use cases that surprise average developers. It is incredibly slow at very basic stuff, like calling a function or accessing a dictionary.
If Python didn’t have such an enormous number of popular C and C++ based libraries it would not be here. It was saved by Numpy etc etc.
3 times.
This is the naive version of that code, because "I will parallelize it later" and I was just getting the logic down.
Turns out, when you use programming languages that are fit for purpose, you don't have to obsess over every function call, because computers are fast.
I think people vastly underestimate how slow python is.
We are rebuilding an internal service in Java, going from python, and our half assed first attempts are over ten times faster, no engineering required, exactly because python takes forever just to call a function. The python version was dead, and would never get any faster without radical rebuilds and massive changes anyway.
It takes python 19ns to add two integers. Your CPU does it in about 0.3 ns..... in 2004.
That those ints take 28 bytes each to hold in memory is probably why the new Java version of the service takes 1/10th the memory as well.
Except when it’s not I/O.
I have said something very wrong. Java may be fast but it isn't magic.
What I claimed is not possible and I should have realized that.
I can not correct my previous claim. I wish HN had longer edit windows and I wish HN allowed you to downvote older comments.
I cannot erase this wrong info
O(10_000) is a really weird notation.
I think these kind of numbers are everywhere and not just specific to Python.
In zig, I sometimes take a brief look to the amount of cpu cycles of various operations to avoid the amount of cache misses. While I need to aware of the alignment and the size of the data type to debloat a data structure. If their logic applies, too bad, I should quit programming since all languages have their own latency on certain operations we should aware of.
There are reasons to not use Python, but that particular reason is not the one.
For me, it will help with selecting what language is best for a task. I think it won't change my view that python is an excellent language to prototype in though.
Benchmark Iteration Process
Core Approach: - Warmup Phase: 100 iterations to prepare the operation (default)
- Timing Runs: 5 repeated runs (default), each executing the operation a specified number of times
- Result: Median time per operation across the 5 runs
Iteration Counts by Operation Speed: - Very fast ops (arithmetic): 100,000 iterations per run
- Fast ops (dict/list access): 10,000 iterations per run
- Medium ops (list membership): 1,000 iterations per run
- Slower ops (database, file I/O): 1,000-5,000 iterations per run
Quality Controls: - Garbage collection is disabled during timing to prevent interference
- Warmup runs prevent cold-start bias
- Median of 5 runs reduces noise from outliers
- Results are captured to prevent compiler optimization elimination
Total Executions: For a typical benchmark with 1,000 iterations and 5 repeats, each operation runs 5,100 times (100 warmup + 5×1,000 timed) before reporting the median result. This methodology ensures reliable, reproducible measurements across different operation speeds.
It's -5 to 256, and these have very tricky behavior for programmers that confuse identity and equaltiy.
``` >>> a = -5 >>> b = -5 >>> a is b True >>> a = -6 >>> b = -6 >>> a is b False ```
Nobody is going to remember any of the numbers on this new list.
I remember refactoring some code to improve readability, then observing something that was previously a few microseconds take tens of seconds.
The original code created a large list of lists. Each child list had 4 fields each field was a different thing, some were ints and one was a string.
I created a new class with the names of each field and helper methods to process the data. The new code created a list of instances of my class. Downstream consumers of the list could look at the class to see what data they were getting. Modern Python developers would use a data class for this.
The new code was very slow. I’d love it if the author measured the time taken to instantiate a class.
The doctor said, “don’t do that”.
customers[3][4]
is a lot less readable than
customers[3].balance
But hidden in this is the failing of every sql-bridge ever - it’s definitely easier for a programmer to read customers(3).balance but the trade off now is I have to provide class based semantics for all operations - and that tends to hide (oh you know, impedance mismatch).
I would far prefer “store the records as plain as we can” and add on functions to operate over it (think pandas stores basically just ints floats and strings as it is numpy underneath)
(Yes you can store pyobjects somehow but the performance drops off a cliff.)
Anyway - keep the storage and data structure as raw and simple as possible and write functions to run over it. And move to pandas or SQLite pretty quickly :-)
When you’re in a project with a few million lines of code and 10 years of history it can get confusing.
Your data will have been handled by many different functions before it gets to you. If you do this with raw lists then the code gets very confusing. In one data structure customer name might be [4] and another structure might have it in [9]. Worse someone adds a new field in [5] then when two lists get concatenated name moves to [10] in downstream code which consumes the concatenated lists.
Please post your code snippet on StackOverflow ([Python] tag) or CodeReview.SE so people can help you fix it.
> created a new class with the names of each field and helper methods to process the data. The new code created a list of instances of my class. Downstream consumers of the list could look at the class to see what data they were getting.
Then again, if you're worried about any of the numbers in this article maybe you shouldn't be using Python at all. I joke, but please do at least use Numba or Numpy so you aren't paying huge overheads for making an object of every little datum.
https://rushter.com/blog/python-strings-and-memory/
[1] https://hex.pm/packages/snakepit [2] https://hex.pm/packages/snakebridge
- If slotted attribute reads and regular attribute reads are the same latency, I suspect that either the regular class may not have enough "bells on" (inheritance/metaprogramming/dunder overriding/etc) to defeat simple optimizations that cache away attribute access, thus making it equivalent in speed to slotted classes. I know that over time slotting will become less of a performance boost, but--and this is just my intuition and I may well be wrong--I don't get the impression that we're there yet.
- Similarly "read from @property" seems suspiciously fast to me. Even with descriptor-protocol awareness in the class lookup cache, the overhead of calling a method seems surprisingly similar to the overhead of accessing a field. That might be explained away by the fact that property descriptors' "get" methods are guaranteed to be the simplest and easiest to optimize of all call forms (bound method, guaranteed to never be any parameters), and so the overhead of setting up the stack/frame/args may be substantially minimized...but that would only be true if the property's method body was "return 1" or something very fast. The properties tested for these benchmarks, though, are looking up other fields on the class, so I'd expect them to be a lot slower than field access, not just a little slower (https://github.com/mikeckennedy/python-numbers-everyone-shou...).
- On the topic of "access fields of objects" (properties/dataclasses/slots/MRO/etc.), benchmarks are really hard to interpret--not just these benchmarks, all of them I've seen. That's because there are fundamentally two operations involved: resolving a field to something that produces data for it, and then accessing the data. For example, a @property is in a class's method cache, so resolving "instance.propname" is done at the speed of the methcache. That might be faster than accessing "instance.attribute" (a field, not a @property or other descriptor), depending on the inheritance geometry in play, slots, __getattr[ibute]__ overrides, and so on. On the other hand, accessing the data at "instance.propname" is going to be a lot more expensive for most @properties (because they need to call a function, use an argument stack, and usually perform other attribute lookups/call other functions/manipulate locals, etc); accessing data at "instance.attribute" is going to be fast and constant-time--one or two pointer-chases away at most.
- Nitty: why's pickling under file I/O? Those benchmarks aren't timing pickle functions that perform IO, they're benchmarking the ser/de functionality and thus should be grouped with json/pydantic/friends above.
- Asyncio's no spring chicken, but I think a lot of the benchmarks listed tell a worse story than necessary, because they don't distinguish between coroutines, Tasks, and Futures. Coroutines are cheap to have and call, but Tasks and Futures have a little more overhead when they're used (even fast CFutures) and a lot more overhead to construct since they need a lot more data resources than just a generator function (which is kinda what a raw coroutine desugars to, but that's not as true as most people think it is...another story for another time). Now, "run_until_complete{}" and "gather()" initially take their arguments and coerce them into Tasks/Futures--that detection, coercion, and construction takes time and consumes a lot of overhead. That's good to know (since many people are paying that coercion tax unknowingly), but it muddies the boundary between "overhead of waiting for an asyncio operation to complete" and "overhead of starting an asyncio operation". Either calling the lower-level functions that run_until_complete()/gather() use internally, or else separating out benchmarks into ones that pass Futures/Tasks/regular coroutines might be appropriate.
- Benchmarking "asyncio.sleep(0)" as a means of determining the bare-minimum await time of a Python event loop is a bad idea. sleep(0) is very special (more details here: https://news.ycombinator.com/item?id=46056895) and not representative. To benchmark "time it takes for the event loop to spin once and produce a result"/the python equivalent of process.nextTick, it'd be better to use low-level loop methods like "call_soon" or defer completion to a Task and await that.
For example, my M4 Max running Python 3.14.2 from Homebrew (built, not poured) takes 19.73MB of RAM to launch the REPL (running `python3` at a prompt).
The same Python version launched on the same system with a single invocation for `time.sleep()`[1] takes 11.70MB.
My Intel Mac running Python 3.14.2 from Homebrew (poured) takes 37.22MB of RAM to launch the REPL and 9.48MB for `time.sleep`.
My number for "how much memory it's using" comes from running `ps auxw | grep python`, taking the value of the resident set size (RSS column), and dividing by 1,024.
1: python3 -c 'from time import sleep; sleep(100)'
Makes me wonder if the cpython devs have ever considered v8-like NaN-boxing or pointer stuffing.
> Collection Access and Iteration
> How fast can you get data out of Python’s built-in collections? Here is a dramatic example of how much faster the correct data structure is. item in set or item in dict is 200x faster than item in list for just 1,000 items!
It seems to suggests an iteration `for x in mylist` is 200x slower than `for x in myset`. It’s the membership test that is much slower. Not the iteration.
Also the overall title “Python Numbers Every Programmer Should Know” starts with 20 numbers that are merely interesting.
23 more comments available on Hacker News