Introducing Architecture Variants
Mood
calm
Sentiment
positive
Category
other
Key topics
Ubuntu introduces architecture variants, specifically x86-64-v3, to optimize performance on modern CPUs, sparking discussion on its implications for package management, compatibility, and future developments.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
17m
Peak period
126
Day 2
Avg / period
24.2
Based on 145 loaded comments
Key moments
- 01Story posted
Oct 30, 2025 at 6:35 AM EDT
28 days ago
Step 01 - 02First comment
Oct 30, 2025 at 6:51 AM EDT
17m after posting
Step 02 - 03Peak activity
126 comments in Day 2
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 6, 2025 at 3:16 PM EST
20 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
x86-64-v3 is AVX2-capable CPUs.
> As a result, we’re very excited to share that in Ubuntu 25.10, some packages are available, on an opt-in basis, in their optimized form for the more modern x86-64-v3 architecture level
> Previous benchmarks we have run (where we rebuilt the entire archive for x86-64-v3 57) show that most packages show a slight (around 1%) performance improvement and some packages, mostly those that are somewhat numerical in nature, improve more than that.
1. https://riscv.atlassian.net/wiki/spaces/HOME/pages/16154732/... 2. https://developer.arm.com/documentation/dui0801/h/A64-Floati...
and key point: "Previous benchmarks we have run (where we rebuilt the entire archive for x86-64-v3 57) show that most packages show a slight (around 1%) performance improvement and some packages, mostly those that are somewhat numerical in nature, improve more than that."
> Previous benchmarks (...) show that most packages show a slight (around 1%) performance improvement and some packages, mostly those that are somewhat numerical in nature, improve more than that.an unobservable benefit is not a benefit.
If you aren't convinced by your ubuntu being 1% faster, consider how many servers, VMs and containers run ubuntu. Millions of servers using a fraction of a percent less energy multiplies out to a lot of energy
Because typically these modern things are a way of making the CPU do things faster by eating more power. There may be savings from having fewer servers etc, but savings in _speed_ are not the same as savings in _power_ (and some times even work the opposite way)
It's not for nothing that some time ago "write once, run everywhere" was a selling proposition (not that it was actually working in all cases, but definitely working better than alternatives).
If you're negotiating deals worth billions of dollars, or even just millions, I'd strongly suggest not doing so with a hangover.
...have you met salespeople? Buying lap dances is a legitimate business expense for them. You'd be surprised how much personal rapport matters and facts don't.
In all fairness, I only know about 8 and 9 figure deals, maybe at 10 and 11 salespeople grow ethics...
Standard advice: You are not Google.
I'm surprised and disappointed 1% is the best they could come up with, with numbers that small I would expect experimental noise to be much larger than the improvement. If you tell me you've managed a 1% improvement you have to do a lot to convince me you haven't actually made things 5% worse.
At scale marginal differences do matter and compound.
> where that 1% is worth any hassle
You'll need context to answer your question, but yes there are cases.Let's say you have a process that takes 100hrs to run and costs $1k/hr. You save an hour and $1k every time you run the process. You're going to save quite a bit. You don't just save the time to run the process, you save literal time and everything that that costs (customers, engineering time, support time, etc).
Let's say you have a process that takes 100ns and similarly costs $1k/hr. You now run in 99ns. Running the process 36 million times is going to be insignificant. In this setting even a 50% optimization probably isn't worthwhile (unless you're a high frequency trader or something)
This is where the saying "premature optimization is the root of all evil" comes from! The "premature" part is often disregarded and the rest of the context goes with it. Here's more context to Knuth's quote[0].
There is no doubt that the holy grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.
Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.
Knuth said: "Get a fucking profiler and make sure that you're optimizing the right thing". He did NOT say "don't optimize".So yes, there are plenty of times where that optimization will be worthwhile. The percentages don't mean anything without the context. Your job as a programmer is to determine that context. And not just in the scope of your program, but in the scope of the environment you expect a user to be running on. (i.e. their computer probably isn't entirely dedicated to your program)
[0] https://dl.acm.org/doi/10.1145/356635.356640 (alt) https://sci-hub.se/10.1145/356635.356640
If you would only accept 10x improvements, I would argue progress would be very small.
"Well, let's say you can shave 10 seconds off of the boot time. Multiply that by five million users and thats 50 million seconds, every single day. Over a year, that's probably dozens of lifetimes. So if you make it boot ten seconds faster, you've saved a dozen lives. That's really worth it, don't you think?"
I put a lot of effort into chasing wins of that magnitude. Over a huge userbase, something like that has a big positive ROI. These days it also affects important things like heat and battery life.
The other part of this is that the wins add up. Maybe I manage to find 1% every couple of years. Some of my coworkers do too. Now you're starting to make a major difference.
Well one example could be llama.cpp . It's critical for them to use every single extension the CPU has move more bits at a time. When I installed it I had to compile it.
This might make it more practical to start offering OS packages for things like llama.cpp
I guess people that don't have newer hardware aren't trying to install those packages. But maybe the idea is that packages should not break on certain hardware.
Blender might be another one like that which really needs the extensions for many things. But maybe you so want to allow it to be used on some oldish hardware anyway because it still has uses that are valid on those machines.
Perhaps if you're doing CPU-bound math you might see an improvement?
I don't think this is a valid argument to make. If you were doing the optimization work then you could argue tradeoffs. You are not, Canonical is.
Your decision is which image you want to use, and Canonical is giving you a choice. Do you care about which architecture variant you use? If you do, you can now pick the one that works best for you. Do you want to win an easy 1% performance gain? Now you have that choice.
This takes me back to arguing with Gentoo users 20 years ago who insisted that compiling everything from source for their machine made everything faster.
The consensus at the time was basically "theoretically, it's possible, but in practice, gcc isn't really doing much with the extra instructions anyway".
Then there's stuff like glibc which has custom assembly versions of things like memcpy/etc, and selects from them at startup. I'm not really sure if that was common 20 years ago but it is now.
It's cool that after 20 years we can finally start using the newer instructions in binary packages, but it definitely seems to not matter all that much, still.
Since then we've slowly started accumulating optional extensions again; newer SSE versions, AVX, encryption and virtualization extensions, probably some more newfangled AI stuff I'm not on top of. So very slowly it might have started again to make sense for an approach like Gentoo to exist**.
* usual caveats apply; if the compiler can figure out that using the instruction is useful etc.
** but the same caveats as back then apply. A lot of software can't really take advantage of these new instructions, because newer instructions have been getting increasingly more use-case-specific; and applications that can greatly benefit from them will already have alternative code-pathes to take advantage of them anyway. Also a lot of the stuff happening in hardware acceleration has moved to GPUs, which have a feature discovery process independent of CPU instruction set anyway.
I would guess that these are domain-specific enough that they can also mostly be enabled by the relevant libraries employing function multiversioning.
Why is this "clever"? This is pretty much how "fat" binaries are supposed to work, no? At least, such packaging is the norm for Android.
I used Gentoo a lot, jeez, between 20 and 15 years ago, and the install guide guiding me through partitioning disks, formatting disks, unpacking tarballs, editing config files, and running grub-install etc, was so incredibly valuable to me that I have trouble expressing it.
I'd agree that the manual Gentoo install process, and those tinkering years in general, gave me experience and familiarity that's come in handy plenty of times when dealing with other distros, troubleshooting, working on servers, and so on.
https://www.shlomifish.org/humour/by-others/funroll-loops/Ge...
There's lots of software applications out there whose official Docker images or pip wheels or whatever bundle everything under the sun to account for all the optional integrations the application has, and it's difficult to figure out which packages can be easily removed if we're not using the feature and which ones are load-bearing.
The extra issue here is that SIMD (the main optimization) simply sucks to use. Auto-vectorization has been mostly a pipe dream for decades now as the sufficiently-smart compiler simply hasn't materialized yet (and maybe for the same reason the EPIC/Itanium compiler failed -- deterministically deciding execution order at compile time isn't possible in the abstract and getting heuristics that aren't deceived by even tiny changes to the code is massively hard).
Doing SIMD means delving into x86 assembly and all it's nastiness/weirdness/complexity. It's no wonder that devs won't touch it unless absolutely necessary (which is why the speedups are coming from a small handful of super-optimized math libraries). ARM vector code is also rather Byzantine for a normal dev to learn and use.
We need a more simple assembly option that normal programmers can easily learn and use. Maybe it's way less efficient than the current options, but some slightly slower SIMD is still going to generally beat no SIMD at all.
As in, are there any common libraries or parts of the system that typically slow things down, or was this more targeting a time when hardware was more limited so improving all would have made things feel faster in general.
I bet you there is some use case of some app or library where this is like a 2x improvement.
Would be nice to know the per app metrics.
Further, maybe it has not been a focus for compiler vendors to generate good code for these higher-level archs if few are using the feature. So Ubuntu's move could improve that.
x86-64-v3 is AVX2-capable CPUs.
Which unfortunately extends all the way to Intels newest client CPUs since they're still struggling to ship their own AVX512 instructions, which are required for v4. Meanwhile AMD has been on v4 for two generations already.
Having a non-uniform instruction set for one package was a baffling decision.
It's pretty clear that Alder Lake was simply a rush job, and had to be implemented with the E cores they already had, despite never having planned for heterogenous cores to be part of their product roadmap.
They had two teams designing the two types of cores.
If you’re doing that sort of work, you also shouldn’t use pre-compiled PyPi packages for the same reason - you leave a ton of performance on the table by not targeting the micro-architecture you’re running on.
That said, most of those packages will just read the hardware capability from the OS and dispatch an appropriate codepath anyway. You maybe save some code footprint by restricting the number of codepaths it needs to compile.
There’s also the difference between being able to run and being able to run optimised. At least 5 years ago, the Ubuntu/Debian builds of FFTW didn’t include the parallelised OpenMP library.
In a past life I did HPC support and I recommend the Spack package manager a lot to people working in this area because you can get optimised builds with whatever compiler tool chain and options you need quite easily that way.
If you do a typical: "cmake . && make install" then you will often miss compiler optimisations. There's no standard across different packages so you often have to dig into internals of the build system and look at the options provided and experiment.
Typically if you compile a C/C++/Fortran .cpp/.c/.fXX file by hand, you have to supply arguments to instruct the use of specific instruction sets. -march=native typically means "compile this binary to run with the maximum set of SIMD instrucitons that my current machine supports" but you can get quite granular doing things like "-march=sse4,avx,avx2" for either compatibility reasons or to try out subsets.
(There is some older text in the Debian Wiki https://wiki.debian.org/ArchitectureVariants but it's not clear if it's directly related to this effort)
No, because those are different ABIs (and a debian architecture is really an ABI)
> the issue of i486 vs. i586 vs. i686 vs. the many varieties of MMX and SSE extensions for 32-bit?
It could be used for this but it's about 15 years too late to care surely?
> (There is some older text in the Debian Wiki https://wiki.debian.org/ArchitectureVariants but it's not clear if it's directly related to this effort)
Yeah that is a previous version of the same design. I need to get back to talking to Debian folks about this.
Does anyone know what the plans are to accomplish this?
But that does not sound like a simple for non technical users solution.
Anyway, non technical users using an installation on another lower computer? That sounds weird.
https://github.com/jart/cosmopolitan/blob/master/ape/specifi...
To me hwcaps feels like a very unfortunate feature creep of glibc now. I don't see why it was ever added, given that it's hard to compile only shared libraries for a specific microarch, and it does not benefit executables. Distros seem to avoid it. All it does is causing unnecessary stat calls when running an executable.
apt (3.1.7) unstable; urgency=medium . [ Julian Andres Klode ] * test-history: Adjust for as-installed testing . [ Simon Johnsson ] * Add history undo, redo, and rollback features
The same will apply to different arm64 or riscv64 variants.
very odd choice of words. "better utilize/leverage" is perhaps the right thing to say here.
And if I have it right, The main advantage should come with package manager and open sourced software where the compiled binaries would be branched to benefit and optimize newer CPU features.
Still, this would be most noticeable mostly for apps that benefit from those features such as audio dsp as an example or as mentioned ssl and crypto.
Especially the binaries for the newest variant, since they can entirely conditionals/branching for all older variants.
All the fuss about Ubuntu 25.10 and later being RVA23 only was about nothing?
Intel has reduced its number of employees, and has lost lots of software developers.
So we lost Clear Linux, their Linux distribution that often showcased performance improvements due to careful optimization and utilization of microarchitectural enhancements.
I believe you can still use the Intel compiler, icc, and maybe see some improvements in performance-sensitive code.
"It was actively developed from 2/6/2015-7/18/2025."
Not the same thing, or perhaps an augmentation of Intel performance libraries (which required C++, I believe).
Sure, harmonizing all of this may have suggested that there were too many software teams. But device drivers don't write themselves, and without feedback from internal software developers, you can't validate your CPU designs.
Would love to see which packages benefited the most in terms of percentile gain and install base. You could probably back out a kWh/tons of CO2 saved metric from it.
"Changes/Optimized Binaries for the AMD64 Architecture v2" (2025) https://fedoraproject.org/wiki/Changes/Optimized_Binaries_fo... :
> Note that other distributions use higher microarchitecture levels. For example RHEL 9 uses x86-64-v2 as the baseline, RHEL 10 uses x86-64-v3, and other distros provide optimized variants (OpenSUSE, Arch Linux, Ubuntu).
> Description: official repositories compiled with LTO, -march=x86-64-vN and -O3.
Packages: https://status.alhp.dev/
Naively, I believe it might be more appropriate to have x86-64-v1 and x86-64-vN options only for specific software and leave the rest as x86-64-v1.
AVX seemed to give the biggest boost to things.
Regarding those who are making fun of Gentoo users, it really did make a bigger difference in the past, but with the refinement of compilers, the difference has diminished. Today, for me, who still uses Gentoo/CRUX for some specific tasks, what matters is the flexibility to enable or disable what I want in the software, and not so much the extra speed anymore.
As an example, currently I use -Os (x86-64-v1) for everything, and only for things related to video/sound/cryptography (I believe for things related to mathematics in general?) I use -O2 (x86-64-v3) with other flags to get a little more out of it.
Interestingly, in many cases -Os with -mtune=nocona generates faster binaries even though I'm only using hardware from Haswell to today's hardware (who can understand the reason for this?).
I couldn't run something from NPM on a older NAS machine (HP Microserver Gen 7) recently because of this.
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.