The Universal Weight Subspace Hypothesis
Key topics
A groundbreaking paper has sparked intense discussion around the "universal weight subspace hypothesis," which proposes that AI models can be compressed into a shared representational structure. Commenters are abuzz, with some likening it to discovering the "French Mother Sauce" of underlying patterns, while others draw connections to the Platonic Representation Hypothesis, sensing a convergence on profound insights. As the conversation unfolds, a consensus emerges that this hypothesis might be distilling universal knowledge or "common sense" in AI, with some even speculating about a broader cultural zeitgeist shift towards Platonic and neo-Platonic thinking. The thread is electric with curiosity, as experts and non-experts alike grapple with the implications of this potentially paradigm-shifting idea.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
34m
Peak period
77
0-6h
Avg / period
16.8
Based on 134 loaded comments
Key moments
- 01Story posted
Dec 8, 2025 at 7:16 PM EST
27 days ago
Step 01 - 02First comment
Dec 8, 2025 at 7:50 PM EST
34m after posting
Step 02 - 03Peak activity
77 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 11, 2025 at 12:30 AM EST
25 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Not a technical person just trying to put it in other words.
Now imagine you discover that all 500 are really just the same 11 base ingredients plus something extra.
What they've done here is use SVD, (which is normally used for image compression and noise reduction), to find that "base recipe". Now we can reproduce those other recipes by only recording the one igredient that differs.
More interestingly it might tell us something new about smoothies in general to know that they all share a common base. Maybe we can even build a simpler base using this info.
At least in theory. The code hasn't actually been released yet.
https://toshi2k2.github.io/unisub/#key-insights
If all need just 16 dimensions if we ever make one that needs 17 we know we are making progress instead of running in circles.
Apparently it doesn't at least not in our models with our training applied to our tasks.
So if we expand one of those 3 things and notice that 17-th vector makes a difference then we are having progress.
As a really stupid example: the sets of integers less than 2, 8, 5, and 30 can all be embedded in the set of integers less than 50, but that doesn’t require that the set of integer is finite. You can always get a bigger one that embeds the smaller.
It's known that large neural networks can even memorize random data. The number of random datasets is unfathomably large, and the weight space of neural networks trained on random data would probably not live in a low dimensional subspace.
It's only the interesting-to-human datasets, as far as I know, that drive the neural network weights to a low dimensional subspace.
Similarly I would expect that transformers trained on the same loss function for predicting the next word, if the data is at all similar (like human language), would converge to approx the same space. And to represent that same space probably weights are similar, too. Weights in general seem to occupy low-dimensional spaces.
All in all, I don’t think this is that surprising, and I think the theoretical angle should be (have been?) to find mathematical proofs like this paper https://openreview.net/forum?id=ONfWFluZBI
it's interesting that this paper was discovered by JHU, not some groups from OAI/Google/Apple, considering that the latter probably have spent 1000x more resource on "rediscovering"
But I always want Genetic Algorithms to show up in any discussion about neural networks...
I just stumbled upon a very nice description of the core of it, right here: https://www.youtube.com/watch?v=AyzOUbkUf3M&t=133s
Almost all talks by Geoffrey Hinton (left side on https://www.cs.toronto.edu/~hinton/) are in very approachable if you're passingly familiar with some ML.
For example, evolving program tapes is not something you can back propagate. Having a symbolic, procedural representation of something as effective as ChatGPT currently is would be a holy grail in many contexts.
On a basic level, it's kind of like if you had a calculation for aiming a cannon, and someone was giving you targets to shoot at 1 by 1, and each time you miss the target, they tell you how much you missed by and what direction. You could tweak your calculation each time, and it should get more accurate if you do it right.
Backpropagation is based on a mathematical solution for how exactly you make those tweaks, taking advantage of some calculus. If you're comfortable with calculus you can probs understand it. If not, you might have some background knowledge to pick up first.
Then I had a multi-layer network - I don't remember how many layers.
Then I was using a simple Genetic Algorithm to try to set the weights.
Essentially, it was like breeding up a winner for the snake game - but you always know where all of the food is, and the ant always started in the same square. I was trying to maximize the score for how many food items the ant would eventually find.
In retrospect, it was pretty stupid. Too much of it was hard-coded, and I didn't have near enough middle layers to do anything really interesting. And I was essentially coming up with a way to not have to do back-propagation.
At the time, I convinced myself I was selecting for instinctive knowledge...
And I was very excited by research that said that, rather than having one pool of 10,000 ants...
It was better to have 10 islands of 1,000 ants, and to occasionally let genetic information travel from one island to another island. The research claimed the overall system would converge faster.
I thought that was super cool, and made me excited that easy parallelism would be rewarded.
I daydream about all of that, still.
Something I've been interested in recently is - I wonder if it'd be possible to encode a known-good model - some massive pretrained thing - and use that as a starting point for further mutations.
Like some other comments in this thread have suggested, it would mean we can distill the weight patterns of things like attention, convolution, etc. and not have to discover them by mutation - so - making use of the many phd-hours it took to develop those patterns, and using them as a springboard. If papers like this are to be believed, more advanced mechanisms may be able to be discovered.
¹ https://www.rogeralsing.com/2008/12/07/genetic-programming-e...
What does this mean? Probably not nothing, but probably not “the cosmos is the mind of god.” It probably means that we live in a universe that tends to produce repeating nested patterns at different scales.
But maybe that’s part of what makes it possible to evolve or engineer brains that can understand it. If it had no regularity there’d be no common structural motifs.
Basically, what if we're not actually "training" the model, but rather the model was randomly initialized and the learning algorithm is just selecting the vectors that happen to point into the right direction? A left multiplication of the form DA with a diagonal matrix is equivalent to multiplying each row in A with the corresponding diagonal element. Low values mean the vector in question was a lottery blank and unnecessary. High values means that this turns out to be correct vector, yay!
But this trivial explanation doesn't work for the full SVD, because you now have a right multiplication U*D. This means each column gets multiplied against the corresponding diagonal element. Both the column in U and row vector in V^T have to perfectly coincide to make the "selection" theory work, which is unlikely to be true for small models, which happen to work just fine.
E.g
https://youtu.be/Qp0rCU49lMs?si=UXbSBD3Xxpy9e3uY
https://thoughtforms.life/symposium-on-the-platonic-space/
e.g see this paper on Universal Embeddings https://arxiv.org/html/2505.12540v2
"The Platonic Representation Hypothesis [17] conjectures that all image models of sufficient size have the same latent representation. We propose a stronger, constructive version of this hypothesis for text models: the universal latent structure of text representations can be learned and, furthermore, harnessed to translate representations from one space to another without any paired data or encoders.
In this work, we show that the Strong Platonic Representation Hypothesis holds in practice. Given unpaired examples of embeddings from two models with different architectures and training data, our method learns a latent representation in which the embeddings are almost identical"
Could someone clarify what this means in practice? If there is a 'commonality' why would substituting it do anything? Like if there's some subset of weights X found in all these models, how would substituting X with X be useful?
I see how this could be useful in principle (and obviously it's very interesting), but not clear on how it works in practice. Could you e.g. train new models with that weight subset initialized to this universal set? And how 'universal' is it? Just for like like models of certain sizes and architectures, or in some way more durable than that?
1. As John Napier, who freely, generously, gifted his `Mirifici' for the benefit of all.
2. Here we go, patent trolls, have at it. OpenAI, et al burning midnight oil to grab as much real estate on this to erase any (even future?) debt stress, deprecating the AGI Philospher's Stone to first owning everything conceivable from a new miraculous `my precious' ring, not `open', closed.
Key point being: the parameters might be picked off a lower dimensional manifold (in weight space), but this doesn't imply that lower-rank activation space operators will be found. So translation to inference-time isn't clear.
Let's say you finetune a Mistral-7B. Now, there are hundreds of other fine-tuned Mistral-7B's, which means it's easy to find the universal subspace U of the weights of all these models combined. You can then decompose the weights of your specific model using U and a coefficient matrix C specific to your model. Then you can convert any operation of the type `out=Wh` to `out=U(C*x)` Both U and C are much smaller dimension that W and so the number of matrix operations as well as the memory required is drastically lower.
or is it just that 16 was arbitrarily chosen by them as close enough to the actual minimal number of dimensions necessary?
I also don’t understand what they write under figure 2, since resnet50 has 50 layers, not 31.
You can show for example that siamese encoders for time-series, with MSE loss on similarity, without a decoder, will converge to the the same latent space up to orthogonal transformations (as MSE is kinda like gaussian prior which doesn’t distinguish between different rotations).
Similarly I would expect that transformers trained on the same loss function for predicting the next word, if the data is at all similar (like human language), would converge to approx the same space, up to some, likely linear, transformations. And to represent that same space probably weights are similar, too. Weights in general seem to occupy low-dimensional spaces.
All in all, I don’t think this is that surprising, and I think the theoretical angle should be (have been?) to find mathematical proofs like this paper https://openreview.net/forum?id=ONfWFluZBI
They also have a previous paper (”CEBRA”) published in Nature with similar results.
- Training costs: We might discover these universal subspaces without training thousands of models
- Storage requirements: Models could share common subspace representations
https://grok.com/share/bGVnYWN5_463d51c8-d473-47d6-bb1f-6666...
*Caption for the two images:*
Artistic visualization of the universal low-parameter subspaces discovered in large neural networks (as described in “The Unreasonable Effectiveness of Low-Rank Subspaces,” arXiv:2512.05117).
The bright, sparse linear scaffold in the foreground represents the tiny handful of dominant principal directions (often ≤16 per layer) that capture almost all of the signal variance across hundreds of independently trained models. These directions form a flat, low-rank “skeleton” that is remarkably consistent across architectures, tasks, and random initializations.
The faint, diffuse cloud of connections fading into the dark background symbolizes the astronomically high-dimensional ambient parameter space (billions to trillions of dimensions), almost all of whose directions carry near-zero variance and can be discarded with negligible loss in performance. The sharp spectral decay creates a dramatic “elbow,” leaving trained networks effectively confined to this thin, shared, low-dimensional linear spine floating in an otherwise vast and mostly empty void.
Now I’ve argued that the bot would very likely have thought of the same question you did, and my original assertion stands.
Here's a very cool analogy from GPT 5.1 which hits the nail in the head in explaining the role of subspace in learning new tasks by analogy with 3d graphics.
Are there novel tasks? Inside the limits of physics, tasks are finite, and most of them are pointless. One can certainly entertain tasks that transcend physics, but that isn't necessary if one merely wants an immortal and indomitable electronic god.
Isn't it obvious?
The surprising thing is inter-modality shared variation. Not intra-modality which is quite obvious and expected.
I would like to see interpretability work into whether these high variance components are low level or high level abstractions.
The "human" part of that matters. This is all human-made data, collected from human technology, which was created to assist human thinking and experience.
So I wonder if this isn't so much about universals or Platonic ideals. More that we're starting to see the outlines of the shapes that define - perhaps constrict - our own minds.
It isn’t obvious that these parameters are universal across all models.
Beyond the practical implications of this (i.e. reduced training and inference costs), I'm curious if this has any consequences for "philosophy of the mind"-type of stuff. That is, does this sentence from the abstract, "we identify universal subspaces capturing majority variance in just a few principal directions", imply that all of these various models, across vastly different domains, share a large set of common "plumbing", if you will? Am I understanding that correctly? It just sounds like it could have huge relevance to how various "thinking" (and I know, I know, those scare quotes are doing a lot of work) systems compose their knowledge.
I see now that they did one experiment with trained from scratch models. They trained five Resnet-50s on five disjoint datasets of natural images, each quite small. And they were able to, without further training, combine them into one "universal" model that can be adapted to have only somewhat worse performance on any one of the five datasets (actually one of them is pretty bad) using only ~35 parameters. Which is kind of cool I guess but I also don't find it that surprising?
I don't expect that you'd get the same finding at large scale in LLMs trained from scratch on disjoint and dissimilar data with different optimizers etc. I would find that surprising. But it would be very expensive to do that experiment so I understand why they weren't able to.
The LLMs are finetuned on very disjoint data. I checked some are on Chinese and other are for Math. The pretrained model provides a good initialization. I'm convinced.
2) the architecture and training methods matter. As a simple scenario to make things a bit easier to understand let's say we have two models with identical architectures and we'll use identical training methods (e.g. optimizer, learning rate, all that jazz) but learn on different data. Also to help so you can even reproduce this on your own let's train one on MNIST (numbers) and the other in FashionMNIST (clothing).
Do you expect these models to have similar latent spaces? You should! This is because despite the data being very different visually there are tons of implicit information that's shared (this is a big reason we do tuning in the first place!). One of the most obvious things you'll see is subnetworks that do edge detection (there's a famous paper showing this with convolutions but transformers do this too, just in a bit different way). The more similar the data (orders shouldn't matter too much with modern training methods but it definitely influences things) the more similar this will be too. So if we trained on LAION we should expect it to do really well on ImageNet because even if there aren't identical images (there are some) there are the same classes (even if labels are different)[0].
If you think a bit here you'll actually realize that some of this will happen even if you change architectures because some principles are the same. Where the architecture similarity and training similarity really help is that they bias features being learned at the same rate and in the same place. But this idea is also why you can distill between different architectures, not just by passing the final output but even using intermediate information.
To help, remember that these models converge. Accuracy jumps a lot in the beginning then slows. For example you might get 70% accuracy in a few epochs but need a few hundred to get to 90% (example numbers). So ask yourself "what's being learned first and why?" A lot will make more sense if you do this.
[0] I have a whole rant on the indirect of saying "zero shot" on ImageNet (or COCO) when trained in things like LAION or JFT. It's not zero shot because ImageNet is in distribution! We wouldn't say "we zero shotted the test set" smh
In any case, my impression is that this is not immediately more useful than a LoRA (and is probably not intended to be), but is maybe an avenue for further research.
The ResNet results hold from scratch because strict local constraints (e.g., 3x3 convolutions) force the emergence of fundamental signal-processing features (Gabor/Laplacian filters) regardless of the dataset. The architecture itself enforces the subspace.
The Transformer/ViT results rely on fine-tunes because of permutation symmetry. If you trained two ViTs from scratch, "Attention Head 4" in Model A might be functionally identical to "Head 7" in Model B, but mathematically orthogonal.
Because the authors' method (SVD) lacks a neuron-alignment step, scratch-trained ViTs would not look aligned. They had to use pre-trained models to ensure the weights shared a coordinate system. Effectively, I think that they proved that CNNs converge due to it's arch, but for Transformers, they mostly just confirmed that fine-tuning doesn't drift far from the parent model.
And this critique is likely not aimed at academics so much as the systems and incentives of academia. This is partially on the parties managing grants (caring much more about impact and visibility than actually moving science forwards, which means everyone is scrounging for or lying about low hanging fruit). It is partially on those who set (or rather maintain) the culture at academic institutions of gathering clout by getting 'impactful' publications. And those who manage journals also share blame, by trying to defend their moat, very much hamming up "high impact", and aggressively rent-seeking.
Perhaps we need to revisit the concept and have a narrow abstract and a lay abstract, given how niche science has become.
The ViT models are never really trained from scratch - they are always finetuned as they require large amounts of data to converge nicely. The pretraining just provides a nice initialization. Why would one expect two ViT's finetuned on two different things - image and text classification end up in the same subspace as they show? I think this is groundbreaking.
I don't really agree with the drift far from the parent model idea. I think they drift pretty far in terms of their norms. Even the small LoRA adapters drift pretty far from the base model.
Given 500 fine tune datasets, we could expect the 500 drag directions to span a 500 dimensional space. After all, 500 random vectors in a high dimensional space are likely to be mutually orthogonal.
The paper shows, however, that the 500 drag directions live in a ~40 dimensional subspace.
Another way to say it is that you can compress fine tune weights into a vector of 40 floats.
Imagine if, one day, fine tunes on huggingface were not measured in gigabytes, megabytes, or even kilobytes. Suppose you started to see listings like 160 bytes. Would that be surprising?
I’m leaving out the detail that the basis direction vectors themselves would have to be on your machine and each basis direction is as big as the model itself. And I’m also taking for granted that the subspace dimension will not increase as the number of fine tune datasets increases.
I agree that the authors decision to use random models on hugging face is unfortunate. I’m hopeful that this paper will inspire follow up works that train large models from scratch.
They're using SVD to throw away almost all of the "new information" and apparently getting solid results anyhow. Which of course raises interesting questions if replicable. The code doesn't seem to have been released yet though.
What I don’t get is what is meant by a universal shared subspace, because there is some invariance regarding the specific values in weights and the directions of vectors in the model. For instance, if you were doing matrix multiplication with a weight tensor, you could swap two rows/columns (depending on the order of multiplication) and all that would do is swap two values in the resulting product, and whatever uses that output could undo the effects of the swap so the whole model has identical behavior, yet you’ve changed the direction of the principal components. There can’t be fully independently trained models that share the exact subspace directions for analogous weight tensors because of that.
https://arxiv.org/abs/2007.00810
Without properly reading the linked article, if thats all this is, not a particularly new result. Nevertheless this direction of proofs is imo at the core of understanding neural nets.
Why else would they put so much money into something if not to try and get more out of it?
Capitalists' morals are driven by their social position. To them this is right becauae its rewarding. To us its an akin abomination we create that destroys us
But the problem isnt inherently tech. Its how society is structured around it that allows it to be used against us.
For CNNs, the 'Universal Subspace' is simply the strong inductive bias (locality) forcing filters into standard signal processing shapes (Laplacian/Gabor) regardless of the data. Since CNNs are just a constrained subset of operations, this convergence is inevitable, not magical.
For Transformers, which lack these local constraints, the authors had to rely on fine-tuning (shared initialization) to find a subspace. This confirms that 'Universality' here is really just a mix of CNN geometric constraints and the stability of pre-training, rather than a discovered intrinsic property of learning.
- I know what I do not know.
-- I do not know AI.
This is a little outside my area, but I think the relevant part of that abstract is "Gradient-based optimization follows horizontal lifts across low-dimensional subspaces in the Grassmannian Gr(r, p), where r p is the rank of the Hessian at the optimum"
I think this question is super interesting though: why can massively overparametrised models can still generalise?
[0]: https://opt-ml.org/papers/2025/paper90.pdf
Here's what I've got. You walk into a big room full of lego models of all kinds. You start taking them apart and find that they're not just made of lego blocks, but of the same set of components, each made of lego blocks.
If I've got that right it seems to be an opportunity for compression.
> We analyze over 1,100 deep neural networks—including 500 Mistral-7B LoRAs and 500 Vision Transformers. We provide the first large-scale empirical evidence that networks systematically converge to shared, low-dimensional spectral subspaces, regardless of initialization, task, or domain.
I instantly thought of muon optimizer which provides high-rank gradient updates and Kimi-k2 which is trained using muon, and see no related references.
The 'universal' in the title is not that universal.
And that: "Defining 'novel' as 'not something that you've said before even though your using all the same words, concepts, linguistic tools, etc., doesn't actually make it 'novel'"
Point being, yeah duh, what's the difference between what any of these models are doing anyway? It would be far more surprising if they discovered a *different* or highly-unique subspace for each one!
Someone gives you a magic lamp and the genie comes out and says "what do you wish for"?
That's still the question. The question was never "why do all the genies seem to be able to give you whatever you want?"