Modular Manifolds
Posted3 months agoActive3 months ago
thinkingmachines.aiTechstoryHigh profile
calmmixed
Debate
60/100
Machine LearningDifferential GeometryOptimization Algorithms
Key topics
Machine Learning
Differential Geometry
Optimization Algorithms
The post discusses using modular manifolds to constrain weight matrices in neural networks, sparking a discussion on the novelty and effectiveness of this approach among HN commenters.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
18m
Peak period
48
0-6h
Avg / period
13.5
Comment distribution54 data points
Loading chart...
Based on 54 loaded comments
Key moments
- 01Story posted
Sep 26, 2025 at 1:06 PM EDT
3 months ago
Step 01 - 02First comment
Sep 26, 2025 at 1:24 PM EDT
18m after posting
Step 02 - 03Peak activity
48 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 28, 2025 at 5:51 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45388728Type: storyLast synced: 11/20/2025, 8:42:02 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Has DAWNBench been done with manifold Muon (with a more appropriate architecture)?
Here's the top model on DAWNBench - https://github.com/apple/ml-cifar-10-faster/blob/main/fast_c...
Trains for 15 epochs and it, like all the others is a 9 layer resnet.
In fact beating SOTA is often the least interesting part of an interesting paper and the SOTA-blind reviewers often use it as a gatekeeping device.
I probably should have made the 9-layer ResNet part more, front-and-center / central to my point.
Now, are there actual meaningful improvements to obtain, and do they stick around all the way to frontier runs? Unclear, really. So far, it looks like opening a can of hyperparameters.
Happy to see this opinion expressed here, too. The more math skeptics there are out there, the longer I get to keep my job. :)
If people are happy with a job or a role that does not need math that' fine.
Familiarity with Maths let's you to rise to the occasion, to become more than a replaceable cog.
The thing is, unless you are trained in math you wouldn't even recognise the opportunity, that a certain kind Of Math could have been used here. In fact, even if you are trained in Math you may not see it till much later -- it needs a special eye and something in that moment.
Polyhedrons were looked at for centuries after centuries by top-notch mathematicians. All missed Euler's formula, except perhaps Descartes.
Often what happens is some nontrivial branch of mathematics suddenly finds a novel and impactful application. Then crowds jump in to learn that Math. But it's mostly already a little too late for them, they have missed this bus.
The best case is one already knows the Math beforehand and you don't know which part will be handy. It helps if you love the subject and can afford to invest time to learn it for the love of the subject. Once in a while you happen to find yourself in the right place and the right time and with the right tools you need.
However, in the meantime, the experts in that math have "missed the bus" on whatever the application area is, that the math expert knows not enough about because they were studying math instead.
Very good presentation. Projected gradient methods were popular during the convex optimization craze two decades ago. The ideas advanced here have precedent and seem sensible to me. My concern is whether it helps much. The test accuracy in figure 6b shows a marginal increase, and a gentler transition to the overfitting regime, suggesting the regularization is working. The higher LR did not translate to a speed up: "Manifold Muon increased the wall clock time per step compared to AdamW..."
More fundamentally, I am a bit skeptical that low test accuracy is the right goal in LLMs because statistical learning theory does not adequately model the macro-behavior of very large models.
Might you please elaborate on this? I recognize that "artificial neural networks are lossy de/compression algorithms" does not enumerate the nuances of these structures, but am curious whether anything in particular is both interesting and missing from SLT.
Sounds like it might help for online RL training regimes as those are naturally quite vulnerable to overfitting .
Higher LR does not mean there’s overfitting.
What is novel about the approach in the blog post? Serious question, I really can't tell after reading the post.
What's your point? Sometimes things need to be retried. Sometimes there are small subtle details that make or break an idea. So what's the point of acting dismissively? If an old idea that didn't work now works, then what's the problem? Besides, progress is typically iterative, not through leaps and bounds. The vast majority of things that look like leaps are just because we don't see the steps between.
The reason I'm saying this is because that sentiment is often used to pass over working solutions and slows down their progress. So even if unintentional it should cause us to rethink how we respond. Otherwise we end up with such silly claims like "Einstein just used Tensors" and "Nash just used topology". In some sense these are accurate, but they are too high level descriptions (and these are real dismissals. Which again, so what? If it works, it works?).
Why is "novelty" even a question? Novelty is only ever in the eyes of the beholder.
Honestly, I do not know, but I'll give you my best read on it.1) Scale: Don't underestimate the importance of this. While I don't think scale is all you need, it certainly is a critical factor.
2) Different optimization: I may be missing something, but it looks like they are using a different optimizer. They mention that they're using the muon optimizer constraining to a Stiefel manifold. Neither of those things are unique on their own, but is their combination? This is where I'm uncertain because such a thing would be easy to miss. Maybe someone did and was unsuccessful with it. Maybe someone did, but was not at scale. Maybe someone did, it worked, and just nobody noticed (that happens a lot!).
So I think this is quite similar to how 99% of progress and breakthroughs are made: putting together ideas that seem unrelated and inventing some glue to generalize the process. At a high level this always looks like you're just putting existing things together, but that glue is really hard to make. And to continue that analogy, if we do a good enough job gluing things together then to anyone but an expert it'll look like there is no glue. It can be surprisingly difficult to tell if something is glued, welded, mated, milled, printed, or whatever. It usually takes a very keen eye to determine the answer non-destructively.
> I figured out how to solve manifold Muon in the square case late last year, but I was unable to solve the full rectangular case and thus posed the problem as an open problem on the Modula docs. Jianlin Su solved the problem this summer
It sounds like the generalisation of projected gradient decent to "Muon" is what they're focusing on, but the derivation is all about the retraction map on the Stiefel manifold? I think I'm missing some background here.
If it helps, here's the "paper" for the Muon optimizer[0] and here's a follow-up[1]. Muon is definitely a gradient decent technique, but so are Adam, SGD, Ada, and many more[2].
The big thing of Muon is using NewtonSchulz5. So you update parameters with θ_{t-1} - η[NS_5(μB_{t-1} + ∇L(θ_{t-1}))] (I bracketed so you can see that this is just a specific version of θ_{t-1} - ηF(∇L(θ_{t-1}),...) which the standard gradient descent -- θ - η∇L(θ) -- is in that class of functions, right?). So we should be careful to over generalize and say that this is just gradient descent. You could even say [1] is "just [0] but with weight-decay" (or go look at the Adam and AdamW algos ;)
But one thing I should add is that gradient descent algorithms aren't topologically aware. I was able to find this post which asks a related question, trying to find what the conditions are for a surface's geodesic to align with gradient descent (note Newton differs from GD too). I don't think this paper is creating a solution where the GD formulation results in following a geodesic to the minimum, but my take is that it is working towards that direction. And to clarify, we'd want to follow the geodesic because that gives us the shortest or most energy efficient path (which ever perspective you want to use). In optimization we want to try to accomplish these two things (and more!): 1) take the "best" path to the optima, 2) find the best optima. Unfortunately these are ill-defined and there's not always objective answers to them. But in an ideal gradient descent algorithm we'd want it to go to the global minimum and take the fastest path, right? So with that it helps to be aware of the geometry (part of why people look at the Hessian but that comes at the cost of increased computation even if the additional information can get us there in fewer steps. So that's not (always) "the best").
I know this isn't a full answer and maybe with more reading I'll have a better one for you. But I'm hoping my answer can at least help you see some of the underlying nuanced problems that (_I think_) the authors are trying to get at. Hopefully I'm not too far off base lol. I'm hoping someone with more expertise can jump in and provide corrections/clarifications in the mean time.
[0] https://kellerjordan.github.io/posts/muon/
[1] https://arxiv.org/abs/2502.16982
[2] (far from a complete list) https://docs.pytorch.org/docs/stable/optim.html#algorithms
[3] (I think similar types of questions may also be fruitful) https://mathoverflow.net/questions/42617/functions-whose-gra...
It remains to be seen if it works better than conventional training schemes.
Just wanted to say the blog post design looks super nice. Beautifully laid out, very readable typography, clear graphics, approachable design with a welcoming UX, footnotes in the side, etc.
Anybody know how this is designed / styled? (I can see three.js being used, along with katex.js - but don't know more details)
Thanks
Specifically, we linearize the emergent KVQ operations of an arbitrary prompt in any arbitrary model by way of interleaving error-correcting code (ECC).
ECC tokens are out-of-band tokens, e.g., Unicode's Private Use Area (PUA), interleaved with raw context tokens. This construction induces an in-context associate memory.
Any sort of interleaved labeling basis, e.g., A1, quick brown fox, A2, jumped lazy dog, induces a similar effect to for chaining recall & reasoning more reliably.
This trick works because PUA tokens are generally untrained hence their initial embedding is still random Gaussian w.h.p. Similar effects can be achieved by simply using token combos unlikely to exist and are often in practice more effective since PUA tokens like emojis or Mandarin characters are often 2,3, or 4 tokens after tokenization vs. codeword combos like zy-qu-qwerty every k content tokens, where can be variable.
Building attention architecture using modular manifolds in white / gray-box models like this new work shows vs. prompt-based black box injection is a natural next step, and so can at least anecdotally validate what they're building ahead of next paper or two.
Which is all to say, absolutely great to see others building in this way!
Suppose we use 3 codeword lanes every codeword which is our default. Each lane of tokens is based on some prime, p, so collectively forms CRT-driven codeword (Chinese Remainder Theorem). This is discretely equivalent to labeling every k tokens with 1x globally unique indexing grammar.
That interleaving also corresponds to a triple of adjacent orthogonal embeddings since those tokens still retain a random gaussian embedding. The net effect is we similarly slice the latent space into spaced chain of modular manifolds within the latent space every k content tokens.
We also refer to that interleaving as Steifel frames for similar reasons as the post reads etc. We began work this spring or so to inject that net construction inside the model with early results in similar direction as post described. That's another way of saying this sort of approach lets us make that chained atlas (wc?) of modular manifolds as tight as possible within dimensional limits of the embedding, floating point precision, etc.
We somewhat tongue-in-cheek refer to this as the retokenization group at the prompt level re: renormalization group / tensor nets / etc. Relayering group is the same net intuition or perhaps reconnection group at architecture level.
You are talking about latent space during inference, not weight space during training, and you are talking about interleaving tokens with random Gaussian tokens, not constraining values to lie on a manifold within a larger space. Whether or not the thing you are describing is meaningful or useful, it is basically unrelated to the original article, and you are not using the term "modular manifold" to refer to the same thing.
We're already working on pushing construction deeper into model both architecture and training. currently that's for fine-tuning and ultimately full architecture shrinkage / pruning and raw training vs. just fine-tuning etc.
& it was just great to see someone else using modular manifolds even if they are using them at the training stage vs. inference stage. they're exploiting modular form at training, we're doing it at inference. cool to see.
[0] https://en.m.wikipedia.org/wiki/Thinking_Machines_Corporatio...
If you like to scroll on mobile :)
I find the idea compelling, but it's too early to know if it will work well at scale, you know, with large models, in the real world.
--
[a] https://en.wikipedia.org/wiki/Condition_number
--
EDIT: On the other hand, yesterday I saw a paper about doing basically the opposite, letting unit values in DNNs get as big or small as they need to get... by mapping them to complex logarithms and keeping them in that domain: https://openreview.net/forum?id=SUuzb0SOGu . I also found this opposing idea oddly compelling, but I don't know how well it works either, because it hasn't been tested at scale.
Their modular norm paper (https://arxiv.org/abs/2405.14813) has several more examples; see their appendix D in particular, but these are also mystifying. Yes they're interested in how things scale but am I the only one to whom it seems that the training losses they report are just not competitive with things that are currently being used?