Language Models Pack Billions of Concepts Into 12k Dimensions
Posted4 months agoActive4 months ago
nickyoder.comTechstoryHigh profile
calmmixed
Debate
80/100
AILanguage ModelsMathematics
Key topics
AI
Language Models
Mathematics
The article discusses how language models can pack billions of concepts into a relatively low-dimensional space, sparking a discussion on the validity of this claim and the underlying mathematics.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
1h
Peak period
96
0-12h
Avg / period
17.5
Comment distribution140 data points
Loading chart...
Based on 140 loaded comments
Key moments
- 01Story posted
Sep 14, 2025 at 11:54 PM EDT
4 months ago
Step 01 - 02First comment
Sep 15, 2025 at 1:21 AM EDT
1h after posting
Step 02 - 03Peak activity
96 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 21, 2025 at 8:27 AM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45245948Type: storyLast synced: 11/20/2025, 6:56:52 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
In all seriousness, this is not at all how resilience to cosmic interference works in practice, and the probability of any executed instruction or even any other bit being flipped is far greater than the one specific bit you are addressing.
Edit: this is wrong as respondents point out. Clearly I shouldn't be commenting before having my first coffee.
Also, why do you believe dot product cannot be trusted?
But it's quite similar to what the top comment is saying about spherical codes. I think my comment is also about using coding theory to represent concepts.
Other than that, I don't have any issue with dot product over bitvectors - it's just not very useful for the above.
In the case of binary vectors, don't forget you are working with the finite field of two elements {0, 1}, and use XOR.
A lot of the ideas are explored in more detail in Anthropic's 2022 paper that's one of the foundational papers in SAE research: https://transformer-circuits.pub/2022/toy_model/index.html
If somebody's paper does not get assigned as mandatory reading for random reviewers, but people read it anyway and cite it in their own work, they're doing a form of post-publication peer review. What additional information do you think pre-publication peer review would give you?
The sloppiness of the circuits thread blog posts has been very damaging to the health of the field, in my opinion. People first learn about mech interp from these blog posts, and then they adopt a similarly sloppy style in discussion.
Frankly, the whole field currently is just a big circle jerk, and it's hard not to think these blog posts are responsible for that.
I mean do you actually think this kind of slop would be publishable in NeurIPS if they submitted the blog post as it is?
In theory, yes. Lets not pretend actual peer review would do this.
Peer review has nothing to do with "being published in some fancy-looking formatted PDF in some journal after passing an arbitrary committee" or whatever, it's literally review by your peers.
Now, do I have problems with this specific paper and how it's written in a semi-magical way that surely requires the reader suspend disbelief? For sure, but that's completely independent of the "peer-review" aspect of it.
Reviewing a paper can easily take 3 weeks full time work.
Looking at a paper and assuming it is correct, followed by citing it, can literally take seconds.
I'm a researcher and there are definitely two modes of reading papers: review mode and usage mode.
A few thoughts:
(1) As others have commented, I think peer review in ML is pretty widely accepted to be dysfunctional right now. I think most people who have published in ML conferences would agree. It's not unusual for early PhD students and sometimes even undergrads to review, and reviewers are overburdened to the point where they can carefully consider all their papers. Everything I've said so far is just anecdote and opinion though. A more objective test was the NeurIPS 2021 Consistency Experiment ( https://blog.neurips.cc/2021/12/08/the-neurips-2021-consiste... ) which found that if a paper was accepted by the conference, there was only a ~50% chance that a parallel review process would come to the same conclusion.
(2) Modern peer review is a relatively modern invention, arising in post-WW2 science as the scientific community grew dramatically, and there was a need for more systematized ways to make decisions about publication, funding, jobs, etc. Famously, Einstein was offended by one of his papers being sent for review. I don't think it's at all obvious that this transition has been good for science! I see lots of people writing papers for reviewers, rather than with the goal of doing the most impactful science they can.
(3) As background, I spent 5 years of my life running a scientific journal ( https://distill.pub/ ), trying to have excellent review processes and enable non-traditional papers to be peer reviewed. I honestly just burnt out on this. Now I just want to do good research.
(4) We do circulate draft papers to researchers working on similar topics at other industry groups, and in academia. As other comments have noted, this sometimes leads to public comments on our papers. In many cases, these are a much deeper review than you'd see in typical peer review processes, such as independent reproduction of experiments.
It means you get 12,000! (Factorial) concepts in the limit case, more than enough room to fit a taxonomy
If we assume that a "concept" is something that can be uniquely encoded as a finite string of English text, you could go up to concepts that are so complex that every single one would take all the matter in the universe to encode (so say 10^80 universes, each with 10^80 particles), and out of 10^43741 concepts you’d still have 10^43741 left undefined.
Things like novels come from that space. We sample it all the time. Extremely, extremely sparsely, of course.
Or to put it another way, in a space of a given size, identifying a specific component takes the log2 of the space's size in bits to identify a concept, not something the size of the space itself. 10^43741 is a very large space by our standards, but the log2 of it is not impossibly large.
If it seems weird for models to work in this space, remember that as the models themselves in their full glory are clocking in at multiple hundreds of gigabytes that the space of possible AIs using this neural architecture is itself 2^trillion-ish, which makes 10^43741 look pedestrian. Understanding how to do anything useful with that amount of possibility is quite the challenge.
Instead, a naive mental model of a language model is to have a positive, negative or zero trit in each axis: 3^12,000 concepts, which is a much lower number than 12000!. Then in practice, almost every vector in the model has all but a few dozen identified axes zeroed because of the limitations of training time.
I’m not even referring to the values within that subspace yet, and so once you pick a concept you still get the N degrees of freedom to create a complex manifold.
The main value of the mental model is to build an intuition for how “sparse” high dimensional vectors are without resorting to a 3D sphere.
There’s a lot of devil in this detail.
So the claim that "This research suggests that current embedding dimensions (1,000-20,000) provide more than adequate capacity for representing human knowledge and reasoning." is way too optimistic in my opinion.
Btw. this is not an original idea from the linked blog or the youtube video it references. The relevance of this lemma for AI (or at least neural machine learning) was brought up more than a decade ago by C. Eliasmith as far as I know. So it has been around long before architectures like GPT that could actually be realistically trained on such insanely high dimensional world knowledge.
In classification tasks, each feature is normalized independently. Otherwise you would have an entry with feature Foo and Bar which depending on the value of Bar it would be made out to be less Foo when normalized.
This vectors are not normalized in n-spheres, and their codomain ends up being an hypercube.
Re: distance metrics and curvilinear spaces and skew coordinates: https://news.ycombinator.com/item?id=41873650 :
> How does the distance metric vary with feature order?
> Do algorithmic outputs diverge or converge given variance in sequence order of all orthogonal axes? Does it matter which order the dimensions are stated in; is the output sensitive to feature order, but does it converge regardless? [...]
>> Are the [features] described with high-dimensional spaces really all 90° geometrically orthogonal?
> If the features are not statistically independent, I don't think it's likely that they're truly orthogonal; which might not affect the utility of a distance metric that assumes that they are all orthogonal
Which statistical models disclaim that their output is insignificant if used with non-independent features? Naieve Bayes, Linear Regression and Logistic Regression, LDA, PCA, and linear models in general are unreliable with non-independent features.
What are some of the hazards of L1 Lasso and L2 Ridge regularization? What are some of the worst cases with outliers? What does regularization do if applied to non-independent and/or non-orthogonal and/or non-linear data?
Impressive but probably insufficient because [non-orthogonality] cannot be so compressed.
There is also the standing question of whether there can be simultaneous encoding in a fundamental gbit.
Johnson-Lichtenstrauss guarantees a distance preserving embedding for a finite set of points into a space with a dimension based on the number of points.
It does not say anything about preserving the underlying topology of the contious high dimensional manifold, that would be Takens/Whitney-style embedding results (and Sauer–Yorke for attractors).
The embedding dimensions needed to fulfil Takens are related to the original manifolds dimension and not the number of points.
It’s quite probable that we observe violations of topological features of the original manifold, when using our to low dimensional embedded version to interpolate.
I used AI to sort the hodge pudge of math in my head into something another human could understand, edited result is below:
=== AI in use === If you want to resolve an attractor down to a spatial scale rho, you need about n ≈ C * rho^(-d_B) sample points (here d_B is the box-counting/fractal dimension).
The Johnson–Lindenstrauss (JL) lemma says that to preserve all pairwise distances among n points within a factor 1±ε, you need a target dimension
k ≳ (d_B / ε^2) * log(C / rho).
So as you ask for finer resolution (rho → 0), the required k must grow. If you keep k fixed (i.e., you embed into a dimension that’s too low), there is a smallest resolvable scale
rho* (roughly rho* ≳ C * exp(-(ε^2/d_B) * k), up to constants),
below which you can’t keep all distances separated: points that are far on the true attractor will show up close after projection. That’s called “folding” and might be the source of some of the problems we observe .
=== AI end ===
Bottom line: JL protects distance geometry for a finite sample at a chosen resolution; if you push the resolution finer without increasing k, collisions are inevitable. This is perfectly consistent with the embedding theorems for dynamical systems, which require higher dimensions to get a globally one-to-one (no-folds) representation of the entire attractor.
If someone is bored and would like to discuss this, feel free to email me.
[1] https://en.m.wikipedia.org/wiki/Map_projection
Violation of topology means that a surface wrongly is mapped to one intersecting itself: Think Klein Bottle.
https://en.wikipedia.org/wiki/Klein_bottle
Like many things in ML, this might be a problem in theory but empirically it isn’t important, or is very low on the stack rank of issues with our models.
It's a problem for the simplest of reasons, information is lost. You cannot reconstruct the original topology.
In terms of the model, it now can't distinguish between what were completely different regions.
From the Klein bottle perspective, a 4D shape gets projected into a 3D shape. On most of the bottle, there is still a 1 to 1 topological mapping from 3D to 4D versions.
But where two surfaces now intersect, there is now no way to distinguish between previously unrelated information. The model won’t be able to anything sensible with that.
TLDR; We don't like folding.
Beyond that, this mathematical observation is genuinely fascinating. It points to a crucial insight into how large language models and other AI systems function. By delving into the way high-dimensional data can be projected into lower-dimensional spaces while preserving its structure, we see a crucial mechanism that allows these models to operate efficiently and scale effectively.
Reading LLM text feels a lot like watching a Dragon Ball Z filler episode.
[0] - https://en.wikipedia.org/wiki/Wikipedia:Signs_of_AI_writing
I liked this bit, among others:
> LLMs overuse the 'rule of three'—"the good, the bad, and the ugly". This can take different forms from "adjective, adjective, adjective" to "short phrase, short phrase, and short phrase".[2]
> While the 'rule of three', used sparingly, is common in creative, argumentative, or promotional writing, it is less appropriate for purely informational texts, and LLMs often use this structure to make superficial analyses appear more comprehensive.
> Examples:
> "The Amaze Conference brings together global SEO professionals, marketing experts, and growth hackers to discuss the latest trends in digital marketing. The event features keynote sessions, panel discussions, and networking opportunities."
> While exploring this question, I discovered something unexpected that led to an interesting collaboration with Grant and a deeper understanding of vector space geometry.
> When I shared these findings with Grant, his response exemplified the collaborative spirit that makes the mathematics community so rewarding. He not only appreciated the technical correction but invited me to share these insights with the 3Blue1Brown audience. This article is that response, expanded to explore the broader implications of these geometric properties for machine learning and dimensionality reduction.
> The fascinating history of this result speaks to the interconnected nature of mathematical discovery.
> His work consistently inspires deeper exploration of mathematical concepts, and his openness to collaboration exemplifies the best aspects of the mathematical community. The opportunity to contribute to this discussion has been both an honor and a genuine pleasure.
I don't know how to express it, maybe it's because I'm not a native English speaker, but my brain has become used to this kind of tone in AI-generated content and I find it distracting to read. I don't mean to diminish this blog post, which is otherwise very interesting. I'm just pointing out an increasing (and understandable) trend of relying on AI to "improve" prose, but I think it sometimes leads to a uniformity of style, which I find a bit sad.
There's room for lyricism and artistry in poetry or other forms of writing meant to entertain or evoke particular feelings, but not in a thesis. A thesis should get to the point using the path of least action. It's the difference between calling something a "many-sided pseudopolygon with no vertices halfway in tint between blue and red" and calling something a "purple circle."
It is completely reasonable to read that signal, and completely reasonable to conclude that you shouldn't ask me to care more about your content than you did as the "creator".
Moreover, it suggests that your "content" may just be a prompt to an AI, and there is no great value to caching the output of an AI on the web, or asking me to read it. In six months I could take the same prompt and get something better.
Finally, if your article looks like it was generated by AI, AI is still frankly not really at the point where long form output of it on a technical concept is quite a safe thing to consume for deep understanding. It still makes a lot of little errors fairly pervasively, and significant errors not only still occur often, but when they do they tend to corrupt the entire rest of the output as the attention mechanism basically causes the AI to justify its errors with additional correct-sounding output. And if your article sounds like it was generated with AI, am I justified in assuming that you've prevented this from happening with your own expertise? Statistically, probably not.
Disliking the "default tone" of the AI may seem irrational but there are perfectly rational reasons to develop that distaste. I suppose we should be grateful that LLM's "default voice" has been developed into something that wasn't used by many humans prior to them staking a claim on it. I've heard a few people complain about getting tagged as AIs for using emdashes or bullet points prior to AIs, but it's not a terribly common complaint.
I've also always been proud of using em and en dashes correctly—including using en dashes for ranges like 12–2pm—but nearly everyone thinks they're an LLM exclusive... so now I really go out of my way to use them just out of spite.
In fact they can pack complete poems with or without typos and you can ask where in the poem the typo is, which is exactly what happens if you paste that into GPT: somewhere in an internal layer it will distinguish exactly that.
So nothing earth shattering here. The article is also filled with words like "remarkable", "fascinating", "profound", etc. that make me feel like some level of subliminal manipulation is going on. Maybe some use of an LLM?
Semantic similarity in embedding space is a convenient accident, not a design constraint. The model's real "understanding" emerges from the full forward pass, not the embedding geometry.
The SAE's job is to try to pull apart the sparse nearly-orthogonal 'concepts' from a given embedding vector, by decomposing the dense vector into a sparsely activation over-complete basis. They tend to find that this works well, and even allows matching embedding spaces between different LLMs efficiently.
Well yes, humans have real-world concepts.
> Semantic properties do not require any human-level understanding
Strawman.
> a Python script has specific semantics one may use to discuss its properties
These are human-attributed semantics. To say that a static script "has" semantics is a category mistake--certainly it doesn't "have" them the way LLMs are purported by the OP to have concepts.
> it has become increasingly clear that LLMs can reason (as in derive knowable facts, extract logical conclusions, compare it to different alternatives
These are highly controversial claims. LLMs present conclusions textually that are implicit in the training data. To get from there to the claim that they can reason is a huge leap. Certainly we know (from studies by Anthropic and elsewhere) that the reasoning steps that LLMs claim to go through are not actual states of the LLM.
I'm not going to say more about this ... it has been discussed at length in the academic literature.
Of course it can -- pattern matching matches on context.
If you're just looking at minimum angles between vectors, you're doing spherical codes. So this article is an analysis of spherical codes… that doesn't reference any work on spherical codes… seems to be written in large part by a language model… and has a bunch of basic inconsistencies that make me doubt its conclusions. For example: in the graph showing the values of C for different values of K and N, is the x axis K or N? The caption says the x axis is N, the number of vectors, but later they say the value C = 0.2 was found for "very large spaces," and in the graph we only get C = 0.2 when N = 30,000 and K = 2---that is, 30,000 vectors in two dimensions! On the other hand, if the x axis is K, then this article is extrapolating a measurement done for 2 vectors in 30,000 dimensions to the case of 10^200 vectors in 12,888 dimensions, which obviously is absurd.
I want to stay positive and friendly about people's work, but the amount of LLM-driven stuff on HN is getting really overwhelming.
Pointing out the errors is a more helpful way if stating problems with the article, which you have also done.
In that particular picture, you're probably correct to interpret it as C vs N as stated.
It's a very helpful way of saying it shouldn't be bothered to be read. After all, if they couldn't be bothered to write it, I can't be bothered to read it.
I have been using embeddings for almost a decade and am well versed with their intricacies. I think this article has merit. The direction of the investigation and the conclusion are interesting, good to have people thinking about how many distinct concepts can be packed in our usual embedding dimension. Wondering how small can you make embedding before a model becomes noticeably worse, given constant parameter count.
If your content is as bad as AI slop it doesn't really matter if it is or not, but I think it's safe to assume that when a verbose and grandiose post is internally inconsistent and was written after 2022, it's slop[0]
0 https://pyxis.nymag.com/v1/imgs/f0e/0bb/d9346e02d8d7173a6a9d...
Not really.
I think the first chapter of [1] is a good introduction to general facts about high-dimensional stuff. I think this is where I first learned about "high-dimensional oranges" and so on.
For something more specifically about the problem of "packing data into a vector" in the context of deep learning, last year I wrote a blog post meant to give some exposition [2].
One really nice approach to this general subject is to think in terms of information theory. For example, take the fact that, for a fixed epsilon > 0, we can find exp(C d) vectors in R^d with all pairwise inner products smaller than epsilon in absolute value. (Here C is some constant depending on epsilon.) People usually find this surprising geometrically. But now, say you want to communicate a symbol by transmitting d numbers through a Gaussian channel. Information theory says that, on average, I should be able to use these d numbers to transmit C d nats of information. (C is called the channel capacity, and depends on the magnitude of the noise and e.g. the range of values I can transmit.) The statement that there exist exp(C d) vectors with small inner products is related to a certain simple protocol to transmit a symbol from an alphabet of size exp(C d) with small error rate. (I'm being quite informal with the constants C.)
[1] https://people.math.ethz.ch/~abandeira//BandeiraSingerStrohm... [2] https://cgad.ski/blog/when-numbers-are-bits.html
I’m not even sure why that would be a problem, because he’s projecting N basis vectors into K dimensions and C is some measure of the error this introduces into mapping points in the N-space onto the K-space, or something. I’m glad to be shown why this is inconsistent with the graph, but your argument doesn’t talk about this idea at all.
C is meant to be the smallest constant so that, for each (N, k, epsilon) with k > C epsilon^-2 log N and epsilon > 0, there exists some arrangement of N vectors in R^k with inner products smaller than epsilon in absolute value. For each (N, k), the author optimized epsilon and reported the empirical value k epsilon^2 / log N, which is the smallest value of C for which our condition holds restricted to the given values of (N, k). (Of course, this is my good-faith interpretation---the article introduces C in the context of a JL distortion bound, and it takes an extra step to turn that into a bound on inner products.)
It can be shown that C = 4 satisfies this condition, when log is the natural log. See [1], for example. Based on the graph, the article claims to do better: "for sufficiently large spaces," it says we can put C = 0.2. This would be a very significant improvement.
For k = 2, the graph shows that C will be lower than 0.2 for sufficiently large N. (The color scheme is confusing! The line for k = 2 is the one that starts at just under 0.8 when N = 1.) Already for k = 3, the graph doesn't give us reason to believe it will be lower than 0.2---you correctly observed it gets to around 0.3. For larger value of k, the graph doesn't seem to show what we can expect for large N: the curves go up, but do not come down. This is what I meant with my comment: the conclusion that C <= 0.2 as N -> infinity is only justified by the behavior at K = 2.
Now, do these results make sense? In the case k = 2, we're trying to put a large number (N) of vectors on the unit circle, and thinking about how small the maximum inner product (epsilon) between any pair of vectors can be. As N -> infinity, the vectors will be packed very tightly and the maximum inner product epsilon will come close to 1. Overall, C = k epsilon^2 / log N will become arbitrarily small. In fact, the same happens for every k.
So, just in connection to this graph, the article makes three mistakes:
1) The article's interpretation of its experiment is wrong: the graph alone doesn't show that C < 0.2 for "large spaces".
2) However, it should be obvious a priori that, for all values of k, the reported values C should converge to 0 for large N (albeit very slowly, at a rate of 1/log N).
3) Unfortunately, this doesn't tell us anything about the minimum value of k / log(N) for a given epsilon and k, and so it doesn't support the conclusion of the article.
The problem with this kind of LLM-driven article is that it gives uncareful work the _appearance_ of careful work but none of the other qualities that usually come with care.
[1] https://lmao.bearblog.dev/exponential-vectors/
I do think some more rigor is needed here (especially wrt to whether his derivation for C is sensical, as you allude to, since it obviously can’t converge to 0) but I hope we can agree that we are well into the “amateur who has trouble presenting as such” territory rather than the “AI slop” territory.
People do indeed rediscover previously existing math, especially when the old content is hidden under non-obvious jargon.
LLMs are designed for Western concepts of attributes, not holistic, or Eastern. There's not one shred of interdependence, each prediction is decontextualized, the attempt to reorganize by correction only slightly contextualizes. It's the object/individual illusion in arbitrary words that's meaningless. Anyone studying Gentner, Nisbett, Halliday can take a look at how LLMs use language to see how vacant they are. This list proves this. LLMs are the equivalent of circus act using language.
"Let's consider what we mean by "concepts" in an embedding space. Language models don't deal with perfectly orthogonal relationships – real-world concepts exhibit varying degrees of similarity and difference. Consider these examples of words chosen at random: "Archery" shares some semantic space with "precision" and "sport" "Fire" overlaps with both "heat" and "passion" "Gelatinous" relates to physical properties and food textures "Southern-ness" encompasses culture, geography, and dialect "Basketball" connects to both athletics and geometry "Green" spans color perception and environmental consciousness "Altruistic" links moral philosophy with behavioral patterns"
isn’t learning the probabilistic relationships between tokens an attempt to approximate those exact semantic relationships between words?
https://pubmed.ncbi.nlm.nih.gov/38579270/
edit: looking into this, this is likely in terms of the brain and arbitrariness highly paradoxical even oxymoronic
>>isn’t learning the probabilistic relationships between tokens an attempt to approximate those exact semantic relationships between words?
This is really a poor manner of resolving the conduit metaphor condition to arbitary signals, to falsify them as specific, which is always impossible. This is simple linguistic via animal signal science. If you can't duplicate any response with a high degreee of certainty from output, then the signal is only valid in the most limited time-space condition and yet it is still arbitrary. CS has no understanding of this.
What should I read to better understand this claim?
> LLMs are the equivalent of circles act using language.
Circled apes?
That's the tip of the iceberg
edit: As CS doesn't understand the parasitic or viral aspects of language and simply idealizes it, it can't access it. It's more of a black box than the coding of these. I can't understand how CS assumed this would ever work. It makes no sense to exclude the very thing that language is and then automate it.
Because there is a large number of combinations of those 12k dimensions? You don’t need a whole dimension for “evil scientist” if you can have a high loading on “evil” and “scientist.” There is quickly a combinatorial explosion of expressible concepts.
I may be missing something but it doesn’t seem like we need any fancy math to resolve this puzzle.
Sometimes these things are patched with cosine distance (or even - Pearson correlation), vide https://p.migdal.pl/blog/2025/01/dont-use-cosine-similarity. Ideally when we don't need to and vectors occupy the space.
I am kind of surprised that the original article does not mention batch normalization and similar operations - these are pretty much created to automatically de-bias and de-correlate values at each layer.
This article uses theory to imply a high bound for semantic capacity in a a vector space.
However, this recent article (https://arxiv.org/pdf/2508.21038) empirically characterizes the semantic capacity of embedding vectors, finding inadequate capacity for some use cases.
These two articles seem at odds. Can anyone help put these two findings in context and explain their seeming contradictions?
(where x is a number dependent on architectural features like MLHA, QGA...)
There is this thing called KV cache which holds an enormous latent state.
Sure, 12k vector space has a significant amount of individual values, but not concepts. This is ridiculous. I mean Shannon would like to have a word with you.
The magic of many current valuable models is simply that they can combine abstract "concepts" like "ruler" + "male" and get "king."
This is perhaps the easiest way to understand the lossy text compression that constitutes many LLMs. They're operating in the embedding space, so abstract concepts can be manipulated between input and output. It's like compiling C using something like LLVM: there's an intermediate representation. (obviously not exactly because generally compiler output is deterministic).
This is also present in image models: "edge" + "four corners" is square, etc.
Now try to separate the "learning the language" from "learning the data".
If we have a model pre trained on language does it then learn concepts quicker, the same or different?
Can we compress just data in a lossy into an LLM like kernel which regenerates the input to a given level of fidelity?
https://lmao.bearblog.dev/exponential-vectors/
For those who are interested in the more "math-y" side of things.
For what it's worth, I don't fully understand the connection between the JL lemma and this "exponentially many vectors" statement, other than the fact that their proof relies on similar concentration behavior.
(And in handling one token, the layers give ~60 chances to mix in previous 'thoughts' via the attention mechanism, and mix in stuff from training via the FFNs! You can start to see how this whole thing ends able to convert your Bash to Python or do word problems.)
Of course, you don't expect it to be 100% space-efficient, detailed mathematical arguments aside. You want blending two vectors with different strengths to work well, and I wouldn't expect the training to settle into the absolute most efficient way to pack the RAM available. But even if you think of this as an upper bound, it's a very different reference point for what 'ought' to be theoretically possible to cram into a bunch of high-dimensional vectors.
2 more comments available on Hacker News