The Theoretical Limitations of Embedding-Based Retrieval
Key topics
A new paper explores the theoretical limitations of embedding-based retrieval, sparking a lively debate about the capacity of dense vectors versus sparse models like BM25. Commenters weighed in on the merits of sparse embeddings, with some pointing out that models like Matryoshka embeddings and SPLADE can scale to tens of thousands of dimensions, while others clarified that Matryoshka embeddings aren't truly sparse. The discussion revealed a consensus that a hybrid approach, combining multiple retrieval methods such as euclidean embedding, hyperbolic embedding, and sparse lexical search, might be the holy grail for optimal performance. This thread is relevant now as it highlights the ongoing quest for more effective information retrieval methods.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
1h
Peak period
12
12-18h
Avg / period
4.8
Based on 38 loaded comments
Key moments
- 01Story posted
Aug 29, 2025 at 4:25 PM EDT
5 months ago
Step 01 - 02First comment
Aug 29, 2025 at 5:39 PM EDT
1h after posting
Step 02 - 03Peak activity
12 comments in 12-18h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 2, 2025 at 1:43 AM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Sparse models like BM25 have a huge dimension and thus don’t suffer from this limit, but they don’t capture semantics and can’t follow instructions.
It seems like the holy grail is a sparse semantic model. I wonder how splade would do?
True sparsity would imply keeping different important vector bases for different documents, but MRL doesn't magically shuffle vector bases around depending on what's your document contains, were that the case cosine similarity between the resulting documents embeddings would simply make no sense as a similarity measure.
In fact, I think we can do it in d=2k dimensions, if we're willing to have arbitrarily precise query vectors.
Embed our points as (sin(theta), cos(theta), sin(2 x theta), cos(2 x theta)..., sin(k x theta), cos(k x theta)), with theta uniformly spaced around the circle, and we should be able to select any k of them.
Using a few more dimensions we can then ease the precision requirements on the query.
Let H(z) = (z- exp(i theta)) (z - exp(i phi))
H(z) is zero only for our two points. We'll now take the norm^2, to get something that's minimized on only our two chosen points.
|H(exp(i x))|^2 = H(z) x conj(H(z)) = sum from -2 to 2 of c_j x exp(i j x)
For some c_j (just multiply it out). Note that this is just a Fourier series with highest harmonic 2 (or k in general), so in fact our c_j tell us the four coefficients to use.
For a simpler, but numerically nastier construction, instead use (t, t^2, t^3, ...), the real moment curve. Then given k points, take the degree k poly with those zeros. Square it, negate, and we get a degree 2k polynomial that's maximized only on our selected k points. The coefficients of this poly are the coefficients of our query vector, as once expanded onto the moment curve we can evaluate polynomials by dot product with the coefficients.
The complex version is exactly the same idea but with z^k and z ranging on the unit circle instead.
In the end they still rely on Maximum Inner Product Search, just with several lookups for smaller partitions of the full embedding, and the largest dataset is Books, where this paper suggests you'd need more than 512 embedding dimensions, and MoL with 256-dimensional embeddings split into 8 parts of 32 each has an abysmal hit rate.
So that's hardly a demonstration that arbitrary high-rank distributions can be approximated well. MoL seems to approximate it better than other approaches, but all of them are clearly hampered by the small embedding size.
When a human tries to retrieve information in a library, they first locate a book by category or by using a metadata keyword search. Then, they open the table of contents (ToC) to find the relevant section, and repeat this process as needed. Therefore, I believe the future of AI retrieval systems should mimic this process. The recently popular PageIndex approach (see this discussioin: https://news.ycombinator.com/item?id=45036944) also belongs to this category, as it generates a table-of-contents–like tree for LLMs to reason over. Again, it is a form of lossy compression, so its limitations can be discussed. However, this approach is the closest to how humans perform retrieval.
As an example, current AI is famously very poor at learning relationships between non-scalar types, like complex geometry, which humans learn with ease. That isn’t too surprising because the same representation problem exists in non-AI computer science.
No lunch theorem only works, because they assume you care about every single value of noise. Nobody does. There's a free lunch to be had, and it's order. You don't care about a single pixel difference between two cat pictures, NFL does.
Lossy compression is precisely where NFL does not apply.
I recommend Judith Flanders' "A Place for Everything" as a both a history and survey of the constraints in sorting and organising information in an alphabetic language. It's also a fun read!
tl;dr why would we want an LLM do something as inefficiently as a human?
I mean, I hear, about experiments of running Turing machine simulation on NN, or even simulation of some physics on NN, but I have not seen any good survey on these topics, and they could be very interest on subject.