The Theoretical Limitations of Embedding-Based Retrieval

Posted4 months agoActive4 months ago

sonabinu

36 points

3 comments

alphaxiv.orgTechstory

calmneutral

Debate

20/100

Information RetrievalEmbedding-Based RetrievalNlp

Key topics

Information Retrieval

Embedding-Based Retrieval

Nlp

A research paper discusses the theoretical limitations of embedding-based retrieval, sparking discussion on the implications for information retrieval and potential alternatives like BM25.

Snapshot generated from the HN discussion

Discussion Activity

Light discussion

First comment

31m

Peak period

0-1h

Avg / period

1.5

Key moments

01Story posted
Sep 3, 2025 at 3:09 PM EDT
4 months ago
Step 01
02First comment
Sep 3, 2025 at 3:40 PM EDT
31m after posting
Step 02
03Peak activity
2 comments in 0-1h
Hottest window of the conversation
Step 03
04Latest activity
Sep 3, 2025 at 4:49 PM EDT
4 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (3 comments)

Showing 3 comments

dmezzetti

4 months ago

This paper has been misrepresented many times. At then end it says:

Multi-vector models

Multi-vector models are more expressive through the use of multiple vectors per sequence combined with the MaxSim operator [Khattab and Zaharia, 2020]. These models show promise on the LIMIT dataset, with scores greatly above the single-vector models despite using a smaller backbone (ModernBERT, Warner et al. [2024]). However, these models are not generally used for instruction-following or reasoning-based tasks, leaving it an open question to how well multi-vector techniques will transfer to these more advanced tasks.

Sparse models

Sparse models (both lexical and neural versions) can be thought of as single vector models but with very high dimensionality. This dimensionality helps BM25 avoid the problems of the neural embedding models as seen in Figure 3. Since the of their vectors is high, they can scale to many more combinations than their dense vector counterparts. However, it is less clear how to apply sparse models to instruction-following and reasoning-based tasks where there is no lexical or even paraphrase-like overlap. We leave this direction to future work.

In other words, it says that both multi-vector (i.e. late interaction) and sparse models hold promise.

forks

4 months ago

So, back to BM25?

gmueckl

4 months ago

Why not link the paper on arxiv? https://arxiv.org/abs/2508.21038

View full discussion on Hacker News

ID: 45119397Type: storyLast synced: 11/20/2025, 8:00:11 PM

Want the full context?