Back to Home11/19/2025, 12:53:22 PM

RAG Is Set Consumption, Not Ranking: A Metric Designed for RAG Evaluation

2 points
1 comments

Mood

thoughtful

Sentiment

neutral

Category

tech

Key topics

RAG evaluation

information retrieval

natural language processing

The post introduces a new metric for evaluating RAG (Retrieval-Augmented Generation) systems, arguing that traditional ranking metrics are not suitable, with the discussion highlighting the need for better evaluation methods.

Snapshot generated from the HN discussion

Discussion Activity

Light discussion

First comment

2m

Peak period

1

Hour 1

Avg / period

1

Comment distribution1 data points

Based on 1 loaded comments

Key moments

  1. 01Story posted

    11/19/2025, 12:53:22 PM

    6h ago

    Step 01
  2. 02First comment

    11/19/2025, 12:55:16 PM

    2m after posting

    Step 02
  3. 03Peak activity

    1 comments in Hour 1

    Hottest window of the conversation

    Step 03
  4. 04Latest activity

    11/19/2025, 12:55:16 PM

    6h ago

    Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (1 comments)
Showing 1 comments
etoud
6h ago
This post argues that production RAG should be evaluated as set consumption, not as a user scrolling a ranked list. Classic IR metrics (nDCG / MAP / MRR) assume a human eyeball stepping through positions with monotone position discount, which doesn’t match how an LLM ingests a fixed top-K context.

I propose a small family of set-based metrics:

• RA-nWG@K – “How good is the actual top-K set we fed the LLM vs the global oracle on the labeled corpus?”

• PROC@K – pool-restricted oracle ceiling: “How good could we have done with this retrieval pool if selection were perfect?”

• %PROC@K – reranker/selection efficiency: “Given that ceiling, how much did our actual top-K realize?”

The goal is to cleanly separate retrieval quality from reranking headroom instead of squinting at one nDCG number.

I’m actively refining this; if you see flaws, better decompositions, or edge cases where this breaks, I’d really like to hear them.

ID: 45979009Type: storyLast synced: 11/19/2025, 2:29:10 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.