RAG Is Set Consumption, Not Ranking: A Metric Designed for RAG Evaluation
Mood
thoughtful
Sentiment
neutral
Category
tech
Key topics
RAG evaluation
information retrieval
natural language processing
The post introduces a new metric for evaluating RAG (Retrieval-Augmented Generation) systems, arguing that traditional ranking metrics are not suitable, with the discussion highlighting the need for better evaluation methods.
Snapshot generated from the HN discussion
Discussion Activity
Light discussionFirst comment
2m
Peak period
1
Hour 1
Avg / period
1
Based on 1 loaded comments
Key moments
- 01Story posted
11/19/2025, 12:53:22 PM
6h ago
Step 01 - 02First comment
11/19/2025, 12:55:16 PM
2m after posting
Step 02 - 03Peak activity
1 comments in Hour 1
Hottest window of the conversation
Step 03 - 04Latest activity
11/19/2025, 12:55:16 PM
6h ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
I propose a small family of set-based metrics:
• RA-nWG@K – “How good is the actual top-K set we fed the LLM vs the global oracle on the labeled corpus?”
• PROC@K – pool-restricted oracle ceiling: “How good could we have done with this retrieval pool if selection were perfect?”
• %PROC@K – reranker/selection efficiency: “Given that ceiling, how much did our actual top-K realize?”
The goal is to cleanly separate retrieval quality from reranking headroom instead of squinting at one nDCG number.
I’m actively refining this; if you see flaws, better decompositions, or edge cases where this breaks, I’d really like to hear them.
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.