High Rate of LLM (gpt5) Hallucinations in Dense Stats Domains (cricket)
Key topics
I’m doing a small experiment to see whether models “know when they know” on T20 international cricket scorecards (cricsheet.com for source). The idea is to test models on publicly available data they likely saw during training, and see if they hallucinate or admit they don't know.
Setup: Each question is from a single T20 match. Model must return an answer (numeric or choice from options) or `no_answer`.
Results (N=100 per model):
- gpt-4o-search-preview • Answer rate: 0.96 • Accuracy: 0.88 • Accuracy (answered): 0.91 • Hallucination (answered): 0.09 • Wrong/100: 9
- gpt-5 • Answer rate: 0.35 • Accuracy: 0.27 • Accuracy (answered): 0.77 • Hallucination (answered): 0.23 • Wrong/100: 8
- gpt-4o-mini • Answer rate: 0.37 • Accuracy: 0.14 • Accuracy (answered): 0.38 • Hallucination (answered): 0.62 • Wrong/100: 23
- gpt-5-mini • Answer rate: 0.05 • Accuracy: 0.02 • Accuracy (answered): 0.40 • Hallucination (answered): 0.60 • Wrong/100: 3
Note: most remaining “errors” with search are obscure/disputed cases where public sources disagree.
It seems for domains where models might have seen some data, it’s better to rely on abstention + RAG vs a larger model with more coverage but worse hallucination rate.
Code/Data: https://github.com/jobswithgpt/llmcriceval
Discussion Activity
Light discussionFirst comment
25m
Peak period
1
0-1h
Avg / period
1
Key moments
- 01Story posted
Aug 26, 2025 at 1:11 PM EDT
4 months ago
Step 01 - 02First comment
Aug 26, 2025 at 1:36 PM EDT
25m after posting
Step 02 - 03Peak activity
1 comments in 0-1h
Hottest window of the conversation
Step 03 - 04Latest activity
Aug 26, 2025 at 2:18 PM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.