High Rate of LLM (gpt5) Hallucinations in Dense Stats Domains (cricket)

Posted4 months agoActive4 months ago

sp1982

3 points

2 comments

Researchstory

informativeneutral

Debate

20/100

Large Language ModelsLLMCricket

Key topics

Large Language Models

LLM

Cricket

Disclaimer: I am not a ML researcher, so the terms are informal/wonky. Apologies!

I’m doing a small experiment to see whether models “know when they know” on T20 international cricket scorecards (cricsheet.com for source). The idea is to test models on publicly available data they likely saw during training, and see if they hallucinate or admit they don't know.

Setup: Each question is from a single T20 match. Model must return an answer (numeric or choice from options) or `no_answer`.

Results (N=100 per model):

- gpt-4o-search-preview • Answer rate: 0.96 • Accuracy: 0.88 • Accuracy (answered): 0.91 • Hallucination (answered): 0.09 • Wrong/100: 9

- gpt-5 • Answer rate: 0.35 • Accuracy: 0.27 • Accuracy (answered): 0.77 • Hallucination (answered): 0.23 • Wrong/100: 8

- gpt-4o-mini • Answer rate: 0.37 • Accuracy: 0.14 • Accuracy (answered): 0.38 • Hallucination (answered): 0.62 • Wrong/100: 23

- gpt-5-mini • Answer rate: 0.05 • Accuracy: 0.02 • Accuracy (answered): 0.40 • Hallucination (answered): 0.60 • Wrong/100: 3

Note: most remaining “errors” with search are obscure/disputed cases where public sources disagree.

It seems for domains where models might have seen some data, it’s better to rely on abstention + RAG vs a larger model with more coverage but worse hallucination rate.

Code/Data: https://github.com/jobswithgpt/llmcriceval

Discussion Activity

Light discussion

First comment

25m

Peak period

0-1h

Avg / period

Key moments

01Story posted
Aug 26, 2025 at 1:11 PM EDT
4 months ago
Step 01
02First comment
Aug 26, 2025 at 1:36 PM EDT
25m after posting
Step 02
03Peak activity
1 comments in 0-1h
Hottest window of the conversation
Step 03
04Latest activity
Aug 26, 2025 at 2:18 PM EDT
4 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (2 comments)

Showing 2 comments

whinvik

4 months ago

1 reply

Is this exercise done to determine what the model can produce from its training data or is the data shown again to the model?

sp1982Author

4 months ago

From training data.

View full discussion on Hacker News

ID: 45029448Type: storyLast synced: 11/18/2025, 12:08:30 AM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

View on HN