Evaluating Probabilistic Reasoning in Llms Through Language-Only Decision Tasks

Postedabout 2 months agoActiveabout 2 months ago

PaulHoule

1 points

1 comments

arxiv.orgResearchstory

skepticalnegative

Debate

40/100

Large Language ModelsProbabilistic ReasoningAI Research

Key topics

Large Language Models

Probabilistic Reasoning

AI Research

A research paper evaluates probabilistic reasoning in large language models (LLMs) through language-only decision tasks, but a commenter raises concerns about the validity of the results, questioning whether the model's performance is due to actual probabilistic reasoning or other factors.

Snapshot generated from the HN discussion

Discussion Activity

Light discussion

First comment

18m

Peak period

0-1h

Avg / period

Key moments

01Story posted
Nov 5, 2025 at 4:45 PM EST
about 2 months ago
Step 01
02First comment
Nov 5, 2025 at 5:02 PM EST
18m after posting
Step 02
03Peak activity
1 comments in 0-1h
Hottest window of the conversation
Step 03
04Latest activity
Nov 5, 2025 at 5:02 PM EST
about 2 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (1 comments)

Showing 1 comments

adamzwasserman

about 2 months ago

I see problems:

The paper claims that Qwen3-4B achieved 89.2% best-arm selection by demonstrating superior "probabilistic reasoning". But this is a 2-armed bandit where random guessing should converge to ~50% over 500 runs of 25 iterations each. An 89% rate is suspiciously high and suggests to me that something else is happening (like prompt bias or the model pattern-matching rather than reasoning)

When they increase from 2 to 5 arms, Qwen3-4B drops from 89% to 6.5% accuracy. I assert that if it truly had probabilistic reasoning capability, performance would degrade more gracefully.

The "overthinking" explanation is hand-wavy. I don't see evidence or chain of reasoning. This is just a post-hoc story to explain unexpected results.

No discussion of variance, confidence intervals, or statistical significance. With 500 runs, these should be straightforward to calculate.

Does the claimed 89% accuracy in a binary choice task strike anyone else as implausibly high for what they're claiming?

View full discussion on Hacker News

ID: 45828475Type: storyLast synced: 11/17/2025, 7:54:08 AM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN